On‑Demand Production MongoDB Clones for Testing (and DR)
Date‑targeted restore, validation, and safe teardown on AWS via Jenkins
Key Metrics
The Challenge
Teams needed fresh, production‑realistic datasets for testing and analysis without touching the live environment. Manually cloning from snapshots was slow and error‑prone. We also wanted the flow to serve as a DR drill to validate restore readiness and procedures.
The Solution
1) Restore pipeline: mongodb-restore-prod
Purpose: Create an on‑demand clone of production by launching EC2 from AMI, attaching the selected snapshots (data + logs), updating MongoDB config, and validating service health.
Required inputs:
SOURCE_INSTANCE_ID- Either
SNAPSHOT_DATEor bothDB_SNAPSHOT_IDandLOGS_SNAPSHOT_ID
Optional inputs:
AMI_ID/AMI_NAME(else create an AMI)TARGET_INSTANCE_NAMEINSTANCE_TYPEAWS_REGIONTIMEOUT
Flow
Read source instance metadata (tags, networking, AZ).
Use or create a fresh AMI from the source instance (if permitted).
Identify the data and logs snapshots (by IDs or tags/size heuristic).
Launch a new EC2 instance and attach the data and logs volumes; reuse the appropriate security groups and subnet.
Wait for the instance to become healthy within a bounded timeout.
Remotely update MongoDB configuration to allow standalone validation, restart the database service, and verify basic health.
Naming
- Default instance name =
<sourceName>-<YYYY-MM-DD>whenTARGET_INSTANCE_NAMEnot set.
Outputs:
- New
INSTANCE_IDand public/private IP - Attached volume IDs for root,
/dev/sdb,/dev/sdc - Slack thread updates for each stage
- Timeouts guard long waits and fail fast with context
2) Cleanup pipeline: terminate-mongo-restored-instance
Purpose: Find and terminate MongoDB “restore” EC2 instances.
Selection (exactly one):
INSTANCE_ID(s) ORINSTANCE_NAME(s) ORPRIVATE_IP(s) ORRESTORE_DATE(matches Name ending with-YYYY-MM-DD)
Defaults:
DRY_RUN=true(permission check only). SetDRY_RUN=falseto actually terminate.
Flow:
- Validate selector (mutually exclusive) and resolve to instance IDs.
- Terminate the selected instances (or perform a dry run) and wait until termination completes.
- Send Slack updates (initial message suppressed if
DRY_RUN=true).
3) Health & Integrity Validation
- Scripted checks after restart: basic connectivity, index counts, collection sizes, and representative queries.
- Export diagnostic artifacts (logs, validation summaries) for audit and triage.
4) CI/CD Integration
- Entry points via Jenkins pipelines for scheduled DR drills and ad‑hoc invocations.
Technologies Used
- Jenkins
- AWS (EC2/EBS snapshots, IAM, S3 for artifacts)
- MongoDB tools (mongorestore, mongodump as needed)
- Kubernetes (optional validation jobs)
- Shell scripting
Results Achieved
- Reliable DR exercises with consistent runbooks
- Lower RTO by automating restore and validation
- Cost control via automatic cleanup of helper resources
- Auditable artifacts for compliance and post‑mortems
Key Metrics
- RTO: ~5–15 minutes
- DR Drill Cadence: Weekly + on‑demand
- Cleanup Success: 100%
Key Learnings
- Codify validation beyond “service up” checks to verify data quality
- Prefer temporary, tagged resources for DR to simplify teardown
- Store artifacts centrally for audits and continuous improvement