Cloud

On‑Demand Production MongoDB Clones for Testing (and DR)

Date‑targeted restore, validation, and safe teardown on AWS via Jenkins

Key Metrics

~5–15 minutes

RTO

Weekly + on‑demand

DR Drill Cadence

100%

Cleanup Success

The Challenge

Teams needed fresh, production‑realistic datasets for testing and analysis without touching the live environment. Manually cloning from snapshots was slow and error‑prone. We also wanted the flow to serve as a DR drill to validate restore readiness and procedures.

The Solution

1) Restore pipeline: `mongodb-restore-prod`

Purpose: Create an on‑demand clone of production by launching EC2 from AMI, attaching the selected snapshots (data + logs), updating MongoDB config, and validating service health.

Required inputs:

SOURCE_INSTANCE_ID
Either SNAPSHOT_DATE or both DB_SNAPSHOT_ID and LOGS_SNAPSHOT_ID

Optional inputs:

AMI_ID/AMI_NAME (else create an AMI)
TARGET_INSTANCE_NAME
INSTANCE_TYPE
AWS_REGION
TIMEOUT

Flow

Read source instance metadata (tags, networking, AZ).
Use or create a fresh AMI from the source instance (if permitted).
Identify the data and logs snapshots (by IDs or tags/size heuristic).
Launch a new EC2 instance and attach the data and logs volumes; reuse the appropriate security groups and subnet.
Wait for the instance to become healthy within a bounded timeout.
Remotely update MongoDB configuration to allow standalone validation, restart the database service, and verify basic health.

Naming

Default instance name = <sourceName>-<YYYY-MM-DD> when TARGET_INSTANCE_NAME not set.

Outputs:

New INSTANCE_ID and public/private IP
Attached volume IDs for root, /dev/sdb, /dev/sdc
Slack thread updates for each stage
Timeouts guard long waits and fail fast with context

2) Cleanup pipeline: `terminate-mongo-restored-instance`

Purpose: Find and terminate MongoDB “restore” EC2 instances.

Selection (exactly one):

INSTANCE_ID(s) OR INSTANCE_NAME(s) OR PRIVATE_IP(s) OR RESTORE_DATE (matches Name ending with -YYYY-MM-DD)

Defaults:

DRY_RUN=true (permission check only). Set DRY_RUN=false to actually terminate.

Flow:

Validate selector (mutually exclusive) and resolve to instance IDs.
Terminate the selected instances (or perform a dry run) and wait until termination completes.
Send Slack updates (initial message suppressed if DRY_RUN=true).

3) Health & Integrity Validation

Scripted checks after restart: basic connectivity, index counts, collection sizes, and representative queries.
Export diagnostic artifacts (logs, validation summaries) for audit and triage.

4) CI/CD Integration

Entry points via Jenkins pipelines for scheduled DR drills and ad‑hoc invocations.

Technologies Used

Jenkins
AWS (EC2/EBS snapshots, IAM, S3 for artifacts)
MongoDB tools (mongorestore, mongodump as needed)
Kubernetes (optional validation jobs)
Shell scripting

Results Achieved

Reliable DR exercises with consistent runbooks
Lower RTO by automating restore and validation
Cost control via automatic cleanup of helper resources
Auditable artifacts for compliance and post‑mortems

Key Metrics

RTO: ~5–15 minutes
DR Drill Cadence: Weekly + on‑demand
Cleanup Success: 100%

Key Learnings

Codify validation beyond “service up” checks to verify data quality
Prefer temporary, tagged resources for DR to simplify teardown
Store artifacts centrally for audits and continuous improvement

Technologies & Tools

MongoDBAWSDisaster RecoverySnapshotsJenkinsKubernetesValidationAutomation

Key Metrics

The Challenge

The Solution

1) Restore pipeline: mongodb-restore-prod

Flow

Naming

2) Cleanup pipeline: terminate-mongo-restored-instance

3) Health & Integrity Validation

4) CI/CD Integration

Technologies Used

Results Achieved

Key Metrics

Key Learnings

Technologies & Tools

1) Restore pipeline: `mongodb-restore-prod`

2) Cleanup pipeline: `terminate-mongo-restored-instance`