Skip to content
Bisman.
Navigate
External
ProjectsProjects
Bisman.
Home›Case Studies›On‑Demand Production MongoDB Clones for Testing (and DR)
Cloud

On‑Demand Production MongoDB Clones for Testing (and DR)

Date‑targeted restore, validation, and safe teardown on AWS via Jenkins

Key Metrics

~5–15 minutes
RTO
Weekly + on‑demand
DR Drill Cadence
100%
Cleanup Success

The Challenge

Teams needed fresh, production‑realistic datasets for testing and analysis without touching the live environment. Manually cloning from snapshots was slow and error‑prone. We also wanted the flow to serve as a DR drill to validate restore readiness and procedures.

The Solution

1) Restore pipeline: mongodb-restore-prod

Purpose: Create an on‑demand clone of production by launching EC2 from AMI, attaching the selected snapshots (data + logs), updating MongoDB config, and validating service health.

Required inputs:

  • SOURCE_INSTANCE_ID
  • Either SNAPSHOT_DATE or both DB_SNAPSHOT_ID and LOGS_SNAPSHOT_ID

Optional inputs:

  • AMI_ID/AMI_NAME (else create an AMI)
  • TARGET_INSTANCE_NAME
  • INSTANCE_TYPE
  • AWS_REGION
  • TIMEOUT

Flow

  1. Read source instance metadata (tags, networking, AZ).

  2. Use or create a fresh AMI from the source instance (if permitted).

  3. Identify the data and logs snapshots (by IDs or tags/size heuristic).

  4. Launch a new EC2 instance and attach the data and logs volumes; reuse the appropriate security groups and subnet.

  5. Wait for the instance to become healthy within a bounded timeout.

  6. Remotely update MongoDB configuration to allow standalone validation, restart the database service, and verify basic health.

Naming

  • Default instance name = <sourceName>-<YYYY-MM-DD> when TARGET_INSTANCE_NAME not set.

Outputs:

  • New INSTANCE_ID and public/private IP
  • Attached volume IDs for root, /dev/sdb, /dev/sdc
  • Slack thread updates for each stage
  • Timeouts guard long waits and fail fast with context

2) Cleanup pipeline: terminate-mongo-restored-instance

Purpose: Find and terminate MongoDB “restore” EC2 instances.

Selection (exactly one):

  • INSTANCE_ID(s) OR INSTANCE_NAME(s) OR PRIVATE_IP(s) OR RESTORE_DATE (matches Name ending with -YYYY-MM-DD)

Defaults:

  • DRY_RUN=true (permission check only). Set DRY_RUN=false to actually terminate.

Flow:

  1. Validate selector (mutually exclusive) and resolve to instance IDs.
  2. Terminate the selected instances (or perform a dry run) and wait until termination completes.
  3. Send Slack updates (initial message suppressed if DRY_RUN=true).

3) Health & Integrity Validation

  • Scripted checks after restart: basic connectivity, index counts, collection sizes, and representative queries.
  • Export diagnostic artifacts (logs, validation summaries) for audit and triage.

4) CI/CD Integration

  • Entry points via Jenkins pipelines for scheduled DR drills and ad‑hoc invocations.

Technologies Used

  • Jenkins
  • AWS (EC2/EBS snapshots, IAM, S3 for artifacts)
  • MongoDB tools (mongorestore, mongodump as needed)
  • Kubernetes (optional validation jobs)
  • Shell scripting

Results Achieved

  • Reliable DR exercises with consistent runbooks
  • Lower RTO by automating restore and validation
  • Cost control via automatic cleanup of helper resources
  • Auditable artifacts for compliance and post‑mortems

Key Metrics

  • RTO: ~5–15 minutes
  • DR Drill Cadence: Weekly + on‑demand
  • Cleanup Success: 100%

Key Learnings

  • Codify validation beyond “service up” checks to verify data quality
  • Prefer temporary, tagged resources for DR to simplify teardown
  • Store artifacts centrally for audits and continuous improvement

Technologies & Tools

MongoDBAWSDisaster RecoverySnapshotsJenkinsKubernetesValidationAutomation
← Back to All Case StudiesDiscuss Your Project →

© 2026 Bisman Singh. Built with passion for DevOps and automation.

Navigation

  • Home
  • About
  • Publications
  • Contact

About Sections

  • Experience
  • Tooling
  • Certifications
  • Education

Resources

  • Case Studies
  • Technical Guides