Multi-tenant “Prefix” Kubernetes Environment Automation
Idempotent provisioning for AWS/EKS with EFS, ALB, Route53, CloudFront, SSM, and Jenkins
Key Metrics
30-40 minutes
Provisioning Time
Idempotent
Rerun Behavior
The Challenge
Scaling multiple staging environments (“prefixes”) consistently is hard. Teams needed a safe, repeatable, and mostly idempotent way to bring up all shared primitives—EFS mounts, security groups, ALB + Route53, CloudFront (two domains), S3, SSM parameters and bootstrap Jenkins jobs before rolling out the core workloads on EKS.
The Solution
1) Idempotent Infrastructure with Terraform
- Reuse-first strategy: attempt safe import to state when resources already exist (e.g., security groups), otherwise create.
- Detect and fix drift in-place where safe.
2) EFS Reuse and Repair
- Locate filesystems via tags, verify mount targets, add missing AZ targets and propagate tags.
- If reuse isn’t possible, restore from backups and tag consistently for future reuse.
3) ALB + Route53 “Safe Apply”
- Create/attach ALB and target groups; for DNS, check for existing records and apply only when changes are needed.
- Avoids flapping and respects existing zones and TTLs.
4) CloudFront by Alias with In-Place Updates
- Find distributions by alias and if drift is detected, update in-place else create.
- S3 bucket policy updates deferred until OAI/OAC is available to avoid race conditions.
5) SSM Parameters and Secrets
- Create only when missing and leave existing values intact to prevent accidental rotation.
6) Jenkins Job Templating
- Seed jobs from XML templates and if a job exists, use as‑is (no in‑place API mutation) to avoid breaking running automations.
7) Core App Deployment
- After infra is ready, shell scripts apply Kubernetes manifests/Helm charts to bring up core services in the new prefix.
Technologies Used
- Terraform (EFS, SGs, ALB/ELBv2, Route53, CloudFront)
- AWS (EFS, EC2, ELBv2, S3, CloudFront, Route53, Backup, SSM)
- Amazon EKS + kubectl
- Jenkins
- Shell scripting
Results Achieved
- Consistent, repeatable bring‑up of new staging prefixes
- Safe reruns: reuse on success, repair drift, and minimal destructive changes
- Lower onboarding time for new environments and services
- Clear separation between infra provisioning and app rollout
Key Metrics
- Provisioning Time: 30-40 minutes
- Rerun Behavior: Idempotent
Key Learnings
- Favor safe imports and tag‑based discovery to preserve existing assets
- Make CloudFront/OAI steps resilient to ordering to avoid policy races
- Keep Jenkins job creation conservative; mutate via code reviews, not API
- Separate infra and app deployment to improve safety and debuggability