Project Overview:
- Build an automated system that can back up a running Kubernetes application, destroy the existing cluster, recreate identical infrastructure, and fully restore the application with a single click.
- The system must perform automatic backups of Kubernetes resources and persistent data, store backups securely, and verify their integrity before and after restore.
Objectives and Deliverables:
- Implement a one-click disaster recovery workflow that: creates backups, validates backups, destroys the source cluster in a controlled manner, provisions a new identical cluster via Infrastructure-as-Code, and restores the application to full working order.
- Deliver working automation scripts/playbooks, IaC templates, verification tests, and documentation demonstrating end-to-end restoration of an example application.
Technologies & Tools:
- Use Velero for Kubernetes backups and restores to manage resource and persistent data snapshots.
- Use Ansible and Kubespray (or equivalent IaC/provisioning tools) to provision an identical cluster automatically; develop glue code in Python and provide a minimal React JS UI for the one-click operation.
Responsibilities & Tasks:
- Design and implement backup orchestration, integrity checks, and restore procedures for both Kubernetes resources and persistent volumes.
- Implement automated cluster teardown and provisioning using Infrastructure-as-Code; ensure the new cluster is functionally identical (networking, storage, RBAC, CRDs) and test restores end-to-end.
Required Profile:
- Engineer profile; Required profile: 1 Trainee (intern level) with interest/experience in DevOps, Kubernetes, cloud infrastructure, scripting and automation.
- Skills desired: Kubernetes administration, Velero experience (preferred), Ansible and IaC experience, Python scripting, familiarity with Kubespray and basic React JS for a minimal control UI.
Verification & Testing:
- Create automated verification steps to validate backup integrity and confirm application correctness after restore (smoke tests, data consistency checks).
- Document failure modes and recovery times; provide logs and reproducible test scenarios to demonstrate reliability.
How to Apply:
- Apply online via the trainees platform: https://trainees-platform.proxym-group.net
- In your application, reference the project REF PRX-2026-17 and the title "32 Automated System for Kubernetes Disaster Recovery PFE"; include a brief CV and relevant experience with Kubernetes/DevOps.