Dynamic RAG Dataset for Multi-Cloud HPC

ReDX Technologies

StageHybride4 à 6 moisRémunéréDate limite : 25 déc. 2025

Machine Learning (LLM)Cloud ArchitectureETL / Data EngineeringHigh-Performance Computing

Description

Teach an LLM to act like a cloud architect for HPC: map HPC code/repositories to concrete hosting plans across AWS, GCP, Azure, and heterogeneous multi-cloud setups.
Collect deployable architecture examples and build a provider-agnostic schema and an auto-refreshed knowledge base of SKUs, prices, and capabilities to ground the model in facts.

Collect & curate a dataset of single-cloud and multi-cloud deployable architectures with short rationales, cost snapshots, and optional diagrams.
Build provider component catalogs (compute, storage, network, scheduler) including specs, limits, regional availability, and pricing for at least three providers.
Produce an auto-refreshed RAG dataset (e.g., weekly) and a simple retrieval API to keep SKUs, prices, and limits up to date.
Fine-tune LLM layers on the curated dataset (config + checkpoints or LoRA adapters) and deliver a final capability demo where the LLM recommends end-to-end cloud hosting architectures.

Parse user hints (desired hardware/architecture/budget), validate against codebases, and map requirements to best-fit cloud components per layer (compute, storage, network, HPC cluster choices), ranking providers.
Implement End-to-End Architecture Synthesis grounded by RAG; optionally emit diagrams to illustrate designs.
Implement scheduled refresh mechanisms for provider data to ensure recommendations remain current and factual.

ML/NLP basics: dataset design, prompt/response schemas, instruction fine-tuning.
Cloud literacy: familiarity with AWS/GCP/Azure building blocks (instances/VMs, storage, regions, pricing) and cloud components for HPC.
Data tooling: Python, JSON, simple ETL/versioning; basic vector search/RAG.
Good software practices: Git, reproducibility, documentation, validations and guardrails for safety.

Deliver a cleaned, preprocessed dataset in the common schema, an auto-refreshed RAG dataset covering at least three providers, fine-tuned LLM artifacts, a retrieval API, a live demo, and a technical report documenting schema, curation, refresh, fine-tuning, and evaluation results.

Recommended period: 6 months (4-6 months as listed).
Compensation: Monthly stipend with potential end-of-internship performance bonus and potential paper publication co-authorship.

📧 Pour postuler: contact@redxt.com