Context & Motivation
- HPC network fabrics (InfiniBand, high-speed Ethernet) are critical for scaling performance, latency predictability, energy efficiency, and TCO in large clusters.
- Knowledge is fragmented across vendor docs (NVIDIA/Mellanox, Intel, Broadcom), standards bodies (IEEE, OpenFabrics Alliance), whitepapers, and community guides, which evolve frequently.
Goal & Problem Statement
- Build an autonomous, domain-specialized, LLM-assisted networking design assistant (GPTFabrics) that discovers, interprets, normalizes, and reasons over InfiniBand and Ethernet information.
- The assistant should expose valid insights to the Cluster Configurator such as topology design, bandwidth/latency estimation, congestion control, and offload compatibility.
Required Features & System Capabilities
- Discover relevant networking information from heterogeneous sources without relying on fixed structures or vendor-specific assumptions.
- Extract and normalize key fields: link speed (Gb/s), port count, ASIC generation, latency, buffer depth, PFC/ECN support, RDMA/RoCE capabilities, routing type, topology constraints, and oversubscription ratios.
- Support design reasoning (e.g., topology non-blocking checks, oversubscription calculations) and estimate topology-aware metrics (bisection bandwidth, port utilization) while flagging configuration issues.
- Track changes and detect new hardware generations or unfamiliar field names and flag them for human inspection rather than silent failure.
Tools, Data Sources & Resources
- Hardware resources: 2–8 × NVIDIA H100 GPUs, Lustre filesystem, HPC cluster access.
- Software ecosystem: HuggingFace Transformers, NVIDIA NeMo, FAISS, LangChain or LlamaIndex; data sources include vendor docs, IEEE papers, OpenFabrics Alliance resources, and HPC case studies.
Learning Objectives & Evaluation
- Analyze differences between InfiniBand and Ethernet in latency, throughput, congestion behavior, and scalability.
- Design robust data and retrieval pipelines resilient to heterogeneous and evolving documentation formats; compare reasoning strategies (rule-based, semantic search, LLM prompting, fine-tuning, hybrid RAG).
- Define evaluation metrics (precision/recall for field extraction, correctness of topology reasoning, hallucination rate, adaptability to new generations) and communicate trade-offs and limitations.
Required Skills
- Python programming: HTTP requests, HTML parsing, data processing.
- Networking fundamentals: InfiniBand, Ethernet, RDMA/RoCE, PFC/ECN, topology design (fat-tree, Dragonfly, leaf-spine).
- Data normalization and database management: schema definition, DuckDB/PostgreSQL.
- Intro AI/LLM skills (preferred): prompting, mapping unstructured fields into schemas, producing factual summaries of HPC network concepts.
Duration & Compensation
- Recommended period: 6 months (4-6 months as listed).
- Compensation: Monthly stipend with potential end-of-internship performance bonus and potential paper publication co-authorship.
📧 Pour postuler: contact@redxt.com