Our Guide to AI Infrastructure: Understanding DGX, HGX, MGX, and More

A blog by Allan Kaye, CEO and co-founder at Vespertec

—

Release date: 3 July 2024

Decoding Reference Architectures and Product Categories

As an Elite NVIDIA compute partner, Vespertec is immersed in the AI infrastructure and accelerated computing space on a daily basis. We frequently work with different GPU form factors and system types when designing or recommending platforms. Given the prevalence of acronyms like DGX and HGX in this domain, we thought it would be useful to provide a summary explaining these terms and how we approach deploying GPU-accelerated computing solutions.

This post aims to demystify the various product names and form factors, enabling readers to better understand the landscape of AI hardware infrastructure.

Exploring Form Factors and Connectivity – PCIe, SXM, NVLink, and Superchip

The GPU landscape offers two primary form factors and connectivity standards: PCIe cards and SXM modules. PCIe card-based GPUs slot into servers like network cards, requiring an appropriately sized interface (typically x16 full-length) and auxiliary power for more powerful models. GPU-optimized servers provide suitable slots, power, and cooling capabilities.

For PCIe-based GPU servers, a single GPU is the starting point. However, advancements in CPU architecture from AMD and Intel have introduced more and faster PCIe lanes, with the current generation being PCIe Gen 5.

These improvements enable systems to support a greater number of GPUs, NVMe-based storage devices (which also utilise the PCIe interface), and network cards, thereby enhancing overall system performance and capabilities.

Multiple PCIe GPU servers are commonplace, with current systems supporting up to 8-10 double-width or 16 single-width PCIe GPU cards. Inter-GPU communication occurs over the PCIe bus, with maximum throughput determined by the system’s PCIe generation.

Some PCIe GPU models support NVIDIA’s NVLink technology, enabling high-bandwidth bridging between two GPUs for increased resource pooling. However, this bridging is limited to two GPUs and unsuitable for larger-scale interconnection.

In contrast, SXM GPUs employ a specific board and socket design, highly interconnected via NVLink. Designed for dense, high-performance computing clusters, SXM GPUs deliver serious computational power and energy efficiency. NVLink, a high-speed interconnect, allows multiple SXM GPUs to communicate, enabling improved parallel processing and data transfer rates. NVLink’s higher bandwidth and lower latency excel in workloads requiring significant inter-GPU communication, such as large-scale deep learning and scientific simulations.

Recently, the Superchip architecture has emerged, starting with the Grace Hopper and, more recently, Grace Blackwell superchips. These integrate CPU, GPU, and HBM3/4 memory into a single package, representing a significant evolution in NVIDIA’s computing platform. The Superchip architecture offers a tightly integrated and optimized solution for high-performance computing and AI workloads, leveraging unified memory and improved CPU-GPU coupling. It aims to provide higher performance, scalability, and power efficiency compared to traditional GPU solutions. NVIDIA’s 2026 roadmap includes Vera Rubin, a new generation succeeding Blackwell Ultra, solidifying the Superchip architecture’s longevity.

The choice between PCIe and SXM GPUs has far-reaching implications for procurement strategies and infrastructure planning. While PCIe GPUs offer a cost-effective and flexible solution for moderate computational demands, SXM GPU platforms excel in large-scale, compute-intensive workloads like AI training and scientific simulations. To select the appropriate form factor, carefully assess your current and projected organizational needs, factoring in performance requirements, scalability, and budgetary constraints.

System Categories Explained

DGX: The Flagship AI Solution

NVIDIA’s DGX system represents the pinnacle of AI computing, designed specifically for demanding training applications and HPC. The system includes 8-way NVLINK connectivity, allowing unparalleled GPU-to-GPU communication within the server and across wider clusters of DGX via network cards dedicated to each of the 8 GPU onboard (1 Nic per GPU).

The DGX has a fixed hardware bill of materials (BOM), streamlining procurement processes and ensuring consistent performance. Support for both hardware and the comprehensive software suite are provided by directly by NVIDIA, ensuring the closest possible alignment with the vendor.

Here’s a great example of how BMW is using DGX to train AI-powered robots for manufacturing use cases.

NVIDIA has built a comprehensive ecosystem around DGX, including the best tools, infrastructure, and support to enable IT professionals, data scientists, and business units to develop and deploy AI solutions as quickly and efficiently as possible. This ecosystem is designed to reduce the complexity of AI deployments, ensuring that organisations can bring AI technologies to bear without a steep learning curve and making DGX the fastest route to production for AI projects.

Notably, the powerful NVIDIA AI Enterprise cloud-native suite of AI and data analytics software is included within the DGX package, but it can also be added to any other system on a per GPU basis.

NVIDIA AI Enterprise is a comprehensive, end-to-end software solution that fundamentally changes how businesses deploy AI. It’s a cloud-native platform designed to streamline the entire data science pipeline, from development to deployment.

By offering a suite of easy-to-use microservices, NVIDIA AI Enterprise ensures optimised model performance, coupled with enterprise-grade security, support, and stability. This platform makes it much easier to transition from prototype to production, making it an essential tool for any enterprise looking to leverage the full potential of AI to drive their business forward.

The DGX is the only system offered by NVIDIA directly. To understand why NVIDIA took this strategic approach, it helps to look at DGX’s origin story. Initially, there was a gap in the market for optimised hardware solutions for AI training and inference. At the time, original equipment manufacturers (OEMs) were hesitant to develop these solutions, which prompted NVIDIA to create the DGX platform. It was confident that it could provide an optimised configuration unmatched by any other offering, and it was proven right. The success of DGX has since influenced every major OEM to develop HGX-optimised solutions based on NVIDIA’s reference architecture which was born with the DGX platform.

HGX: Flexible GPU Solutions

In contrast to the fixed BOM of DGX systems, HGX offerings provide a more customisable solution, allowing users to tailor their configurations to project requirements. That’s the main difference between NVIDIA’s DGX and HGX platforms: DGX is made and directly supported by NVIDIA with a fixed specification.

HGX is an OEM product that uses the same NVIDIA SXM GPU and NVLINK baseboard as the DGX but allows for customisation—for example, to the network card specification or the capacity of the local NVMe storage. The product itself is supported by the OEM vendor concerned: e.g. Gigabyte, QCT, Supermicro, Inspur, or other manufacturers.

HGX systems offer the flexibility to balance performance with budgetary constraints, making them an attractive option if you have a specific use case in mind.

DPU (Data Processing Unit)

The DPU, or Data Processing Unit, is a critical component that features in every DGX and many HGX deployments.

The NVIDIA “Bluefield-3 DPU” is the latest generation in this product category. It is installed into systems in the same way as a traditional network card and is used to offload, accelerate, and isolate software-defined networking, storage, security, and management functions.

Offloading these workloads frees up CPU and GPU resource enabling the system to focus on core computing takes. By handling these ancillary workloads more efficiently, the DPU enables higher overall system performance.

Although not a GPU or a reference architecture in itself, Bluefield-3 DPU is considered to be best practice in high performance GPU clusters. Much of its capability is due to its 16 Arm Cortex-A78 cores which provide the processing power for advanced networking and security functions.

DPUs are becoming increasingly popular in modern data centres due to the DOCA software library. This provides a wide range of development tools and pre-built packages that significantly enhance the DPU’s utility.

In the context of NVIDIA’s AI and HPC platforms, the DPU plays a vital role in enabling high-speed data movement and low-latency communication between the various system components, ensuring that data bottlenecks don’t impede the performance of the massively parallel GPU workloads.

MGX: Modular Reference Architecture for Cutting-edge Projects

NVIDIA released the MGX reference architecture to streamline and accelerate the process of creating AI and HPC systems for their OEM partners. This innovative framework simplifies system design, standardizes components, and significantly reduces development time and costs for new AI/HPC servers. MGX offers pre-validated designs and a modular approach, allowing OEMs to easily mix and match components, reducing engineering efforts and accelerating prototyping. Key features include common motherboard designs, standardized power and cooling solutions, flexible GPU configurations, compatibility with various CPUs, and integrated networking options. This approach enables OEMs to focus on differentiation and value-added features while facilitating faster adoption of new NVIDIA technologies. The result is a more consistent experience for end-users across different OEM products. Essentially, MGX provides OEMs with a substantial head start in system design, dramatically shortening the time-to-market for new AI and HPC solutions compared to designing from scratch.

These advantages for OEMs translate into tangible benefits for end customers, but how exactly does this impact enterprises seeking AI and HPC solutions?

NVIDIA’s MGX reference architecture offers significant advantages for end customers and enterprises. It accelerates access to cutting-edge AI and HPC solutions, providing a wider range of customizable options from various OEMs. The pre-validated designs enhance reliability and consistency across different offerings, while the modular approach facilitates easier scalability and future-proofing. Potential cost savings arise from standardization and reduced development costs. Improved interoperability simplifies integration with existing infrastructure and management across multi-vendor environments. The architecture’s optimized designs contribute to better power efficiency, potentially lowering operating costs and environmental impact. Shorter deployment times and a more consistent support experience further streamline adoption. Ultimately, MGX allows enterprises to focus on their core AI and HPC workloads instead of hardware complexities, fostering faster innovation and improved return on investment in technology.

EGX: Empowering AI at the Edge

NVIDIA’s EGX platform brings real-time AI to the edge of enterprise networks, combining NVIDIA GPUs with networking, storage, and security technologies. This architecture enables AI processing at the point of action, rather than in centralized data centres. EGX provides a scalable platform for deploying AI applications in edge environments, from small form-factor devices like the Jetson AGX Xavier to full-scale servers powered by NVIDIA H100 or L40S GPUs.

The EGX platform integrates NVIDIA’s AI software stack, including the NVIDIA AI Enterprise suite, with support for containerized applications. It incorporates built-in security features and remote management capabilities through the NVIDIA Fleet Command, creating a comprehensive ecosystem for edge AI deployment.

Various industries stand to benefit significantly from EGX. In manufacturing, EGX can power real-time quality control systems using computer vision, potentially utilizing NVIDIA Metropolis for intelligent video analytics. Healthcare providers can leverage EGX for rapid medical image processing and analysis at the point of care, possibly integrating with NVIDIA Clara for healthcare-specific AI workflows. Retail businesses can employ EGX for in-store analytics and personalized shopping experiences, while smart cities can use it for traffic management and public safety applications.

For enterprises, EGX translates into reduced latency and bandwidth usage by processing data closer to its source. The platform’s scalability allows organisations to start with smaller deployments and expand to more powerful systems as needed. EGX’s support for containerized applications facilitates easier deployment and management of AI workloads across distributed environments.

The platform’s emphasis on security helps protect sensitive data at the edge, addressing privacy concerns and regulatory requirements. Remote management features through Fleet Command simplify the operation of edge AI systems, reducing the need for on-site IT support. By bringing AI capabilities to the edge with solutions like the NVIDIA-Certified Systems, EGX enables enterprises to make faster, more informed decisions based on real-time data analysis, leading to improved operational efficiency, enhanced customer experiences, and new opportunities for innovation across industries.

Specialised Categories: OVX, AGX, and IGX

While the DGX, HGX, MGX, and EGX categories cater to the majority of AI and HPC workloads, NVIDIA’s line-up also includes specialised offerings for specific project requirements. These categories include OVX, AGX, and IGX:

OVX systems are designed to support NVIDIA’s Omniverse platform, enabling collaborative 3D design and digital twin simulations.
AGX solutions target embedded edge computing applications, such as autonomous vehicles and robotics.
IGX caters to industrial and edge computing needs, offering robust and reliable solutions for harsh environments.

Building Optimised Data Centre Infrastructure

If you’re responsible for compute, networking, storage, or security infrastructure, it’s crucial to understand NVIDIA’s diverse product categories and how they interact with your existing systems and future needs.

Whether your primary metrics revolve around meeting business demand, ensuring uptime and secure storage, or keeping IT spending within budget, working out which reference architecture or category is best for your use case can help you make informed decisions – aligned with both current and future technological needs.

If you’re evaluating your AI or HPC workloads and seeking expert guidance on the best infrastructure solutions to solve them, get in touch and we can schedule a chat. As an NVIDIA Elite Compute Partner – and one of only two NVIDIA Elite Networking Partners – our team can help you navigate NVIDIA’s offerings and integrate the perfect AI technologies into your data centre infrastructure.

Blog

Latest News

Our Guide to AI Infrastructure: Understanding DGX, HGX, MGX, and More

Decoding Reference Architectures and Product Categories

System Categories Explained

The Vespertec ideology always shows in the solutions we provide and the way we work – as we hope you’ll discover for yourself soon.