Skip to main content

Blog

Q&A with NVIDIA and Arm: What You Need to Know About the Arm-Based NVIDIA Grace CPU Superchip

By Ben Langstreth, Senior Account Manager at Vespertec, Sergey Shatov, EMEA Senior Sales Manager, Datacenter GPU and CPU at NVIDIA, and Tim Thornton, Director of Arm-based Engineering at Arm.

Release date: 17 September 2024

NVIDIA Grace CPU Superchip: A High-Performance, Energy-Efficient, Arm-based CPU.

In our latest blog, we sit down with Tim Thornton, Director of Arm-Based Engineering at Arm, and Sergey Shatov, EMEA Senior Sales Manager, Datacenter GPU and CPU at NVIDIA, to explore the groundbreaking NVIDIA Grace CPU Superchip. This Arm-based CPU is setting new standards for performance and energy efficiency in data centres, particularly for AI and HPC workloads. We discussed why NVIDIA chose the Arm Neoverse V2 architecture, how Grace compares to leading x86 CPUs, and what companies should consider when migrating to this next-generation technology.

 

NVIDIA Grace architecture is designed for an AI- and cloud-first world, with GPUs accelerating the data centre and the CPU delivering outstanding single threaded performance, power efficiency, fast memory bandwidth, and a coherent high-speed connection to the GPU or another CPU.

The Grace CPU combines 72 high-performance and power-efficient Arm® Neoverse™ V2 cores, connected with the NVIDIA Scalable Coherency Fabric (SCF) that delivers 3.2TB/s of bisection bandwidth – double that of traditional CPUs. This architecture keeps data flowing between CPU cores, cache, memory, and system input and output (IO) to get the most out of the system performance. Grace is the first data centre CPU to utilise server-class high-speed LPDDR5X memory with a wide memory subsystem that delivers 500GB/s of bandwidth at one fifth the power of traditional DDR memory at similar cost.

The Grace CPU Superchip is composed of two Grace CPU chips connected coherently over NVIDIA NVLink Chip-to-Chip (C2C) at 900 GB/s. It packs 144 Neoverse V2 cores into a single module, with server-class LPDDR5X memory that delivers up to 1TB/s of memory bandwidth. The Grace CPU Superchip comprises the heart of a two-socket server in a compact module, delivering 2x the performance in the same power envelopes as traditional server CPUs with DDR5 memory.

The Grace CPU can also be found in the NVIDIA Grace Hopper or NVIDIA Grace Blackwell based architectures. The NVLink-C2C connects the CPU and GPU with 7x the bandwidth of x16 Gen5 PCIe and provides memory coherency to simplify programming. The NVIDIA Grace Hopper™ Superchip pairs the NVIDIA Grace CPU with a NVIDIA Hopper GPU to maximise the capabilities for accelerated computing and generative AI workloads. The NVIDIA GB200 NVL72, combines 36 NVIDIA Grace CPUs and 72 NVIDIA Blackwell GPUs in a liquid cooled rack-scale design that unlocks real-time performance on trillion parameter scale models.

 

Why did NVIDIA choose Neoverse V2 cores for Grace CPUs?

The Grace CPU incorporates the Neoverse V2 core, Arm’s most advanced data centre CPU core available in chips today. The Neoverse V2 implements the Armv9 architecture, bringing improvements in security, vector processing (SVE2), and overall performance. The Grace CPU integrates the Neoverse V2 cores with the NVIDIA SCF and the high-bandwidth and energy efficient LPDDR5X memory to deliver flagship levels of x86 CPU performance in only half the power across Hyperscale, Enterprise and HPC applications.

 

Are there compatibility issues between Arm-based Grace architecture and x86, or will applications just run?

The NVIDIA Grace architecture is fully compatible with the Arm ecosystem, ensuring that any application designed for 64-bit Arm in data centers will also operate seamlessly on Grace, and vice versa.  To run x86 code on Grace, a simple recompilation is typically required. This recompilation is generally a simple job, and even complex code can be enabled on Arm with limited effort.

Compatibility for interpreted codes is also typically simple to achieve; one general rule of thumb is that upgrading the newer versions of an application will usually reduce any compatibility challenges.

NVIDIA developed a dedicated, self-paced lab on Launchpad for testing and porting. It is now available, free of charge for everybody.

 

What do companies considering migrating workloads to Arm-based Grace architecture need to be aware of? 

Both Grace and Grace Hopper Superchips are already powering a significant share of newly built supercomputers and AI centres. In most workloads, the Grace CPU performs similarly to the leading x86 CPUs, while requiring significantly lower power. For companies prioritising both performance and sustainability, migration should be strongly considered.

Additionally, given that Arm uses a fine-grained memory model that enables better control of your application, and developers may find new avenues for optimization and security. Developers can use tools like memory fences or atomic operations, which are readily available in most modern programming languages and libraries, to explicitly define the required ordering of memory accesses, thus ensuring correct behaviour across different memory models.

 

Are there sections of the software landscape that work more easily? Conversely, are there types of applications that are always difficult?

Data centre CPU workloads need high-per core performance for latency sensitive applications and branchy, compute intensive applications. In addition, High-Performance Computing (HPC) CPU codes often are memory bound. To meet these varied requirements, it is critical that modern CPUs have high-perf core performance, a fast fabric and high memory bandwidth. Data centers are power and space constrained, so performance efficiency is critical.

That is why Grace is designed to deliver high single-thread performance, memory bandwidth, and the data movement, while in a very power efficient platform. The full Grace CPU Superchip only requirres up to 500W for the memory and CPU, while rival platforms require over 800W.

 

What are the benefits of the Grace Superchip’s onboard memory vs. standard DRAM?

The LPDDR5X memory in NVIDIA Grace balances cost, power, bandwidth, and capacity. It delivers up to 500 GB/s in only about 16W, approximately one fifth the power of conventional DDR5 memory.

 

Power in the data centre is a hot topic, so what can we expect from Arm-based Grace vs x86?

Users are encouraged to benchmark their own applications, but the Arm-based NVIDIA Grace CPU delivers exceptional power efficiency. Power efficiency is part of Arm’s DNA , and when coupled with the fast NVIDIA SCF and low-power memory in Grace organisations that are density and power-constrained can do up to 2X more compute within the same power envelope, or to do the same amount of compute while saving half of the required energy compared with AMD Genoa in multiple workloads, such as graph analytics, data analytics, and microservices.

The Grace Superchip also delivers better, more power-efficient performance than x86 in HPC applications, such as seismic data processing, CFD (OpenFOAM), molecular dynamic (CP2K), climate (NEMO), and weather (ICON, WRF).

 

Is Arm scalable like x86?

NVIDIA Grace has been engineered with scalability in mind. When applications need to scale across all cores with high-communication across the threads and keep data moving NVIDIA developed a fast on-chip fabric, called NVIDIA SCF, which offers twice the bisection bandwidth of the best x86 CPU. This means that cores can efficiently communicate on large problems with limited overhead to deliver incredible performance.

 

What use cases are the best fit for NVIDIA Grace?

The NVIDIA Grace CPU is an ideal choice when absolute performance, energy efficiency, and data centre density are critical. This includes scientific computing, data analytics, enterprise and hyperscale applications.

 

Scroll back up to page top
Follow us
Contact us