What exactly is the recently popular “super node” used for?

2025.05.01
Recently, a new term has become popular in the AI ​​circle, that is - supernode.

Supernodes have appeared frequently in major exhibitions and forums. Industry leaders have also waved their flags and shouted that it will be an important trend in the development of intelligent computing and usher in a wave of development.

So, what is a supernode? Why do we need a supernode?

In this article today, Xiaozaojun will give you an in-depth explanation.

What is a supernode?


Supernode, also known as SuperPod in English, is a concept first proposed by Invitrogen.

As we all know, GPU is an important computing hardware, providing strong support for the training of AIGC large models.

With the continuous growth of large model parameter scale, the demand for the scale of GPU clusters is also growing. From kilocards to 10,000 cards, and then to 100,000 cards, it may even be larger in the future.
So, how do we build GPU clusters that are getting bigger and bigger?

The answer is simple: Scale Up and Scale Out.

Scale Up is to expand upward, also called vertical expansion, to increase the number of resources in a single node. Scale Out is to expand outward, also called horizontal expansion, to increase the number of nodes.

Putting more GPUs in each server is called Scale Up. At this time, a server is a node.
Connecting multiple computers (nodes) through a network is called Scale Out.

Let’s talk about Scale Up first.

For a single server, the number of GPUs that can be inserted is limited by space, power consumption, and heat dissipation, usually 8 or 12 cards.

To insert so many GPUs, we also need to consider whether the internal communication capabilities of the server can support it. If there is a bottleneck in the interconnection of GPUs, then the expected effect of Scale Up cannot be achieved.
In the past, the internal computer was mainly based on the PCIe protocol, which had slow data transmission rates and high latency, and could not meet the requirements at all.

In 2014, Invitrogen launched its own private NVLINK bus protocol to solve this problem. NVLINK allows GPUs to communicate in a point-to-point manner, with a much higher speed than PCIe and much lower latency.
NVLINK was originally used only for internal machine communication. In 2022, Invitrogen separated the NVSwitch chip and turned it into an NVLink switch, which is used to connect GPU devices between servers. This means that nodes are no longer limited to one server, but can be composed of multiple servers and network devices.

These devices are in the same HBD (High Bandwidth Domain). Invitrogen calls this Scale Up system that interconnects more than 16 GPU-GPU cards with ultra-high bandwidth a super node.
After years of development, NVLINK has been upgraded to the fifth generation. Each GPU has 18 NVLink connections, and the total bandwidth of the Blackwell GPU can reach 1800GB/s, far exceeding the bus bandwidth of PCIe Gen6.

In March 2024, Invitrogen released NVL72, which can integrate 36 Grace CPUs and 72 Blackwell GPUs into a liquid-cooled cabinet, achieving a total of 720 PFLOPs of AI training performance, or 1440 PFLOPs of inference performance.
What are the advantages of supernodes?
At this point, you may ask: Why do we have to have supernodes? If the Scale Up route is not easy to take, we can take the Scale Out route and increase the number of nodes. Can't we also make a large-scale GPU cluster?

The answer is very simple. The reason why we need to have a supernode, an enhanced version of Scale Up, is because it can bring huge advantages in terms of performance, cost, network, maintenance, etc.
Scale Out tests the communication capability between nodes. Currently, the main communication network technologies used are Infiniband (IB) and RoCEv2.

Both technologies are based on the RDMA (Remote Direct Memory Access) protocol, and have higher speeds, lower latency, and stronger load balancing capabilities than traditional Ethernet.

IB is Invincible's proprietary technology, which started early, has strong performance, and is expensive. RoCEv2 is an open standard, a product of traditional Ethernet fused with RDMA, and is cheap. The gap between the two is shrinking.
In terms of bandwidth, IB and RoCEv2 can only provide Tbps-level bandwidth. Scale Up can achieve 10Tbps bandwidth interconnection between hundreds of GPUs.

In terms of latency, the latency of IB and RoCEv2 is as high as 10 microseconds. Scale Up has extremely strict requirements on network latency, which needs to reach 100 nanoseconds (100 nanoseconds = 0.1 microseconds).
In the AI ​​training process, there are many parallel computing methods, such as TP (tensor parallelism), EP (expert parallelism), PP (pipeline parallelism) and DP (data parallelism).

Generally speaking, the communication volume of PP and DP is small, and is usually handled by Scale Out. The communication volume of TP and EP is large, and needs to be handled by Scale Up (inside the super node).
As the current optimal solution for Scale Up, supernodes can effectively support parallel computing tasks through internal high-speed bus interconnection, accelerate parameter exchange and data synchronization between GPUs, and shorten the training cycle of large models.

Supernodes generally also support memory semantics, and GPUs can directly read each other's memory, which is also not available in Scale Out.
From the perspective of networking and operation and maintenance, supernodes also have obvious advantages.

The larger the HBD (hyperbandwidth) of the supernode, the more GPUs can be scaled up, and the simpler the scale-out network will be, greatly reducing network complexity.
A supernode is a highly integrated small cluster with internal busses already connected. This also reduces the difficulty of network deployment and shortens the deployment cycle. Subsequent operation and maintenance will also be much more convenient.

Of course, a supernode cannot be infinitely large, and its own cost factors must also be considered. The specific scale needs to be calculated based on the demand scenario.

In general, the advantage of a supernode is to increase local bandwidth and reduce the cost of increasing global bandwidth to obtain greater benefits.
What are the options for supernodes?

Because supernodes have significant advantages, after Invitrogen proposed this concept, it immediately attracted the attention of the industry. Many manufacturers have also joined the research on supernodes.

At present, the mainstream supernode solutions in the industry mainly include the following:

1. Private protocol solution.
The representative manufacturer is of course Invitrogen.
In addition to Invitrogen, Huawei, a major domestic manufacturer, recently released its AI nuclear-grade technology, CloudMatrix 384 supernode, which is also a private protocol.

CloudMatrix 384 uses 384 Ascend computing power cards to form a supernode. Among the commercial supernodes, the single unit is the largest and can provide up to 300 PFLOPs of dense BF16 computing power, which is nearly twice that of Invitrogen's GB200 NVL72 system.

2. Open organization plan.
If there are private protocols, there will certainly be open standards. In the Internet era, open decoupling is the general trend.
Private protocols often mean high costs. For AI, a popular field, developing open standards will help lower industry barriers and help achieve technological equality.

At present, there are more than one open standards for hypernodes, but they are basically based on Ethernet technology (ETH). Because Ethernet technology is the most mature and open, and has the most participating companies.
From a technical perspective, Ethernet has the largest switching chip capacity (single chip 51.2T has been commercialized), the highest speed Serdes technology (currently reaching 112Gbps), and the switching chip latency is also very low (200ns), which can fully meet the performance requirements of Scale Up.

Among the open standards for supernodes, the most representative one is the ETH-X open supernode project led by the Open Data Center Committee (ODCC) and designed by the China Academy of Information and Communications Technology and Tencent.
More than 30 industry, academia and research institutions participated in this project. These include operators (China Mobile), cloud vendors (Tencent, etc.), equipment vendors (Rayjie, ZTE, etc.), computing card suppliers (Suyuan Technology, BiRen Technology, etc.), and high-speed interconnection technology solution providers (Luxshare Technology, etc.).

Let's take a brief look at the technical details of ETH-X open super nodes.
ETH-X is based on Ethernet technology to build a high-bandwidth, elastic and scalable HBD with high computing density, high interconnection bandwidth, high power density and high energy efficiency.

It is worth noting that ETH-X not only includes Scale Up, but also Scale Out. The typical network topology is shown in the figure below:
According to the data provided by Tencent at the 2024 Open Data Center Conference, based on the ETH-X supernode, in the training scenario, the performance/cost comparison of the LLama-70B dense model under the 64K cluster was carried out, and the 256-card scale-up was used, which reduced the training cost by 38% compared with the 8-card scale-up.

In the inference scenario, in the FP4 precision 128-card instance inference performance/cost comparison of LLama-70B, the 256-card scale-up increased the inference benefit by 40.48% compared with the 8-card scale-up.
This effect is still very good.

At present, the ETH-X super node technical specification 1.0 has been released. Not long ago (April 8), the ETH-X open super node project held a lighting ceremony for the first prototype at the Huaqin Technology Dongguan Smart Manufacturing Base.
Let's take a look at the physical architecture of the ETH-X open supernode.

AI Rack is the specific implementation of the ETH-X supernode. The Serdes rate in the rack currently supports up to 112Gbps, and will support 224Gbps in the future.

The rack includes computing nodes, switching nodes, and key components.
The entire cabinet can realize NOC (Network-on-Chip)-level communication topology between multiple GPUs, and support cross-GPU direct access (Direct Access) and zero-copy transmission (Direct Copy) through unified memory addressing and memory semantic interface.

According to actual test data, cross-card data access latency can be reduced by 12.7 times, dynamically reconstructing the flexible combination unit of 8~512 card super nodes.
Among the key components, the Cable Tray is particularly noteworthy.

ETH-X Supernode AI Rack adopts a cabinet copper connection solution. The Cable Tray is a high-speed copper cable solution that enables hardware interconnection between various subsystems, and is also an important connector hardware that provides high-speed interconnection capabilities.
Invitrogen's latest NVLINK solution also uses the Cable Cartridge solution. In short-distance transmission scenarios, copper connections in cabinets can achieve high reliability and low cost (reducing the use of optical modules) compared to optical fibers, and are also conducive to wiring. At present, the use of copper cable direct connection technology inside Scale Up has become a mainstream trend.

Final words
Well, the above is all about the introduction of super nodes. Do you all understand?
As the AI ​​wave continues to develop, the industry's demand for super nodes will become stronger and stronger. More manufacturers will join the relevant open standards. This will strongly promote the maturity of related technologies and standards and bring a more prosperous and diverse ecosystem.

Super nodes, the future is promising!