From "slow" to "lightning": How does the NVMe protocol make data fly?

In the field of data storage and transmission, speed has always been a key metric that people constantly strive for. Traditional hard disk drives (HDDs), limited by their mechanical structure, made data reading and writing slow and time-consuming. Even with the advent of solid-state drives (SSDs), early products based on the SATA interface experienced limited performance improvements due to interface bandwidth and protocol limitations. It was like a powerful sports car stuck on a narrow, congested road, unable to fully accelerate.

Everything changed with the advent of the NVMe protocol. It's like building an ultra-fast "data highway" for data transmission, completely breaking through existing performance bottlenecks. So, what unique design of the NVMe protocol allows data transmission to leap from "snail-speed" to "lightning-fast"? Let's delve deeper and uncover the mysteries of the NVMe protocol.

Part 1. What is the NVMe protocol

1.1 Dilemma of Traditional Storage Protocols

In the early stages of solid-state drive development, the Advanced Host Controller Interface (AHCI) protocol played a crucial role. Originally designed for mechanical hard drives (SDs), the AHCI protocol was born out of the fundamental principles and performance characteristics of these drives. SDs use a magnetic head to read and write data on a spinning platter. This mechanical structure results in relatively slow read and write speeds and long seek times. For example, in traditional SDs, the magnetic head must move across the platter to locate the data, a process that introduces significant latency, typically ranging from a few milliseconds to over ten milliseconds.

With the advent of solid-state drives (SSDs), the limitations of the AHCI protocol gradually became apparent. SSDs use flash memory chips for data storage. Data is read and written via electronic signals that control the state of the flash memory chips, without the physical movement of mechanical parts. This theoretically enables extremely fast read and write speeds. However, the AHCI protocol's single-queue design became a bottleneck that limited SSD performance.

In the AHCI protocol, there's only one queue between the host and storage device to handle input/output (I/O) requests. All I/O requests must enter this queue sequentially for processing. This is like a single-lane highway where vehicles can only pass one by one. Even if the vehicle behind has a more urgent task, it must wait for the vehicle in front to pass before moving forward. This single-queue design fails to fully utilize the high-speed read/write capabilities of solid-state drives (SSDs), resulting in a large number of I/O requests lingering in the queue, significantly increasing data transmission latency and reducing overall performance.

Furthermore, the AHCI protocol's command processing mechanism is relatively complex, further increasing system overhead and latency. The AHCI protocol requires a SATA controller for data transmission and command processing. This process involves multiple layers of protocol conversion and data processing, making data transfer from the host to the storage device and vice versa cumbersome, reducing data transfer efficiency.

Moreover, the AHCI protocol has a high CPU utilization rate. When processing I/O requests, the CPU needs to spend a lot of time and resources on queue management, command parsing, and data transmission coordination, which to some extent also affects the overall performance of the system.

1.2 The dawn of PCIe interface

Just as the AHCI protocol was struggling with SSDs, the emergence of the PCIe (Peripheral Component Interconnect Express) interface offered new hope for improved SSD performance. With its exceptional high bandwidth and low latency, the PCIe interface laid a solid hardware foundation for the NVMe protocol.

The PCIe interface utilizes serial communication technology, enabling it to operate at higher frequencies than traditional parallel communication interfaces, resulting in higher data transfer rates. Furthermore, the PCIe interface supports parallel transmission across multiple lanes, each capable of independent data transfer. This significantly increases the overall bandwidth of the PCIe interface as the number of lanes increases. For example, the theoretical bandwidth of a PCIe 3.0 x4 interface can reach 32GB/s, while the latest PCIe 4.0 x4 interface boasts a theoretical bandwidth of 64GB/s. This high bandwidth provides SSDs with ample data transfer channels, allowing them to fully utilize their high-speed read and write performance advantages without being limited by the bandwidth of traditional interfaces.

In addition to high bandwidth, the PCIe interface also boasts extremely low latency. Traditional storage interfaces require data transfers to go through multiple intermediaries, such as SATA controllers, which introduce latency. However, the PCIe interface utilizes a direct connection, allowing SSDs to communicate directly with the CPU via the PCIe bus. This bypasses traditional intermediaries like SATA controllers, significantly reducing the number of intermediate links and latency. This low latency allows SSDs to respond more quickly to system I/O requests, improving overall system responsiveness and performance.

It is precisely these advantages of the PCIe interface that paved the way for the emergence of the NVMe protocol. Designed precisely based on the PCIe interface, the NVMe protocol fully leverages the high bandwidth and low latency of the PCIe interface. Through optimized command processing mechanisms and a multi-queue design, it significantly improves SSD performance. It can be said that the PCIe interface is like a highway, and the NVMe protocol is the supercar that speeds along it. The combination of the two unleashes unprecedented performance in SSDs.

1.3 NVMe SSD

(1) Basic architecture

Generally speaking, an NVMe SSD consists of three main components. On the host side, the official NVMe website and common Linux and Windows operating systems have integrated the corresponding drivers. Regarding transmission and control, a controller is implemented using the PCIe interface and the NVMe protocol. This controller acts as the "command center" of the entire storage system, responsible for data processing and transmission scheduling. The storage medium consists of the FTL (Flash Translation Layer) and NAND Flash chips. The NAND Flash is used for actual data storage, while the FTL manages key operations such as NAND Flash address mapping and wear leveling, ensuring efficient and stable operation of the storage medium.

(2) NVMe controller

NVMe controllers essentially combine DMA (Direct Memory Access) technology with a multi-queue mechanism. DMA plays a key role in data transfer, efficiently transferring instructions and reading and writing user data between storage devices and system memory. This significantly reduces CPU intervention during data transfers, freeing up CPU resources for other critical tasks. The multi-queue mechanism is the core of fully exploiting the parallel potential of flash memory.

By building multiple queues, flash memory can process task requests from different queues simultaneously, allowing data operations that were originally executed sequentially to be carried out in parallel, just like a highway with multiple lanes running in parallel. This greatly improves data processing efficiency, allows the parallel processing capabilities of flash memory to be fully utilized, and comprehensively improves the data transmission and processing performance of the NVMe storage system.

Part 2. NVMe protocol core mechanism

2.1 PCIe-based direct communication

(1) Innovation of the physical layer

A key feature of the NVMe protocol is its direct PCIe-based communication method. NVMe SSDs connect directly to the CPU via the PCIe bus. This direct physical connection completely bypasses the traditional SATA controller. In traditional storage architectures, data must be processed and forwarded multiple times by the SATA controller, which not only increases the data transmission path but also introduces additional latency. Imagine traveling from city A to city B, which once required multiple transfer stations. Now, a direct highway significantly reduces travel time.

This direct connection significantly reduces the number of intermediate links in data transmission, allowing data to be transferred more quickly and directly between the CPU and SSD, significantly reducing latency. Experimental data shows that SSDs using the NVMe protocol reduce data transmission latency by several times or more compared to traditional AHCI SSDs. This enables faster system response when processing large amounts of data, significantly improving overall operational efficiency.

(2) Basic structure of PCIe bus

The PCIe bus utilizes a layered architecture, comprised of the physical layer, data link layer, and transaction layer (similar to the hierarchical structure of computer networks). Data is forwarded across the PCIe bus in the form of packets. Within this architecture, the NVMe protocol sits at the upper application layer of PCIe, with the NVMe protocol's specifications operating at the application layer. In other words, PCIe provides underlying abstract support for NVMe. Leveraging the PCIe physical layer, NVMe facilitates the actual transmission of data. The PCIe data link layer ensures data reliability, laying the foundation for stable NVMe data transmission. The PCIe transaction layer handles transmission sequencing, flow control, and other aspects. These layers provide the underlying support for the efficient operation of the NVMe protocol, allowing NVMe to focus on application-level functionality, such as the storage device's logical interface specifications and command set. An NVMe SSD is equivalent to a PCIe end device (EP).

In contrast, the AHCI protocol stack is relatively complex. The AHCI protocol requires the SATA controller to handle data transmission and command processing, which involves multiple layers of protocol conversion and data processing. For example, when transferring data from the host to the storage device, the SATA controller must perform multiple protocol parsing, data encapsulation, and decapsulation operations. These operations not only increase system overhead but also reduce data transmission efficiency.

The streamlined design of the NVMe protocol stack enables more efficient use of system resources when processing data, reducing time wasted during protocol processing and further improving SSD performance. This streamlined protocol stack structure is like a sharp scalpel, precisely removing redundant components from traditional protocol stacks, making data transmission smoother and more efficient.

2.2 Collaborative Operations of Multiple Queues and Multithreading

(1) Multi-queue mechanism

The traditional AHCI protocol supports only a single-queue data processing method between storage devices and hosts, and this queue can only accommodate a maximum of 32 commands. This is like a parking lot with a single entrance. All vehicles must queue at this entrance to enter. Even if there are plenty of empty spaces in the parking lot, there is no room for more vehicles to enter at the same time. This single-queue design is prone to congestion when faced with a large number of I/O requests, resulting in inefficient data processing.

The NVMe protocol overcomes this limitation, supporting up to 64,000 queues. Each queue can independently process read and write requests, much like a large parking lot with multiple entrances, allowing different vehicles to enter simultaneously. This significantly increases parallel processing capabilities. When the system receives multiple I/O requests, the NVMe protocol can distribute them to different queues for processing, enabling simultaneous processing of multiple requests and significantly improving data processing efficiency and speed.

(2) Multi-core CPU parallel processing

The NVMe protocol's multi-queue mechanism perfectly aligns with multi-core CPU architectures, fully leveraging their parallel processing capabilities. In modern computer systems, CPUs typically have multiple cores, each capable of independently executing tasks. The NVMe protocol assigns different queues to different CPU cores, allowing each core to focus on processing I/O requests in its own queue.

This is like a large factory with multiple production workshops, each with its own production line and workers. Different orders can be assigned to different workshops for production, and all workshops can operate simultaneously, greatly improving production efficiency. In this way, the NVMe protocol fully utilizes the parallel processing capabilities of multi-core CPUs, enabling the system to operate more efficiently when handling large numbers of I/O requests, further improving overall throughput.

(3) Lock-free design

To further improve the efficiency of its multi-queue mechanism, the NVMe protocol adopts a lock-free design. In traditional multi-threaded programming, when multiple threads access shared resources simultaneously, locking mechanisms are often required to prevent data conflicts and inconsistencies. However, locking mechanisms introduce overhead such as thread contention and context switching, reducing system performance.

The NVMe protocol uses atomic operations to achieve concurrent access to queues, avoiding thread contention. Atomic operations are indivisible and cannot be interrupted by other threads during execution, thus ensuring data consistency and integrity. Through this lock-free design, the NVMe protocol reduces contention and latency between threads, improving the system's concurrent processing capabilities and efficiency. This is like a busy intersection without traffic lights. Instead, vehicles coordinate their passage through a more intelligent system, significantly improving traffic efficiency.

2.3 The Ultimate Pursuit of Low Latency and High IOPS

(1) Command processing optimization

The NVMe protocol has been meticulously optimized for command processing to achieve low latency and high IOPS (input/output operations per second). The NVMe protocol's command format is extremely concise, enabling the CPU to parse and execute these commands more quickly. Compared to traditional storage protocols, the NVMe protocol's command fields are more concise and clear, reducing unnecessary information redundancy and saving CPU processing time.

Furthermore, the NVMe protocol supports 64-bit addressing, enabling it to access a larger address space and process more data. The advantages of 64-bit addressing are particularly evident when faced with the need for large-scale data storage and processing. It can avoid data processing bottlenecks caused by insufficient address space and improve system scalability and performance.

(2) End-to-end DMA

The NVMe protocol utilizes end-to-end Direct Memory Access (DMA) technology, another key factor in achieving low latency and high IOPS. Traditional data transfer methods require the CPU to transfer data between storage devices and memory, increasing CPU load and introducing additional latency.

DMA technology allows data to be transferred directly between the SSD and memory without CPU intervention. This is like building a direct highway between two cities, allowing goods to be transported directly from one city to the other without transiting through other cities. Through end-to-end DMA technology, the NVMe protocol significantly reduces data transmission latency and improves data transmission efficiency. The latency of a single I/O operation can be as low as microseconds, enabling NVMe SSDs to deliver exceptional performance when handling large numbers of random read and write requests, providing strong support for application scenarios with extremely high response speed requirements, such as databases and real-time analytics.

2.4 Perfect adaptation of high bandwidth

The NVMe protocol complements the high-bandwidth nature of the PCIe interface, fully utilizing the high-speed data transfer capabilities it offers. For example, a PCIe 3.0 x4 interface boasts a theoretical bandwidth of 32GB/s, while a PCIe 4.0 x4 interface boasts a theoretical bandwidth of 64GB/s. These high-speed interfaces provide ample data transfer channels for NVMe SSDs, enabling them to fully leverage their high-performance advantages.

In comparison, the traditional AHCI SATA interface has a theoretical bandwidth of only 6Gbps, which translates to approximately 750MB/s. This bandwidth is insufficient for the ever-increasing data processing demands of modern computer systems. The NVMe protocol's efficient use of high bandwidth results in significant performance improvements over the traditional AHCI protocol. In practice, SSDs using the NVMe protocol can easily achieve sequential read and write speeds exceeding several thousand MB/s, while AHCI SSDs struggle to achieve such speeds.

This huge difference in performance makes the NVMe protocol the preferred solution for those pursuing high-performance storage. Whether in enterprise-level data centers or in high-end gaming laptops and workstations with extremely high performance requirements, NVMe SSDs provide users with a smoother and more efficient experience with their outstanding performance.

Part 3. NVMe protocol full process

3.1 Initialization phase

When a computer system boots up, it's like the opening of a sophisticated symphony, with the initialization phase serving as the prelude. During this crucial stage, the system comprehensively enumerates connected devices via the PCIe bus, much like a meticulous conductor checking each member of an orchestra. Among the numerous devices, the system accurately identifies NVMe devices, much like a conductor identifying the lead violinist in an orchestra. Once successfully identified, the system quickly loads the corresponding NVMe driver, which acts as a translator between the device and the system, ensuring accurate communication between the two parties.

At the same time, the system meticulously constructs an Admin Queue in memory. This crucial channel is primarily used to transmit commands and information related to device management. It's like an orchestra's back-office department: while not directly involved in the performance, it plays an indispensable role in the orchestra's smooth operation. Within this queue, a series of initialization commands are transmitted, such as setting device configuration parameters and querying capabilities. These commands, like instructions during an orchestra rehearsal, ensure that the device is operating optimally and fully prepared for subsequent data transmission.

3.2 Command submission phase

When an application or operating system generates an I/O request, it's like an orchestra receiving an important performance. These requests are quickly encapsulated into NVMe commands, much like organizing the performance's repertoire and requirements into a clear musical score. These commands contain detailed operation information, such as the data read and write address, transfer length, and operation type. Every detail is crucial, just like every note on a musical score determines the performance's quality.

Encapsulated NVMe commands are sequentially written to the submission queue. The submission queue is like an orchestra's performance schedule, where all performance tasks are sequentially scheduled for execution. During this process, the operating system uses a scheduling algorithm to rationally arrange the order of command submission to ensure efficient system operation. For example, urgent data requests are prioritized at the front of the queue for timely processing, much like an orchestra prioritizing important pieces during a performance.

3.3 Command processing phase

The NVMe controller is like a conductor in an orchestra, constantly monitoring the submission queue. When it detects a new command from the submission queue, it quickly retrieves the command, much like a conductor picking up a sheet of music and conducting a performance. Then, based on the command's content, the NVMe controller methodically executes the corresponding data read, write, or management operations.

When performing data read and write operations, the NVMe controller interacts efficiently with the flash memory chips to accurately read or write data. This is like the tacit cooperation between a conductor and an orchestra: the conductor conveys instructions through gestures and eye contact, and the orchestra members accurately play the corresponding notes. During this process, the NVMe controller fully utilizes its internal caching mechanism to improve data read and write speeds. For example, when reading data, if the data is already in the cache, it can be directly retrieved from the cache, significantly reducing the read time. This is similar to the orchestra members' intimate familiarity with frequently performed pieces, allowing them to perform them quickly and accurately.

For management operations, the NVMe controller performs tasks such as device status query and configuration update. These operations are like an orchestra debugging and maintaining its instruments between performances, ensuring that the devices are always in good working condition.

3.4 Completion notification phase

When the NVMe controller successfully completes an operation, it's like an orchestra completing a flawless performance. It meticulously writes the results of the operation into the completion queue. The completion queue acts like a post-performance feedback sheet, recording the performance's success and evaluation. The result information includes the operation status (success or failure), the associated status code, and any possible error messages. This information is crucial for the system to understand the operation's execution, just as audience feedback is valuable for an orchestra to improve its performance.

The CPU retrieves the results from the completion queue through either interrupts or polling. The interrupt method is like the audience's enthusiastic applause after a performance, which immediately draws the CPU's attention and allows it to quickly process the completed operation. When the NVMe controller completes an operation, it sends an interrupt signal to the CPU. Upon receiving the signal, the CPU pauses the currently executing task and processes the results in the completion queue. The polling method is like an orchestra staff member regularly checking the feedback table. The CPU actively queries the completion queue at regular intervals to see if any operations have completed. Regardless of the method used, the CPU obtains the operation results in a timely manner and performs subsequent processing based on the results, such as returning data to the application and handling errors, just like an orchestra adjusts the schedule for the next performance based on audience feedback.

Part 4. Advantages and Applications of NVMe Protocol

4.1 Four major technical advantages of NVMe protocol

1. Performance Leap: The performance improvement brought by the NVMe protocol is truly a leap forward. Under the PCIe 3.0 interface, solid-state drives using the NVMe protocol can achieve sequential read and write speeds of up to 3.5GB/s. When upgraded to the PCIe 4.0 interface, this speed soars to 7GB/s. Such data is a world of difference compared to traditional AHCI SSDs. Traditional AHCI SSDs are limited by the bandwidth of the SATA interface, and their sequential read and write speeds are usually only around 500MB/s, which is far from comparable to NVMe SSDs. In the scenario of copying large files, using an NVMe SSD can complete the task in just seconds, while an AHCI SSD may take several minutes. This huge performance gap brings a very obvious difference in the experience in actual use.

2. Low Latency: Random read and write latency is a key metric for measuring storage performance, and the NVMe protocol excels in this regard, with latency as low as 10μs. Database systems frequently perform random read and write operations to retrieve and update data. Low-latency storage devices can significantly improve database response speeds, enabling query results to be returned to users more quickly. In the field of real-time analytics, timely data processing is crucial. The low latency of the NVMe protocol ensures real-time analysis results, providing timely and accurate data support for decision-making.

3. High Reliability: Data security and stability are core requirements of storage systems, and the NVMe protocol provides strong assurance in this regard. It supports end-to-end data protection. By adding protection information (such as CRC) to data, data integrity is verified during transmission and storage. Any data errors are detected and corrected promptly, effectively preventing silent errors during transmission and storage that could lead to data loss or corruption. The NVMe protocol also supports power loss protection technology. In the event of a sudden power outage, it ensures that cached data is securely written to the storage media, preventing data loss due to power outages and providing a double layer of security for data storage.

4. Scalability: With growing storage demands and increasingly complex storage architectures, storage system scalability is becoming increasingly important. The NVMe protocol supports multiple namespaces, partitioning a physical storage device into multiple logical namespaces. Each namespace can be independently managed and used, providing more flexible storage resource allocation for different applications or users. For example, within a data center, different business systems can be assigned separate namespaces to achieve isolated and efficient storage resource utilization. Furthermore, the NVMe protocol supports virtualization (SR-IOV) technology, providing independent storage device interfaces for virtual machines in a virtualized environment. This improves storage performance and isolation for virtual machines, making the NVMe protocol more adaptable to complex application scenarios such as cloud computing.

4.2 Wide range of application scenarios

1. Enterprise Storage: In the enterprise storage sector, data centers and cloud computing place extremely high demands on storage system performance, reliability, and scalability. The NVMe protocol's high performance and low latency enable it to meet the needs of large-scale data storage and rapid processing in data centers. In a hyperconverged architecture, NVMe SSDs, as storage media, can significantly improve the storage performance of compute nodes, enabling efficient integration of compute and storage resources and reducing overall system costs. In distributed storage systems, the NVMe protocol's multiple queues and high concurrency support can support simultaneous access by numerous clients, ensuring fast data read and write speeds and consistency, and providing stable and reliable storage support for cloud computing services.

2. High-Performance Computing: In high-performance computing fields such as AI training, scientific computing, and high-frequency trading, the requirements for computing speed and data processing capabilities are extremely stringent. In AI training, model training requires processing massive amounts of data. The high-speed read and write capabilities of NVMe SSDs can quickly provide training data, significantly reducing training time and improving model training efficiency. In scientific computing, such as weather simulation and gene sequencing, large amounts of scientific data must be analyzed and processed in real time. The low latency and high bandwidth characteristics of the NVMe protocol meet the data processing speed requirements of these applications. In high-frequency trading, every millisecond of delay can affect the success of a transaction. The low latency performance of NVMe SSDs ensures the rapid execution of trading instructions, creating more opportunities for investors.

3. Consumer Devices: Within the consumer device market, high-end gaming laptops and workstations are increasingly demanding storage performance. Using an NVMe SSD as a system drive significantly improves device boot speeds and application loading speeds. For gamers, this significantly reduces loading times, allowing them to enter the game world faster and enjoy a smoother gaming experience. In workstations, when running applications with high storage requirements, such as large-scale design software and video editing software, NVMe SSDs ensure fast software response, improve work efficiency, and provide users with a more efficient and convenient experience.

4. NVMe over Fabrics (NVMe-oF): NVMe over Fabrics (NVMe-oF) technology enables access to remote NVMe devices using network technologies such as Ethernet and Fibre Channel, extending the advantages of the NVMe protocol to storage networks. In data centers, NVMe-oF enables multiple servers to connect to remote NVMe storage devices, enabling shared storage resources and centralized management, improving storage resource utilization and flexibility. Furthermore, NVMe-oF offers higher storage performance and lower latency than traditional network storage protocols. Compared to traditional network storage protocols, it better meets the storage performance requirements of enterprise applications and provides strong support for building more efficient storage network architectures.

NEWS