Notes on Scraping Together a Heterogeneous System¶

CPU¶

Higher frequency is better for CPU bound applications.
Higher \(\#_\text{cores}\) and multiprocessing support is better for perfectly parallel workloads.
- Inherently serial applications are bounded above by Amdahl’s law.
- Physical cores represent the number of physical processing units on a chip.
- The number of logical cores (\(\#_\text{cores} \times \#_\text{threads}\)) is the maximum number of concurrent threads the chip supports via simultaneous multithreading (SMT).
  - SMT works well when the threads have highly different characteristics e.g. one thread doing mostly integer operations, another mainly doing floating point operations.
  - Note that in hardware virtualization, a logical core is called a virtual CPU, vCPU, or virtual processor.
Larger multi-level cache size is better for cache bound applications.
The chip also specifies what is supported in terms of memory size, memory type, max number of memory channels, PCIe data rate, and max number of PCIe lanes.
- Higher is better for I/O bound and memory bound workloads.

Multiprocessor/Multiprocessing¶

Outside of perfectly parallel workloads, a single CPU system is more cost effective than a system with multiple physical CPUs because existing software (e.g. After Effects, Premiere Pro, SolidWorks) mostly do not take advantage of the additional processor cores. The additional physical CPUs may be even slower than a single CPU system, possibly due to communication bandwidth (e.g. NUMA). As an aside, a term that is useful to know is transfers per second. Multiplying the transfer rate by the information channel width gives the data transmission rate.

Intel QuickPath Interconnect (QPI)¶

Initially, Intel’s CPU used a FSB to access the northbridge and DMI to link to the southbridge (a.k.a. ICH). Intel later on replaced the FSB with QPI and integrated the northbridge into the CPU die itself. The southbridge became redundant and was replaced by the PCH. PCH still uses DMI, but Intel have started to replace QPI with UPI.

Any communication to other CPUs and uncore components (e.g. remote memory, L3 cache) uses QPI. Other external communications (e.g. local memory, devices) use pins, PCIe, SATAe, etc… In the Skylake microarchitecture, core-to-core (intra-chip) communication uses a ring bus interconnect; Intel has since replaced it with a mesh topology interconnect.

AMD Infinity Fabric (IF)¶

AMD’s CPU initially used a FSB to access the northbridge and UMI to link to the southbridge (a.k.a. FCH). AMD later on replaced the FSB with HT and integrated the northbridge into the CPU die itself when it introduced the APU, which still uses UMI. In an effort towards SoC, AMD integrated its southbridge into the die and replaced HT with IF.

The IF’s Scalable Data Fabric (SDF) connects each CCX (CPU Complex) to uncore devices such as memory controllers and PCIe controllers. It is a 256-bit bi-directional crossbar that is used to simultaneously transport data for multiple buses to their final destination and runs at the speed of the memory controller. In the Zen microarchitecture, die-to-die (intra-chip) communication uses AMD’s Global Memory Interconnect (GMI).

Memory¶

Higher data transfer rate is better for DRAM bandwidth bound applications.
- The most common DRAM interface is DDR, which allows either a read or a write at each clock edge.
- A higher bandwidth interface is GDDR, which allows a read and a write at each clock edge.
- With the advent stackable DRAM die technologies such as HBM and HMC, the leap to higher bandwidth is achieved through adding more memory channels with wider bus width.
Lower memory timings is better for latency bound applications.
- Note that the CAS latency can only accurately measure the time to transfer the first word of memory.
Unregistered DIMMs do not support very large amounts of memory.
- RDIMMs are faster than LRDIMMs, but the former can only support up to 512GB while the latter can go beyond 1TB.
- RDIMMs (and the like) require ECC.
More channels of communication in the memory architecture is better for DRAM bandwidth bound applications.
Memory Deep Dive Series is a nice overview of a server’s memory subsystem.

Storage¶

Non-volatile data storage (NVM) can be either mechanically addressed or electrically addressed. The former has additional mechanical performance characteristics to be aware of when examining I/O bound applications. Those measurements can be mapped onto a commonly accepted metric consisting of sequential and random operations.

Both storage system types are accessed through a predefined set of logical device interfaces.

SATA and SAS were designed primarily for HDDs.
- SATA targets the lowest cost per gigabyte and is the most cost effective for low frequency access of reference/streaming/sequential data e.g. archival data, file-sharing, email, web, backups.
- SAS is geared towards maximal performance, reliability, and availability on high frequency immediate random access data e.g. database transactions.
- Hard drive reliability is highly dependent on capacity and the manufacturer (e.g. HGST, Western Digital, Seagate Technology).
SATA could not keep up with the speed of SSDs, so SATAe was introduced to interface with PCIe SSDs through the AHCI drivers.
AHCI did not fully exploit the low latency and parallelism of PCIe SSDs, so it was replaced by NVMe.
- M.2 and U.2. are realizations of NVMe in different physical formats.

The aforementioned interfaces support RAID on a single system. When scaling beyond a single machine, the only viable solution is a distributed file system.

Network Interface Controller¶

Gigabit Ethernet (1GbE) is a typical setup for small clusters running workloads that are not IO bound. Technologies such as standard Ethernet switches and LAN connections use RJ45 connectors to terminate twisted-pair copper cables.

If a longer maximum distance is desired, then optical fiber transceivers (e.g. SFP, QSFP) can be used with LC connectors. If more bandwidth are needed, 10GbE is available through SFP+ and QSFP+; 100GbE is supported through QSFP+ and CFP.

If the goal is to reduce latency, then the network adapters need to support Converged Enhanced Ethernet, which provides reliability without requiring the complexity of TCP. Furthermore, this functionality is necessary to perform RDMA in computer clusters. The initial implementation of RDMA was iWARP, but iWARP has since been superseded by RoCE.

Some alternative network interconnects are InfiniBand, Fibre Channel, and proprietary technologies such as Intel Omni-Path. Note that InfiniBand provides RDMA capabilities through its own set of protocols (e.g. IPoIB) and needs to use a network bridge to communicate with Ethernet devices. Fibre Channel instead supports FCP and interacts with Ethernet via FCoE. Omni-Path supports Ethernet and InfiniBand protocols as well as RDMA. InfiniBand currently achieves minimal latency and maximal throughput followed by RoCE, Omni-Path, and lastly Fibre Channel [VCWUR+12][VWKE16].

Coprocessor Interconnect¶

Modern coprocessors can be categorized into four types in order of increasing costs: GPUs, manycore processors, FPGAs, and ASICs. Even though their performance is highly dependently on the workload, all of them share two characteristics:

They require a local host CPU to configure and operate them through the root complex (RC), which limits the number of accelerators per host.
The unbalanced communication between distributed accelerators is further exacerbated by the limitations PCIe Gen 3.

The RC logically aggregates PCIe hierarchy domains into a single PCIe hierarchy [Tsa16]. This hierarchy along with the RC is known as the PCIe fabric. Since the CPU dictates the maximum number of supported PCIe lanes that all PCIe links communicate over, the hierarchy typically includes switches and bridges. Switches provide an aggregation capability and allow more devices to be attached to a single root port. They act as packet routers and recognize which path a given packet will need to take based on its address or other routing information. A switch may have several downstream ports, but it can only have one upstream port. Bridges serve to interface between different buses (e.g. PCI, PCIe).

The PCIe tree topology has several limitations. Simultaneous communication between all devices will induce congestion in the PCIe fabric resulting in bandwidth reduction. The congestion factors include upstream port conflicts, downstream port conflicts, head-of-line blocking, and crossing the RC conflicts [MKA+16][Law14]. Furthermore, when there are multiple RCs, inter-processor communication needs to be taken into account if the devices are not under a single RC.

To overcome these limitations, different technology groups have banded together and propose three new interconnect standards: CAPI, CCIX, and Gen-Z.

CAPI is a new physical layer standard focused on low-latency high-speed coherent DMA between devices of different ISAs.
- CCIX has the same goal, but builds upon PCIe Gen 4 and additionally supports switched fabric topologies.
- NVLink is alternative proprietary interconnect technology tailored for Nvidia’s GPUs.
- There have been speculations that CAPI and CCIX will converge at some point.
Gen-Z is a memory semantic fabric that enables memory operations to direct attach and disaggregated memory and storage.
- Its packet-based protocol supports both CCIX and CAPI.

PCIe Topology¶

There are two common communication patterns:

Point-to-point communication between a single sender and a single receiver.
Collective communication between multiple senders and receivers.

Most collectives amenable to bandwidth-optimal implementation on rings, and many topologies can be interpreted as one or more rings. Ring-based collectives enable optimal intra-node communication.

Digits DevBox ¶

Bandwidth between the two GPU groups is not as high as within a single group.

{
  "CPU 0": {
    "8747 PCIe Switch 0": ["GPU 0", "GPU 1"],
    "8747 PCIe Switch 1": ["GPU 2", "GPU 3"]
  }
}

Inefficient Configuration of 8 GPUs ¶

Inter-group bandwidth is half of intra-group bandwidth due to crossing the RC(s).

{
  "CPU 0": {
    "8796 PCIe Switch 0": ["GPU 0", "GPU 1", "GPU 2", "GPU 3", "NIC 0"]
  },
  "CPU 1": {
    "8796 PCIe Switch 1": ["GPU 4", "GPU 5", "GPU 6", "GPU 7", "NIC 1"]
  }
}

{
  "CPU 0": {
    "8796 PCIe Switch 0": ["GPU 0", "GPU 1", "GPU 2", "GPU 3", "NIC 0"],
    "8796 PCIe Switch 1": ["GPU 4", "GPU 5", "GPU 6", "GPU 7", "NIC 1"]
  },
  "CPU 1": {}
}

Big Sur¶

Inter-group bandwidth is equivalent to intra-group bandwidth. This configuration is also known as cascading or daisy chaining switches.

{
  "CPU 0": {
    "8796 PCIe Switch 0": [
      "GPU 0", "GPU 1", "GPU 2", "GPU 3",
      {
        "8796 PCIe Switch 1": ["GPU 4", "GPU 5", "GPU 6", "GPU 7", "NIC 0"]
      }
    ]
  },
  "CPU 1": {}
}

DGX-1 ¶

{
  "CPU 0": {
    "8664 PCIe Switch 0": ["GPU 0", "GPU 2", "NIC 0"],
    "8664 PCIe Switch 1": ["GPU 1", "GPU 3", "NIC 1"]
  },
  "CPU 1": {
    "8664 PCIe Switch 2": ["GPU 4", "GPU 6", "NIC 2"],
    "8664 PCIe Switch 3": ["GPU 5", "GPU 7", "NIC 3"]
  },
  "GPU 0": {
    "NVLink" : ["GPU 1", "GPU 2", "GPU 3", "GPU 5"]
  },
  "GPU 1": {
    "NVLink" : ["GPU 0", "GPU 2", "GPU 3", "GPU 4"]
  },
  "GPU 2": {
    "NVLink" : ["GPU 0", "GPU 1", "GPU 3", "GPU 7"]
  },
  "GPU 3": {
    "NVLink" : ["GPU 0", "GPU 1", "GPU 2", "GPU 6"]
  },
  "GPU 4": {
    "NVLink" : ["GPU 5", "GPU 6", "GPU 7", "GPU 1"]
  },
  "GPU 5": {
    "NVLink" : ["GPU 4", "GPU 6", "GPU 7", "GPU 0"]
  },
  "GPU 6": {
    "NVLink" : ["GPU 4", "GPU 5", "GPU 7", "GPU 3"]
  },
  "GPU 7": {
    "NVLink" : ["GPU 4", "GPU 5", "GPU 6", "GPU 2"]
  }
}

GPU(s)¶

Table 5 Comparison of GPU Capability¶
	Quadro	Tesla	GeForce
(DP) FLOPS	High	Medium to High	Low
Memory Bandwidth	High	Medium to High	Low
Memory Quantity	High	Medium to High	Low
ECC	Yes	Yes	No
Data Transfer Interconnect	PCIe/NVLink	PCIe/NVLink	PCIe
DMA Engines	Dual	Dual	Single
P2P	Yes	Yes	Yes
RDMA	Yes	Yes	No
Hyper-Q	Full	Full	Partial
GPU Boost	Configurable	Configurable	Automatic
Target	Graphics/Compute	Compute	Graphics/Compute
Cluster Management Tools	Yes	Yes	No

Besides the device-to-host and device-to-device interconnect technology, the DMA Engines, RDMA, and Hyper-Q are equally important features in high-performance computing.

Dual DMA engines enable simultaneous execution of the following pipelined workload:

Transfer results from data chunk \(n - 1\) from device to host.
Run kernel that operates on data chunk \(n\).
Transfer data chunk \(n + 1\) from host to device.

A single DMA Engine can only transfer data in one direction at a time, so the data transfer steps of the proposed pipeline will be executed sequentially.

P2P communication between multiple GPUs on a single machine are fully supported when all of them are under a single RC. Nvidia has an implementation of this called GPUDirect. The GPUs directly access and transfer memory between each other over PCIe without involving the CPU and host memory. When sending data between GPUs across a network, this solution uses shared pinned memory to avoid a host-memory-to-host-memory copy. However, the host memory and CPU are still involved in the data transfer. Nvidia later on collaborated with Mellanox to introduce GPUDirect RDMA which transfers data directly from GPU memory to Mellanox’s InfiniBand adapter over PCIe. The CPU and host memory are no longer involved in the data transfer. Note that this particular functionality requires the GPU and the network card to share the same RC.

Hyper-Q enables multiple CPU threads or processes to launch work on a single GPU simultaneously, thereby dramatically increasing GPU utilization and slashing CPU idle times. It allows connections for both CUDA streams, threads from within a process, or MPI processes. Note that GeForce products cannot use Hyper-Q with MPI.

A technology that is completely unrelated to GPGPU is SLI. The goal of SLI is to increase rendering performance by dividing the workload across multiple GPUs. All graphics resources that would normally be expected to be placed in GPU memory are automatically broadcasted to the memory of all the GPUs in the SLI configuration.

Miscellaneous¶

The last pieces of a system are the motherboard, power supply unit (PSU), and chassis. Ensure that the motherboard supports the desired configuration. The PSU in turn needs to be efficient enough to power up such a system. The chassis just needs to house all of the components.

For machines with more than two GPUs, consider Cirrascale products. They are well-known and provide great service. However, if one wishes to have barebone hardware without any service fees, then assembling individual Supermicro components is one cost-effective solution. To avoid the hassle of assembly, configure and order from Thinkmate or Silicon Mechanics.

hwloc, lspci, and lstopo are ways to gather information about increasingly complex parallel computing platforms so as to exploit them accordingly and efficiently.

Custom Deep Learning System¶

One marketing tactic Nvidia employs is framing the presentation of their latest GPUs in a way that implies the latest product is essential in attaining the best performance. Hence one should always verify that claim with domain-specific benchmarks (e.g. Premiere Pro Quadro vs Titan) before making a purchase.

The Titan Xp/X is superior to the GTX 1080 Ti in terms of specs, but that does not translate into huge gains. Furthermore, the application may not be able to fully utilize the extra resources.

NVLink is another example of where it’s not cost effective to get the latest technology. Unless an algorithm (e.g. sorting) makes use of this increased bandwidth, replacing the PCIe Gen 3 fabric with NVLink can only give at most a 2x performance boost.

There are two concrete specifications that can be said about current deep learning systems.

Frameworks like PyTorch, MXNet and TensorFlow exhibit near linear scaling with multiple GPUs, so eight GPUs per node is sufficient.
- Having more than eight GPUs in a node is not recommended because P2P is not supported beyond eight devices at any given instant.
- The cascading GPU topology is not advised because neither frameworks account for this type of dataflow.
In terms of system memory, while twice the GPU memory footprint would normally be sufficient to manage background data moves and back buffering, four times as much gives greater flexibility for managing in-memory working sets and streaming data movement.

References

Law14: Jason Lawley. Understanding performance of pci express systems. https://www.xilinx.com/support/documentation/white_papers/wp350.pdf, 2014.
MKA+16: Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R Alam, Thomas C Schulthess, and Torsten Hoefler. A pcie congestion-aware performance model for densely populated accelerator servers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 63. IEEE Press, 2016.
Tsa16: Dan Tsafrir. Pci express architecture in a nutshell. https://webcourse.cs.technion.ac.il/236376/Spring2016/ho/WCFiles/chipset_microarch.pdf, 2016.
VWKE16: Faith Virginia Van Wig, Luke Anthony Kachelmeier, and Kari Natania Erickson. Comparison of high performance networks: edr infiniband vs. 100gb rdma capable ethernet. Technical Report, Los Alamos National Laboratory (LANL), 2016.
VCWUR+12: Jerome Vienne, Jitong Chen, Md Wasi-Ur-Rahman, Nusrat S Islam, Hari Subramoni, and Dhabaleswar K Panda. Performance analysis and evaluation of infiniband fdr and 40gige roce on hpc and cloud computing systems. In High-Performance Interconnects (HOTI), 2012 IEEE 20th Annual Symposium on, 48–55. IEEE, 2012.