################################################# Notes on Scraping Together a Heterogeneous System ################################################# CPU === - Higher frequency is better for `CPU bound`_ applications. - Higher :math:`\#_\text{cores}` and `multiprocessing`_ support is better for `perfectly parallel`_ workloads. - Inherently serial applications are bounded above by `Amdahl's law`_. - Physical cores represent the number of physical processing units on a chip. - The number of logical cores (:math:`\#_\text{cores} \times \#_\text{threads}`) is the maximum number of concurrent threads the chip supports via simultaneous multithreading (`SMT`_). - SMT works well when the threads have highly different characteristics e.g. one thread doing mostly integer operations, another mainly doing floating point operations. - Note that in hardware virtualization, a logical core is called a virtual CPU, vCPU, or virtual processor. - Larger multi-level cache size is better for `cache bound`_ applications. - The chip also specifies what is supported in terms of memory size, memory type, max number of memory channels, PCIe data rate, and max number of PCIe lanes. - Higher is better for `I/O bound`_ and `memory bound`_ workloads. .. _CPU bound: https://en.wikipedia.org/wiki/CPU-bound .. _multiprocessing: https://en.wikipedia.org/wiki/Multiprocessing .. _perfectly parallel: https://en.wikipedia.org/wiki/Embarrassingly_parallel .. _Amdahl's law: https://en.wikipedia.org/wiki/Amdahl%27s_law .. _SMT: https://en.wikipedia.org/wiki/Simultaneous_multithreading .. _cache bound: https://software.intel.com/en-us/node/605613 .. _I/O bound: https://en.wikipedia.org/wiki/I/O_bound .. _memory bound: https://en.wikipedia.org/wiki/Memory_bound_function Multiprocessor/Multiprocessing ============================== Outside of perfectly parallel workloads, a single CPU system is more cost effective than a system with multiple physical CPUs because existing software (e.g. `After Effects`_, `Premiere Pro`_, `SolidWorks`_) mostly do not take advantage of the additional processor cores. The additional physical CPUs may be even slower than a single CPU system, possibly due to `communication bandwidth`_ (e.g. `NUMA`_). As an aside, a term that is useful to know is `transfers per second`_. Multiplying the transfer rate by the information channel width gives the data transmission rate. .. _After Effects: http://ppbm7.com/index.php/tweakers-page/95-single-or-dual-cpu/109-single-or-dual-cpu .. _Premiere Pro: https://www.pugetsystems.com/labs/articles/Should-you-use-a-Dual-Xeon-for-Premiere-Pro-CC-2017-932 .. _SolidWorks: https://www.pugetsystems.com/labs/articles/Solidworks-2016-Multi-Core-Performance-741 .. _communication bandwidth: https://en.wikipedia.org/wiki/List_of_device_bit_rates#Computer_buses .. _NUMA: https://en.wikipedia.org/wiki/Non-uniform_memory_access .. _transfers per second: https://en.wikipedia.org/wiki/Transfer_(computing) Intel QuickPath Interconnect (QPI) ---------------------------------- Initially, Intel's CPU used a `FSB`_ to access the `northbridge`_ and `DMI`_ to link to the `southbridge`_ (a.k.a. `ICH`_). Intel later on replaced the FSB with `QPI`_ and integrated the northbridge into the CPU die itself. The southbridge became redundant and was replaced by the `PCH`_. PCH still uses DMI, but Intel have started to replace QPI with `UPI`_. Any communication to other CPUs and `uncore`_ components (e.g. remote memory, L3 cache) uses QPI. Other external communications (e.g. local memory, devices) use pins, PCIe, SATAe, etc... In the `Skylake microarchitecture`_, core-to-core (intra-chip) communication uses a ring bus interconnect; Intel has since replaced it with a `mesh topology interconnect`_. .. _FSB: https://en.wikipedia.org/wiki/Front-side_bus .. _northbridge: https://en.wikipedia.org/wiki/Northbridge_(computing) .. _DMI: https://en.wikipedia.org/wiki/Direct_Media_Interface .. _ICH: https://en.wikipedia.org/wiki/I/O_Controller_Hub .. _southbridge: https://en.wikipedia.org/wiki/Southbridge_(computing) .. _QPI: https://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect .. _PCH: https://en.wikipedia.org/wiki/Platform_Controller_Hub .. _UPI: http://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade/7 .. _uncore: https://en.wikipedia.org/wiki/Uncore .. _Skylake microarchitecture: https://en.wikichip.org/wiki/intel/microarchitectures/skylake .. _mesh topology interconnect: http://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd-epyc-7000-cpu-battle-of-the-decade/5 AMD Infinity Fabric (IF) ------------------------ AMD's CPU initially used a FSB to access the northbridge and `UMI`_ to link to the southbridge (a.k.a. `FCH`_). AMD later on replaced the FSB with `HT`_ and integrated the northbridge into the CPU die itself when it introduced the `APU`_, which still uses UMI. In an effort towards `SoC`_, AMD integrated its southbridge into the die and replaced HT with `IF`_. The IF's Scalable Data Fabric (SDF) connects each CCX (CPU Complex) to uncore devices such as memory controllers and PCIe controllers. It is a 256-bit bi-directional crossbar that is used to simultaneously transport data for multiple buses to their final destination and runs at the speed of the memory controller. In the `Zen microarchitecture`_, die-to-die (intra-chip) communication uses AMD's Global Memory Interconnect (GMI). .. _UMI: https://en.wikipedia.org/wiki/Unified_Media_Interface .. _FCH: https://en.wikipedia.org/wiki/List_of_AMD_chipsets#FCH .. _HT: https://en.wikipedia.org/wiki/HyperTransport .. _APU: https://en.wikipedia.org/wiki/AMD_Accelerated_Processing_Unit .. _SoC: https://en.wikipedia.org/wiki/System_on_a_chip .. _IF: https://en.wikichip.org/wiki/amd/infinity_fabric .. _Zen microarchitecture: https://en.wikichip.org/wiki/amd/microarchitectures/zen Memory ====== - Higher `data transfer rate`_ is better for `DRAM bandwidth bound`_ applications. - The most common `DRAM`_ interface is `DDR`_, which allows either a read or a write at each clock edge. - A higher bandwidth interface is `GDDR`_, which allows a read and a write at each clock edge. - With the advent stackable DRAM die technologies such as `HBM`_ and `HMC`_, the leap to higher bandwidth is achieved through adding more memory channels with wider bus width. - Lower `memory timings`_ is better for latency bound applications. - Note that the `CAS latency`_ can only accurately measure the time to transfer the first word of memory. - `Unregistered DIMMs`_ do not support very large amounts of memory. - `RDIMMs are faster than LRDIMMs`_, but the former can only support up to 512GB while the latter can go beyond 1TB. - `RDIMMs`_ (and the like) require `ECC`_. - `More channels of communication in the memory architecture`_ is better for DRAM bandwidth bound applications. - `Memory Deep Dive Series`_ is a nice overview of a server's memory subsystem. .. _data transfer rate: https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory .. _DRAM bandwidth bound: https://software.intel.com/en-us/node/638229 .. _DRAM: https://en.wikipedia.org/wiki/Dynamic_random-access_memory .. _DDR: https://en.wikipedia.org/wiki/Double_data_rate .. _GDDR: https://en.wikipedia.org/wiki/GDDR5_SDRAM .. _HBM: https://en.wikipedia.org/wiki/High_Bandwidth_Memory .. _HMC: https://en.wikipedia.org/wiki/Hybrid_Memory_Cube .. _memory timings: https://en.wikipedia.org/wiki/Memory_timings .. _CAS latency: https://en.wikipedia.org/wiki/CAS_latency .. _Unregistered DIMMs: https://en.wikipedia.org/wiki/DIMM .. _RDIMMs are faster than LRDIMMs: https://www.microway.com/hpc-tech-tips/ddr4-rdimm-lrdimm-performance-comparison/ .. _RDIMMs: https://en.wikipedia.org/wiki/Registered_memory .. _ECC: https://en.wikipedia.org/wiki/ECC_memory .. _More channels of communication in the memory architecture: https://en.wikipedia.org/wiki/Multi-channel_memory_architecture .. _Memory Deep Dive Series: http://frankdenneman.nl/2015/02/18/memory-configuration-scalability-blog-series/ Storage ======= Non-volatile data storage (`NVM`_) can be either `mechanically addressed`_ or `electrically addressed`_. The former has additional `mechanical performance characteristics`_ to be aware of when examining I/O bound applications. Those measurements can be mapped onto a commonly accepted metric consisting of `sequential and random operations`_. Both storage system types are accessed through a predefined set of logical device interfaces. - `SATA`_ and `SAS`_ were designed primarily for HDDs. - SATA targets the lowest cost per gigabyte and is the most cost effective for low frequency access of reference/streaming/sequential data e.g. archival data, file-sharing, email, web, backups. - SAS is geared towards maximal performance, reliability, and availability on high frequency immediate random access data e.g. database transactions. - `Hard drive reliability`_ is highly dependent on capacity and the manufacturer (e.g. HGST, Western Digital, Seagate Technology). - SATA could not keep up with the speed of SSDs, so `SATAe`_ was introduced to interface with PCIe SSDs through the `AHCI`_ drivers. - AHCI did not fully exploit the low latency and parallelism of PCIe SSDs, so it was replaced by `NVMe`_. - M.2 and U.2. are realizations of NVMe in different physical formats. The aforementioned interfaces support `RAID`_ on a single system. When scaling beyond a single machine, the only viable solution is a `distributed file system`_. .. _NVM: https://en.wikipedia.org/wiki/Non-volatile_memory .. _mechanically addressed: https://en.wikipedia.org/wiki/Hard_disk_drive .. _electrically addressed: https://en.wikipedia.org/wiki/Solid-state_storage .. _mechanical performance characteristics: https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics .. _sequential and random operations: https://en.wikipedia.org/wiki/IOPS .. _SATA: https://en.wikipedia.org/wiki/Serial_ATA .. _SAS: https://en.wikipedia.org/wiki/Serial_Attached_SCSI .. _Hard drive reliability: https://www.backblaze.com/b2/hard-drive-test-data.html .. _SATAe: https://en.wikipedia.org/wiki/SATA_Express .. _AHCI: https://en.wikipedia.org/wiki/Advanced_Host_Controller_Interface .. _NVMe: https://en.wikipedia.org/wiki/NVM_Express .. _RAID: https://en.wikipedia.org/wiki/RAID .. _distributed file system: https://en.wikipedia.org/wiki/Clustered_file_system Network Interface Controller ============================ Gigabit Ethernet (`1GbE`_) is a typical setup for small clusters running workloads that are not IO bound. Technologies such as standard Ethernet switches and LAN connections use `RJ45 connectors`_ to terminate `twisted-pair copper cables`_. If a longer maximum distance is desired, then optical fiber transceivers (e.g. `SFP`_, `QSFP`_) can be used with `LC connectors`_. If more bandwidth are needed, `10GbE`_ is available through `SFP+`_ and `QSFP+`_; `100GbE`_ is supported through QSFP+ and `CFP`_. If the goal is to reduce latency, then the network adapters need to support `Converged Enhanced Ethernet`_, which provides reliability without requiring the complexity of `TCP`_. Furthermore, this functionality is necessary to perform `RDMA`_ in `computer clusters`_. The initial implementation of RDMA was `iWARP`_, but `iWARP has since been superseded`_ by `RoCE`_. .. _1GbE: https://en.wikipedia.org/wiki/Gigabit_Ethernet .. _RJ45 connectors: https://en.wikipedia.org/wiki/Modular_connector#8P8C .. _twisted-pair copper cables: https://en.wikipedia.org/wiki/Ethernet_over_twisted_pair .. _SFP: https://en.wikipedia.org/wiki/Small_form-factor_pluggable_transceiver#1_and_2.5_Gbit.2Fs_SFP .. _QSFP: https://en.wikipedia.org/wiki/QSFP#4_x_1_Gbit.2Fs_QSFP .. _LC connectors: https://en.wikipedia.org/wiki/Optical_fiber_connector .. _SFP+: https://en.wikipedia.org/wiki/Small_form-factor_pluggable_transceiver#10_Gbit.2Fs_SFP.2B .. _QSFP+: https://en.wikipedia.org/wiki/QSFP#4_x_10_Gbit.2Fs_QSFP.2B .. _10GbE: https://en.wikipedia.org/wiki/10_Gigabit_Ethernet .. _100GbE: https://en.wikipedia.org/wiki/100_Gigabit_Ethernet .. _CFP: https://en.wikipedia.org/wiki/C_Form-factor_Pluggable .. _Converged Enhanced Ethernet: https://en.wikipedia.org/wiki/Data_center_bridging .. _TCP: https://en.wikipedia.org/wiki/Transmission_Control_Protocol .. _RDMA: https://en.wikipedia.org/wiki/Remote_direct_memory_access .. _computer clusters: https://en.wikipedia.org/wiki/Computer_cluster .. _RoCE: https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet .. _iWARP: https://en.wikipedia.org/wiki/IWARP .. _iWARP has since been superseded: http://www.mellanox.com/pdf/whitepapers/WP_RoCE_vs_iWARP.pdf Some alternative network interconnects are `InfiniBand`_, `Fibre Channel`_, and proprietary technologies such as Intel `Omni-Path`_. Note that InfiniBand provides RDMA capabilities through its own `set of protocols`_ (e.g. `IPoIB`_) and needs to use a `network bridge`_ to communicate with Ethernet devices. Fibre Channel instead supports `FCP`_ and interacts with Ethernet via `FCoE`_. Omni-Path supports Ethernet and InfiniBand protocols as well as RDMA. InfiniBand currently achieves minimal latency and maximal throughput followed by RoCE, Omni-Path, and lastly Fibre Channel :cite:`vienne2012performance,van2016comparison`. .. _InfiniBand: https://en.wikipedia.org/wiki/InfiniBand .. _Fibre Channel: https://en.wikipedia.org/wiki/Fibre_Channel .. _Omni-Path: https://en.wikipedia.org/wiki/Omni-Path .. _set of protocols: http://www.infinibandta.org/content/pages.php?pg=technology_public_specification .. _IPoIB: https://tools.ietf.org/html/rfc4391 .. _network bridge: https://en.wikipedia.org/wiki/Bridging_(networking) .. _FCP: https://en.wikipedia.org/wiki/Fibre_Channel_Protocol .. _FCoE: https://en.wikipedia.org/wiki/Fibre_Channel_over_Ethernet Coprocessor Interconnect ======================== Modern `coprocessors`_ can be categorized into four types in order of increasing costs: `GPUs`_, `manycore processors`_, `FPGAs`_, and `ASICs`_. Even though their performance is highly dependently on the workload, all of them share two characteristics: - They require a local host CPU to configure and operate them through the root complex (`RC`_), which limits the number of accelerators per host. - The unbalanced communication between distributed accelerators is further exacerbated by the limitations `PCIe Gen 3`_. .. _coprocessors: https://en.wikipedia.org/wiki/Coprocessor .. _GPUs: https://en.wikipedia.org/wiki/Graphics_processing_unit .. _manycore processors: https://en.wikipedia.org/wiki/Manycore_processor .. _FPGAs: https://en.wikipedia.org/wiki/Field-programmable_gate_array .. _ASICs: https://en.wikipedia.org/wiki/Application-specific_integrated_circuit .. _RC: https://en.wikipedia.org/wiki/Root_complex .. _PCIe Gen 3: https://en.wikipedia.org/wiki/PCI_Express The RC `logically aggregates`_ PCIe hierarchy domains into a single PCIe hierarchy :cite:`tsafrir2016pciea`. This hierarchy along with the RC is known as the PCIe fabric. Since the CPU dictates the maximum number of supported PCIe lanes that all PCIe links communicate over, the hierarchy typically includes `switches and bridges`_. Switches provide an aggregation capability and allow more devices to be attached to a single root port. They act as packet routers and recognize which path a given packet will need to take based on its address or other routing information. A switch may have several downstream ports, but it can only have one upstream port. Bridges serve to interface between different buses (e.g. PCI, PCIe). .. _logically aggregates: https://en.wikipedia.org/wiki/PCI_configuration_space .. _switches and bridges: https://en.wikipedia.org/wiki/PLX_Technology The PCIe tree topology has several limitations. Simultaneous communication between all devices will induce congestion in the PCIe fabric resulting in bandwidth reduction. The congestion factors include upstream port conflicts, downstream port conflicts, `head-of-line blocking`_, and crossing the RC conflicts :cite:`martinasso2016pcie,lawley2014understanding`. Furthermore, when there are multiple RCs, `inter-processor communication`_ needs to be taken into account if the devices are not under a single RC. .. _head-of-line blocking: https://en.wikipedia.org/wiki/Head-of-line_blocking .. _inter-processor communication: https://exxactcorp.com/blog/exploring-the-complexities-of-pcie-connectivity-and-peer-to-peer-communication/ To overcome these limitations, different technology groups have banded together and propose three new interconnect standards: `CAPI`_, `CCIX`_, and `Gen-Z`_. - CAPI is a new physical layer standard focused on low-latency high-speed coherent DMA between devices of different `ISAs`_. - CCIX has the same goal, but builds upon PCIe Gen 4 and additionally supports `switched fabric`_ topologies. - `NVLink`_ is alternative proprietary interconnect technology tailored for Nvidia's GPUs. - There have been speculations that CAPI and CCIX will converge at some point. - Gen-Z is a memory semantic fabric that enables memory operations to direct attach and disaggregated memory and storage. - Its packet-based protocol supports both CCIX and CAPI. .. _CAPI: https://en.wikipedia.org/wiki/Coherent_Accelerator_Processor_Interface .. _CCIX: https://www.ccixconsortium.com/ .. _Gen-Z: https://en.wikipedia.org/wiki/Gen-Z .. _ISAs: https://en.wikipedia.org/wiki/Instruction_set_architecture .. _switched fabric: https://en.wikipedia.org/wiki/Switched_fabric .. _NVLink: https://en.wikipedia.org/wiki/NVLink PCIe Topology ------------- There are two common communication patterns: - Point-to-point communication between a single sender and a single receiver. - Collective communication between multiple senders and receivers. Most collectives amenable to bandwidth-optimal implementation on rings, and many topologies can be interpreted as one or more rings. `Ring-based collectives`_ enable optimal intra-node communication. .. _Ring-based collectives: http://images.nvidia.com/events/sc15/pdfs/NCCL-Woolley.pdf `Digits DevBox`_ ^^^^^^^^^^^^^^^^ Bandwidth between the two GPU groups is not as high as within a single group. .. include:: pcie-topology-digits-devbox.json :code: javascript .. _Digits DevBox: https://developer.nvidia.com/devbox `Inefficient Configuration of 8 GPUs`_ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Inter-group bandwidth is half of intra-group bandwidth due to crossing the RC(s). .. include:: pcie-topology-balanced-8-gpus.json :code: javascript or .. include:: pcie-topology-balanced-8-gpus-single-root-complex.json :code: javascript .. _Inefficient Configuration of 8 GPUs: http://on-demand.gputechconf.com/gtc/2016/presentation/s6492-scott-le-grand-deterministic-machine-learning-molecular-dynamics.pdf Big Sur ^^^^^^^ Inter-group bandwidth is equivalent to intra-group bandwidth. This configuration is also known as cascading or daisy chaining switches. .. include:: pcie-topology-big-sur.json :code: javascript `DGX-1`_ ^^^^^^^^ .. include:: pcie-topology-dgx-1.json :code: javascript .. _DGX-1: https://devblogs.nvidia.com/parallelforall/dgx-1-fastest-deep-learning-system/ GPU(s) ====== .. list-table:: `Comparison of GPU Capability`_ :stub-columns: 1 * - - `Quadro`_ - `Tesla`_ - `GeForce`_ * - (DP) `FLOPS`_ - High - Medium to High - Low * - Memory Bandwidth - High - Medium to High - Low * - Memory Quantity - High - Medium to High - Low * - ECC - Yes - Yes - No * - Data Transfer Interconnect - PCIe/NVLink - PCIe/NVLink - PCIe * - DMA Engines - Dual - Dual - Single * - P2P - Yes - Yes - Yes * - RDMA - Yes - Yes - No * - `Hyper-Q`_ - Full - Full - Partial * - `GPU Boost`_ - Configurable - Configurable - Automatic * - Target - Graphics/Compute - Compute - Graphics/Compute * - Cluster Management Tools - Yes - Yes - No .. _GeForce: https://en.wikipedia.org/wiki/GeForce .. _Tesla: https://en.wikipedia.org/wiki/Nvidia_Tesla .. _Quadro: https://en.wikipedia.org/wiki/Nvidia_Quadro .. _FLOPS: https://en.wikipedia.org/wiki/FLOPS .. _Comparison of GPU Capability: https://www.microway.com/knowledge-center-articles/comparison-of-nvidia-geforce-gpus-and-nvidia-tesla-gpus/ .. _Hyper-Q: http://developer.download.nvidia.com/compute/DevZone/C/html_x64/6_Advanced/simpleHyperQ/doc/HyperQ.pdf .. _GPU Boost: https://www.nvidia.com/content/PDF/kepler/nvidia-gpu-boost-tesla-k40-06767-001-v02.pdf Besides the device-to-host and device-to-device interconnect technology, the DMA Engines, RDMA, and Hyper-Q are equally important features in `high-performance computing`_. .. _high-performance computing: https://en.wikipedia.org/wiki/High-performance_computing Dual DMA engines enable simultaneous execution of the following pipelined workload: #. Transfer results from data chunk :math:`n - 1` from device to host. #. Run kernel that operates on data chunk :math:`n`. #. Transfer data chunk :math:`n + 1` from host to device. A single DMA Engine can only transfer data in one direction at a time, so the data transfer steps of the proposed pipeline will be executed sequentially. P2P communication between multiple GPUs on a single machine are fully supported when all of them are under a single RC. Nvidia has an implementation of this called `GPUDirect`_. The GPUs directly access and transfer memory between each other over PCIe without involving the CPU and host memory. When sending data between GPUs across a network, this solution uses shared `pinned memory`_ to avoid a host-memory-to-host-memory copy. However, the host memory and CPU are still involved in the data transfer. Nvidia later on collaborated with Mellanox to introduce `GPUDirect RDMA`_ which transfers data directly from GPU memory to Mellanox's InfiniBand adapter over PCIe. The CPU and host memory are no longer involved in the data transfer. Note that this particular functionality requires the GPU and the network card to share the same RC. .. _GPUDirect: https://developer.nvidia.com/gpudirect .. _pinned memory: https://en.wikipedia.org/wiki/CUDA_Pinned_memory .. _GPUDirect RDMA: https://devblogs.nvidia.com/parallelforall/benchmarking-gpudirect-rdma-on-modern-server-platforms/ Hyper-Q enables multiple CPU threads or processes to launch work on a single GPU simultaneously, thereby dramatically increasing GPU utilization and slashing CPU idle times. It allows connections for both CUDA streams, threads from within a process, or `MPI`_ processes. Note that GeForce products cannot use Hyper-Q with MPI. .. _CUDA: https://en.wikipedia.org/wiki/CUDA .. _MPI: https://en.wikipedia.org/wiki/Message_Passing_Interface A technology that is completely unrelated to `GPGPU`_ is `SLI`_. The goal of SLI is to increase rendering performance by dividing the workload across multiple GPUs. All graphics resources that would normally be expected to be placed in GPU memory are `automatically broadcasted`_ to the memory of all the GPUs in the SLI configuration. .. _GPGPU: https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units .. _SLI: https://en.wikipedia.org/wiki/Scalable_Link_Interface .. _automatically broadcasted: http://developer.download.nvidia.com/whitepapers/2011/SLI_Best_Practices_2011_Feb.pdf Miscellaneous ============= The last pieces of a system are the `motherboard`_, power supply unit (`PSU`_), and `chassis`_. Ensure that the motherboard supports the desired configuration. The PSU in turn needs to be `efficient enough`_ to `power up such a system`_. The chassis just needs to house all of the components. .. _motherboard: https://en.wikipedia.org/wiki/Motherboard .. _PSU: https://en.wikipedia.org/wiki/Power_supply_unit_(computer) .. _chassis: https://en.wikipedia.org/wiki/Computer_case .. _efficient enough: https://en.wikipedia.org/wiki/80_Plus .. _power up such a system: https://outervision.com/power-supply-calculator For machines with more than two GPUs, consider `Cirrascale`_ products. They are well-known and provide great service. However, if one wishes to have barebone hardware without any service fees, then assembling individual `Supermicro`_ components is one cost-effective solution. To avoid the hassle of assembly, configure and order from `Thinkmate`_ or `Silicon Mechanics`_. .. _Cirrascale: http://cirrascale.com/ .. _Supermicro: http://supermicro.com .. _Thinkmate: http://www.thinkmate.com/systems/supermicro .. _Silicon Mechanics: https://www.siliconmechanics.com `hwloc`_, `lspci`_, and `lstopo`_ are ways to gather information about increasingly complex parallel computing platforms so as to exploit them accordingly and efficiently. .. _hwloc: https://www.open-mpi.org/projects/hwloc/ .. _lspci: https://linux.die.net/man/8/lspci .. _lstopo: https://linux.die.net/man/1/lstopo Custom Deep Learning System =========================== One marketing tactic Nvidia employs is framing the presentation of their latest GPUs in a way that implies the latest product is essential in attaining the best performance. Hence one should always verify that claim with domain-specific benchmarks (e.g. `Premiere Pro Quadro vs Titan`_) before making a purchase. .. _Premiere Pro Quadro vs Titan: https://www.pugetsystems.com/labs/articles/Premiere-Pro-CC-2017-NVIDIA-Quadro-Pascal-Performance-938 The Titan Xp/X is superior to the GTX 1080 Ti in terms of specs, but that `does not translate into huge gains`_. Furthermore, the application may not be able to `fully utilize the extra resources`_. .. _does not translate into huge gains: https://www.pugetsystems.com/labs/hpc/TitanXp-vs-GTX1080Ti-for-Machine-Learning-937/ .. _fully utilize the extra resources: https://www.pugetsystems.com/labs/hpc/PCIe-X16-vs-X8-for-GPUs-when-running-cuDNN-and-Caffe-887/ NVLink is another example of where it's `not cost effective`_ to get the latest technology. Unless an algorithm (e.g. sorting) makes use of this increased bandwidth, replacing the PCIe Gen 3 fabric with NVLink can only give at most a 2x performance boost. .. _not cost effective: http://www.azken.com/images/dgx1_images/dgx1-system-architecture-whitepaper1.pdf There are two concrete specifications that can be said about current deep learning systems. - Frameworks like `PyTorch`_, `MXNet`_ and `TensorFlow`_ exhibit near linear scaling with multiple GPUs, so eight GPUs per node is sufficient. - Having more than eight GPUs in a node is `not recommended`_ because P2P is not supported beyond eight devices at any given instant. - The cascading GPU topology is not advised because neither frameworks account for this type of dataflow. - In terms of system memory, while twice the GPU memory footprint would normally be sufficient to manage background data moves and back buffering, four times as much gives greater flexibility for managing in-memory working sets and streaming data movement. .. _PyTorch: http://pytorch.org/ .. _MXNet: http://mxnet.io/ .. _TensorFlow: https://www.tensorflow.org/performance/benchmarks .. _not recommended: https://devtalk.nvidia.com/default/topic/1004967/max-number-of-cuda-devices/ .. rubric:: References .. bibliography:: refs.bib :all: