Nvidia GPU¶
GPGPU: General-purpose computing on graphics processing units Nvidia: Company that design graphics processing units (GPUs) for the gaming and professional markets. Nvidia GPU have better ecosystem for Machine Learning.
Monitoring¶
nvidia-smi
watch -n 1 nvidia-smi
GPU Summary Table¶
Mixed precision¶
Mixed Precision Training
With half precision, it could lower memory consumsion and speed up
- half precision that slower than single precision consider as not supported.
- tensor precision stat (1:8) provided by Nvidia Turing Whitepaper, other wiki
With Tensor core (4x4 matrix core) + with FP16 Accumulate (TensorRT), theologically x8 speed up (usually x2~3 because of memory bound)
Processing Power: GeForce stat from wiki/GeForce_10_series, wiki/GeForce_20_series, Tesla stat from Nvidia
Google Cloud Platform(GCP): https://cloud.google.com/compute/gpus-pricing
CUDA¶
Compute Unified Device Architecture
Some SOTA papers have its own CUDA code to implement its novel idea, such as deformable convolution, optical flow wrapper. So better understand some basic CUDA programming.
An Even Easier Introduction to CUDA
- Add Specifier to the code that run on GPU global: runs on the GPU and can be called from CPU code, kernel device: runs on the GPU and can be called from GPU code, device code host: runs on the CPU and can be called from CPU code, host code
- Memory Allocation in CUDA
cudaMallocManaged, cudaFree
- caller add execution configuration <<<numBlocks, blockSize>>> numBlocks: the number of thread block blockSize: the number of thread in thread block, suggest to be multiple of 32
- in the kernel, assign index for parallel computation
value|description
—|—
gridDim.x
|the number of blocks in the gridblockIdx.x
|the index of the current thread block in the gridblockDim.x
| the number of threads in the blockthreadIdx.x
| the index of the current thread within its block
Documents¶
CUDA C++ Programing Guide
CUDA C++ Best Practices Guide
CUDA Math API
Datatype: half vs half2¶
half2 structures store two half values in the space of a single 32-bit word, as the bottom of Figure 1 shows.
cuDNN¶
The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
TensorRT¶
NVIDIA TensorRT™ is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. TensorRT provides INT8 and FP16 optimizations for production deployments of deep learning inference applications. Reduced precision inference significantly reduces application latency, which is a requirement for many real-time services, auto and embedded applications.