cuda wiki

本片博客收录了cuda学习过程中的遇到的知识点总结。

主要参考 https://github.com/yszheda/wiki/wiki/CUDA 的形式,以链接的形式记录,如果有需要便去链接的网页搜索。

CUDA Reading List

跟着 https://github.com/yszheda/wiki/wiki/CUDA-Reading-List 学就完了。

CUDA Programming

modular arithmetic

formule like (ab - cd) mod m or (ab + c) mod m *.
use double-precision arithmetic to avoid expensive div and mod operations.

这里总结了单双精度浮点数的区别: + What's the difference between a single precision and double precision floating point operation?
+

CUDA structure

volatile

编译器会自动优化对global和shared memory的读写,比如将global内存变量缓存到register或者 L1 Cache。
volatile关键字阻止编译器优化,编译器会认为被volatile声明过的变量可能随时会被其他线程访问或修改。

warp

CUDA runtime API

cudaMemcpyAsync

cudaMemcpyAsynccudaMemcpy 的异步版本。若满足以下两个条件: - 使用non-default stream - host memory是pinned allocation。
GPU会分配一个free DMA copy engine,效果就是拷贝过程可以和其他GPU操作同步,比如kernel执行或者另一个拷贝(假如GPU有多个DMA copy engine的话)。
如果两个条件不能同时满足的话,GPU上的操作和 cudaMemcpy 是一致的,只不过它不会阻塞host。

也就是说, cudaMemcpyAsync 不一定使用创建的流和锁页内存。

MPS

MPS介绍

volta及以后的架构对MPS做了基于硬件加速的实现,并对进程做了地址空间隔离,这样进一步减少kernellaunch带来的延迟。Volta下的MPS服务最多可以允许同时48个Client(客户端)。
+ Multi-Process Scheduling翻译文

MPS使用

start the MPS server:

1
2
3
4
5
#!/bin/bash
# the following must be performed with root privilege
export CUDA_VISIBLE_DEVICES="0"
nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

stop the MPS server:

1
2
3
#!/bin/bash
echo quit | nvidia-cuda-mps-control
nvidia-smi -i 2 -c DEFAULT

IPC

pthreadMigration

pthread只能绑定一个context,而一个设备,多个pthread内创建的context是一致的。
pthread不能被MPS加速。

Simple sample demonstrating multi-GPU/multithread functionality using the CUDA Context Management API. This API allows the a CUDA context to be associated with a CPU process. CUDA Contexts have a one-to-one orrespondence with host threads. A host thread may have only one device context current at a time.

代码见threadMigration.cpp

get SM-ID in cuda thread

1
2
3
4
5
6
static __device__ __inline__ uint32_t __mysmid()
{
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;
}

%smid and %warpid are defined as volatile values

Dynamic Parallelism

dynamic kernel creation

PTX

CUDA Snippets

review

  • Interview questions on CUDA Programming?
    • How many different kind of memories are in a GPU ?
    • What means coalesced / uncoalesced?
    • Can you implement a matrix transpose kernel?
    • What is a warp ?
    • How many warps can run simultaneously inside a multiprocessor?
    • What is the difference between a block and a thread ?
    • Can thread communicate between them? and blocks ?
    • Can you describe how works a cache?
    • What is the difference between shared memory and registers?
    • Which algorithms perform better on the gpu? data bound or cpu bound?
    • Which steps will you perform to port of an application to cuda ?
    • What is a barrier ?
    • What is a Stream ?
    • Can you describe what means occupancy of a kernel?
    • What means structure of array vs array of structures?
  • Nvidia Interview | Set 1