Contents
  1. 1. CUDA Reading List
  2. 2. CUDA Programming
    1. 2.1. modular arithmetic
    2. 2.2. CUDA structure
      1. 2.2.1. volatile
    3. 2.3. warp
    4. 2.4. MPS
      1. 2.4.1. MPS介绍
      2. 2.4.2. MPS使用
    5. 2.5. IPC
    6. 2.6. pthreadMigration
    7. 2.7. get SM-ID in cuda thread
    8. 2.8. Dynamic Parallelism
  3. 3. PTX
  4. 4. CUDA Snippets
  5. 5. review

本片博客收录了cuda学习过程中的遇到的知识点总结。

主要参考 https://github.com/yszheda/wiki/wiki/CUDA 的形式,以链接的形式记录,如果有需要便去链接的网页搜索。

CUDA Reading List

跟着 https://github.com/yszheda/wiki/wiki/CUDA-Reading-List 学就完了。

CUDA Programming

modular arithmetic

formule like (ab - cd) mod m or (ab + c) mod m *.
use double-precision arithmetic to avoid expensive div and mod operations.

这里总结了单双精度浮点数的区别:

CUDA structure

volatile

编译器会自动优化对global和shared memory的读写,比如将global内存变量缓存到register或者 L1 Cache。
volatile关键字阻止编译器优化,编译器会认为被volatile声明过的变量可能随时会被其他线程访问或修改。

warp

MPS

MPS介绍

volta及以后的架构对MPS做了基于硬件加速的实现,并对进程做了地址空间隔离,这样进一步减少kernellaunch带来的延迟。Volta下的MPS服务最多可以允许同时48个Client(客户端)。

MPS使用

start the MPS server:

1
2
3
4
5
#!/bin/bash
# the following must be performed with root privilege
export CUDA_VISIBLE_DEVICES="0"
nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

stop the MPS server:

1
2
3
#!/bin/bash
echo quit | nvidia-cuda-mps-control
nvidia-smi -i 2 -c DEFAULT

IPC

pthreadMigration

pthread只能绑定一个context,而一个设备,多个pthread内创建的context是一致的。
pthread不能被MPS加速。

Simple sample demonstrating multi-GPU/multithread functionality using
the CUDA Context Management API. This API allows the a CUDA context to be
associated with a CPU process. CUDA Contexts have a one-to-one orrespondence
with host threads. A host thread may have only one device context current
at a time.

代码见threadMigration.cpp

get SM-ID in cuda thread

1
2
3
4
5
6
static __device__ __inline__ uint32_t __mysmid()
{
uint32_t smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;
}

%smid and %warpid are defined as volatile values

Dynamic Parallelism

dynamic kernel creation

PTX

CUDA Snippets

review

  • Interview questions on CUDA Programming?

    • How many different kind of memories are in a GPU ?
    • What means coalesced / uncoalesced?
    • Can you implement a matrix transpose kernel?
    • What is a warp ?
    • How many warps can run simultaneously inside a multiprocessor?
    • What is the difference between a block and a thread ?
    • Can thread communicate between them? and blocks ?
    • Can you describe how works a cache?
    • What is the difference between shared memory and registers?
    • Which algorithms perform better on the gpu? data bound or cpu bound?
    • Which steps will you perform to port of an application to cuda ?
    • What is a barrier ?
    • What is a Stream ?
    • Can you describe what means occupancy of a kernel?
    • What means structure of array vs array of structures?
  • Nvidia Interview | Set 1

Contents
  1. 1. CUDA Reading List
  2. 2. CUDA Programming
    1. 2.1. modular arithmetic
    2. 2.2. CUDA structure
      1. 2.2.1. volatile
    3. 2.3. warp
    4. 2.4. MPS
      1. 2.4.1. MPS介绍
      2. 2.4.2. MPS使用
    5. 2.5. IPC
    6. 2.6. pthreadMigration
    7. 2.7. get SM-ID in cuda thread
    8. 2.8. Dynamic Parallelism
  3. 3. PTX
  4. 4. CUDA Snippets
  5. 5. review