cudaMemcpyAsync 是 cudaMemcpy 的异步版本。若满足以下两个条件： - 使用non-default stream - host memory是pinned allocation。
GPU会分配一个free DMA copy engine，效果就是拷贝过程可以和其他GPU操作同步，比如kernel执行或者另一个拷贝（假如GPU有多个DMA copy engine的话）。
如果两个条件不能同时满足的话，GPU上的操作和 cudaMemcpy 是一致的，只不过它不会阻塞host。

也就是说， cudaMemcpyAsync 不一定使用创建的流和锁页内存。

MPS

MPS介绍

volta及以后的架构对MPS做了基于硬件加速的实现，并对进程做了地址空间隔离，这样进一步减少kernellaunch带来的延迟。Volta下的MPS服务最多可以允许同时48个Client（客户端）。
+ Multi-Process Scheduling翻译文

MPS使用

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

start the MPS server:

#!/bin/bash
# the following must be performed with root privilege
export CUDA_VISIBLE_DEVICES="0"
nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
nvidia-cuda-mps-control -d

stop the MPS server:

1
2
3

#!/bin/bash
echo quit | nvidia-cuda-mps-control
nvidia-smi -i 2 -c DEFAULT

IPC

pthreadMigration

pthread只能绑定一个context，而一个设备，多个pthread内创建的context是一致的。
pthread不能被MPS加速。

Simple sample demonstrating multi-GPU/multithread functionality using the CUDA Context Management API. This API allows the a CUDA context to be associated with a CPU process. CUDA Contexts have a one-to-one orrespondence with host threads. A host thread may have only one device context current at a time.

代码见threadMigration.cpp。

get SM-ID in cuda thread

static __device__ __inline__ uint32_t __mysmid()
{
    uint32_t smid;
    asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
    return smid;
}

%smid and %warpid are defined as volatile values

Dynamic Parallelism

dynamic kernel creation

PTX

The reference guide for inlining PTX

CUDA Snippets

review

Interview questions on CUDA Programming?
- How many different kind of memories are in a GPU ?
- What means coalesced / uncoalesced?
- Can you implement a matrix transpose kernel?
- What is a warp ?
- How many warps can run simultaneously inside a multiprocessor?
- What is the difference between a block and a thread ?
- Can thread communicate between them? and blocks ?
- Can you describe how works a cache?
- What is the difference between shared memory and registers?
- Which algorithms perform better on the gpu? data bound or cpu bound?
- Which steps will you perform to port of an application to cuda ?
- What is a barrier ?
- What is a Stream ?
- Can you describe what means occupancy of a kernel?
- What means structure of array vs array of structures?
Nvidia Interview | Set 1