This is a small introductory text to get started with writing GPGPU code. This isn’t an in-depth tutorial that teaches you how to do everything, but rather points you into a direction to go explore.
After doing the “assignments” you will have completed the following goals:
- IDE with language server and linting.
- Compile and run GPU code.
- Able to write a simple GPU-”accelerated”
hello, world!
program. - Exchange your soul for cool hardware technology.
GPUs come in a lot of different flavors. Each have their own way of interacting with them, and most of them have different instruction sets. For now, let’s pick the software stack for either of the two major vendors: ROCm (HIP) for AMD, and CUDA for Nvidia. Both of these platforms provide a C++ runtime API and language extensions for writing kernels. The assignments and text fluff is mostly here to give you some kind of guidance on what to do. But the bulk of the information and knowledge required needs to be acquired via research! Feed a person a fish, and they will be fed for a day.
Assignment 1. Sell your soul to big-hardware.
Find out what (dedicated) GPU you have, and install the appropriate software stack. Make sure you can invoke the now installed C++ compiler:
hipcc
for AMD, andnvcc
for Nvidia. Alternatively, if you feel daring, you could use something such as OpenCL for better cross platform support.I guess, if you really want to, you could do something with Rust and the SPIR-V targets, but I have no experience with that. I just know it’s possible… somehow.
Since we’re now able to compile and run things on the GPU in the most basic sense, lets make our future self’s life a bit easier. Having an IDE that can communicate with a language server (via protocol such as LSP) is very useful. The IDE can then help you with autocomplete and even find code issues.
Assignment 2. Typing software.
Pick an IDE (I use VSCode a lot, but you could use Neovim), and setup the language server. Since we’re using C++ for our kernels (HIP C++ and CUDA C++), something like
clangd
works really good. Make sure you get red highlighting when you write something stupid, and actual good autocomplete suggestions when you start writing.
Now, the real cool kids on the block are tired of typing nvcc
and hipcc
all the time. And you have to link stuff too? Crazy! Luckily we can use some tooling to manage the builds for us.
Assignment 3. IKEA manuals. (optional)
Get something like
cmake
. Well really just get CMake. Also you wantninja
, becausemake -j
is stupid. Build systems aren’t the focus here. Just vibe code your way to be able to doninja my_target_or_whatever
.
Okay, you did it! The most annoying part of the tutorial is over. Lets finally use that expensive piece of hardware in your puter! I assume you already know how to do a hello, world!
in plain ol’ C++. If not, do that now!
As for making sure everything actually works. Let’s make a simple kernel that will run on our GPU device and some code that will fill a buffer with numbers!
// count_up.cuda
// note: we're missing some includes. your IDE should
// be smart enough to help you fill them in!
// this will be our common type between host and
// device.
using data_t = int;
// this is the gpu kernel. it will run on the gpu!
__global__ void counting_kernel(std::size_t num_elems, data_t* data) {
// this code runs for every thread in the gpu.
// but we want each thread to do something slightly
// different. we need a way the differentiate
// between them. luckily, we can use the following
// variables:
// - blockIdx: the index of the launched block
// - blockDim: the number of threads in this block
// - threadIdx: the index of this thread in this block
// we can combine this to get the global thread id.
const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
// lets do something really simple. each thread
// will write its own global thread id to the
// data buffer.
data[tid] = tid;
}
int main() {
// define how many elements we want to work on.
constexpr std::size_t num_elems = 64;
// allocate the data-buffer on device.
data_t* d_data;
cudaMalloc(&d_data, num_elems * sizeof(data_t));
// for a kernel launch we need to launch a grid
// of blocks where each block contains a number
// of threads.
//
// we will investigate this deeper later. for now
// just make sure we don't launch more threads
// than the amount of work we need to do.
constexpr unsigned int grid_size = 2;
constexpr unsigned int block_size = 32;
static_assert(num_elems >= grid_size * block_size, "kernel launch must not exceed allocated memory!");
// launch a kernel that fills the data on device.
counting_kernel<<<2, 32>>>(num_elems, d_data);
// we have the data on device, but need it on host.
// create a buffer on host to store the results.
auto h_data = std::vector<data_t>(num_elems);
// copy the data from the device to the host.
cudaMemcpy(h_data.data(), d_data, num_elems * sizeof(data_t), cudaMemcpyDeviceToHost);
// we don't need the data on the device anymore.
// lets clean up after ourself and free the memory.
cudaFree(&d_data);
// oh right, still need to print the results for
// us to verify!
for (std::size_t i = 0; i < num_elems; ++i) {
std::cout << i << "\t: " << h_data[i] << std::endl;
}
// graceful exit :3
return 0;
}
Assignment 4. Hello, world!
This program should do the following procedure.
- Allocate a buffer on device.
- Launch a kernel that does work on this buffer. It should fill the entire buffer with the string
Hello, world!
repeated.- Copy this buffer from device to host.
- Print the contents of this buffer.
The world of GPUs is now your oyster (as long as you can afford the power bill). Perhaps try to implement a cool parallel algorithm now!