When you compute geometry on the CPU and draw it with OpenGL, data has to travel from system RAM to GPU VRAM over PCIe every frame. For a large mesh updated every frame, that transfer becomes the bottleneck. The math is fast. The bus is slow.
CUDA-OpenGL interop removes that transfer. A CUDA kernel writes directly into an OpenGL buffer object while it lives in GPU memory. OpenGL then reads from the same memory. No round-trip through the CPU. No copy.
The Four Steps
Interop follows a strict protocol every frame. Always in this order:
- Register the OpenGL buffer with CUDA (done once at init)
- Map the resource to give CUDA exclusive access
- Get a device pointer into the buffer that your kernel can write to
- Unmap to return the buffer back to OpenGL
When the buffer is mapped, OpenGL must not use it. When it is unmapped, CUDA must not touch it. Violating this is undefined behavior.
Step 1: Register
Create a normal OpenGL VBO, then register it with CUDA. This is a one-time call at initialization:
glGenBuffers(1, &vbo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, bufferSize, NULL, GL_DYNAMIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
struct cudaGraphicsResource *cuResource;
cudaGraphicsGLRegisterBuffer(&cuResource, vbo, cudaGraphicsMapFlagsWriteDiscard);
cudaGraphicsMapFlagsWriteDiscard is a hint that tells CUDA the kernel will overwrite the entire buffer, so there is no need to preserve its previous contents. This allows the driver to skip any synchronization needed to maintain coherency with the old data. Use it when your kernel recomputes everything each frame.
Step 2: Map
Before launching your kernel each frame, map the resource. This hands ownership to CUDA and blocks OpenGL from reading it until you unmap:
cudaGraphicsMapResources(1, &cuResource, 0);
The second argument is the count of resources to map, and the third is an optional CUDA stream. Passing 0 uses the default stream.
Step 3: Get a Device Pointer
Once mapped, ask for a raw pointer into the buffer's GPU memory:
float4 *devPtr = NULL;
size_t mappedSize = 0;
cudaGraphicsResourceGetMappedPointer((void**)&devPtr, &mappedSize, cuResource);
devPtr now addresses GPU memory that belongs to the VBO. Pass it to your kernel like any other device pointer. Whatever the kernel writes there, OpenGL will see after you unmap.
Step 4: Unmap and Draw
After the kernel finishes, unmap to return the buffer to OpenGL, then draw as normal:
cudaGraphicsUnmapResources(1, &cuResource, 0);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glVertexAttribPointer(ATTRIB_POSITION, 4, GL_FLOAT, GL_FALSE, 0, NULL);
glEnableVertexAttribArray(ATTRIB_POSITION);
glDrawArrays(GL_POINTS, 0, vertexCount);
The VBO that OpenGL binds is the same one CUDA just wrote into. No upload happened.
Cleanup
On shutdown, unregister the resource before deleting the buffer:
cudaGraphicsUnregisterResource(cuResource);
glDeleteBuffers(1, &vbo);
Deleting a VBO that is still registered with CUDA is undefined behavior. Always unregister first.
When It Is Worth It
Interop pays off when a large GPU buffer is updated every frame by a computation that parallelizes well across thousands of threads. Particle systems, fluid and wave surfaces, physics-driven meshes. If the buffer is small, or updated rarely, the overhead of registration and the map/unmap calls each frame is not worth it. The benefit is proportional to how much data you would otherwise be transferring.