Reliable way to automatically get CUDA Cores and Blocks?

vibeit · February 22, 2019

So I'm trying to learn parallel programming with CUDA. So I'm wondering if there's a way to automatically get the number of blocks and threads without looking up the card's specifications online, then using those values in the <<<block, thread>>> line of code. I want to do this automatically so that regardless of what card model I run the code on, it will utilize the whole card.

Right now I'm using a GTX 1050 with 640 Cuda cores according to the specs sheet from the nvidia website. If that's the total cores how do I get the number of blocks? Right now I put 1.

Below is the code I'm trying to run.


#include <cuda.h>
#include "DataManager.h"

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <cmath>
#include <iostream>
#include <stdio.h>
#include <stdlib.h>     /* srand, rand */
#include <time.h>
#include <array>
#include <random>

#define THREADS 640
DataManager dm;

__global__ 
void Operation(int n, float *x, float *io) {
	
	int index = threadIdx.x;
	int stride = blockDim.x;

	for (int i = index; i < n; i+= stride) {
		io[i] = x[i] * io[i];
		printf("%d %f\n", i , io[i]);
	}
}


int main(int argc, char ** argv){
	int N = THREADS;
	
	
	float *x, float *y;
	cudaMallocManaged(&x, N * sizeof(float));
	cudaMallocManaged(&y, N * sizeof(float));

	for (int i = 0; i < N; i++) {
		x[i] = (float)(rand() % 100);
		y[i] = (float)(rand() % 100);
		
	}
	
	Operation <<<1, THREADS >>> (N, x, y);
	

	cudaDeviceSynchronize();
	cudaFree(x);
	cudaFree(y);
	return 0;
}

xentropa · February 22, 2019

If you look at the deviceQuery.cpp in the nvidia toolkit, it gives a list of commands for returning each value for the device properties.

mathijs727 · February 22, 2019

Don’t use a single block, that will force the kernel to run on only a single CU (32 cores).

You should decide on the number of threads per block (multiple of 32, max of 512 IIRC) and scale the number of blocks appropriately with the number of work items.

Apart from trying all possibilities yourself, CUDA also has a function that tries to guess the optimal block size for your kernel:

https://devblogs.nvidia.com/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/

Note that the optimal block size might change from kernel to kernel because of registry pressure.

Also make sure that you schedule a couple times more blocks than you have conpute units such that the GPU can efficiently hide memory latency (problem size should be at least a couple times larger than the number of cores in your GPU).

Sign In

Reliable way to automatically get CUDA Cores and Blocks?

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

Every Monitor Fails This Test… Except One - Sun Vision rE rLCD Display

Latest From Tech Quickie:

Nutrition Facts…for your Internet Connection?

Latest From TechLinked:

Microsoft’s “M1” Moment is Here

Latest From GameLinked:

The next Must-Play RPGs

Latest From ShortCircuit:

The World's Fastest CPU (Technically...) - Intel i9-14900KS

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!

My Activity Streams