Jump to content

Reliable way to automatically get CUDA Cores and Blocks?

So I'm trying to learn parallel programming with CUDA. So I'm wondering if there's  a way to automatically get the number of blocks and threads without looking up the card's specifications online, then using those values in the <<<block, thread>>> line of code. I want to do this automatically so that regardless of what card model I run the code on, it will utilize the whole card.

 

Right now I'm using a GTX 1050 with 640 Cuda cores according to the specs sheet from the nvidia website. If that's the total cores how do I get the number of blocks? Right now I put 1.

Below is the code I'm trying to run.

 


#include <cuda.h>
#include "DataManager.h"

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <cmath>
#include <iostream>
#include <stdio.h>
#include <stdlib.h>     /* srand, rand */
#include <time.h>
#include <array>
#include <random>

#define THREADS 640
DataManager dm;

__global__ 
void Operation(int n, float *x, float *io) {
	
	int index = threadIdx.x;
	int stride = blockDim.x;

	for (int i = index; i < n; i+= stride) {
		io[i] = x[i] * io[i];
		printf("%d %f\n", i , io[i]);
	}
}


int main(int argc, char ** argv){
	int N = THREADS;
	
	
	float *x, float *y;
	cudaMallocManaged(&x, N * sizeof(float));
	cudaMallocManaged(&y, N * sizeof(float));

	for (int i = 0; i < N; i++) {
		x[i] = (float)(rand() % 100);
		y[i] = (float)(rand() % 100);
		
	}
	
	Operation <<<1, THREADS >>> (N, x, y);
	

	cudaDeviceSynchronize();
	cudaFree(x);
	cudaFree(y);
	return 0;
}

 

My system: CPU: Intel i5 6500; Mobo: H110M-k; GPU: Nvidia GT 730; Memory: 16 GB; HDD: 2x 1TB HDD;

Link to comment
Share on other sites

Link to post
Share on other sites

If you look at the deviceQuery.cpp in the nvidia toolkit, it gives a list of commands for returning each value for the device properties.

Link to comment
Share on other sites

Link to post
Share on other sites

Don’t use a single block, that will force the kernel to run on only a single CU (32 cores).

You should decide on the number of threads per block (multiple of 32, max of 512 IIRC) and scale the number of blocks appropriately with the number of work items.

Apart from trying all possibilities yourself, CUDA also has a function that tries to guess the optimal block size for your kernel:

https://devblogs.nvidia.com/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/

Note that the optimal block size might change from kernel to kernel because of registry pressure.

 

Also make sure that you schedule a couple times more blocks than you have conpute units such that the GPU can efficiently hide memory latency (problem size should be at least a couple times larger than the number of cores in your GPU).

Desktop: Intel i9-10850K (R9 3900X died 😢 )| MSI Z490 Tomahawk | RTX 2080 (borrowed from work) - MSI GTX 1080 | 64GB 3600MHz CL16 memory | Corsair H100i (NF-F12 fans) | Samsung 970 EVO 512GB | Intel 665p 2TB | Samsung 830 256GB| 3TB HDD | Corsair 450D | Corsair RM550x | MG279Q

Laptop: Surface Pro 7 (i5, 16GB RAM, 256GB SSD)

Console: PlayStation 4 Pro

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×