GPGPU

General-Purpose computing on
Graphics Processor Units

Why should I bother GPU?

Moore is no more

Real parallel computation

Available for normal people

Applications

Graphics
Simulations
AI
Data Analysis

3. Where we can gain performance by using GPU? Of course graphics. Graphics card are made for compute graphics transformation so that's not surprise. But maybe you don't know modern browsers accelerate page rendering using GPUs. Have you ever heard about Havoc of PhysiX? They are physics engines that make use of GPUs to provide better simulations. But that's not only application of GPU in simulations. We use GPUs for simulation of protein structures. Artificial Intelligence. This is kind of new application of GPU. 3 months ago NVidia released cuDNN - deep neural network accelerator. Data Analysis. I think this is biggest group in fact we can put applications I mentioned before in this group - for example image edge detection is data analysis. Mostly bioinformatics with text algorithms to analyse DNA but there are more and more data analysis accelerator for Matlab or R and database that use GPU but they are only in for academic research.

“Talk is cheep show me the code”

Linus Torvalds

Add the corresponding locations of A and B, and store the result in C.

Plain Old C


							void vecadd( int *A , int *B , int *C)
							{
							    for (int i = 0; i < L; i++) {
							        C[i] = A[i] + B[i];
							    }
							}

OpenMP


							void vecadd( int *A , int *B , int *C)
							{
							    chunk = CHUNKSIZE;
							    #pragma omp parallel shared(A,B,C,chunk) private(i)
							    {
							        #pragma omp for schedule(dynamic,chunk) nowait
							        for (int i = 0; i < L; i++) {
							            C[i] = A[i] + B[i];
							        }
							    }
							}

GLSL


							#version 110

							uniform sampler2D texture1;
							uniform sampler2D texture2;

							void main() {
							    vec4 A = texture2D(texture1, gl_TexCoord[0].st);
							    vec4 B = texture2D(texture2, gl_TexCoord[0].st);
							    gl_FragColor = A + B;
							}

OpenCL


								__kernel
								void vecadd(__global int *A,
								            __global int *B,
								            __global int *C)
								{
								    int id = get_global_id(0);
								    C[id] = A[id] + B[id];
								}

CUDA


								__global__
								void vecadd( int *A , int *B , int *C)
								{
								    int id = blockIdx.x*blockDim.x+threadIdx.x;
								    C[id] = A[id] + B[id] ;
								}

Not so fast

Copy data to GPU
Launch kernel
Check for errors
Copy output back to RAM

Results

http://hpclab.blogspot.com/2011/09/is-gpu-good-for-large-vector-addition.html

but for harder problems

http://hpclab.blogspot.com/2011/09/is-gpu-good-for-large-vector-addition.html

How it works inside?

This is concept diagram of GPU. Our main bottle neck is PCIe where we are limited to 16 GB/s which is comparable with DDR3. And that's why vector addition is slower than on regular CPU data transfer overwhelm computation. In matrix multiplication we didn't see it because computation takes much longer than copying of data. Next we have 2 types of read only constant memory where constants and kernel arguments are stored and texture memory optimised for 2D access. Global memory is read and write and requires sequential and aligned (16 byte) reads and writes to be fast. Other types of memory are faster. Shared memory is shared across threads in one block. Local memory and registers are privet for thread. Thread are packed in block which are packed in grids but I'll not talk about it.

Rules Of Thumb

RTFM

“Life is too short for man pages, and occasionally much too short without them.”

Randall Munroe (xkcd.com)

Think parallel

SIMD

Problems

Problem: Development is hard

Solution: Always have spare GPU in your computer

Problem: Debugging is impossible

Solution: Write tests and run them!

Problem: Copying data to/from GPU is slow

Solution: Use stream and compute while data are loaded

Problem: GPU doesn't like 64bit computation

Solution: Wait for next release

Problem: I dont want to code a lot

Solution: Use libs

ArrayFire
Thrust (STL for CUDA)
cuBLAS (Basic Linear Algebra Subprograms)
cuFFT
cuDNN (GPU-accelerated library of primitives for deep neural networks)

Before you code your custom solution.

PGStorm


							postgres=# SELECT COUNT(*) FROM t1 WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 15;
							count
							-------
							6718
							(1 row)

							Time: 7019.855 ms


							postgres=# SELECT COUNT(*) FROM t2 WHERE sqrt((x-25.6)^2 + (y-12.8)^2) < 15;
							count
							-------
							6718
							(1 row)

							Time: 176.301 ms

t1 and t2 contain same contents with 10 millions of records, but t1 is a regular table and t2 is a foreign table managed by PG-Strom