[hpsdr] Intel's benchmark: GPU vs. CPU

Fri Jun 25 11:00:49 PDT 2010

Dear Charles, Ken, Jeremy and All,

First, thanks a lot, Jeremy! This tutorial I wasn't aware of, but it
will surely help. Most of the time I spend optimizing my Cuda kernels,
using Nvidia's tools. It is very very important that the three kind of
memories on the grpahics adapter (global, shared and registers) are
used correctly. Otherwise you end  with very bad performance in
comparison to a good CPU. E.g. accessing global memory on the adpater
is a penalty of order one hundred in comparison to shared memory. You
have to design the algorithms around the use of memories, around well
known parallel patterns, and coalesing writes, etc etc.

That's why I guess you won't sse to much relief of CPU usage in your
implementation, Charles! It is also known that CUFFT has not the best
performance on small FFT sizes. There's much discussion about that
going on in the Nvidia Cuda forum. I use the FFT Cuda implementation
of Vasily Volkov (just google for it), which is faster for smaller FFT
sizes.

The most recent code I implemented was a sort of (block) ring buffer,
using only global memory reads and writes (just a quick implementaion
to see how this could work). This was easy because I did not had to
think about the size of the Blocks and the size of the Grid. The
performance was really bad. Then I started to think about how many
threads, the Block and Grid size, which is closely related to the
problem you want to solve and size of your buffers etc etc. In the end
I had an implementation with minimal global reads and writes, making
maximum use of shared memory (shared memory reads and writes are very
fast) and (I hope) a clever organization of indices. Suddenly it
worked like charm. I now read 16 kB of raw data from Ozy directly into
the GPU, the block ring buffer organizes the shuffling of 2016 samples
per 16 kB reads (after converting from 24-Bit raw samples to 32 Bit
floats). The same will be done when forming the Ozy data to be send
back - same principle. The C&C bytes are  added to the buffer just
before sending them to Ozy. I guess I can do the same with the I and Q
samples to be send to Ozy.

The best thing is that I now can scale up and adapt the GPU algorithms
to any sizes I like very easily. Imagine OzyII is here..

I will report in short in more detail of my status!

Vy 73,
Hermann

DL3HVH

On Fri, Jun 25, 2010 at 3:59 PM, Jeremy McDermond
<mcdermj at xenotropic.com> wrote:
> Hermann --
>
> There's a good tutorial at Mac Research on the memory layout of the NVidia GPUs that should be just as applicable to CUDA as OpenCL.  You might find it interesting for optimization of your kernels:
>
> http://www.macresearch.org/opencl_episode4

 1277488849.0