[hpsdr] Announcement of a CudaSharpDSP package for HPSDR: doing parallel DSP processing on your GPU

Thu Apr 8 22:39:41 PDT 2010

Thanks for the information, Jeremy!

Being able to address not only different OS platforms but also
different types of hardware (CPU, Cell proc, GPU, etc) is indeed a big
advantage. I will take a look at it, because as I saw, the Cuda
toolkit already comprises OpenCL.

Since you mentioned the small block sizes we use: that is exactly the
reason why the performance gain I encountered so far isn't really
worth doing all of this. This is a general problem with parallel
computing: when does the overhead (doing parallel computation) pays
off? In my case I minimized copying data to and from the GPU memory,
because this is by far the smallest bottleneck - although we're
talking here about "big" bandwidths. There are only two places where
data is transferred from CPU RAM to GPU RAM or vice versa: when I feed
the GPU with Ozy data, and back, when I copy the outbuffer (the audio)
back to Ozy.

There's a little benchmark I found on the web
(http://www.cv.nrao.edu/~pdemores/gpu/fft_times.png), where you can
see, that for small FFT sizes, FFTW is comparable in performance. I
saw yet another benchmark (the web page of which I don't remember any
more) where FFTW was fairly faster than a Cuda FFT for small sizes,
say < 16k samples! Now, we have sizes about 1k or 2k!

So, it is really highly experimental what we are doing. And that is
also the reason why I can't await Ozy II :D ! I'm curious what I can
get out of Ozy II on my home-Gigabit-network. On this network I have
Netto rates of copying 20-30 MBytes/s. If I could feed my GPU memory
directly (without CPU - maybe DMA?????), this would really be great!

73, Hermann
DL3HVH

 1270791581.0