[hpsdr] Announcement of a CudaSharpDSP package for HPSDR: doing parallel DSP processing on your GPU

Sat Apr 10 13:27:16 PDT 2010

On Apr 9, 2010, at 1:39 AM, Hermann wrote:

> There's a little benchmark I found on the web
> (http://www.cv.nrao.edu/~pdemores/gpu/fft_times.png), where you can
> see, that for small FFT sizes, FFTW is comparable in performance. I
> saw yet another benchmark (the web page of which I don't remember any
> more) where FFTW was fairly faster than a Cuda FFT for small sizes,
> say < 16k samples! Now, we have sizes about 1k or 2k!

Yeah.  MacHPSDR is using 1k block sizes to feed the DSP, and the biggest FFT is only around 4k points.  I'm not sure how much time you save with doing the things other than the FFT in the DSP chain though.  I'm noticing a decent amount of improvement just by doing things like magnitude calculation of complex numbers using functions that are written using SIMD instructions on the main CPU.

> So, it is really highly experimental what we are doing. And that is
> also the reason why I can't await Ozy II :D ! I'm curious what I can
> get out of Ozy II on my home-Gigabit-network. On this network I have
> Netto rates of copying 20-30 MBytes/s. If I could feed my GPU memory
> directly (without CPU - maybe DMA?????), this would really be great!

Again, the problem is the chunks with which you have to process.  As Bill mentioned last night, this will probably really come into its own when you are trying to implement many receivers, or with something like CW Skimmer.  The problem here is latency.  Transferring up big blocks is necessary to make the memory copy worthwhile, but any delay through the audio chain is also a little bit touchy.  It's not such a big deal with the phone modes, but for CW, people get grumpy.  For that reason alone this may be worth doing.  Another thing when you look at OpenCL though is the ability to run the kernels on the main CPU.  This avoids a bit of the memory copying, and you can control which compute units you send work to.  So, for stuff that's going to be worth it, you can send it to the GPU or any other compute devices out there.  If it doesn't look like it's going to be a speed improvement, you can tell OpenCL to only send it to the CPU.

Most of this is just speculation, since I haven't implemented anything yet.  But, I'm looking forward to having a chance to look at your work and possibly collaborate on putting some GPGPU functions together to accelerate things.  It might be neat to use for having many receivers within the 192kHz passband.

> 
> 73, Hermann
> DL3HVH
> 

--
Jeremy McDermond (NH6Z)
Xenotropic Systems
mcdermj at xenotropic.com

 1270931236.0