[hpsdr] Announcement of a CudaSharpDSP package for HPSDR: doing parallel DSP processing on your GPU

Sun Apr 11 01:07:50 PDT 2010

On Sat, Apr 10, 2010 at 10:27 PM, Jeremy McDermond
<mcdermj at xenotropic.com> wrote:
> ***** High Performance Software Defined Radio Discussion List *****
>
> On Apr 9, 2010, at 1:39 AM, Hermann wrote:
>
>> There's a little benchmark I found on the web
>> (http://www.cv.nrao.edu/~pdemores/gpu/fft_times.png), where you can
>> see, that for small FFT sizes, FFTW is comparable in performance. I
>> saw yet another benchmark (the web page of which I don't remember any
>> more) where FFTW was fairly faster than a Cuda FFT for small sizes,
>> say < 16k samples! Now, we have sizes about 1k or 2k!
>
> Yeah.  MacHPSDR is using 1k block sizes to feed the DSP, and the biggest FFT is only around 4k points.
> I'm not sure how much time you save with doing the things other than the FFT in the DSP chain though.  I'm
> noticing a decent amount of improvement just by doing things like magnitude calculation of complex numbers
> using functions that are written using SIMD instructions on the main CPU.
>

Yes, most of my Cuda implementation is now already about parallelizing
most of the 1k or 2k-loops in the KISS code (AGC, Filters, Spectras,
output). The thing is, once you have copied let's say 1k samples onto
GPU memory, it's better to stay there because of the bandwidth
differences of copying data from host to GPU memory. On a GTX280 you
can reach peak bandwidths between device memory (memory on the
graphics adapter) and GPU of 141 GBps, in contrast to 8GBps when
copying data from host memory to GPU (on PCIe Gen2).

Now I found on the Nvidia Cuda forum a FFT implemention not using
CuFFT, which is reported to be 3 times faster than CuFFT. If I could
us this implementaion the problem I have in that CuFFT is not
supporting multithread programming, could also be solved (that is, I
will be able to introduce a high priority thread for DSP processing
again). This will take me a little further.

>> So, it is really highly experimental what we are doing. And that is
>> also the reason why I can't await Ozy II :D ! I'm curious what I can
>> get out of Ozy II on my home-Gigabit-network. On this network I have
>> Netto rates of copying 20-30 MBytes/s. If I could feed my GPU memory
>> directly (without CPU - maybe DMA?????), this would really be great!
>
> Again, the problem is the chunks with which you have to process.  As Bill mentioned last night, this will
> probably really come into its own when you are trying to implement many receivers, or with something like CW
> Skimmer.  The problem here is latency.  Transferring up big blocks is necessary to make the memory copy
> worthwhile, but any delay through the audio chain is also a little bit touchy.  It's not such a big deal with the
> phone modes, but for CW, people get grumpy.  For that reason alone this may be worth doing.  Another thing
> when you look at OpenCL though is the ability to run the kernels on the main CPU.  This avoids a bit of the
> memory copying, and you can control which compute units you send work to.  So, for stuff that's going to be
> worth it, you can send it to the GPU or any other compute devices out there.  If it doesn't look like it's going to
> be a speed improvement, you can tell OpenCL to only send it to the CPU.
>
> Most of this is just speculation, since I haven't implemented anything yet.  But, I'm looking forward to having a
> chance to look at your work and possibly collaborate on putting some GPGPU functions together to accelerate
> things.  It might be neat to use for having many receivers within the 192kHz passband.

Yes, latency will be one of the most important issues here. I will
finish all the Cuda implementaion first (I learn a lot about DSP
processing at the same time), then try to see where are the
bottlenecks wrt latencies. Meanwhile I should really finish my
Penelope installation, and see how my transmit path is doing! All
hardware is here, but the Cuda work has kept my busy for weeks now.

vy 73,
Hermann

DL3HVH

 1270973270.0