[hpsdr] Announcement of a CudaSharpDSP package for HPSDR: doing parallel DSP processing on your GPU

Fri Apr 30 22:53:23 PDT 2010

Dear All,

there's an interesting discussion going on currently about software
development! I want to take the opportunity to report about my little
progress for the CudaDSP package. I will report along the DSP main
loop in the receiver's "DoDSPProcess" method.

1. There's a new 'section' called ozy_input, where the raw Ozy data
samples (I and Q, without control bytes) are fed onto the device (I
will speak about the graphics adapter as the 'device', in contrast to
the 'host', where all CPU related computations are done). The
conversion of the raw 24-Bit samples to 32-Bit floats is done in a
parallel fashion on the device. Not a big problem.

2. The Noise Blanker and the Avergae Noise Blanker methods on the
device are working. At least, they'll do their job. I tested the Noise
Blanke with a very bad power supply I have here, which poisons my HF
surrounding up to 10 MHz. Very nice to see, how periodic noise is
cancelled by the NB! Surely, in a next round these methods deserve an
updating.

3. The Local Oscillator method is not yet implemented. I looked at the
C (or C Sharp) code, and I think it is straight forward to implement
on the device.

4. Since the first application to use will be the little HPSDR server,
spectra computations are not done in the moment. As with the Local
Oscillator, I don't see big problems parallelizing these methods.

5. Most important: the Filter. I now use Vasily Volkov's (University
of California) FFT implementation and abandoned the original library
CUFFT from Nvidia. Now finally, 'DoDSPProcess' is running in its own
thread (which in KISS is the USBLoop). I needed some time to figure
out how I change the Cuda Contexts, esp. if there are changes in the
filter (changing bandwidth), or changing inbuffer/outbuffer array
length. Because you have to switch Contexts back and forth among the
main thread and the 'dsp thread'.

The FFT from Vasily is working great - thanks to him! Even for the
small FFT sizes we are using here, I have a little performance gain.

6. What Metering concerns, its more or less the same as with spectra
computations. Not done by now.

7. AGC: this I have been working now for three weeks or so (well, that
is to say, only in the evenings...there's another job for living) and
it gave me some headache because of the recursion of the gain factor.
Computing energy of the signal is straightforward, and scaling the
buffer as well, but the recursion is not so easy to parallelize. I
took several approaches (like unrolling the recursion), and I now have
two solutions which in priciple work, but still I hear distortions in
the output.

8. Squelch: not done.

9. PLL process: not done, should be no problem.

10. Noise filters: as with AGC we have some recursions here. I have a
version which is working in principle, but too many distortions. Needs
still some work.

11. 'DoOutput' method: straight forward, no problem.

All 'little helper' methods with big buffer-length loops are
implemented on the device.

Summarizing: the receiver functionalities are working in principle,
and the main progress during the last weeks was to fiddle out how Cuda
Context switching can be done. Now I have a solid 'dsp thread' running
again, as it should be. Interestingly, if I run the application with
CudaSharpDSP, turn on AGC, Filter, and have 192 kHz sample rate, and
8k input buffer, 4k output buffer, the CPU still takes 30% (averged on
the cores). Must all come from shuffling data around from Ozy to GPU
and back, or?

Any hints, comments and suggestions are welcome!

vy 73s,

Hermann

DL3HVH

 1272693203.0