[hpsdr] Announcement of a CudaSharpDSP package for HPSDR: doing parallel DSP processing on your GPU

Mon Jun 21 23:28:39 PDT 2010

On Tue, Jun 22, 2010 at 12:52 AM, Jeremy McDermond
<mcdermj at xenotropic.com> wrote:
>
> That's kinda interesting.  I haven't gotten beyond the point of merely initializing and recognizing hardware devices on my OpenCL work, but OpenCL supposedly supports multiple kernels on a single device, you just have to compile each kernel and upload it to the device.  OpenCL then has the idea of a "pipleine" into a computing context (like an OpenGL context).  You have to deal with thread locking for multiple threads using a single pipeline, but multiple pipelines are guaranteed to be separate from a threading standpoint.
>

Hi Jeremy,

you can run multiple kernels on a single device, even concurrently
(with computing capabiltiy > 1.1 afaik, e.g. Nvidia GTX285 or even a
Tesla device), but not (yet) if the different kernels are started from
different CPU threads - correct me if i'm wrong. But this would be
exactly what we need: having different CPU threads for working on DSP,
audio, spectra etc, and start parallel kernels from each concurrently.

>> But this is what I want to do next. In the view of the upcoming OzyII
>> with fast ethernet
>
> Gigabit Ethernet even!

sure...I can't await OzyII ! I have a little home Gigabit-network here
with a Gigabit switch :D I run my backups with up to 25 MByte/s.

>> it may well be, that using Cuda for computing and
>> displaying the data makes more sense than trying to do all the DSP on
>> the GPU.
>
> I'm sure it will become more useful still with multiple receivers and wider passbands.

Yes. As soon as I have some stable version of my implementation, I
will flash my Mercury with the 3-receivers-version.

What I do right now is loading the Ozy data directly onto the device
in chunks of 16 kB, scan out the sync, control and I, Q bytes of the
Ozy protocol, convert them to floats (all on the device), and push
them forward for DSP processing. Why 16 k? Because 16k of 24-Bit raw
data are needed at least to fill a 1024 Byte buffer of 32-Byte floats
in one step into a ring buffer. At the end the audio data is stuffed
into another parallel version of a ring buffer (with the right
decimation according to the sample rate), and copied back to the host
thread.

But now I have a synchronization problem: I get more data in then out
;-) that's why audio is currently corrupted. Need to do some balancing
and buffering. Its almost OK with 16kB buffer size for input data, and
4kB buffer size output data. I also have a version where I do 4096
Byte FFTs for filtering.

Vy 73,
Hermann

DL3HVH

 1277188119.0