CUDASynth

These CUDA filters are packaged into DGDecodeNV, which is part of DGDecNV.
User avatar
admin
Site Admin
Posts: 4002
Joined: Thu Sep 09, 2010 3:08 pm

CUDASynth

Post by admin » Thu Aug 30, 2018 1:57 pm

Looking at available frameworks for CUDA/NVDec with Avisynth/Vapoursynth, we do not see proper pipelines running on the GPU that eliminate unnecessary PCIe frame transfers and copies into Avisynth/Vapoursynth buffers. For example, we may see the set of internal filters ported to CUDA. The spurious PCIe transfers and copies for intermediate filters are not eliminated. A solution is not so easy because CUDA enforces some constraints that are challenging for implementing such a true GPU pipeline of filters. Importantly, we should prefer to avoid having to modify Avisynth/Vapoursynth itself.

The basic CUDASynth idea is to have two GPU frame buffers accessible to every filter. One holds the input and one takes the output. The output would be the CPU for the last filter in the chain. One floating CUDA context is used globally, so that all the filters in the chain (script) can access the frame buffers. The frame buffers are used in a ping-pong manner.

The source filter creates the global CUDA context and allocates the input/output frame buffers gpu0 and gpu1. The context id and frame buffer pointers must be communicated to the other filters. Currently a file is used but this can be changed to a named pipe, shared memory, etc.

A CUDASynth script allowing for a pipeline of execution on the GPU might be:

DGSource(..., dst="gpu0")
DGSharpen(...,src="gpu0", dst="gpu1")
DGDenoise(..., src="gpu1", dst="cpu")

That is the design concept. Four full frame copies CPU<-->GPU and two copies to Avisynth/Vapoursynth are eliminated!

Currently, the global context and communication to the other script filters have been implemented and tested. Next will be the declaration of the GPU input/output buffers and testing of the GPU pipeline.

gonca
Distinguished Member
Distinguished Member
Posts: 607
Joined: Sun Apr 08, 2012 6:12 pm

Re: CUDASynth

Post by gonca » Thu Aug 30, 2018 2:36 pm

Sounds like you're making progress.

fidodkk
Posts: 1
Joined: Fri Jul 20, 2018 10:30 am

Re: CUDASynth

Post by fidodkk » Sat Sep 01, 2018 11:40 am

Interesting idea

User avatar
admin
Site Admin
Posts: 4002
Joined: Thu Sep 09, 2010 3:08 pm

Re: CUDASynth

Post by admin » Sat Sep 01, 2018 11:51 am

Thanks guys!

I've implemented the ping-pong buffers declaration by the DGSource filter and tested that the intermediate filters can access the buffers. Next main step is to modify DGSource to convert the NV12 delivered by CUVID to YUV420P8/16 on the GPU so that the intermediate filters get the right color space. Previously, this conversion was done on the CPU after transferring the decoded source frame back to the CPU and just before returning the frame to Avisynth. This change should improve performance of DGSource() in both the pipelined and nonpipelined use cases as the expensive NV12->YUV420 conversion is moved to the GPU. After that, I'll add the src/dst parameters and implement the pipeline. I expect this framework to bring out the true power of CUDA in an Avisynth/Vapoursynth environment.

gonca
Distinguished Member
Distinguished Member
Posts: 607
Joined: Sun Apr 08, 2012 6:12 pm

Re: CUDASynth

Post by gonca » Sat Sep 01, 2018 1:54 pm

DDSource even faster, and the filters as well.
Awesome

User avatar
admin
Site Admin
Posts: 4002
Joined: Thu Sep 09, 2010 3:08 pm

Re: CUDASynth

Post by admin » Tue Sep 04, 2018 9:43 am

The NV12->YV12 conversion on the GPU was a bit harder than I expected. Nevertheless, I have it working now. It did not produce a significant DGSource performance improvement over my previous code (just a few fps). I can understand that because it is a small part of the overall frame decode and delivery, and because I was not able (as I had hoped) to avoid a pitched frame copy into the Avisynth frame because the decoded frame pitch can differ from the Avisynth frame pitch. The main point is that I can now source a YV12 frame down the pipeline on the GPU.

The next step is to implement the fsrc/fdst parameters for DGSource and another filter such as DGSharpen, and then try to get the pipeline running. A challenge may be to ensure that the downstream filters use the correct pitch.

User avatar
admin
Site Admin
Posts: 4002
Joined: Thu Sep 09, 2010 3:08 pm

Re: CUDASynth

Post by admin » Tue Sep 04, 2018 6:31 pm

gonca wrote:
Tue Sep 04, 2018 9:43 am
Huh?
Gosh, I thought it was perfectly clear. :(

Here's the CUDA kernel code. It should make everything clear for you. Note that this just deinterleaves the chroma. The luma does not have to be touched and is just copied with cuMemcpyDtoD(). Also note that w is actually passed in as the pitch and h is half the frame height. That simplifies the code.

Code: Select all

extern "C"

__global__ void NV12toYV12(unsigned char *src, unsigned char *dst, int w, int h)
{
	const int ix = blockDim.x * blockIdx.x + threadIdx.x;
	const int iy = blockDim.y * blockIdx.y + threadIdx.y;
	unsigned char *u, *v, *s;

	u = dst + (h * 2) * w;
	v = u + h * w / 2;
	s = src + (h * 2) * w;
	if (ix < w && iy < h)
	{
		if (!(ix & 1))
			u[w * iy / 2 + ix / 2] = s[w * iy + ix];
		else
			v[w * iy / 2 + ix / 2] = s[w * iy + ix];
	}
}
It's quick and dirty, but it works. ;) 16-bit comes later.

User avatar
hydra3333
Distinguished Aussie Member
Distinguished Aussie Member
Posts: 132
Joined: Wed Oct 06, 2010 3:34 am
Contact:

Re: CUDASynth

Post by hydra3333 » Thu Sep 06, 2018 8:33 am

Interesting, and thank you for your ideas and work.

User avatar
admin
Site Admin
Posts: 4002
Joined: Thu Sep 09, 2010 3:08 pm

Re: CUDASynth

Post by admin » Thu Sep 06, 2018 10:26 am

You're welcome.

Working on the sharing of the ping-pong buffers between the filter threads. It should be easy but I have a DGDecodeNV CUDA autolock that I have to make available to the other filters so they can all avoid collisions when pushing/popping the context.

User avatar
admin
Site Admin
Posts: 4002
Joined: Thu Sep 09, 2010 3:08 pm

Re: CUDASynth

Post by admin » Sat Sep 08, 2018 10:10 am

I found the bug that prevented the downstream filters from accessing the GPU ping-pong buffers declared by DGSource(). All the filters can now access the ping-pong buffers using the information passed in the file, which means they do not have to be built together. The file now also passes the CUDA lock that prevents collisions on the context usage. It's not actually needed to pass the context now, because that is implicit in the lock. So I am now passing the lock and the gpu0/1 references in binary form. It's all working!

I ran my first pipeline DGSource() -> DGSharpen() -> Avisynth+. With just DGSource(fdst="cpu") alone I get 206 fps for 3840x2160 HEVC, so that is the maximum fps that can be expected. Adding DGSharpen() with no pipeline, i.e.:

DGSource(fdst="cpu")
DGSharpen(fsrc="cpu")

I get 127 fps. Adding DGSharpen() pipelined, i.e.:

DGSource(fdst="gpu0")
DGSharpen(fsrc="gpu0")

I get 190 fps. So one way to look at this is that the "price" of DGSharpen() without a pipeline is 206-127 = 79fps. The price of DGSharpen() with a pipeline is 206-190 = 16 fps. The price is reduced by a factor of 5. The total savings would increase with the number of filters in the pipeline.

Post Reply