CUDASynth

Post by **admin** » Thu Aug 30, 2018 1:57 pm

Looking at available frameworks for CUDA/NVDec with Avisynth/Vapoursynth, we do not see proper pipelines running on the GPU that eliminate unnecessary PCIe frame transfers and copies into Avisynth/Vapoursynth buffers. For example, we may see the set of internal filters ported to CUDA. The spurious PCIe transfers and copies for intermediate filters are not eliminated. A solution is not so easy because CUDA enforces some constraints that are challenging for implementing such a true GPU pipeline of filters. Importantly, we should prefer to avoid having to modify Avisynth/Vapoursynth itself.

The basic CUDASynth idea is to have two GPU frame buffers accessible to every filter. One holds the input and one takes the output. The output would be the CPU for the last filter in the chain. One floating CUDA context is used globally, so that all the filters in the chain (script) can access the frame buffers. The frame buffers are used in a ping-pong manner.

The source filter creates the global CUDA context and allocates the input/output frame buffers gpu0 and gpu1. The context id and frame buffer pointers must be communicated to the other filters. Currently a file is used but this can be changed to a named pipe, shared memory, etc.

A CUDASynth script allowing for a pipeline of execution on the GPU might be:

DGSource(..., dst="gpu0")
DGSharpen(...,src="gpu0", dst="gpu1")
DGDenoise(..., src="gpu1", dst="cpu")

That is the design concept. Four full frame copies CPU<-->GPU and two copies to Avisynth/Vapoursynth are eliminated!

Currently, the global context and communication to the other script filters have been implemented and tested. Next will be the declaration of the GPU input/output buffers and testing of the GPU pipeline.

Current version of CUDASynth:

http://rationalqm.us/misc/CUDASynth_0.3.rar

Guest · Post by **Guest** » Thu Aug 30, 2018 2:36 pm

Sounds like you're making progress.

fidodkk · Post by **fidodkk** » Sat Sep 01, 2018 11:40 am

Interesting idea

Post by **admin** » Sat Sep 01, 2018 11:51 am

Thanks guys!

I've implemented the ping-pong buffers declaration by the DGSource filter and tested that the intermediate filters can access the buffers. Next main step is to modify DGSource to convert the NV12 delivered by CUVID to YUV420P8/16 on the GPU so that the intermediate filters get the right color space. Previously, this conversion was done on the CPU after transferring the decoded source frame back to the CPU and just before returning the frame to Avisynth. This change should improve performance of DGSource() in both the pipelined and nonpipelined use cases as the expensive NV12->YUV420 conversion is moved to the GPU. After that, I'll add the src/dst parameters and implement the pipeline. I expect this framework to bring out the true power of CUDA in an Avisynth/Vapoursynth environment.

Guest · Post by **Guest** » Sat Sep 01, 2018 1:54 pm

DDSource even faster, and the filters as well.
Awesome

Post by **admin** » Tue Sep 04, 2018 9:43 am

The NV12->YV12 conversion on the GPU was a bit harder than I expected. Nevertheless, I have it working now. It did not produce a significant DGSource performance improvement over my previous code (just a few fps). I can understand that because it is a small part of the overall frame decode and delivery, and because I was not able (as I had hoped) to avoid a pitched frame copy into the Avisynth frame because the decoded frame pitch can differ from the Avisynth frame pitch. The main point is that I can now source a YV12 frame down the pipeline on the GPU.

The next step is to implement the fsrc/fdst parameters for DGSource and another filter such as DGSharpen, and then try to get the pipeline running. A challenge may be to ensure that the downstream filters use the correct pitch.

Post by **admin** » Tue Sep 04, 2018 6:31 pm

gonca wrote: ↑
Tue Sep 04, 2018 9:43 am
Huh?

Gosh, I thought it was perfectly clear.

Here's the CUDA kernel code. It should make everything clear for you. Note that this just deinterleaves the chroma. The luma does not have to be touched and is just copied with cuMemcpyDtoD(). Also note that w is actually passed in as the pitch and h is half the frame height. That simplifies the code.

Code: Select all

extern "C"

__global__ void NV12toYV12(unsigned char *src, unsigned char *dst, int w, int h)
{
	const int ix = blockDim.x * blockIdx.x + threadIdx.x;
	const int iy = blockDim.y * blockIdx.y + threadIdx.y;
	unsigned char *u, *v, *s;

	u = dst + (h * 2) * w;
	v = u + h * w / 2;
	s = src + (h * 2) * w;
	if (ix < w && iy < h)
	{
		if (!(ix & 1))
			u[w * iy / 2 + ix / 2] = s[w * iy + ix];
		else
			v[w * iy / 2 + ix / 2] = s[w * iy + ix];
	}
}

It's quick and dirty, but it works.

16-bit comes later.

hydra3333 · Post by **hydra3333** » Thu Sep 06, 2018 8:33 am

Interesting, and thank you for your ideas and work.

Post by **admin** » Thu Sep 06, 2018 10:26 am

You're welcome.

Working on the sharing of the ping-pong buffers between the filter threads. It should be easy but I have a DGDecodeNV CUDA autolock that I have to make available to the other filters so they can all avoid collisions when pushing/popping the context.

Post by **admin** » Sat Sep 08, 2018 10:10 am

I found the bug that prevented the downstream filters from accessing the GPU ping-pong buffers declared by DGSource(). All the filters can now access the ping-pong buffers using the information passed in the file, which means they do not have to be built together. The file now also passes the CUDA lock that prevents collisions on the context usage. It's not actually needed to pass the context now, because that is implicit in the lock. So I am now passing the lock and the gpu0/1 references in binary form. It's all working!

I ran my first pipeline DGSource() -> DGSharpen() -> Avisynth+. With just DGSource(fdst="cpu") alone I get 206 fps for 3840x2160 HEVC, so that is the maximum fps that can be expected. Adding DGSharpen() with no pipeline, i.e.:

DGSource(fdst="cpu")
DGSharpen(fsrc="cpu")

I get 127 fps. Adding DGSharpen() pipelined, i.e.:

DGSource(fdst="gpu0")
DGSharpen(fsrc="gpu0")

I get 190 fps. So one way to look at this is that the "price" of DGSharpen() without a pipeline is 206-127 = 79fps. The price of DGSharpen() with a pipeline is 206-190 = 16 fps. The price is reduced by a factor of 5. The total savings would increase with the number of filters in the pipeline.

Guest · Post by **Guest** » Sat Sep 08, 2018 10:46 am

The price is reduced by a factor of 5

Sales are good, faster is better

Post by **admin** » Sat Sep 08, 2018 12:40 pm

Wow, gonca, listen to this...

I did some optimizations to reduce the sizes of the critical sections for the lock. Then I ran this script:

DGSource(fdst="cpu")

I got 207.1 fps. Note that adding prefetch here does not help and in fact greatly reduces the fps.

Now I ran this script:

DGSource(fdst="gpu0")
DGSharpen(fsrc="gpu0")
prefetch(2)

I got 206.3 fps!

This means that thanks to CUDASynth, a limited sharpen filter is being executed on a 3840x2160 frame essentially for free.

This is what I mean by bringing out the true power of NVDec/CUDA for Avisynth/Vapoursynth.

Guest · Post by **Guest** » Sat Sep 08, 2018 2:02 pm

This brings up the issue of hardware encoding.
Depending on Nvidia's capabilities on the new cards, a two card system could be amazingly fast
One card pre-processing and frame serving, and the second encoding

Post by **admin** » Sat Sep 08, 2018 2:05 pm

gonca wrote: ↑
Sat Sep 08, 2018 2:02 pm
This brings up the issue of hardware encoding.
Depending on Nvidia's capabilities on the new cards, a two card system could be amazingly fast
One card pre-processing and frame serving, and the second encoding

For sure. Someone send me another 1080 Ti, or maybe an RTX 2080 Ti. I would accept either one.

Even with one GPU we can make an encoder filter and put it at the end of the pipeline taking input from gpu0/1, saving PCIe and copies between the Avisynth output and the encoder. On the list now.

hydra3333 · Post by **hydra3333** » Sat Sep 08, 2018 10:15 pm

admin wrote: ↑
Sat Sep 08, 2018 12:40 pm
Now I ran this script:
DGSource(fdst="gpu0")
DGSharpen(fsrc="gpu0")
prefetch(2)

I got 206.3 fps! This means that thanks to CUDASynth, a limited sharpen filter is being executed on a 3840x2160 frame essentially for free.

This is what I mean by bringing out the true power of NVDec/CUDA for Avisynth/Vapoursynth.

That is truly impressive.

admin wrote: ↑
Sat Sep 08, 2018 12:40 pm
Even with one GPU we can make an encoder filter and put it at the end of the pipeline taking input from gpu0/1, saving PCIe and copies between the Avisynth output and the encoder. On the list now.

And I'd thought I could not get any more excited ...

Post by **admin** » Mon Sep 10, 2018 8:44 am

I finished supporting both fsrc and fdst for DGSharpen so I am able to show the results on a longer pipeline. Here is the script:

dgsource("LG Chess 4K Demo.dgi",fulldepth=false,fdst="gpu0")
DGSharpen(fsrc="gpu0",fdst="gpu1")
DGSharpen(fsrc="gpu1",fdst="gpu0")
DGSharpen(fsrc="gpu0")
prefetch(5)

For the 3840x2160 59.94 fps stream I get 161.6 fps. When I do not use the pipeline (all fsrc and fdst are "cpu"), then I get 76.8 fps. So CUDASynth is twice as fast and makes the difference between real-time and non-real-time operation. There is still a lot of headroom to have more filters while remaining real-time. The average "price" of the sharpens here is about 15 fps each.

Next, after I add P16 support, I'm going to CUDASynth-enable DGHDRtoSDR and see what kind of frame rate we can get for a full HDR to SDR script.

Guest · Post by **Guest** » Mon Sep 10, 2018 3:32 pm

what kind of frame rate we can get for a full HDR to SDR script.

This should be good!

Post by **admin** » Mon Sep 17, 2018 12:26 pm

Oy, it took a whole week to get the pipeline running in P16 [DGSource(fulldepth=true]. My NV12toYV12 kernel was broken for P16 and I had some bad pitch handling in DGSharpen for the case of fsrc=gpu and fdst=gpu. The implementation was tricky as debugging is hard when the intermediate ping-pong buffers are not visible from the host to check their contents at various points along the pipeline. Writing code to copy them down to the host would be possible but tricky so I resorted to some hunches and other tricks such as memsetting the device memory at various points to see if the values showed up at the final output as expected. It was something of a week-long nightmare because when I go to bed with outstanding bugs in my code my brain works overtime all night. And the bugs were like a layered onion; rack your brains to fix one layer and another gets exposed. After multiple layers in a row stretching over a week I got pretty zonked out from lack of proper sleep.

But I am a patient and persistent soul and never give up (unless the goal is theoretically impossible), and so everything is working perfectly now

for a 4-filter GPU pipeline (DGSource + 3 x DGSharpen) in P8 and P16. I'm going to CUDASynth-enable DGHDRtoSDR now and see how it performs in a DGSource->DGHDRtoSDR pipeline.

At some point I will publish a specification for how to implement a CUDASynth compatible filter together with a source code example. Without that CUDASynth acceleration would be limited to my own filters.

Guest · Post by **Guest** » Mon Sep 17, 2018 3:20 pm

I guess you never give up then since theoretically impossible is theoretically impossible.
Theoretically improbable, maybe
Theoretically not possible at this time, ok
But with technological improvements and increases in knowledge what is impossible today might be possible tomorrow
Therefore
theoretically impossible is theoretically impossible
and QED
you never give up

Post by **admin** » Mon Sep 17, 2018 3:57 pm

OK.

Guest · Post by **Guest** » Mon Sep 17, 2018 4:40 pm

I am just happy you never give up
We keep getting new and better avs/vpy tools to use
Thank you
:bravo:

Post by **admin** » Tue Sep 18, 2018 9:34 am

I have CUDASynth-enabled DGHDRtoSDR and have some preliminary performance numbers for your enjoyment. The source is the same as previously used: 3840x2160 59.94 fps HDR10. The script with GPU pipelining is:

dgsource("LG Chess 4K Demo.dgi",fulldepth=true,fdst="gpu0")
dghdrtosdr(impl="255",light=250,fsrc="gpu0") # outputs YV12
prefetch(4)

Not pipelined on GPU: 80 fps, CPU 13%
Pipelined on GPU: 204 fps, CPU 8%

Quite a substantial performance boost!

Guest · Post by **Guest** » Tue Sep 18, 2018 3:34 pm

Nice boost, 250%

Guest · Post by **Guest** » Tue Sep 18, 2018 5:08 pm

Any ETA on public testing?
No rush, just getting antsy

Post by **admin** » Tue Sep 18, 2018 8:25 pm

Still have some things to finish up: Vapoursynth support, fdst parameter for DGHDRtoSDR, CUDASynth-enable DGDenoise, documentation, source code example. And I'm timesharing with DGIndex MKV support. Hang in there.

BTW, the CPU reduction is also important as it leaves more CPU for encoding.

It's hard to find 2080 Ti's:

https://www.nowinstock.net/computers/vi ... rtx2080ti/

Puts the lie to some of the whining at other forums by people saying it's too expensive, nobody wants it, nVidia are dirty rotten criminal capitalists, how dare they make a GPU I can't afford, blah blah blah.

gonca, what's the fastest most powerful Threadripper likely to be available within a few months?

I saved a lot of dough doing my own bathroom remodeling so I have the ready green to dish out for the best hardware.

CUDASynth

CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth