DGDecomb

These CUDA filters are packaged into DGDecodeNV, which is part of DGDecNV.
Post Reply
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

DGDecomb

Post by admin »

Just starting a thread for tracking a CUDA accelerated Decomb, which I name DGDecomb. The first thing to do, however, is to make a 64-bit version of Decomb. I can't believe I never made one. :facepalm:

Adding the existing Telecide() and Decimate() drops the frame rate from 395 fps to 85 fps for 1080p, so there is a lot of room for speed-up here.
DAE avatar
Guest

Re: About DGDecomb

Post by Guest »

Would you consider a sub forum just for the DG (GPU) cudasynth filters.
DGCudasynth might be a good name?
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

Re: About DGDecomb

Post by admin »

I was thinking about that this morning. If I have a forum called (say) CUDA/CUVID Tools, then I would have to put DGDecNV in there also. So now I am thinking of just renaming the DGDecNV forum to CUDA/CUVID Tools. Or just do nothing. What would make sense for you?

Meanwhile, I'm trying to recall how Decomb works. :scratch: I have my old journal entries to remind me, thank heavens.
DAE avatar
Guest

Re: About DGDecomb

Post by Guest »

Leave DGDecode where it is, after all, it is used with many third party tools outside of your control >>> DGDecode and DGIndex are universal tools
Cudasynth are reliant upon DGDecode and under your control
I just thought that the separate sub-forum would keep things neater and easier to track / manage
DAE avatar
Aleron Ives
Posts: 126
Joined: Fri May 31, 2013 8:36 pm

Re: About DGDecomb

Post by Aleron Ives »

I would suggest leaving the DGDecNV forum the way it is, but create a sub-forum named "DGDecNV Plugins" or something like that where you can put threads to discuss all of these secondary filters that rely on DGDecNV. You can then have all of these "master" threads gathered in one place while regular DGDecNV discussion continues as usual in the parent forum. At this point I don't think it's necessary to have separate sub-forums for DGDenoise/DGSharpen/DGDecomb, etc., but I think you have enough plugins/extensions now to warrant a general sub-forum for them as a group.
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

Re: About DGDecomb

Post by admin »

Sounds good, gonca and Aleron. Thank you.
DAE avatar
Aleron Ives
Posts: 126
Joined: Fri May 31, 2013 8:36 pm

Re: DGDecomb

Post by Aleron Ives »

Oh, it looks like we got not a sub-forum, but a top-level forum. CUDA Filters are too proud to be subordinates of the DGDecNV forum! 8-)

:lol:
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Oh yeah, I just noticed you suggested a subforum. :facepalm: Oh well, this works good too.
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Here's my current thinking on a design for fast matching. I will have two GPU textures set up, one for the current frame and one for the previous frame. I will have two GPU global memory arrays, one for the current-current per-pixel calculation results, and one for the current-previous calculations. The kernel will make the two difference calculations for a single pixel and store the results in the global memory arrays (of course, all the pixels run in parallel to the extent supported by the GPU). Then a second kernel will perform parallel reductions to get the sum of differences for current-current and current-previous.

When we step to the next frame, we don't want to have to update two textures, because on the GPU the current texture can become the previous texture. But swapping the references requires texture objects, which are supported only on Kepler+. So to keep the older cards, I will have two kernels which differ only in that they treat texture A as current and B as previous, and vice versa. Then the caller just calls the appropriate kernel, knowing which texture was just updated (the caller toggles between them, of course). On a random access, both textures are updated and then linear stepping is performed from there as described.

We'll actually need 4 differencing kernels in total, two as described above for TFF processing, and two for BFF. That will be more efficient than passing a kernel parameter and having conditionals in the kernel code.

This should be super fast. We'll see.

Postprocessing and decimation come later.
DAE avatar
Aleron Ives
Posts: 126
Joined: Fri May 31, 2013 8:36 pm

Re: DGDecomb

Post by Aleron Ives »

:wow:

Image

:lol:

Seriously though, thank you for preserving support for older cards.
DAE avatar
Guest

Re: DGDecomb

Post by Guest »

I won't pretend to understand most of the technical things you said, or how you implement them, but I believe I catch the gist.
By increasing the number of kernels you can reduce the number of conditionals and therefore clock cycles, and by running the first two in parallel and then the appropriate one for TFF or BFF makes it even more efficient, and easier in future to modify if needed.
Very logical and elegant approach
That is why you are the man when it comes to CUDA coding.
DAE avatar
Guest

Re: DGDecomb

Post by Guest »

Aleron Ives wrote::wow:

Image

:lol:

Seriously though, thank you for preserving support for older cards.
+1

Good thing D.G. knows what it means
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Thanks, guys. I'm glad you didn't find any holes in my design.
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Started coding. :D
DAE avatar
Sharc
Posts: 233
Joined: Thu Sep 23, 2010 1:53 pm

Re: DGDecomb

Post by Sharc »

Looking forward ..... :D
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Good morning, all.

I completed a very rough first cut of TelecideNV to test the concept. It works and fields get matched as expected. However, I have not yet implemented any optimizations, that is:

* no frame subsampling (vanilla Telecide subsamples by 4 in both X and Y)
* no toggling to avoid an extra texture update
* reduction (summing of the differences) performed on the CPU (very slow, especially when not subsampled) instead of a parallel reduction kernel
* no kernel optimizations
* using floats rather than ints (because I started with the DGSharpen code)
* no pinned memory
* some other stuff

Performance is comparable to vanilla Telecide. Now I'll implement the optimizations and we'll see how much performance can be squeezed out.

On another matter, I have successfully brought up my new system (with a 1050Ti for now). I managed to sneak in an order for the 1080Ti at the nVidia store and it seems to have taken. This site is great for getting alerted when buying windows open:

https://www.nowinstock.net/computers/vi ... gtx1080ti/
DAE avatar
Guest

Re: DGDecomb

Post by Guest »

It will be fun to see the fps of the 1080ti
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

1080Ti arrives tomorrow. :D

Here is the state of play for TelecideNV on my 1050Ti. This includes all optimizations except parallel reduction and pinned host memory. Looking good! Performance is limited by host<-->device memory bandwidth.

Script:

loadplugin("telecidenv.dll")
loadplugin("decomb.dll")
blankclip(length=10000,pixel_type="YV12",width=1920,height=1080)
telecidenv()
#telecide(post=0,chroma=false)

TelecideNV (CUDA) --------------------------------------------------------------------
Number of frames: 10000
Length (hh:mm:ss.ms): 00:06:56.667
Frame width: 1920
Frame height: 1080
Framerate: 24.000 (24/1)
Colorspace: YV12
Audio channels: 1
Audio bits/sample: 16
Audio sample rate: 44100
Audio samples: 18375000

Frames processed: 10000 (0 - 9999)
FPS (min | max | average): 1428 | 1812 | 1739
Memory usage (phys | virt): 135 | 128 MiB
Thread count: 17
CPU usage (average): 11%

Time (elapsed): 00:00:05.749

Telecide (Classic) --------------------------------------------------------------------
Number of frames: 10000
Length (hh:mm:ss.ms): 00:06:56.667
Frame width: 1920
Frame height: 1080
Framerate: 24.000 (24/1)
Colorspace: YV12
Audio channels: 1
Audio bits/sample: 16
Audio sample rate: 44100
Audio samples: 18375000

Frames processed: 10000 (0 - 9999)
FPS (min | max | average): 462.2 | 628.0 | 584.2
Memory usage (phys | virt): 18 | 15 MiB
Thread count: 9
CPU usage (average): 12%

Time (elapsed): 00:00:17.116
DAE avatar
Aleron Ives
Posts: 126
Joined: Fri May 31, 2013 8:36 pm

Re: DGDecomb

Post by Aleron Ives »

Ooooh, aaaah...

I sure hope it works with my poor old GPU driver. :?
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Added pinning of the metrics array on the host:

TelecideNV: 1870 fps
Telecide: 579 fps

I cannot pin the source frame data because it is allocated by Avisynth.
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

I implemented SSE2 reduction. I chose to do that rather than use a CUDA reduction kernel because by the time the host<--> device transfers, kernel launches, and synchronization are factored in, there is no advantage for CUDA.

Current timings for 1080p on 1050Ti:

TelecideNV: 2050 fps
Telecide: 579 fps

Now I can leave the optimizations, implement BFF handling, and then think about postprocessing and decimation.
DAE avatar
Guest

Re: DGDecomb

Post by Guest »

TelecideNV: 2050 fps
Is that with the GTX1050ti or GTX1080ti

Just noticed that the 1080ti will arrive today, so this speed is on a 1050ti.
Very impressive!
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

I clarified my post to indicate that I used the 1050Ti. The kernel is so fast that performance is now being limited mainly by Avisynth and application overhead.

Regarding performance, some points should be borne in mind. With a single GPU all the filters have to compete for CUDA time. With multiple GPUs you could, for example, run DGDecodeNV on one of them and DGDenoise on another, etc. Even with DGDecodeNV, TelecideNV, DGDenoise, and DGSharpen all running at once, even if computation is serialized in the case of one GPU, you recover a lot of CPU time for encoding. For a CPU-bound encoder, the main thing is that the delivery of filtered frames to the encoder must not be a bottleneck.
DAE avatar
Guest

Re: DGDecomb

Post by Guest »

But the filters are so fast that even with Avisynth overhead limiting then a CPU bound encoder will always be the bottleneck.
Now, if somebody could come up with a better (CUDA) version of NVEnc, say DGNVenc... :ugeek:
I realize that it probably isn't possible with CUDA :facepalm:
User avatar
admin
Posts: 4551
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Nothing is off the table.
Post Reply