Page 1 of 12

DGDecomb

Posted: Sun Mar 12, 2017 10:00 am
by admin
Just starting a thread for tracking a CUDA accelerated Decomb, which I name DGDecomb. The first thing to do, however, is to make a 64-bit version of Decomb. I can't believe I never made one. :facepalm:

Adding the existing Telecide() and Decimate() drops the frame rate from 395 fps to 85 fps for 1080p, so there is a lot of room for speed-up here.

Re: About DGDecomb

Posted: Sun Mar 12, 2017 11:10 am
by gonca
Would you consider a sub forum just for the DG (GPU) cudasynth filters.
DGCudasynth might be a good name?

Re: About DGDecomb

Posted: Sun Mar 12, 2017 11:15 am
by admin
I was thinking about that this morning. If I have a forum called (say) CUDA/CUVID Tools, then I would have to put DGDecNV in there also. So now I am thinking of just renaming the DGDecNV forum to CUDA/CUVID Tools. Or just do nothing. What would make sense for you?

Meanwhile, I'm trying to recall how Decomb works. :scratch: I have my old journal entries to remind me, thank heavens.

Re: About DGDecomb

Posted: Sun Mar 12, 2017 11:32 am
by gonca
Leave DGDecode where it is, after all, it is used with many third party tools outside of your control >>> DGDecode and DGIndex are universal tools
Cudasynth are reliant upon DGDecode and under your control
I just thought that the separate sub-forum would keep things neater and easier to track / manage

Re: About DGDecomb

Posted: Sun Mar 12, 2017 3:23 pm
by Aleron Ives
I would suggest leaving the DGDecNV forum the way it is, but create a sub-forum named "DGDecNV Plugins" or something like that where you can put threads to discuss all of these secondary filters that rely on DGDecNV. You can then have all of these "master" threads gathered in one place while regular DGDecNV discussion continues as usual in the parent forum. At this point I don't think it's necessary to have separate sub-forums for DGDenoise/DGSharpen/DGDecomb, etc., but I think you have enough plugins/extensions now to warrant a general sub-forum for them as a group.

Re: About DGDecomb

Posted: Sun Mar 12, 2017 5:41 pm
by admin
Sounds good, gonca and Aleron. Thank you.

Re: DGDecomb

Posted: Sun Mar 12, 2017 5:51 pm
by Aleron Ives
Oh, it looks like we got not a sub-forum, but a top-level forum. CUDA Filters are too proud to be subordinates of the DGDecNV forum! 8-)

:lol:

Re: DGDecomb

Posted: Sun Mar 12, 2017 7:29 pm
by admin
Oh yeah, I just noticed you suggested a subforum. :facepalm: Oh well, this works good too.

Re: DGDecomb

Posted: Mon Mar 13, 2017 10:56 am
by admin
Here's my current thinking on a design for fast matching. I will have two GPU textures set up, one for the current frame and one for the previous frame. I will have two GPU global memory arrays, one for the current-current per-pixel calculation results, and one for the current-previous calculations. The kernel will make the two difference calculations for a single pixel and store the results in the global memory arrays (of course, all the pixels run in parallel to the extent supported by the GPU). Then a second kernel will perform parallel reductions to get the sum of differences for current-current and current-previous.

When we step to the next frame, we don't want to have to update two textures, because on the GPU the current texture can become the previous texture. But swapping the references requires texture objects, which are supported only on Kepler+. So to keep the older cards, I will have two kernels which differ only in that they treat texture A as current and B as previous, and vice versa. Then the caller just calls the appropriate kernel, knowing which texture was just updated (the caller toggles between them, of course). On a random access, both textures are updated and then linear stepping is performed from there as described.

We'll actually need 4 differencing kernels in total, two as described above for TFF processing, and two for BFF. That will be more efficient than passing a kernel parameter and having conditionals in the kernel code.

This should be super fast. We'll see.

Postprocessing and decimation come later.

Re: DGDecomb

Posted: Mon Mar 13, 2017 3:10 pm
by Aleron Ives
:wow:

Image

:lol:

Seriously though, thank you for preserving support for older cards.

Re: DGDecomb

Posted: Mon Mar 13, 2017 3:15 pm
by gonca
I won't pretend to understand most of the technical things you said, or how you implement them, but I believe I catch the gist.
By increasing the number of kernels you can reduce the number of conditionals and therefore clock cycles, and by running the first two in parallel and then the appropriate one for TFF or BFF makes it even more efficient, and easier in future to modify if needed.
Very logical and elegant approach
That is why you are the man when it comes to CUDA coding.

Re: DGDecomb

Posted: Mon Mar 13, 2017 3:18 pm
by gonca
Aleron Ives wrote::wow:

Image

:lol:

Seriously though, thank you for preserving support for older cards.
+1

Good thing D.G. knows what it means

Re: DGDecomb

Posted: Mon Mar 13, 2017 5:16 pm
by admin
Thanks, guys. I'm glad you didn't find any holes in my design.

Re: DGDecomb

Posted: Mon Mar 13, 2017 8:32 pm
by admin
Started coding. :D

Re: DGDecomb

Posted: Tue Mar 14, 2017 5:14 am
by Sharc
Looking forward ..... :D

Re: DGDecomb

Posted: Wed Mar 15, 2017 11:13 am
by admin
Good morning, all.

I completed a very rough first cut of TelecideNV to test the concept. It works and fields get matched as expected. However, I have not yet implemented any optimizations, that is:

* no frame subsampling (vanilla Telecide subsamples by 4 in both X and Y)
* no toggling to avoid an extra texture update
* reduction (summing of the differences) performed on the CPU (very slow, especially when not subsampled) instead of a parallel reduction kernel
* no kernel optimizations
* using floats rather than ints (because I started with the DGSharpen code)
* no pinned memory
* some other stuff

Performance is comparable to vanilla Telecide. Now I'll implement the optimizations and we'll see how much performance can be squeezed out.

On another matter, I have successfully brought up my new system (with a 1050Ti for now). I managed to sneak in an order for the 1080Ti at the nVidia store and it seems to have taken. This site is great for getting alerted when buying windows open:

https://www.nowinstock.net/computers/vi ... gtx1080ti/

Re: DGDecomb

Posted: Wed Mar 15, 2017 5:37 pm
by gonca
It will be fun to see the fps of the 1080ti

Re: DGDecomb

Posted: Thu Mar 16, 2017 1:51 pm
by admin
1080Ti arrives tomorrow. :D

Here is the state of play for TelecideNV on my 1050Ti. This includes all optimizations except parallel reduction and pinned host memory. Looking good! Performance is limited by host<-->device memory bandwidth.

Script:

loadplugin("telecidenv.dll")
loadplugin("decomb.dll")
blankclip(length=10000,pixel_type="YV12",width=1920,height=1080)
telecidenv()
#telecide(post=0,chroma=false)

TelecideNV (CUDA) --------------------------------------------------------------------
Number of frames: 10000
Length (hh:mm:ss.ms): 00:06:56.667
Frame width: 1920
Frame height: 1080
Framerate: 24.000 (24/1)
Colorspace: YV12
Audio channels: 1
Audio bits/sample: 16
Audio sample rate: 44100
Audio samples: 18375000

Frames processed: 10000 (0 - 9999)
FPS (min | max | average): 1428 | 1812 | 1739
Memory usage (phys | virt): 135 | 128 MiB
Thread count: 17
CPU usage (average): 11%

Time (elapsed): 00:00:05.749

Telecide (Classic) --------------------------------------------------------------------
Number of frames: 10000
Length (hh:mm:ss.ms): 00:06:56.667
Frame width: 1920
Frame height: 1080
Framerate: 24.000 (24/1)
Colorspace: YV12
Audio channels: 1
Audio bits/sample: 16
Audio sample rate: 44100
Audio samples: 18375000

Frames processed: 10000 (0 - 9999)
FPS (min | max | average): 462.2 | 628.0 | 584.2
Memory usage (phys | virt): 18 | 15 MiB
Thread count: 9
CPU usage (average): 12%

Time (elapsed): 00:00:17.116

Re: DGDecomb

Posted: Thu Mar 16, 2017 2:21 pm
by Aleron Ives
Ooooh, aaaah...

I sure hope it works with my poor old GPU driver. :?

Re: DGDecomb

Posted: Thu Mar 16, 2017 6:08 pm
by admin
Added pinning of the metrics array on the host:

TelecideNV: 1870 fps
Telecide: 579 fps

I cannot pin the source frame data because it is allocated by Avisynth.

Re: DGDecomb

Posted: Fri Mar 17, 2017 8:59 am
by admin
I implemented SSE2 reduction. I chose to do that rather than use a CUDA reduction kernel because by the time the host<--> device transfers, kernel launches, and synchronization are factored in, there is no advantage for CUDA.

Current timings for 1080p on 1050Ti:

TelecideNV: 2050 fps
Telecide: 579 fps

Now I can leave the optimizations, implement BFF handling, and then think about postprocessing and decimation.

Re: DGDecomb

Posted: Fri Mar 17, 2017 9:14 am
by gonca
TelecideNV: 2050 fps
Is that with the GTX1050ti or GTX1080ti

Just noticed that the 1080ti will arrive today, so this speed is on a 1050ti.
Very impressive!

Re: DGDecomb

Posted: Fri Mar 17, 2017 10:29 am
by admin
I clarified my post to indicate that I used the 1050Ti. The kernel is so fast that performance is now being limited mainly by Avisynth and application overhead.

Regarding performance, some points should be borne in mind. With a single GPU all the filters have to compete for CUDA time. With multiple GPUs you could, for example, run DGDecodeNV on one of them and DGDenoise on another, etc. Even with DGDecodeNV, TelecideNV, DGDenoise, and DGSharpen all running at once, even if computation is serialized in the case of one GPU, you recover a lot of CPU time for encoding. For a CPU-bound encoder, the main thing is that the delivery of filtered frames to the encoder must not be a bottleneck.

Re: DGDecomb

Posted: Fri Mar 17, 2017 11:29 am
by gonca
But the filters are so fast that even with Avisynth overhead limiting then a CPU bound encoder will always be the bottleneck.
Now, if somebody could come up with a better (CUDA) version of NVEnc, say DGNVenc... :ugeek:
I realize that it probably isn't possible with CUDA :facepalm:

Re: DGDecomb

Posted: Fri Mar 17, 2017 2:04 pm
by admin
Nothing is off the table.