DGDecomb

These CUDA filters are packaged into DGDecodeNV, which is part of DGDecNV.
Post Reply
User avatar
gonca
Curly Approved/Moose Approved
Posts: 956
Joined: Sun Apr 08, 2012 6:12 pm

Re: DGDecomb

Post by gonca »

I won't pretend to understand most of the technical things you said, or how you implement them, but I believe I catch the gist.
By increasing the number of kernels you can reduce the number of conditionals and therefore clock cycles, and by running the first two in parallel and then the appropriate one for TFF or BFF makes it even more efficient, and easier in future to modify if needed.
Very logical and elegant approach
That is why you are the man when it comes to CUDA coding.
User avatar
gonca
Curly Approved/Moose Approved
Posts: 956
Joined: Sun Apr 08, 2012 6:12 pm

Re: DGDecomb

Post by gonca »

Aleron Ives wrote::wow:

Image

:lol:

Seriously though, thank you for preserving support for older cards.
+1

Good thing D.G. knows what it means
User avatar
admin
Site Admin
Posts: 4449
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Thanks, guys. I'm glad you didn't find any holes in my design.
User avatar
admin
Site Admin
Posts: 4449
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Started coding. :D
DAE avatar
Sharc
Moose Approved
Posts: 231
Joined: Thu Sep 23, 2010 1:53 pm

Re: DGDecomb

Post by Sharc »

Looking forward ..... :D
User avatar
admin
Site Admin
Posts: 4449
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Good morning, all.

I completed a very rough first cut of TelecideNV to test the concept. It works and fields get matched as expected. However, I have not yet implemented any optimizations, that is:

* no frame subsampling (vanilla Telecide subsamples by 4 in both X and Y)
* no toggling to avoid an extra texture update
* reduction (summing of the differences) performed on the CPU (very slow, especially when not subsampled) instead of a parallel reduction kernel
* no kernel optimizations
* using floats rather than ints (because I started with the DGSharpen code)
* no pinned memory
* some other stuff

Performance is comparable to vanilla Telecide. Now I'll implement the optimizations and we'll see how much performance can be squeezed out.

On another matter, I have successfully brought up my new system (with a 1050Ti for now). I managed to sneak in an order for the 1080Ti at the nVidia store and it seems to have taken. This site is great for getting alerted when buying windows open:

https://www.nowinstock.net/computers/vi ... gtx1080ti/
User avatar
gonca
Curly Approved/Moose Approved
Posts: 956
Joined: Sun Apr 08, 2012 6:12 pm

Re: DGDecomb

Post by gonca »

It will be fun to see the fps of the 1080ti
User avatar
admin
Site Admin
Posts: 4449
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

1080Ti arrives tomorrow. :D

Here is the state of play for TelecideNV on my 1050Ti. This includes all optimizations except parallel reduction and pinned host memory. Looking good! Performance is limited by host<-->device memory bandwidth.

Script:

loadplugin("telecidenv.dll")
loadplugin("decomb.dll")
blankclip(length=10000,pixel_type="YV12",width=1920,height=1080)
telecidenv()
#telecide(post=0,chroma=false)

TelecideNV (CUDA) --------------------------------------------------------------------
Number of frames: 10000
Length (hh:mm:ss.ms): 00:06:56.667
Frame width: 1920
Frame height: 1080
Framerate: 24.000 (24/1)
Colorspace: YV12
Audio channels: 1
Audio bits/sample: 16
Audio sample rate: 44100
Audio samples: 18375000

Frames processed: 10000 (0 - 9999)
FPS (min | max | average): 1428 | 1812 | 1739
Memory usage (phys | virt): 135 | 128 MiB
Thread count: 17
CPU usage (average): 11%

Time (elapsed): 00:00:05.749

Telecide (Classic) --------------------------------------------------------------------
Number of frames: 10000
Length (hh:mm:ss.ms): 00:06:56.667
Frame width: 1920
Frame height: 1080
Framerate: 24.000 (24/1)
Colorspace: YV12
Audio channels: 1
Audio bits/sample: 16
Audio sample rate: 44100
Audio samples: 18375000

Frames processed: 10000 (0 - 9999)
FPS (min | max | average): 462.2 | 628.0 | 584.2
Memory usage (phys | virt): 18 | 15 MiB
Thread count: 9
CPU usage (average): 12%

Time (elapsed): 00:00:17.116
DAE avatar
Aleron Ives
Posts: 113
Joined: Fri May 31, 2013 8:36 pm

Re: DGDecomb

Post by Aleron Ives »

Ooooh, aaaah...

I sure hope it works with my poor old GPU driver. :?
User avatar
admin
Site Admin
Posts: 4449
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Added pinning of the metrics array on the host:

TelecideNV: 1870 fps
Telecide: 579 fps

I cannot pin the source frame data because it is allocated by Avisynth.
User avatar
admin
Site Admin
Posts: 4449
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

I implemented SSE2 reduction. I chose to do that rather than use a CUDA reduction kernel because by the time the host<--> device transfers, kernel launches, and synchronization are factored in, there is no advantage for CUDA.

Current timings for 1080p on 1050Ti:

TelecideNV: 2050 fps
Telecide: 579 fps

Now I can leave the optimizations, implement BFF handling, and then think about postprocessing and decimation.
User avatar
gonca
Curly Approved/Moose Approved
Posts: 956
Joined: Sun Apr 08, 2012 6:12 pm

Re: DGDecomb

Post by gonca »

TelecideNV: 2050 fps
Is that with the GTX1050ti or GTX1080ti

Just noticed that the 1080ti will arrive today, so this speed is on a 1050ti.
Very impressive!
User avatar
admin
Site Admin
Posts: 4449
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

I clarified my post to indicate that I used the 1050Ti. The kernel is so fast that performance is now being limited mainly by Avisynth and application overhead.

Regarding performance, some points should be borne in mind. With a single GPU all the filters have to compete for CUDA time. With multiple GPUs you could, for example, run DGDecodeNV on one of them and DGDenoise on another, etc. Even with DGDecodeNV, TelecideNV, DGDenoise, and DGSharpen all running at once, even if computation is serialized in the case of one GPU, you recover a lot of CPU time for encoding. For a CPU-bound encoder, the main thing is that the delivery of filtered frames to the encoder must not be a bottleneck.
User avatar
gonca
Curly Approved/Moose Approved
Posts: 956
Joined: Sun Apr 08, 2012 6:12 pm

Re: DGDecomb

Post by gonca »

But the filters are so fast that even with Avisynth overhead limiting then a CPU bound encoder will always be the bottleneck.
Now, if somebody could come up with a better (CUDA) version of NVEnc, say DGNVenc... :ugeek:
I realize that it probably isn't possible with CUDA :facepalm:
User avatar
admin
Site Admin
Posts: 4449
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Nothing is off the table.
User avatar
admin
Site Admin
Posts: 4449
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

OK, girls and boys, are you ready? Are you ready for the first performance comparison between the 1050Ti and the 1080Ti?

I ran this script with a 720x480 clip needing field matching. I added denoising and sharpening as well. Note that TelecideNV has not been integrated into DGDecodeNV.dll yet, but it will be.

loadplugin("dgdecodenv.dll")
loadplugin("telecidenv.dll")
a=dgsource("lain.dgi")
a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a # Lain is a short clip; make a long enough test for AVSMeter
telecidenv()
DGDenoise(strength=0.1)
DGSharpen(strength=0.5)

The results:

1050Ti: 470 fps
1080Ti: 818 fps

That seems like a useful speed-up to me. ;)
User avatar
gonca
Curly Approved/Moose Approved
Posts: 956
Joined: Sun Apr 08, 2012 6:12 pm

Re: DGDecomb

Post by gonca »

That's a substantial speed increase.
In Canada they are still out of stock
User avatar
admin
Site Admin
Posts: 4449
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

I reckon your 1070 would weigh in at about 650 fps for that test.

BTW, while profiling the filters I was able to find a useful speed gain for DGSource(). About 20%. I'll slipstream it at some point.

Can you imagine having two of these, the first decoding and filtering, and the second encoding? Maybe it's time for me to look into the NV encoding samples. After I complete DGDecomb, of course.
User avatar
gonca
Curly Approved/Moose Approved
Posts: 956
Joined: Sun Apr 08, 2012 6:12 pm

Re: DGDecomb

Post by gonca »

2 gtx1080ti cards in one system
That might require a 1000 watt power supply
But the CPU could be cut down

Edit
Now that I think about it, the CPU can't be cut down.
Each card requires 16 lanes for a total of 32 lanes
You would need an extreme edition CPU to handle that many lanes
User avatar
admin
Site Admin
Posts: 4449
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Now I have to find out what a lane is. :cry:
DAE avatar
Aleron Ives
Posts: 113
Joined: Fri May 31, 2013 8:36 pm

Re: DGDecomb

Post by Aleron Ives »

admin wrote:Now I have to find out what a lane is. :cry:
It's time for a driver's ed refresher course! :lol:

;)
DAE avatar
Sharc
Moose Approved
Posts: 231
Joined: Thu Sep 23, 2010 1:53 pm

Re: DGDecomb

Post by Sharc »

DGSource has the boolean parameter "use-pf".
Is the decision about the frame type (progressive/interlaced) based on flags of the source, or does the algorithm analyse the frames and decide for combed or non-combed frames for deinterlacing (similar to DGDecomb)?

How will this be with the new DGDecombNV? Is it a pure FM / Decimate function?
User avatar
gonca
Curly Approved/Moose Approved
Posts: 956
Joined: Sun Apr 08, 2012 6:12 pm

Re: DGDecomb

Post by gonca »

lane >>> pci-e x 3.0 lane
Different CPU types support different numbers
User avatar
admin
Site Admin
Posts: 4449
Joined: Thu Sep 09, 2010 3:08 pm

Re: DGDecomb

Post by admin »

Sharc wrote:DGSource has the boolean parameter "use-pf".
Is the decision about the frame type (progressive/interlaced) based on flags of the source, or does the algorithm analyse the frames and decide for combed or non-combed frames for deinterlacing (similar to DGDecomb)?

How will this be with the new DGDecombNV? Is it a pure FM / Decimate function?
use_pf uses source stream flags; there is no analysis. It isn't something you can rely on generally, but if you know that your stream properly sets the progressive_frame flag, it can be useful to avoid deinterlacing frames marked as progressive. Even so, a frame may be coded and marked interlaced but have no motion and thus not need deinterlacing. So analysis is the fully correct general strategy to preserve frames whose content is progressive.

I am just now starting to think about postprocessing/deinterlacing for DGDecomb. It will perform analysis in some way (I'm trying to re-use the field matching metrics) and then any frames appearing combed after field matching will be deinterlaced with an as yet to-be-determined CUDA kernel. We will save one frame copy to the GPU because it is already there from the field matching. The primary goal is speed, and not necessarily to reproduce all classic Decomb behavior and options.
User avatar
hydra3333
Moose Approved
Posts: 214
Joined: Wed Oct 06, 2010 3:34 am
Contact:

Re: DGDecomb

Post by hydra3333 »

admin wrote:BTW, while profiling the filters I was able to find a useful speed gain for DGSource(). About 20%. I'll slipstream it at some point.
Thanks !
admin wrote:Maybe it's time for me to look into the NV encoding samples. After I complete DGDecomb, of course.
... and DGdeblockNV :) ? OK, this looks like it has options you may or may nor consider useful; just a thought.
https://forum.videohelp.com/threads/370 ... U-encoding
http://rigaya34589.blog135.fc2.com/blog ... ry-17.html
https://github.com/rigaya/NVEnc
this also may or may not be of interest (a new deblocker) [link removed]
Post Reply