DGDecomb

Guest · Post by **Guest** » Mon Mar 13, 2017 3:15 pm

I won't pretend to understand most of the technical things you said, or how you implement them, but I believe I catch the gist.
By increasing the number of kernels you can reduce the number of conditionals and therefore clock cycles, and by running the first two in parallel and then the appropriate one for TFF or BFF makes it even more efficient, and easier in future to modify if needed.
Very logical and elegant approach
That is why you are the man when it comes to CUDA coding.

Guest · Post by **Guest** » Mon Mar 13, 2017 3:18 pm

Aleron Ives wrote:

Seriously though, thank you for preserving support for older cards.

+1

Good thing D.G. knows what it means

Post by **admin** » Mon Mar 13, 2017 5:16 pm

Thanks, guys. I'm glad you didn't find any holes in my design.

Post by **admin** » Mon Mar 13, 2017 8:32 pm

Started coding.

Sharc · Post by **Sharc** » Tue Mar 14, 2017 5:14 am

Looking forward .....

Post by **admin** » Wed Mar 15, 2017 11:13 am

Good morning, all.

I completed a very rough first cut of TelecideNV to test the concept. It works and fields get matched as expected. However, I have not yet implemented any optimizations, that is:

* no frame subsampling (vanilla Telecide subsamples by 4 in both X and Y)
* no toggling to avoid an extra texture update
* reduction (summing of the differences) performed on the CPU (very slow, especially when not subsampled) instead of a parallel reduction kernel
* no kernel optimizations
* using floats rather than ints (because I started with the DGSharpen code)
* no pinned memory
* some other stuff

Performance is comparable to vanilla Telecide. Now I'll implement the optimizations and we'll see how much performance can be squeezed out.

On another matter, I have successfully brought up my new system (with a 1050Ti for now). I managed to sneak in an order for the 1080Ti at the nVidia store and it seems to have taken. This site is great for getting alerted when buying windows open:

https://www.nowinstock.net/computers/vi ... gtx1080ti/

Guest · Post by **Guest** » Wed Mar 15, 2017 5:37 pm

It will be fun to see the fps of the 1080ti

Post by **admin** » Thu Mar 16, 2017 1:51 pm

1080Ti arrives tomorrow.

Here is the state of play for TelecideNV on my 1050Ti. This includes all optimizations except parallel reduction and pinned host memory. Looking good! Performance is limited by host<-->device memory bandwidth.

Script:

loadplugin("telecidenv.dll")
loadplugin("decomb.dll")
blankclip(length=10000,pixel_type="YV12",width=1920,height=1080)
telecidenv()
#telecide(post=0,chroma=false)

TelecideNV (CUDA) --------------------------------------------------------------------
Number of frames: 10000
Length (hh:mm:ss.ms): 00:06:56.667
Frame width: 1920
Frame height: 1080
Framerate: 24.000 (24/1)
Colorspace: YV12
Audio channels: 1
Audio bits/sample: 16
Audio sample rate: 44100
Audio samples: 18375000

Frames processed: 10000 (0 - 9999)
FPS (min | max | average): 1428 | 1812 | 1739
Memory usage (phys | virt): 135 | 128 MiB
Thread count: 17
CPU usage (average): 11%

Time (elapsed): 00:00:05.749

Telecide (Classic) --------------------------------------------------------------------
Number of frames: 10000
Length (hh:mm:ss.ms): 00:06:56.667
Frame width: 1920
Frame height: 1080
Framerate: 24.000 (24/1)
Colorspace: YV12
Audio channels: 1
Audio bits/sample: 16
Audio sample rate: 44100
Audio samples: 18375000

Frames processed: 10000 (0 - 9999)
FPS (min | max | average): 462.2 | 628.0 | 584.2
Memory usage (phys | virt): 18 | 15 MiB
Thread count: 9
CPU usage (average): 12%

Time (elapsed): 00:00:17.116

Aleron Ives · Post by **Aleron Ives** » Thu Mar 16, 2017 2:21 pm

Ooooh, aaaah...

I sure hope it works with my poor old GPU driver.

Post by **admin** » Thu Mar 16, 2017 6:08 pm

Added pinning of the metrics array on the host:

TelecideNV: 1870 fps
Telecide: 579 fps

I cannot pin the source frame data because it is allocated by Avisynth.

Post by **admin** » Fri Mar 17, 2017 8:59 am

I implemented SSE2 reduction. I chose to do that rather than use a CUDA reduction kernel because by the time the host<--> device transfers, kernel launches, and synchronization are factored in, there is no advantage for CUDA.

Current timings for 1080p on 1050Ti:

TelecideNV: 2050 fps
Telecide: 579 fps

Now I can leave the optimizations, implement BFF handling, and then think about postprocessing and decimation.

Guest · Post by **Guest** » Fri Mar 17, 2017 9:14 am

TelecideNV: 2050 fps

Is that with the GTX1050ti or GTX1080ti

Just noticed that the 1080ti will arrive today, so this speed is on a 1050ti.
Very impressive!

Post by **admin** » Fri Mar 17, 2017 10:29 am

I clarified my post to indicate that I used the 1050Ti. The kernel is so fast that performance is now being limited mainly by Avisynth and application overhead.

Regarding performance, some points should be borne in mind. With a single GPU all the filters have to compete for CUDA time. With multiple GPUs you could, for example, run DGDecodeNV on one of them and DGDenoise on another, etc. Even with DGDecodeNV, TelecideNV, DGDenoise, and DGSharpen all running at once, even if computation is serialized in the case of one GPU, you recover a lot of CPU time for encoding. For a CPU-bound encoder, the main thing is that the delivery of filtered frames to the encoder must not be a bottleneck.

Guest · Post by **Guest** » Fri Mar 17, 2017 11:29 am

But the filters are so fast that even with Avisynth overhead limiting then a CPU bound encoder will always be the bottleneck.
Now, if somebody could come up with a better (CUDA) version of NVEnc, say DGNVenc...

I realize that it probably isn't possible with CUDA

Post by **admin** » Fri Mar 17, 2017 2:04 pm

Nothing is off the table.

Post by **admin** » Fri Mar 17, 2017 7:55 pm

OK, girls and boys, are you ready? Are you ready for the first performance comparison between the 1050Ti and the 1080Ti?

I ran this script with a 720x480 clip needing field matching. I added denoising and sharpening as well. Note that TelecideNV has not been integrated into DGDecodeNV.dll yet, but it will be.

loadplugin("dgdecodenv.dll")
loadplugin("telecidenv.dll")
a=dgsource("lain.dgi")
a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a # Lain is a short clip; make a long enough test for AVSMeter
telecidenv()
DGDenoise(strength=0.1)
DGSharpen(strength=0.5)

The results:

1050Ti: 470 fps
1080Ti: 818 fps

That seems like a useful speed-up to me.

Guest · Post by **Guest** » Fri Mar 17, 2017 8:29 pm

That's a substantial speed increase.
In Canada they are still out of stock

Post by **admin** » Fri Mar 17, 2017 9:08 pm

I reckon your 1070 would weigh in at about 650 fps for that test.

BTW, while profiling the filters I was able to find a useful speed gain for DGSource(). About 20%. I'll slipstream it at some point.

Can you imagine having two of these, the first decoding and filtering, and the second encoding? Maybe it's time for me to look into the NV encoding samples. After I complete DGDecomb, of course.

Guest · Post by **Guest** » Fri Mar 17, 2017 9:20 pm

2 gtx1080ti cards in one system
That might require a 1000 watt power supply
But the CPU could be cut down

Edit
Now that I think about it, the CPU can't be cut down.
Each card requires 16 lanes for a total of 32 lanes
You would need an extreme edition CPU to handle that many lanes

Post by **admin** » Fri Mar 17, 2017 10:37 pm

Now I have to find out what a lane is.

Aleron Ives · Post by **Aleron Ives** » Fri Mar 17, 2017 11:28 pm

admin wrote:Now I have to find out what a lane is.

It's time for a driver's ed refresher course!

Sharc · Post by **Sharc** » Sat Mar 18, 2017 3:26 am

DGSource has the boolean parameter "use-pf".
Is the decision about the frame type (progressive/interlaced) based on flags of the source, or does the algorithm analyse the frames and decide for combed or non-combed frames for deinterlacing (similar to DGDecomb)?

How will this be with the new DGDecombNV? Is it a pure FM / Decimate function?

Guest · Post by **Guest** » Sat Mar 18, 2017 6:26 am

lane >>> pci-e x 3.0 lane
Different CPU types support different numbers

Post by **admin** » Sat Mar 18, 2017 6:38 am

Sharc wrote:DGSource has the boolean parameter "use-pf".
Is the decision about the frame type (progressive/interlaced) based on flags of the source, or does the algorithm analyse the frames and decide for combed or non-combed frames for deinterlacing (similar to DGDecomb)?

How will this be with the new DGDecombNV? Is it a pure FM / Decimate function?

use_pf uses source stream flags; there is no analysis. It isn't something you can rely on generally, but if you know that your stream properly sets the progressive_frame flag, it can be useful to avoid deinterlacing frames marked as progressive. Even so, a frame may be coded and marked interlaced but have no motion and thus not need deinterlacing. So analysis is the fully correct general strategy to preserve frames whose content is progressive.

I am just now starting to think about postprocessing/deinterlacing for DGDecomb. It will perform analysis in some way (I'm trying to re-use the field matching metrics) and then any frames appearing combed after field matching will be deinterlaced with an as yet to-be-determined CUDA kernel. We will save one frame copy to the GPU because it is already there from the field matching. The primary goal is speed, and not necessarily to reproduce all classic Decomb behavior and options.

hydra3333 · Post by **hydra3333** » Sat Mar 18, 2017 7:49 am

admin wrote:BTW, while profiling the filters I was able to find a useful speed gain for DGSource(). About 20%. I'll slipstream it at some point.

Thanks !

admin wrote:Maybe it's time for me to look into the NV encoding samples. After I complete DGDecomb, of course.

... and DGdeblockNV

? OK, this looks like it has options you may or may nor consider useful; just a thought.
https://forum.videohelp.com/threads/370 ... U-encoding
http://rigaya34589.blog135.fc2.com/blog ... ry-17.html
https://github.com/rigaya/NVEnc
this also may or may not be of interest (a new deblocker) [link removed]

DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb