DGDecomb

Post by **admin** » Fri Mar 17, 2017 8:59 am

I implemented SSE2 reduction. I chose to do that rather than use a CUDA reduction kernel because by the time the host<--> device transfers, kernel launches, and synchronization are factored in, there is no advantage for CUDA.

Current timings for 1080p on 1050Ti:

TelecideNV: 2050 fps
Telecide: 579 fps

Now I can leave the optimizations, implement BFF handling, and then think about postprocessing and decimation.

Guest · Post by **Guest** » Fri Mar 17, 2017 9:14 am

TelecideNV: 2050 fps

Is that with the GTX1050ti or GTX1080ti

Just noticed that the 1080ti will arrive today, so this speed is on a 1050ti.
Very impressive!

Post by **admin** » Fri Mar 17, 2017 10:29 am

I clarified my post to indicate that I used the 1050Ti. The kernel is so fast that performance is now being limited mainly by Avisynth and application overhead.

Regarding performance, some points should be borne in mind. With a single GPU all the filters have to compete for CUDA time. With multiple GPUs you could, for example, run DGDecodeNV on one of them and DGDenoise on another, etc. Even with DGDecodeNV, TelecideNV, DGDenoise, and DGSharpen all running at once, even if computation is serialized in the case of one GPU, you recover a lot of CPU time for encoding. For a CPU-bound encoder, the main thing is that the delivery of filtered frames to the encoder must not be a bottleneck.

Guest · Post by **Guest** » Fri Mar 17, 2017 11:29 am

But the filters are so fast that even with Avisynth overhead limiting then a CPU bound encoder will always be the bottleneck.
Now, if somebody could come up with a better (CUDA) version of NVEnc, say DGNVenc...

I realize that it probably isn't possible with CUDA

Post by **admin** » Fri Mar 17, 2017 2:04 pm

Nothing is off the table.

Post by **admin** » Fri Mar 17, 2017 7:55 pm

OK, girls and boys, are you ready? Are you ready for the first performance comparison between the 1050Ti and the 1080Ti?

I ran this script with a 720x480 clip needing field matching. I added denoising and sharpening as well. Note that TelecideNV has not been integrated into DGDecodeNV.dll yet, but it will be.

loadplugin("dgdecodenv.dll")
loadplugin("telecidenv.dll")
a=dgsource("lain.dgi")
a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a+a # Lain is a short clip; make a long enough test for AVSMeter
telecidenv()
DGDenoise(strength=0.1)
DGSharpen(strength=0.5)

The results:

1050Ti: 470 fps
1080Ti: 818 fps

That seems like a useful speed-up to me.

Guest · Post by **Guest** » Fri Mar 17, 2017 8:29 pm

That's a substantial speed increase.
In Canada they are still out of stock

Post by **admin** » Fri Mar 17, 2017 9:08 pm

I reckon your 1070 would weigh in at about 650 fps for that test.

BTW, while profiling the filters I was able to find a useful speed gain for DGSource(). About 20%. I'll slipstream it at some point.

Can you imagine having two of these, the first decoding and filtering, and the second encoding? Maybe it's time for me to look into the NV encoding samples. After I complete DGDecomb, of course.

Guest · Post by **Guest** » Fri Mar 17, 2017 9:20 pm

2 gtx1080ti cards in one system
That might require a 1000 watt power supply
But the CPU could be cut down

Edit
Now that I think about it, the CPU can't be cut down.
Each card requires 16 lanes for a total of 32 lanes
You would need an extreme edition CPU to handle that many lanes

Post by **admin** » Fri Mar 17, 2017 10:37 pm

Now I have to find out what a lane is.

Aleron Ives · Post by **Aleron Ives** » Fri Mar 17, 2017 11:28 pm

admin wrote:Now I have to find out what a lane is.

It's time for a driver's ed refresher course!

Sharc · Post by **Sharc** » Sat Mar 18, 2017 3:26 am

DGSource has the boolean parameter "use-pf".
Is the decision about the frame type (progressive/interlaced) based on flags of the source, or does the algorithm analyse the frames and decide for combed or non-combed frames for deinterlacing (similar to DGDecomb)?

How will this be with the new DGDecombNV? Is it a pure FM / Decimate function?

Guest · Post by **Guest** » Sat Mar 18, 2017 6:26 am

lane >>> pci-e x 3.0 lane
Different CPU types support different numbers

Post by **admin** » Sat Mar 18, 2017 6:38 am

Sharc wrote:DGSource has the boolean parameter "use-pf".
Is the decision about the frame type (progressive/interlaced) based on flags of the source, or does the algorithm analyse the frames and decide for combed or non-combed frames for deinterlacing (similar to DGDecomb)?

How will this be with the new DGDecombNV? Is it a pure FM / Decimate function?

use_pf uses source stream flags; there is no analysis. It isn't something you can rely on generally, but if you know that your stream properly sets the progressive_frame flag, it can be useful to avoid deinterlacing frames marked as progressive. Even so, a frame may be coded and marked interlaced but have no motion and thus not need deinterlacing. So analysis is the fully correct general strategy to preserve frames whose content is progressive.

I am just now starting to think about postprocessing/deinterlacing for DGDecomb. It will perform analysis in some way (I'm trying to re-use the field matching metrics) and then any frames appearing combed after field matching will be deinterlaced with an as yet to-be-determined CUDA kernel. We will save one frame copy to the GPU because it is already there from the field matching. The primary goal is speed, and not necessarily to reproduce all classic Decomb behavior and options.

hydra3333 · Post by **hydra3333** » Sat Mar 18, 2017 7:49 am

admin wrote:BTW, while profiling the filters I was able to find a useful speed gain for DGSource(). About 20%. I'll slipstream it at some point.

Thanks !

admin wrote:Maybe it's time for me to look into the NV encoding samples. After I complete DGDecomb, of course.

... and DGdeblockNV

? OK, this looks like it has options you may or may nor consider useful; just a thought.
https://forum.videohelp.com/threads/370 ... U-encoding
http://rigaya34589.blog135.fc2.com/blog ... ry-17.html
https://github.com/rigaya/NVEnc
this also may or may not be of interest (a new deblocker) [link removed]

Post by **admin** » Sat Mar 18, 2017 10:33 am

Thanks for the references, hydra3333. Yes, deblocking is on the list and has a high priority.

Meanwhile, I have completed postprocessing for TelecideNV and I added a show option so you can see the metrics and decisions. I'll give y'all a test version after I fix up DGDecIM to use the Avisynth 2.6 interface for my friend Selur. Then comes DecimateNV followed by DeblockNV.

hydra3333 · Post by **hydra3333** » Sun Mar 19, 2017 12:11 am

Many thanks for your ongoing clarity of thought and action, and really useful code.
:bravo:

Sharc · Post by **Sharc** » Sun Mar 19, 2017 6:19 am

admin wrote:
Sharc wrote:DGSource has the boolean parameter "use-pf".
Is the decision about the frame type (progressive/interlaced) based on flags of the source, or does the algorithm analyse the frames and decide for combed or non-combed frames for deinterlacing (similar to DGDecomb)?

How will this be with the new DGDecombNV? Is it a pure FM / Decimate function?
use_pf uses source stream flags; there is no analysis. It isn't something you can rely on generally, but if you know that your stream properly sets the progressive_frame flag, it can be useful to avoid deinterlacing frames marked as progressive. Even so, a frame may be coded and marked interlaced but have no motion and thus not need deinterlacing. So analysis is the fully correct general strategy to preserve frames whose content is progressive.

I am just now starting to think about postprocessing/deinterlacing for DGDecomb. It will perform analysis in some way (I'm trying to re-use the field matching metrics) and then any frames appearing combed after field matching will be deinterlaced with an as yet to-be-determined CUDA kernel. We will save one frame copy to the GPU because it is already there from the field matching. The primary goal is speed, and not necessarily to reproduce all classic Decomb behavior and options.

Thanks for clarification.
Yes, the flags can be misleading or confusing. For example, I have seen different practices for 3:2 hard telecined material, like the 3 progressive frames flagged as progressive_frame=true (progressive) and the 2 combed frames as progressive_frame=false (interlaced), whereas in another case all 5 frames were flagged as progressive_frame=false (interlaced).....

Post by **admin** » Sun Mar 19, 2017 7:48 pm

I got DecimateNV working. Now I have a full CUDA IVTC solution. I need to add licensing to DecimateNV and make a few optimizations, and then I'll give y'all a beta of both TelecideNV and DecimateNV.

There's something very interesting I learned about CUDA. Memory transfers are so expensive relative to kernel processing that it often pays off to do some suboptimal stuff in the kernel in order to save on memory transferred back to the host. For example, suppose you have a kernel that runs one thread per pixel and calculates a difference between two frames. Then you would transfer a full frame-sized array of differences back to the host and sum them on the host. But if you (say) allocated one thread per 16 pixels (sacrificing some parallelism) and calculated all sixteen and summed them in the kernel, then the size of the memory transfer back to the host is 1/16 of a full frame. Not only that, but the summation on the host is faster. There is a sweet spot for performance that has to be determined empirically. This is one of the optimizations I mentioned above. I want to complete those before giving any more timings. Trust me, it blows away classic Decomb!

jpsdr · Post by **jpsdr** » Mon Mar 20, 2017 4:30 am

I don't know how exactly your IVTC works, but i'll just share with you my work/result of my VDub IVTC filter i've made almost 15 years ago... Get the code on my github if you're curious.
At the time, all the automatic IVTC filters worked by detecting interlaced frame by computing fields corrolation values (differences between fields), and considering IVTC frames the two which have the highest correlation value. But i wasn't satisfied by the results.
I've choosen a different approach: The rough idea is the following: Doing the IVTC on the frame, computing the correlation, and consider the frame telecined if the difference between the correlation of the IVTC frame and correlation of the original frame is decreased under some threshold. And also use one of your idea comming from your smart deinterlace : computing correlation only on area suposely detected interlaced by a simple threshold test as you do in your smart deinterlace. If i remember properly, the interlace detection area is made only on the original frame, and these same areas are used on both original and telecined frame to compute correlation.
And i was unsing two computations : correlation from the whole frame, and correlation only from "areas".

Do whatever you want with these two cents share thoughs...

Post by **admin** » Mon Mar 20, 2017 5:10 am

Thanks for bringing this to my attention, jpsdr.

Do you have a clip that shows your method performing better than the traditional approach? And how does your method compare in performance? From what you said it sounds like it would be way slower.

I had a look at your code. It is so complex and extensive and with only very limited commenting that I have no hope of figuring out what you are doing. And to be honest your previous post is rather unclear. Finally, telling me about this after I complete my implementation is a bit perverse.

jpsdr · Post by **jpsdr** » Mon Mar 20, 2017 10:58 am

Sorry, no bad/perverse intention...

I'll PM you later an ftp account with a clip i'm using to made my tests, but i don't remember if it performs better on this specific clip. I think remembering it performs better than the automatic IVTC included in VDub...

If my post wasn't clear, i'll re-try to explain the idea:
N.o : odd field of frame N. (Bottom field, lines 1,3,5,...)
N.e : even field of frame N. (Top field lines 0,2,4,...).
- Compute correlation data between N.o and N.e : value A.
- Compute correlation data between N.e and N-1.o : value B.
Two correlation values are computed for each frame :
One computed from the whole frame.
One computed only on zones detected interlaced, using the same idea/method of your smart deinterlace. The map of interlaced zones is made using the original frame, and after the both correlations values (can be called A' and B') are computed only on the zones from the map constructed.
As the filter is old, it was made at the time VDub filters were working only on RGB32 data. So, all computations are made only on RGB data.
To remove noise from correlation data and greatly increase the accuracy, the 2 LSB are removed (from RGB data).
If A' and B' are "good" : If B' << A' frame is a telecined frame, otherwise not. If A' and B' are "not good", A and B are used.
To validate a telecine pattern the both frames detected must be contiguous, except... If change scene is detected.
The filter has a pipeline structure, it computes data only on the current frame. Meaning it works only when runned through the whole file, display is not working, and it's why it doesn't have a preview function.
Another thing in my program, when it founds an IVTC pattern, it stays locked on it if no "strong" detection pattern is found. It's typicaly for on anime when a caracter is talking without mouving, and on the picture only a small mouth is "moving". Another thing preventing any "preview" from working, history/past has an effect on the present.
These are the rough ideas.
If you want to play with it, put in VDub the following filter in the filter chain :
IVTC (with default setting)
Remove frame (with default setting)

then run the process, and see the saved result file.

Post by **admin** » Mon Mar 20, 2017 11:34 am

Thanks, jpsdr. Looking forward to your IVTC torture clip(s).

As I mentioned, for a CUDA implementation my focus is on speed and your algorithm would be both difficult to implement and rather slow for me. I notice you did not comment on its speed, even though I specifically asked about it. Nevertheless, thank you for the further explanation.

Post by **admin** » Mon Mar 20, 2017 12:41 pm

Folks, please do some testing with this beta of DGTelecide/DGDecimate:

http://rationalqm.us/misc/Beta.rar

If no outright bugs are found I'll slipstream it. Remember, this has no bells and whistles. Based on results, I'll enhance it as needed.

Sharc · Post by **Sharc** » Mon Mar 20, 2017 5:59 pm

First quick tests with DGTelecide():
I am getting strong residual combing even though the show=true reports that the frame has been deinterlaced.
I don't get such combes with the classic telecide().

What is the valid range of pthresh? 0.0 to 1.0? (The documentation calls it "strength" btw.)

DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb

Re: DGDecomb