CUDASynth

Post by **Rocky** » Tue Feb 13, 2024 10:05 am

I knew about that after a former member suggested it should be done to me. Guess why he is former.

hydra3333 · Post by **hydra3333** » Tue Feb 13, 2024 6:05 pm

Please forgive me if I enquire on the status of and the ins and outs of the much anticipated cudasynth/dgdenoise et al

I feel like I'm an excited little kid again ... are we there yet ?

Post by **Rocky** » Tue Feb 13, 2024 8:04 pm

Happy to oblige. The NLM part is fine and isn't changing from test3.

We made some good progress with the CUDA temporal filter. As mentioned we prioritize speed but of course still require that the filtering is useful and doesn't create artifacts. Going to a full block-matching kind of motion compensation is not consistent with that, because the speed would not be satisfying. Then the question is whether there is a way to achieve our goal, i.e., to have good to great filtering at high speed.

Those of you that were around in the early days of desktop image processing starting with VirtualDub filters may remember a filter called TemporalCleaner by Jim Casaburi. The concept was simple:

---
For each pixel in the current frame, if the difference between the previous frame's corresponding pixel and the current pixel is below a threshold then replace the current pixel by the average of the previous and current pixels, otherwise keep the current pixel.
---

https://avisynth.org.ru/docs/english/ex ... leaner.htm

This sounds naive by modern standards but was a useful advance at the time, when desktop image processing was just getting started. Indeed, while sometimes improving compressibility for MPEG2, this idea performs poorly for denoising, with the threshold being critical. Nevertheless, it is a reasonable fundamental approach that can be easily improved.

Let's consider why it performed poorly. If the threshold was too large, then you would get visible motion blur in the filtered video. If it was too small, the blur would be removed but only noise amplitudes less than the threshold would be filtered. So you couldn't have both good filtering and no blurring.

Attempts were made at the time to solve this by doing scene change detection. At a scene change the original pixels would be used. While this improved things by suppressing blurring across scene changes, it did nothing for motion not involving the entire frame, such as a moving object in an otherwise static frame. Learn to love motion blur? No!

So, our insight was to in effect apply scene change detection but only considering a window around the current pixel. If the neighborhood is changing enough, determined by a threshold, then the original pixel is used, otherwise it is filtered. The problem with the original TemporalCleaner was that the "scene change detection window" was just the current pixel, leading to the conflict between filtering and blurring.

We implemented this idea in our CUDA filter. We used a 5x5 window around the current pixel, and accumulated a change metric between the current and previous frames. If the metric is under a threshold, the pixel is filtered, otherwise not. In order to increase the filtering effect, when below threshold, we averaged the current pixel with the two previous frames.

This algorithm performs surprisingly well. We also implemented an option to show the motion map to allow for tweaking of the user-configurable threshold. If all this reminds you of our early work on Smart Deinterlacer, it means you are an old codger with a great memory. It still remains to determine the best window size for motion detection, but 5x5 works very well. The algorithm as a whole delivers a good tradeoff between quality and speed.

I hope to be able to give you a test4 build with this tomorrow. BTW, even though we removed the SIMD temporal median algorithm, we are going to keep the multiple-CPU DLLs, as there are performance gains from the SIMD for other things that the compiler applies.

Post by **Curly** » Tue Feb 13, 2024 8:21 pm

sounds gud Rock way 2 go

quik report
i won big at the rulette table 2nite
the ladies are swarming me
more later knerk

hydra3333 · Post by **hydra3333** » Wed Feb 14, 2024 12:51 am

Cool !

Feel free to ignore:

Speaking of roulette tables, just wondering ... over at d9, cudasynth post p=1997434#post1997434 says in part

if CUDA will provide ME data from hardware ASIC and you can use it to make motion compensated frames you can use temporal median as final output stage (or simple weighted averaging as in MDegrain). As I read NVIDIA also provide some API for CUDA-programmers to get ME data from MPEG encoder ASIC where available. Other implementation is for DX12.
See https://docs.nvidia.com/video-techno...ide/index.html

Motion Estimation Only Mode

NVENC can be used as a hardware accelerator to perform motion search and generate motion vectors and mode information. The resulting motion vectors or mode decisions can be used, for example, in motion compensated filtering

I didn't readily spot further commentary, so :
not sure if that is relevant or useful, for example if possible to use the nvenc asic for ME then is the relative cost of transiting the PCIE "negligible enough" for reasonable speed vs CPU-only ME, and even if obtainable could the ME vectors be easily used or not in the cudasynth architecture, and would it yield superior enough results for any potential effort to implement.

edit: ps: if there's an ffmpeg nvenc encode going on at the same time (almost certain) then could that interfere with the asic concurrently performing the ME ?

Just wondering.

Dear Curly, girlfriends can be expensive; just a thought, perhaps choose ones with less makeup, there's a reasonable chance they'll still look OK when you wake up. Natural beauties like Britney.

Post by **Rocky** » Wed Feb 14, 2024 5:27 am

I can't answer for Curly but regarding the ME stuff, I live in the real world. Just because someone (not you, hydra3333) expatiates about something in a manner designed to demonstrate purported expertise or preeminence in a domain, it doesn't necessarily mean that the ideas are sound or practical. Let that person show the proof of concept.

Post by **Curly** » Wed Feb 14, 2024 5:33 am

gud advice hydra3333 amigo mio thx
gonna try blackjack 2nite

hydra3333 · Post by **hydra3333** » Wed Feb 14, 2024 6:43 am

OK and thanks for clarifying, current course maintained, cool.

If you mean me it's zero friends and I'm ok with that, being on the lower end of the bell curve for many things

I've had my fair share of "well that didn't work, it seemed like a good idea at the time to have a go with, let's try a different tack"

Just happy you're doing cool stuff !

Cheers

Post by **Curly** » Wed Feb 14, 2024 8:55 am

Rocky was not referring to you, m8! I hope we are friends.

Post by **Rocky** » Thu Feb 15, 2024 6:25 am

Some clarifications on performance:

Previously we had searchw 5/7/9. Now we have good/better/best. The good/better/best are scaled higher than searchw. That means "good" is already equivalent to searchw = 9. So for accurate comparison to the previous searchw = 9 you have to choose "good". Settings "better" and "best" provide additional high-quality settings that we did not have before.

You also have to turn off temporal denoising in the new version to compare to the previous version, because the previous version did not have temporal denoising.

Another thing not to forget is that the more filters in the chain, the more relative speed improvement there will be, as more and more expensive PCIe transfers are avoided. This is why I intend to add more filters, even those supplied by third parties. If not too difficult, I'll even port them to CUDA. The case with one filter doesn't showcase the CUDASynth philosophy, but it already shows signifcant gains. The filter chain with HDR to SDR and denoising starts to get quite impressive, for example.

Post by **Rocky** » Thu Feb 15, 2024 9:02 am

Here is test4 adding the new temporal denoiser. Refer to the Notes.txt file for details.

Note that the standalone DGDenoise() filter has not yet been updated with the new filter.

https://rationalqm.us/misc/DGDecodeNV_test4.zip

I think I'll add DGSharpen() next.

hydra3333 · Post by **hydra3333** » Thu Feb 15, 2024 9:30 pm

Thanks. Creating test scripts now to try it out. dgsharpen, cool ! dgdeblock after that ?

hydra3333 · Post by **hydra3333** » Fri Feb 16, 2024 2:46 am

Hi, I must be doing something wrong, advice would be welcomed.

Using the same source and almost the same .vpy script, I seem to get getting 1/2 speed from the new cudasynth compared to pre-cudasynth.
109fps vs 52 fps.
Perhaps it is because I specified dn_quality="best" ?

I have a 3900X with 32Gb, and an 8Gb 2060-Super video card.
https://www.msi.com/Graphics-Card/GeFor ... cification

Here is a log showing the .vpy scripts etc, in the hope you may find time to consider it.