CUDASynth

Post by **admin** » Wed Sep 19, 2018 4:14 pm

gonca wrote: ↑
Wed Sep 19, 2018 3:37 pm
(note: here goes the budget)
Get the 32 core CPU and 2 GPUs

That sounds good. I don't have a budget. I'm getting older so I'm going to blow it all on hardware and travel adventures.

BestBuy went to preorder on the 2080 Ti but by the time I saw the alert it was all taken. Nobody wants these things.

Thanks for making this thread go into flames (the icon on the forum list).

Guest · Post by **Guest** » Wed Sep 19, 2018 4:28 pm

Thanks for making this thread go into flames

That is how I learned about the hardware side of things.
Making things go up in flames

Post by **admin** » Wed Sep 19, 2018 5:03 pm

Wow, parallel ATA connectors.

What's the CPU, a 386?

The first processor I coded for was an 8080. The OS was CP/M. It had a 100K floppy drive that raised and lowered the head on each sector access (the infamous head-loading solenoid), causing a pleasing bang-bang-bang that the neighbors loved. I clearly remember tossing that system (Heathkit H8) in the dumpster when I upgraded to a 386-based system.

Guest · Post by **Guest** » Wed Sep 19, 2018 5:38 pm

8080 and 386
Those were the days, never to be seen again, thank whichever supreme deity for that small favor
Gee-sh, now I am getting politically correct

Post by **admin** » Wed Sep 19, 2018 8:26 pm

If using the word "God" is politically incorrect, then we are done. Praise the Lord!

Post by **admin** » Sun Sep 30, 2018 11:31 am

Status report...

CUDASynth-enabling of DGDenoise is complete. It was an involved thing because I had to code and test all combinations of: fulldepth=true/false [times] chroma=true/false [times] fsrc=cpu/gpu0/gpu1 [times] fdst= cpu/gpu0/gpu1. That is a total of 2x2x3x3 = 36 combinations, each with its unique mix of CUDA kernel launches and pitched 2D memcpy's. It all seems to be working, thankfully. Don't ever accuse me of not being persistent.

There are four things to do now:

1) Ensure all filters still work fine when the source filter does not declare GPU buffers and a lock, i.e., when the source filter is not CUDASynth-enabled. This is context creation/management stuff. I also want to add an integer pipeline ID that can be specified in the script, thereby allowing for multiple pipelines to run simultaneously.

2) Thorough code review and any needed refactoring.

3) Vapoursynth native support.

4) Documentation and code sample (open source).

I'm going to hold off on 3) for now, do the others and then give y'all the new toy to play around with.

Have to open source the CUDA filter framework to recruit others to develop compatible filters. The core filters should be CUDASynth-enabled as well.

hydra3333 · Post by **hydra3333** » Mon Oct 01, 2018 5:31 am

Thank you very much.

Post by **admin** » Mon Oct 01, 2018 8:44 pm

You're welcome, hydra3333!

Item 1) above is finished, but without the pipeline ID. That is going to need some deeper thinking, as it must allow any filter to create ping-pong buffers, etc. Let's hold off on that for now. So tomorrow, I want to make the documentation and source code example and get it into your hands. Code review can be done in parallel with your testing.

Consider this script:

dgsource("LG Chess 4K Demo.dgi",fulldepth=true,fdst="gpu0")
dghdrtosdr(light=250,fsrc="gpu0",fdst="gpu1",fulldepth=true)
dgdenoise(fsrc="gpu1",fdst="gpu0",chroma=true)
dgsharpen(fsrc="gpu0")
prefetch(2)

The source is 3840x2160 59.94 HDR10. The resulting frame rate is 95 fps, 1.6 x real-time. If the pipeline is not used (all fsrc and fdst set to "cpu", then the frame rate is 30 fps, half of real-time. So for this real-world example CUDASynth is increasing performance by over 300%. Gotta love it, am I wrong?

Guest · Post by **Guest** » Tue Oct 02, 2018 5:00 am

Gotta love it, am I wrong?

You are right

Post by **admin** » Tue Oct 02, 2018 11:48 am

The source code example filter is done. I used DGSharpen (debundled from DGDecodeNV) and removed the licensing and CUDA code encryption. Now I just have to write some documentation.

Post by **admin** » Wed Oct 03, 2018 12:06 pm

I am happy to announce CUDASynth 0.1:

http://rationalqm.us/misc/CUDASynth_0.1.rar

Testing and feedback will be appreciated.

Guest · Post by **Guest** » Wed Oct 03, 2018 3:48 pm

Got a couple of things to finish off and then the testing will begin

Guest · Post by **Guest** » Wed Oct 03, 2018 5:43 pm

Speed looks good

cudasynth.log: (719.6 KiB) Downloaded 644 times

test.log: (244.36 KiB) Downloaded 631 times

Can't see the results of the cudasynth file in VDub or MPC-HC so I will run quick encode and check

Guest · Post by **Guest** » Wed Oct 03, 2018 6:34 pm

No visible issues that I could see on the encoded file
CPU usage down, and surprisingly so is GPU and VPU load.
Speed is right up there though
Looks good

Post by **admin** » Wed Oct 03, 2018 8:24 pm

Thanks for the test results, gonca. Now to get some critical mass we need to make more CUDASynth-enabled filters. Feel free to suggest possibilities. If there are any good open source ones it would not be hard to port them.

hydra3333 · Post by **hydra3333** » Thu Oct 04, 2018 12:56 am

testing, sorry, per the CUDASynth.txt "* Vapoursynth is not yet supported" unfortunately I no longer have avisynth.

Filters possibilities ? You have current functionality for
- decode / deinterlace / crop / resize
- denoise
- sharpen
- HDR10 to SDR

That's about all I use, other than maybe an occasional
- deblock for low quality TV broadcasts (Aus telly can be bitrate starved)
- video stabilisation, rarely, more for the home videos that one must share including vhs type captures
- croprel, addborders, rarely, more for the home videos that one must share including vhs type captures
- HDRAGC or equivalent, rarely, more for the home videos that one must share including vhs type captures
- despot, very rarely for some vhs type captures

- mdegrain, very rarely for some vhs captures etc
- anti-alias ?sangnom, almost never nowadays
- QTGMC deinterlacing, almost never nowadays

Guest · Post by **Guest** » Thu Oct 04, 2018 4:55 am

DGTelecide
DGDecimate
DGPQtoHLG

I tend to use the DG filters more than any other

DJATOM · Post by **DJATOM** » Fri Oct 05, 2018 7:49 am

I'd like to have nnedi3/eedi3 CUDA versions. They are cpu consuming, and offloading to gpu will help a lot. Currently we have nnedi3 openCL (full rewrite to use gpu only) and eedi3 openCL (partial rewrite to use gpu for calculating connection costs), so the second one still consuming cpu for main processing.
If you want to look at them, eedi3 | nnedi3.

Post by **admin** » Fri Oct 05, 2018 8:39 am

Thank you, gentlemen, for the thoughts and links. eedi3 and nnedi3 look like they might be fun to try. First, though, I need to get serious about DGIndex MKV support.

@hydra3333

For Vapoursynth, can't you use the avscompat layer? I can add native support later, although I have to confess that the required code duplication is a royal pain in the you-know-what.

Guest · Post by **Guest** » Mon Oct 08, 2018 2:03 pm

Masktools 2 seems to be popular
https://github.com/pinterf/masktools/releases

DJATOM · Post by **DJATOM** » Mon Oct 08, 2018 2:14 pm

avscompat layer seems to be working, but speed is near the same as in "cpu" mode.
But still relatively fast - near 65 fps (default settings, DGSource -> DGDenoise -> DGSharpen) and 105 fps (default settings, DGSource -> DGDenoise).
Hardware: GTX 750, i5-4670k

Post by **admin** » Mon Oct 08, 2018 4:04 pm

Thanks for the results, DJ. Just out of interest I'd like to see a benchmark of a script for these three (no CUDASynth):

Avisynth+
Vapoursynth native
Vapoursynth avscompat

I seem to recall when doing some testing recently both Vapoursynth ways fell short compared to Avisynth+, but I haven't tried it recently.

DJATOM · Post by **DJATOM** » Mon Oct 08, 2018 4:37 pm

Ok, for now I've checked same script

ClearAutoloadDirs()
LoadPlugin("C:\322\x64\DGDecodeNV.dll")
DGSource("J:\Darling6\STREAM\EP16.dgi")
DGDenoise()
DGSharpen()
trim(0,6000)

and

ClearAutoloadDirs()
LoadPlugin("C:\322\x64\DGDecodeNV.dll")
DGSource("J:\Darling6\STREAM\EP16.dgi",fdst="gpu0")
DGDenoise(fsrc="gpu0",fdst="gpu0")
DGSharpen(fsrc="gpu0",fdst="cpu")
trim(0,6000)

So CUDASynth works in the native avs+.

C:\322>avs2yuv64 EP16.avs -o NUL
Avs2YUV 0.28
Script file: EP16.avs
Resolution: 1920x1080
Frames per sec: 24000/1001 (23.976)
Total frames: 6001
CSP: YV12
Progress Frames FPS Elapsed Remain
[100.0%] 6000/6001 86.72 0:01:09 0:00:00
Started: Tue Oct 9 00:24:34 2018
Finished: Tue Oct 9 00:25:43 2018
Elapsed: 0:01:09

C:\322>avs2yuv64 EP16.avs -o NUL
Avs2YUV 0.28
Script file: EP16.avs
Resolution: 1920x1080
Frames per sec: 24000/1001 (23.976)
Total frames: 6001
CSP: YV12
Progress Frames FPS Elapsed Remain
[100.0%] 6000/6001 102.32 0:00:58 0:00:00
Started: Tue Oct 9 00:26:26 2018
Finished: Tue Oct 9 00:27:25 2018
Elapsed: 0:00:59

I'll measure avscompat (without and with fsrc/fdst) soon, need to close browser to have more GPU RAM for testing.
And as there are no native Vapoursynth versions for DGDenoise/DGSharpen, should I check them in avscompat and DGSource in the native modes?

Guest · Post by **Guest** » Mon Oct 08, 2018 4:46 pm

LoadPlugin("C:/Program Files (Portable)/dgdecnv/x64 Binaries/DGDecodeNV.dll")
DGSource("I:\test.dgi", fieldop=0, fulldepth=True)
ConvertBits(10)

FPS 92.3

import vapoursynth as vs
core = vs.get_core()
core.std.LoadPlugin("C:/Program Files (Portable)/dgdecnv/x64 Binaries/DGDecodeNV.dll")
clip = core.dgdecodenv.DGSource(r'I:\test.dgi', fieldop=0, fulldepth=True)
clip = core.resize.Point(clip, format=vs.YUV420P10)
clip.set_output()

FPS 133.0

import vapoursynth as vs
core = vs.get_core()
core.avs.LoadPlugin("C:/Program Files (Portable)/dgdecnv/x64 Binaries/DGDecodeNV.dll")
clip = core.avs.DGSource("I:/test.dgi", fieldop=0, fulldepth=True)
clip = core.resize.Point(clip, format=vs.YUV420P10)
clip.set_output()

FPS 129.8
Source was a 4K clip

Edit
Avs compatability is 2x faster with cudasynth than with the regular version, 4K sample with DGHDRtoSDR (default) and DGSharpen (default)

DJATOM · Post by **DJATOM** » Mon Oct 08, 2018 4:52 pm

cudasynth in avscompat:

import vapoursynth as vs
core = vs.get_core()

core.avs.LoadPlugin(r'C:\322\x64\DGDecodeNV.dll')

clip = core.avs.DGSource(r'J:\Darling6\STREAM\EP16.dgi', fdst="gpu0")
clip = core.avs.DGDenoise(clip, fsrc="gpu0", fdst="gpu0")
clip = core.avs.DGSharpen(clip, fsrc="gpu0", fdst="cpu")
clip = core.std.Trim(clip, 0, 6000)
clip.set_output()

no cudasynth in avscompat:

import vapoursynth as vs
core = vs.get_core()

core.avs.LoadPlugin(r'C:\322\x64\DGDecodeNV.dll')

clip = core.avs.DGSource(r'J:\Darling6\STREAM\EP16.dgi")
clip = core.avs.DGDenoise(clip)
clip = core.avs.DGSharpen(clip)
clip = core.std.Trim(clip, 0, 6000)
clip.set_output()

native DGSource + avscompat DGDenoise and DGSharpen:

import vapoursynth as vs
core = vs.get_core()

core.std.LoadPlugin(r'C:\322\x64\DGDecodeNV.dll')
core.avs.LoadPlugin(r'C:\322\x64\DGDecodeNV.dll')

clip = core.dgdecodenv.DGSource(r'J:\Darling6\STREAM\EP16.dgi')
clip = core.avs.DGDenoise(clip)
clip = core.avs.DGSharpen(clip)
clip = core.std.Trim(clip, 0, 6000)
clip.set_output()

I don't know why we have such results, at least I tried to compare with minimum differences in the resource usage (with closed browser, etc).

CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth

Re: CUDASynth