I've been a busy bee for the last three days and have a lot to report.
The CUDA adaptive deinterlacer [equivalent to Decomb's FieldDeinterlace()] has been successfully implemented. At first, I was disappointed in the performance (all results run on 1080Ti). Consider this [I'll explain why it's in DGTelecide() in a moment; for now realize that mode 2 does the same thing as FieldDeinterlace(full=true)]:
loadplugin("dgdecodenv.dll")
loadplugin("decomb.dll")
dgsource("lain.dgi").loop()
dgtelecide(mode=2)
#fielddeinterlace(full=true)
720x480
DGTelecide() 1092 fps
FieldDeinterlace() 645 fps
OK, a 1.7 times speedup is welcome but not eye-popping. Then I discovered that the benefits of CUDA scale with the frame size. It's not surprising when you think about it. Here are the results for larger frame sizes:
1920x1080
DGTelecide() 345 fps
FieldDeinterlace() 143 fps
A 2.4 times speedup. That's starting to impress.
3840x2160
DGTelecide() 99 fps
FieldDeinterlace() 19 fps
A 5.2 times speedup. Now we're cooking with gas! Importantly, the CPU solution is slower than real-time while the CUDA solution is 4 times real-time for 24 fps 4K content. Now that is eye-popping.
Now I'll tell you about the new packaging concept. I knew that the previous first cut of CUDA DGTelecide() was not using an adaptive deinterlace for post-processed frames. Obviously that is suboptimal because progressive parts of the frame will be degraded. So I had to retrofit this new CUDA deinterlace into the Telecide postprocessor. But if that code is going to be in Telecide() why bother having a separate deinterlacing filter? So I redesigned DGTelecide() to support four modes, choosable via a mode parameter:
DGTelecide(clip,mode,pthresh,dthresh,blend,map,show,device)
Mode 0: Field matching without adaptive deinterlacing of frames with a bad match (no postprocessing).
Mode 1: Field matching with adaptive deinterlacing of frames with a bad match (postprocessing).
Mode 2: Adaptive deinterlacing of all frames (unconditional deinterlacing).
Mode 3: Adaptive deinterlacing of frames determined to be combed (conditional deinterlacing).
Just as for FieldDeinterlace(), separate thresholds are used for determining if a frame is combed, and for deinterlacing combed frames. Users can decide for themselves whether mode 3 really gives any benefit over mode 2 with a good dthresh, given the extra overhead of checking whether a frame is combed. The 'show' information overlay shows the mode and is tailored for that mode.
I just have to write the documentation and then I will slipstream it. I am very happy with how it turned out.
I also want to post some technical material about all the things I have discovered about optimizing CUDA for filtering. There are some interesting and surprising things. One of them is that Avisynth itself limits performance because it does not allow for frames to be malloc'ed by CUDA rather than the OS. CUDA-malloc'ed (pinned) host memory can be transferred to/from the GPU much faster than OS-malloc'ed memory. Perhaps Avisynth could be modified to allow a user-supplied malloc to be used. More on all that later.