Welcome to my journal!

 So what's this all about?

This is my journal where I expound about anything and everything that interests me at the time. It includes descriptions of my thinking and work in progress, rants, raves, and other random musings. The formatting is shamelessly borrowed from Avery Lee's VirtualDub site (imitation is the sincerest form of flattery!). My only hope for this page is that it won't bore you (too badly).

Note that I add new entries at the top, so if you are an infrequent visitor, you'll probably want to scroll down a little to get into the flow.

[All contents are Copyright (c) 2003, 2004 Donald A. Graft, All Rights Reserved.]

 8-10-2003: KernelDeint(): A Fixed Spatial/Temporal Filter for Deinterlacing

Time flies when you're having fun. And I am having fun. Recently, Colin Browell sent me an email calling my attention to an old US patent that described a deinterlacing method based on a fixed spatial/temporal filter. To be honest, while I found it interesting I wasn't very optimistic about it being useful. I was not optimistic because I thought it's an old idea that you don't see in use today. If it was worth anything, it would be well-known and used. Nevertheless, I decided to crank out a quick prototype to see how it looked. Suffice it to say I was surprised and delighted! To see why, let me show you some frame grabs. Then I'll discuss the theory and a refinement of the basic idea. I polished the prototype into a finished Avisynth filter called KernelDeint(), and I encourage you to download it and use it for your interlaced video.

Following is a grab of a test clip deinterlaced using field discarding and bicubic interpolation to restore the original height. This acts as a sort of standard for the resolution we can expect with simple field discarding.


 Field Discarding

And following is a grab of the same frame processed with KernelDeint(). The threshold has been set to 0 to force the entire frame to be deinterlaced (and not just "moving" areas), in order to compare it fairly to the previous grab.


 KernelDeint()

Cognoscenti will have no problem seeing a noticeable improvement in retained resolution. For those with less sensitive eyes, following is a frame showing the difference between the images (made using Avisynth's Subtract() filter. This image allows you to visualize the extra detail retained by KernelDeint().


 Difference

Obviously, this spatial/temporal approach has something to offer compared to simple spatial interpolation! But how does it work? If you read the patent you can gain a complete understanding, but I will state the main point. Information from the previous and following fields in included in the calculations made for interpolating the missing field. This information is band-passed such that low frequency (DC) components are excluded, and high frequency components (including those that would cause combing) are excluded. But the vertical frequencies that are included are enough to improve the overall resolution while still removing the combing. Following is the small kernel used to generate the frame grabs.


 Small Kernel

Even more dramatic results can be obtained by using the larger kernel described in the patent. It performs some sharpening, however, that some may not want. But often people add a sharpener after field discarding, so there is a precedent for having a sharpening kernel, and with the large kernel you don't need a separate sharpening filter.

One very important point that doesn't show up very well in this particular example is that this filter also produces significantly less jaggies on diagonal edges (and especially near horizontal edges). Try it on one of your problematic clips. You will be impressed!

My friend Ivo ('i4004' on the forum) suggested that a standard motion test as done in SmartDeinterlacer could be performed and then the kernel applied only to the "moving" areas. This would allow static picture areas to be faithfully retained while moving areas still benefit from the better resolution of the spatial/temporal approach. I implemented this idea in KernelDeint(). It also allows a choice of small versus large kernel, and allows the user to show the motion map. Following is the frame from above with the motion test enabled at threshold=10.


 Motion Test

Short of full motion estimation/compensation techniques, it's hard to conceive of a better result. This algorithm can run in real time on fast processors. For that reason, and because of its impressive results, this spatial/temporal approach should arguably be the standard operating procedure for deinterlacing moving areas of interlaced video.

Are you wondering how it performs when you throw hybrid progressive/interlaced material at it? I'll talk about that in a later journal entry.

 6-25-2003: Pattern Guidance

Let's talk about pattern guidance. Consider a standard 3:2 pulldown sequence, using our familiar notation:
a a b c d
a b c c d
Decomb 5's matching strategy will match the first frame to current (it could match to next also, we'll come back to this subtlety), the second to next, the third to next, the fourth to current, and the fifth to current. We can write a shorthand for the sequence, then, as:
...c n n c c c n n c c c n n c c c n n c c...
Given a clip that is clean 3:2 pulled down throughout, we might expect the output of the blind field matching to give the exact pattern shown above. In the real world, however, noise and the imperfection of the blind matching algorithm result in errors. The output will be corrupted with sporadic little failures of the matching. For example, for the source string above, our matching result might be:
...c n n c n c n n c c c n n c c c n c c c...

The challenge for a pattern guidance algorithm is now clear. Somehow take advantage of the fact that a 3:2 pattern is known to be present to correct the sporadic mismatches. This will require us to detect and track the 3:2 pattern and to generate a prediction for a match. If the actual blind match is the same as the predicted match we just accept it. If they differ, we reach a crossroads because two things could be happening. First, it could just be one of the sporadic mismatches we mentioned above. Second, it could be a real scenario where the pattern is no longer valid. If the first case applies, we want to overrule the blind match and use the predicted match. If the second case applies, we want to accept the blind match. I have found that the two cases can be adequately distinguished by comparing the predicted and blind match metrics; if they differ by less than a threshold amount, you have the first case, and if they differ by too much, you have the second case. The hard part is tracking the pattern and making good predictions.

Ideally, we would like to have a solution that will allow us to randomly navigate to any frame and the result would be completely determined and independent of any previous actual match decisions. While this is possible, it doesn't give the best solution, as will become clear. Nevertheless, since at the beginning of a clip we have no history of actual matches, and because the user can navigate to a random frame, our algorithm has to start in the ideal mode, which I call "soft" guidance. Let's consider how the soft guidance is implemented.

Consider again the ideal 3:2 sequence shown above. I pointed out that for the first frame the match metrics for a current match and for a next match will be equal or very close. Such a case occurs only once per cycle, so if we can detect it, we will have determined the pattern phase. Let's denote that frame by 'c*'. Then we have this:

...c* n n c c c* n n c c c* n n c c c* n n c c...
Our soft guidance strategy, then, is as follows (this is the easy version; Decomb does something better, but let's understand the easy way first). When trying to match frame N, examine the current and next match metric pairs for frames N+1 through N+5. Find the frame with the lowest difference between the current and next match metrics. This frame fixes the phase of the 3:2 pattern! Given the phase, we can easily predict the match for frame N. If the predicted match is close enough to the blind match, then overrule the blind match with the predicted match.

Well now, that was pretty simple, and entirely worthy of Joe Six-Pack. And it works well when there is steady motion in the clip. It fails when a non-3:2 section arrives and when motion becomes intermittent. For the first case, we simply require that the lowest current/next difference be low, if it isn't, then it can't be 3:2 and we make no prediction, thereby disabling guidance for the section. The second case occurs when there is no motion, or low motion, or duplicated frames in the cycle we are examining. In these cases there are multiple candidate phases, and the lowest one is not always the correct one (again due to noise and algorithmic imperfections).

To handle this multiple candidate phase problem, Decomb generates a list of candidate phases sorted by goodness, i.e., by lowness of the current/next difference. Decomb then tries to apply guidance using the candidates in best-first order. As soon as one succeeds (predicted/blind mismatch is low enough), the prediction is accepted and used. If none of them succeed, the blind match is used. There, now you know how soft pattern guidance works.

Soft pattern guidance works quite well, but in the presence of low motion and duplicated frames, it can still jump back and forth between possible phases, and thereby emit slightly wrong matches. Soft guidance is good when we arrive at a frame with no history of a 3:2 pattern of actual matches, but when we know that a 3:2 pattern of actual matches is behind us, we ought to be able to simply continue it. That is "hard" pattern guidance. It clearly is not completely deterministic for random access, because it depends upon the actual matches that precede the randomly accessed frame. And the actual matches for those depend upon their predecessors, and so on, and so on. So we cannot achieve a hard guidance and complete determinism with random access. Let's see how hard guidance is done.

Hard guidance is easily accomplished. We simply store the last five actual matches if they exist. They will exist if we have played through them. When encoding we will be playing through from beginning to end, so hard guidance will always have the 5-frame history (except for the first 5 frames of the clip). Then, if the 5 frames match a 3:2 pattern, we know the phase and make a hard prediction. If the hard prediction succeeds (predicted/blind mismatch is low enough), then we use it.

The final piece of the puzzle is how hard and soft guidance are combined. We start by trying hard guidance. If no hard prediction is possible, or the prediction fails, then we invoke soft guidance. If both fail, the blind field match is used.

There you have it: pattern guidance in Decomb 5. It works surprisingly well, as long as the threshold for accepting an override ('gthresh') is not set too high. Fortunately, most of the sporadic mismatches are small and can be successfully overridden by the pattern guidance.

 6-21-2003: DGBob(): More Artifact Reduction

DGBob() was still producing artifacts on some unusual clips, so I modified the motion detection to better detect motion by considering more pixels in the temporal neighborhood. Consider this:

a   b   c   d   e
      x _ y 
f   g   h   i   j
The revised test is now this:
if ((abs(c-a) < D) && (abs(c-b) < D) && (abs(c-e) < D) && (abs(c-d) < D) &&
    (abs(h-f) < D) && (abs(h-g) < D) && (abs(h-j) < D) && (abs(h-i) < D) &&
    (abs(x-y) < D)) 
    pixel position '_' is static
else 
    pixel position '_' is moving
I also added an optional 'artifact protection' mode. In this mode, when the test above determines that a pixel position is static so that the previous field's pixel can be used, extra checking is performed to test whether that would produce a visible artifact. The test is as follows:
if (((x + AP < c) && (x + AP < h)) || ((x - AP > c) && (x - AP > h)))
	use (c + h) / 2
else
	use x
AP is the artifact protection threshold. It is set to a relatively high value to avoid preempting desirable weaving while still catching many obvious artifacts. The artifact protection mode should be used only when absolutely necessary, as it can increase flickering a bit (because some valid weaving gets preempted), and it requires more processing. The motion detection is now good enough that this mode is usually not required, but when that rare perverse clip comes along, it is available to almost guarantee the absence of perceivable artifacts.

I can't think of any other ways to improve DGBob(), and it now supports all color spaces, so I am now going to return to Decomb 5.

 6-18-2003: DGBob() Revisited: Artifact Reduction

DGBob() in its first release was, in my humble opinion, barely acceptable. And that applies to all the smart bob filters, including SmoothDeinterlace(). The reason for my opinion, of course, is the artifacts that they produce. Just to be fair to my own work, here is a typical artifact from SmoothDeinterlace(), but the first release of DGBob() artifacts in the same way (but see below!):



Smart bob artifacts [SmoothDeinterlace()]

Here we have a light pole that is being panned across the field of view. The artifacts along the pole are plainly obvious and quite objectionable. They arise because pixels are changing in just the current field and not the previous and following fields. Consider this depiction of three lines from five successive fields:

x   x   x
  o _ o 
x   x   x

Suppose that we are trying to invent a pixel to replace the pixel position labeled by the underscore character. We want to either interpolate from the 'x' pixels above and below if the pixel position is in a moving area, or we want to simply use the 'o' pixel from the previous field if the pixel position is in a static area. But how do we determine whether the pixel position is moving?

A naive approach is to consider the two 'o' pixels on each side. If they differ by no more than a threshold amount, then the pixel position marked by '_' can be considered to be static. One hopes that this test will be correct most of the time. But it fails surprisingly often. Any kind of change that appears only in the single field containing the '_' pixel will fail the test. Such changes can be caused by single field events, such as flashes, but more commonly by fast motion, as in the case of the moving pole shown in the image above. This is a serious artifact, and no consideration of the history or future of the line containing the '_' pixel gives us any way to avoid it.

I was in despair about this until just this morning. Then I had the brainstorm that we could get an indication of the single-field change in the '_' pixel position by looking at the lines above and below. It's quite unlikely that a single-field change would be limited to a single line, so we can use the lines above and below to help us determine what is happening at our '_' pixel position. Here is a recasting of the pixel map with the pixels replaced by different letters, so that I can refer to them:

a   b   c
  x _ y 
d   e   f

To determine if our '_' pixel position is moving, we will make the following test:

if ((abs(a-b) < D) && (abs(d-e) < D) && (abs(x-y) < D))
    pixel position '_' is static
else 
    pixel position '_' is moving

The improvement due to this new motion test is impressive. Here is how DGBob() so modified renders the pole:



Result from new motion detection [DGBob()]

I'm certainly quite pleased with the result. My next steps for DGBob() will be addition of YV12 support, addition of a 'show motion map' option, and improved interpolation (cubic and/or edge-directed). When DGBob() is completed, I will return full-force to Decomb 5, with an attack on pattern guidance.

 6-14-2003: DGBob(): A MetricX Spinoff

I needed a little break from Decomb 5, so I decided to tackle an area that I have been frustrated by for some time, i.e., performing high-quality smart bobbing (converting fields to frames and doubling the frame rate). It is not hard to write a basic smart bob filter. I wrote the first one for VirtualDub many years ago. But writing a smart bob that can mitigate the effects of flutter is not so easy. Gunnar Thalin was a pioneer in this area with his popular SmoothDeinterlacer() for VirtualDub (later ported to Avisynth by 'Xesdeeni'). SmoothDeinterlacer() successfully achieves its goal of significantly reducing the flutter and shimmering that often result from bobbing.

SmoothDeinterlacer(), however, arguably suffers from two problems. First, it is very slow. Second, it can take some time before areas are detected as static and anti-flutter mitigation put into play. So I am trying to write a new smart bob for Avisynth that is much faster and which engages the anti-flutter mitigation faster.

The challenge for engaging anti-flutter mitigation is to know when a pixel position is static. When we are creating a missing pixel (when creating a frame from a field we have to create the missing lines), if the pixel is static we can use the pixel from the previous field, rather than having to interpolate it somehow from the current field. We can look at the corresponding pixels in the previous and following fields, and if they are close enough, we can (with fingers crossed) assume that the pixel is static. This doesn't always work, because sometimes the pixel can differ validly for one field period. To reduce artifacts from that, I added a test for the two previous corresponding pixels rather than just one. This reduces the artifacts but does not eliminate them completely. Fortunately, they occur only for fast motions, and the eye doesn't notice them. The artifacts of DGBob() are comparable to or less than those of SmoothDeinterlacer() in this regard.

For pixels that are not static, we potentially have to deinterlace them. I used the new metric that I described earlier and which I called 'MetricX'.

Initial results with DGBob() are encouraging. It performs much faster than SmoothDeinterlacer() and appears to give comparable results. But there are some wrinkles to be worked out. Following is a frame grab showing what DGBob() can produce. Compare it to Figure 1 in the MetricX description below (6-6-2003). To see the anti-flutter mitigation, you'll need to play a clip because a single frame grab shows only one interpolated field and flutter results from the alternation of fields.



DGBob() Frame from Field

 6-10-2003: Scene Changes Revisited

When I talked about bad edits (aka "scene changes") I did not approach it in a fully complete and rigorous way. So I need to revisit bad edit handling.

Suppose we have a field sequence like this: ... a [x y] d ... I list below the possibilities for the frame [x y], where b and c are fields different from a and d. After each combination, I give the appropriate match result, where 'C' means match current, 'N' means match next, and 'P' means match previous. 'WEIRD' means the fields are out of order and thus the combination is highly unlikely unless the stream is seriously malformed. 'NO MATCH' means there is no successful match. Finally I give the number of orphaned fields for each combination (orphaned fields imply more perverse bad edits). Note that preference is always given to matching current.

a [a a] d => C 0
a [a b] d => P 1
a [a c] d => P 1
a [a d] d => N or P 0  <***>
a [b a] d => WEIRD
a [b b] d => C 0
a [b c] d => NO MATCH
a [b d] d => N 1
a [c a] d => WEIRD
a [c b] d => NO MATCH
a [c c] d => C 0
a [c d] d => N 1
a [d a] d => WEIRD
a [d b] d => WEIRD
a [d c] d => WEIRD
a [d d] d => C 0

If we adopt a two-way matching strategy, i.e., one where we can possibly return either the current match and always either the forward (N) or backward (P) match, it is clear from the enumerated combinations that there is no reason to prefer either current-plus-next or current-plus-previous matching. Either way, we would miss two good matching combinations.

Let's now consider the more radical three-way matching strategy, where we consider the backward, current, and forward matches. It appears to be a good idea, because it successfully returns a progressive frame for all combinations where there is a progressive frame to be returned. But appearances can be deceiving, we are told, and there is no exception to the rule here.

Consider the combination marked with <***>. Here we have a simple edit cut between the fields of a frame. Other than the clean edit cuts on frame boundaries, this is the cleanest cut, because it leaves no orphaned fields. So we might expect it to be more common than the other, more perverse cuts. Our problem now is that this relatively clean cut puts us in a serious dilemma. Do we return the forward match or the backward match? We cannot compare the metrics meaningfully, because there are no fields in common (a/a versus d/d). It appears to be an arbitrary decision. But if it occurs arbitrarily we can randomly drop or duplicate frames! Here's an example to show why:

a [a b] [b c] [c d] d
Suppose for the first bracketed frame we match backward, for the second we match forward. We have lost frame b/b! The reader can confirm through analysis that extra frame duplicates can also be created. Our conclusion is that three-way matching can result in random frame deletion and duplication when cuts are made between the fields of frames. Experience shows that such cuts are common. The resulting jerkiness, or juddering, can be visible, depending upon the clip type. For animations, where there is sporadic motion and duplicated frames in the source, the effect is disguised and often is invisible. For smooth motion in normal video, such as pans, it might be quite annoying. And of course, the extent of the problem depends on how frequent this type of cut is in the clip.

We should exhaust all options to make three-way matching work, because it succeeds in more cases (all of them!) than does two-way matching. Maybe there is a solution to the juddering problem. Suppose we try to make the decision in the <***> case non-arbitrary. One way is to examine the two matches for combing. If both are combed or both are clean, then always take the forward match. If one is combed and the other is not, take the uncombed one. It seems that this strategy will avoid juddering while allowing the three-way match to be used.

This idea suggests the following modified two-way matching strategy to achieve the same effect. Perform forward two-way matching and if the returned frame is combed, test the backward match. If it is not combed use it; if it is combed use current. This is the strategy currently implemented in the new Decomb. I would welcome new ideas about all of this from my readers. But I cannot think of anything better to do.

 6-6-2003: A New Combing Metric for Decomb

The last release of Decomb contained a new metric for deciding whether a frame is combed. Initial feedback indicates that it is an improvement. Let's take a look at it in detail.

First, here is a frame from one of my deinterlacing "torture test" clips. The figures in the foreground are completely still while the (out-of-focus) hand moves in the background.


 Figure 1

Let's forget everything we know, or think we know, about deinterlacing. Let's ask Joe Six-Pack for his opinion. How do we get rid of the ugly combing? Well, Joe says, you can pass through every other line and just worry about the rest. For the rest, we have to consider each pixel on the line. Consider one such pixel B with pixel A above it and pixel C below it. Now it's obvious, Joe continues, that if pixel B is combed, then it must be either lighter than both A and C, or it must be darker than both A and C. So just test that and declare the pixel combed if it is true.

Well now, that certainly cuts to the chase, doesn't it? We can test it on our Figure 1 above quite easily. Using Joe's test and setting combed pixels to full white yields the following map of combed pixels according to Joe's algorithm:


 Figure 2

Joe's algorithm certainly picked up the combed areas very well. But ouch, look what it did to the rest of the frame! We need to suppress all that noise, otherwise we will have trouble distinguishing it from real combing when we use our window technique previously described. But Joe says don't worry, it's easy to suppress that noise! (I'm beginning to develop greater respect for beer drinkers.) Joe goes on to say that we can't even perceive contrast differences below a certain threshold T, say about 7, so just adjust the test to allow for this. Here is the equation he gives, explaining that if R is true then the pixel is combed:

R = (B+T < A && B+T < C) || (B-T > A && B-T > C);

Well, there's nothing to lose; let's try it. Here is the result:


 Figure 3

Hey Joe! That's not too shabby. But Joe's not done. He says he has to run off to the bar but he'll give us one more trick before he goes. He says a lot of the remaining noise in the combing map is isolated pixels. You can suppress those by requiring a map pixel to have another map pixel to the immediate right or left. He says not to try that in the vertical direction because it creates noticeable artifacts. Hmmm. Anyway, he flashes this image and disappears:


 Figure 4

The metric that produces Figure 4 looks very useful! We could just run a fixed window over it and look for a threshold number of combed pixels to declare a combed frame. And this is just what Decomb now does. The variable T is called vthresh.

Ah, dear reader, I see you gesticulating wildly. What's that you say? "Use this map to actually perform the deinterlacing as well!" It's certainly a possibility. Let's suppose that we simply do a blend on each of the pixels detected as combed in Figure 4, while passing through all the other pixels directly from the source image:


 Figure 5

Nothing wrong with that. We could improve it with some edge-directed interpolation, something I've been meaning to add to Decomb for a while. And it may even give us a solution to the Marching Ants problem.

So there you are: a great new combed frame detector and a potentially great general-purpose deinterlacer. See y'all later. I've got some work to do!

 6-5-2003: New Beta

I just finished up and released a major new beta of New Generation Decomb. It replaces the combed frame detection metric with a better one, adds vthresh override capability, and displays the pattern guidance mismatch metric to allow tweaking of gthresh. Initial feedback is encouraging. Be sure to read the tutorial to understand the new vthresh handling.

After that burst of energy I need a rest before resuming my exposition here. When I do resume, I will describe the new combing metric and then talk about pattern guidance.

 6-1-2003: A Digression: The Holy Grail of Hybrid Rendering

We've all run into those nasty clips that contain a mix of 3:2 pulldown and straight video content. Fans of StarTrek know all about how hard such clips are to render satisfactorily! If you leave them at 30fps while deinterlacing the video sections, then the video sections look fine but the film sections look jerky. On the other hand, decimating the clip to 24fps leaves the films sections fine but makes the video jerky. This unfortunate dilemma has led some people to go to ridiculous extremes. For example, people have created clips at 120fps because it has 30fps and 24fps as factors.

Decomb's Decimate filter provides a special mode (mode=3) for decimating hybrid clips. It improves matters but is not miraculous. A recent email from Kevin Atkinson contained an idea that got me thinking that it might be possible to do better than Decimate mode=3. So, I have run some experiments.

Download the clips that I reference below (right click and Save Target As...) and play them off your hard disk in BSPlayer or any other decent media player.

The first clip is the raw 30fps video source, deinterlaced of course (I do recommend Fox News for your viewing pleasure):

Original 30fps clip

As expected, this clip is smooth as silk, especially the scrolling banner at the bottom. Let's try converting it to 24fps by simply tossing every fifth frame:

24fps clip made by tossing every fifth frame

Seems a might, shall we say, jerky! Yes, indeed. No big surprise, though. Let's try Decimate(mode=3):

24fps clip made by Decimate(mode=3)

Better, isn't it? The video looks OK but the scrolling banner still looks rather dodgy. The movement seems spatially smoother, but there is an annoying strobing effect. Also, if we single step, we'll see that some of the frames are now blends of two original frames. Some purists lose their lunch over that.

OK, the stage has been set. All ears are a-twitter. What magic is coming our way?:

24fps clip made by new method

Is it the holy grail of hybrid rendering? Not quite. But to my eyes it is an improvement. There are no hideous blended frames now and the movement, while not perfectly smooth, seems controlled and consistent. It almost looks like it is intended to be as it is, sort of a stock ticker effect.

But how is it done? Here is a cycle before decimation:

[a b] [c d] [e f] [g h] [i j]
And here is a field-decimated cycle, intended to spread the decimation more evenly through the cycle:
[a b] [d e] [f g] [i j]
Looks good, except that the frames d/e and f/g are now unusable, because top fields will be displayed as bottom fields and vice-versa. Fortunately, it's not a deal killer because we can resample the fields to put them back in the right spatial positions (compare VirtualDub's "field bob" filter). Doing that creates the "new method" clip.

Is it an improvement that should be incorporated into Decomb? I await the opinion of my readers before deciding. But I believe it is.

 5-30-2003 [2]: Fixes and "A Scene Change is as Good as a Rest"

I've repaired the big bugs in the new Decomb and fixed scene change handling as earlier described, and I now feel comfortable in making a general beta available. Get it here: Decomb 5.0.0 beta 5. Please be sure to carefully read the tutorial because, as you might imagine, a lot of things are changed from Decomb classic.

Let's talk about scene changes a little more. It's actually a misnomer when you think about it, but it's the terminology people use, so we will continue to do so as well. Think about it, though; every time a new frame picture comes along that is a "scene change". The change can be bigger or smaller, but there is no reason why any change in picture in and of itself should cause a matching problem. What causes the problem is bad edits. Of course they typically occur when editors cut or insert clips based on different scenes of the movie, so there is a coincidental correspondence between scene changes and problems resulting from bad edits. But if a scene change is made with a good edit, there is no problem for field matching. That is why not all scene changes experience problems.

Let's consider a typical PAL scene change done with a good edit (we assume top field first in all the following examples):

a a b b
a a b b
This clip causes no problems whatsoever and the field matcher continues to match "current" right through it without skipping a beat. But consider this edit:
a a a b b
a a b b b
On the third frame we have to match backwards but that is not a problem and no combed frames are emitted. What if our editor is really drunk and makes this edit, leaving an orphaned field:
a a b c c
a a c c c
On the third frame, we have to match to "next" to avoid emitting a combed frame. That is why I made the modification to Decomb earlier described. In PAL, the only way to get a match failure is to make a double bad edit leaving two orphaned fields:
a a b d d
a a c d d
There is no good match for the third frame. Can we assume no editor would ever do this? Probably. It's almost perverse. I have seen one instance of it in my entire life. But still, there is an even more compelling reason to avoid adding special handling for this, although we easily could (we could deliver a/a or d/d). The frame b/c can be viewed as the insertion of a video frame. It's a very small video section! Recall that we need to pass video sections through to the deinterlacer. So for those reasons, we do not attempt to handle this special case. The worst that will happen is that one frame in a zillion will get deinterlaced when we could theoretically deliver a duplicate good frame instead.

That's PAL. Similar arguments apply to NTSC 3:2 pulldown. But there is a single cut that will produce the match failure (the blank part represents a cut-out portion):

a a b c d e        l m m n o
a b c c d        k l m n o o
The combined frame e/k has no good match. Again, however, I have seen only a few of these in my life and the resulting frame looks like a video frame. Good editing practice of course avoids silly things like this.

I suppose one could argue that if there is just one video frame in a row, then treat it as a failed match due to a really bad edit. But until one of my users actually complains about it and submits a source clip of original material showing multiple instances, I think it will be fine just to deinterlace in the rare cases in which it occurs. So sayeth I.

 5-30-2003: New Generation Decomb Initial Results

I completed a first beta of New Generation Decomb and released it to my favorite testers at Doom9 (I will release a beta here shortly). Initial results are encouraging. A major bug was identified rather quickly and was squashed, although I haven't released the fixed beta yet, as I want to add a new feature to address another deficiency, that of handling of scene changes.

In Decomb classic, Telecide() performed a three-way match. Therefore, at some scene changes, where there is no available match to the previous frame, Telecide() could match to the following frame and output a non-combed frame. But now, because the new Telecide() performs only a two-way match for speed and better matching reliability, it emits a combed frame at some scene changes. This is a degradation of performance versus Decomb classic and something we cannot accept. Yet we do not want to restore three-way matching. What to do?

I think a good compromise solution is this: when a two-way matched frame is declared interlaced, Telecide() tries the third match. If it is an improvement it is used. This requires postprocessing to be enabled, but postprocessing is now so fast that I can't imagine that anyone would choose to run without it. The best part of this compromise is that it is free. The calculation for the forward match is already calculated as the backward match for the following frame! Having a metrics cache means that we have to calculate metrics for a frame only once; it doesn't matter when we do it. I'll explain more about that when I discuss pattern guidance, because it relies heavily on the metrics cache.

So, I will implement this solution and then release a wider beta. After that, I hope to start expounding on pattern guidance, which is a very interesting problem.

 5-27-2003: Postprocessing At a Discount

We've established an effective field comparison algorithm that allows us to correctly match the fields for each frame. Now we require a means to decide whether a matched frame is still combed, as it might be for several well-known reasons (noise, too much vertical detail at the wrong spatial frequency, missing fields, hybrid clips, hybrid frames, blended fields, poor edits, etc.). Combed frames need to be deinterlaced.

Let's take the direct approach. No pussyfooting around. We already have a nice metric that we calculated for the best field match. Why don't we simply threshold that? We'll say that metrics above a certain value indicate an interlaced frame, i.e., one in which the fields are not from the same picture. If we can live with that, then we will have made the postprocessing decision totally for free!

Unfortunately, we cannot live with that. The problem is that the metric is too blunt. We already know that there is a certain amount of pseudo-difference that remains due to the spatial offset between the fields. We have to set the threshold above that floor. Furthermore, the floor level varies depending on the contents of the frame. Now consider a small mouth that comes out combed. The actual metric contribution from the very small mouth is swamped by the residual pseudo-difference floor, such that we cannot reliably distinguish between expected variation of the floor and a small area of interlacing.

There is also the problem that the metrics will be dependent upon the size of the frames. It would be preferable for the user to have a normalized metric so that "threshold=40" always means the same thing.

I'm sure that my dear readers have already conceived the solution. We simply calculate difference metrics for small windows on the frame. Every window (say 16x16 pixels) has an independent difference metric calculated. Then we select the window with the highest difference metric and use that as our metric for deciding if the frame is combed. This solves our two problems neatly. First, the contribution of a small area is magnified. Second the metric is automatically normalized because the window size is fixed. We will still need to threshold the metric to make the final decision, but it is not problematic or sensitive thanks to the windowing.

Readers who have studied the source code are thinking, "Old generation Decomb windowed the combed frame detection, so what is the big deal?" The big deal is that in old generation Decomb, the frame was first sampled and calculated for the field matching and then, if postprocessing is enabled, it was sampled and calculated again. Why? It is a legacy of the fact that FieldDeinterlace() was written before Telecide() had integrated postprocessing, and FieldDeinterlace()'s code was absorbed into Telecide uncritically. Clearly, as long as the subsampling works for both matching and combed frame detection, the two should share the sampling and calculations. New generation Decomb does this, resulting in a major speedup of postprocessing.

Some practical details are still dangling. How do we efficiently calculate the metrics per window? That is actually trivial. We declare an array of accumulators, one for each block. Then after we calculate the contribution of a sampled pixel, we accumulate it into the overall sum for matching purposes and into the appropriate window sum for combed frame detection purposes. Then after all the pixels have been processed, we scan the window metrics and pick the highest one.

Of course we still have to deinterlace the frames that we have decided are combed. We cannot re-use our calculations for that because a) deinterlacing requires full sampling, and b) the deinterlacing combing detection algorithm differs from the frame differencing algorithm. In our favor, however, is the fact that relatively few frames need to be deinterlaced when we are processing progressive material. The main processing hit has been eliminated by re-using the field matching calculations for combed frame detection.

Old generation Decomb was capable of automatically adapting to the field order of the clip. This is a mixed blessing. While it is slightly easier for the user, it makes the filter slower and, worse, it makes possible extra spurious bad field matches. New generation Decomb therefore requires the user to specify the field order of the clip. Fortunately there is a simple, quick, and reliable way to determine the field order. I believe most users would trade off the few minutes required to determine the field order against several hours of saved processing time and a diminished frequency of bad field matches. In fact old generation Decomb had the 'mm' option to suppress the spurious match test and thereby achieve the same result.

Next time, we will look at the major revamp of pattern guidance in new generation Decomb. It supports full random timeline navigation and is not dependent upon a history of previous decisions, yet it performs better than old generation Decomb's pattern guidance. Think about how that can be the case.

Here's a status report on new generation Decomb. All coding is complete except for YV12 support. The new reference manual is complete. A user manual is in progress. When that is done, I will release an alpha version. While that is out I will complete the YV12 support and fix any bugs that are found.

 5-26-2003: Field Differencing

Recall that we were considering a field sequence ...a [b c]... where b and c make up a frame. We concluded that the frame is progressive if and only if a and b are from the same picture (temporal moment) or b and c are from the same picture. Our problem is how to reliably determine if two fields are from the same picture.

The direct approach is to simply subtract one field from the other, hoping that the spatial offset is not too serious. We can get an idea about how that works using Avisynth:

a=SeparateFields(clip)
Subtract(a,Trim(a,1,0)
The differences are centered on a middle gray level by Subtract() to show positive and negative differences. Following is a typical progressive frame. We hope that the picture will be totally flat, featureless gray. But alas, we are disappointed; there is a lot of strong detail.

The spatial offset results in detail that looks like movement between the fields, making us think that this frame is not progressive. What to do? Maybe it will help if we resample the fields so as to move one up a bit and the other down, so that we are comparing the corresponding spatial points. The following image uses VirtualDub's built-in field bob filter to align the fields. Then the Avisynth Subtract() operation is applied as before.

Obviously it's much better but you can still make out significant detail, implying field difference. Is there anything else we can do? We can blur the bobbed fields before the Subtract operation, as in the following image.

Now we're getting somewhere! For comparison, the following image shows the same bob/blur/subtract operation applied to fields from two different pictures. In this case we hope and expect to see a big difference.

Comparison of these last two images gives us confidence that with the help of a threshold we can reliably distinguish between progressive and video frames. In fact, new generation Decomb uses this approach for both field matching and video frame detection. When we try to match field b to either field c or field a in the example ...a [b c]..., we simply calculate the two field differences and pick the pairing that produces the smallest field difference. The beauty of this is that the most expensive part of postprocessing, determining if a frame is video, will come almost for free from the field matching operation!

Let us examine the details of how Decomb (assume new generation unless specifically qualified from now on) performs these two tasks.

First consider the field matching. Clearly, performing the bobbing and blurring as described above is computationally expensive. We need an algorithm that approaches it but can be computed efficiently. Decomb subsamples the frame to select pixels to examine. Suppose we denote one of these sampled pixels as C. The pixels above and below C then are as follows:

a
b
C
d
e
Pixels a, C, and e are from the top field and b and d are from the bottom field. Now we calculate:
difference = abs((a+C+e)/3 - (b+d)/2)
In practice we use a trick to avoid the division by 3. It is clear that the bobbing effect is achieved by the choice of pixels and the blurring is achieved by the averaging. We now accumulate all the differences calculated for all our sample points and the result is our difference metric for the field comparison. Finally, returning to the matching scenario ...a [b c]..., we get the field difference metric for [b a] and for [b c] and we select the pairing that has the lowest field difference metric. That is how Decomb performs its field matching. There are some refinements but that is the essence of it.

In my next journal entry I will describe how the progressive/video decision is derived from this field matching calculation and how it brings us postprocessing at a discount. Think about it! It's not so simple as you may think because you'll see that the obvious approach fails.

 5-23-2003: In the beginning

In the beginning man created film. And it was progressive. And man said, Let there be interlacing: and there was interlacing. And man saw the interlacing, that it was good: and man divided the even fields from the odd.

If God had been making these decisions we'd be a lot better off.

Before I transition into an explanation of this parable, I want to alert readers to the fact that I have added two items to the Decomb deficiency list adduced below on 5-21. Now for the transition...

This fundamental yin-yang of video, progressive versus interlaced, is the first consideration we need to pay attention to when deciding how to render a clip. And we need to decide on a per-frame basis, not on a per-clip one, because hybrid material is ubiquitous. An interlaced video frame (one that cannot be field-matched back to a progressive source frame) must be passed through a field restoration process as is, and then it must be deinterlaced.

This need to weed out video frames before applying field reconstruction and pattern guidance means that it is the first problem that we need to solve. The problem statement is simple. How do we examine a frame and determine if it is video as defined above, or if it is actually a good progressive frame, or can result in a good progressive frame if appropriately matched?

Several solutions suggest themselves. We might take the tack that if it is not progressive film, it must be video. So we'll look for the 3:2 pulldown pattern and if we don't see it, we have video. But that is unacceptable because we can have progressive frames in the absence of any pattern, e.g., 24fps or 30fps progressive. Being out of pattern is not the same as being video.

We are left with the idea of comparing the fields of the frame. Consider this sequence, where the top field is first (each letter denotes a field; a field and the field below are a frame, and a different letter denotes a new picture):

a c e g i ...
b d f h j ...
Here we have frames of interlaced video. Now consider this:
a b c d e...
a b c d e...
Here we have simple progressive frames. Consider this:
. a b c d e ...
a b c d e . ...
Here we have progressive frames but with a one-field phase shift.
a a b c d ...
a b c c d ...
Here we have 3:2 pulldown.

Considering all of the above patterns, what is the key that allows us to look at a given frame of the video sequence and say, "that is a video frame"? We cannot simply reply that if the frame is combed it is video, because the phase-shift case and the 3:2 case both have combed frames, but they are not video, because the progressive original frames can be recovered through field matching.

Consider the frame c/d in the video case. We have the temporal field sequence ... x [c d] y. If c does not match to d, it must match to x, otherwise it is video. Consideration of the cases above shows that this is a valid conclusion. But comparing c to d and c to x is just the same comparison strategy we use to attempt to match fields! This will be a useful coincidence for performance reasons. For now, we just note that our algorithm for determining if a frame is video is as follows:

Attempt to field-match the frame. Select the best match. If the two fields of the match are not the same picture, i.e., they differ, then the frame is video.
That makes it sound easy, but it isn't. Sadly. The problem is that it is not straightforward to compare fields to see if they are the same picture. The reason is not just because there may be noise. We can allow for that. The problem is that the two fields are spatially offset from one another. Even though they may come from the same picture, they differ because one field has the even lines and the other has the odd lines. We need a differencing method that will not be confused by this spatial offset. And it must also not be confused by a progressive picture that has vertical detail whose spatial frequency would spoof simple line-to-line comparison. It is not an easy problem.

In my next journal entry, I will describe my solution to this problem. The solution is not the same as the one currently implemented in Telecide, and it is an improvement. It has other highly useful consequences too, as you will see. For now, it will be instructive for you to think about this problem and its possible solutions.

 5-21-2003: New Generation Decomb

My Decomb filter package for VirtualDub has become probably my most popular and often-used tool. It is now mature and stable and has accumulated useful features that make it a very flexible solution capable of meeting many diverse needs in desktop video. Certainly it has some features that were unique when introduced, such as the idea of using blind field matching for progressive frame restoration as one operation, and decimation as a separate decoupled operation, such that combining the two results in an inverse telecine (IVTC) operation. The decoupling of the functions allowed Decomb to address more application domains than an integrated tool would have done.

Another example of an innovative feature is the idea of allowing the deinterlacer to decide whether a frame is combed before applying adaptive deinterlacing to the frame. This allows the deinterlacer to be applied as a post-processor to a progressive frame restoration process: only the frames emerging still combed from the restoration process are touched; good progressive frames are not degraded. FieldDeinterlace() is still the only deinterlacer that offers this functionality.

A final example is Decimate()'s special modes for hybrid material. While this functionality is still not the holy grail of hybrid rendering that we all seek, it offers an advance over previous practice.

Despite having important features that commend it, Decomb has some deficiencies. Living as I do by the philosophy "we can improve anything!", I'm motivated to look squarely at the deficiencies and think about what can be done to improve matters. Here I am amply and ably helped by my users, who are not shy to tell me what they don't like!

So what is there not to like about Decomb? Following are the major deficiencies that make me sleep poorly for having designed and implemented such things (I have named the problems so that I can make specific references to the deficiencies later). There are surely other problems, but I don't lose sleep over them.

  • The Mouth Problem  While Telecide() generally does a good job of field matching, it can choose poorly when motion is limited to very small areas, such as moving mouths. It is irritating to have a great encode except for a few mouths that are combed. I call this the "mouth problem". Of course, pattern guidance can significantly reduce the occurrence of the problem, but the implementation of pattern guidance is not great (see below) and not all clips can use pattern guidance. Manual override of the field matching is also possible but it is a cumbersome process that users prefer to avoid.

  • The Indeterministic Pattern Guidance Problem  Pattern guidance is the process of biasing the blind field matching towards a declared clip type. For example, if the clip is known to be 3:2 pulldown material, then Telecide(guide=1) is used to bias the field matching to the 3:2 pattern. This can eliminate some bad field matches that might otherwise occur when only blind field matching is used. Unfortunately, the pattern guidance implementation relies on a buffering of the previous four frames. This means that inconsistent results are achieved when there is random timeline access. Ideally, rendering of a given frame should not be dependent on the past history of frames processed. To be clear, rendering is dependent on the previous frames, but it should not require that the clip be played through those frames to reach the current frame. One of the strengths of Avisynth is random frame access; when a frame is requested the previous frames can be obtained without having had to play through them. Doing it that way is better, but it raises some practical issues that need to be addressed in an efficient way (caching of previous calculations).

  • The Insufficient Pattern Guidance Support Problem  The current implementation of pattern guidance demonstrably helps the field matching, but it is half-hearted. It relies on a history of only the past four frames (for 3:2 guidance). If the current frame combined with the past four frames looks like a 3:2 pattern, then the field match decision is biased appropriately. This is really a minimal approach for two reasons. First, too few frames are included in the analysis. If one of them is out-of-pattern, then guidance fails. Better would be to consider many more frames and use an approximate matching to allow for individual frames to be noisy, misinterpreted by the combing metric, etc. Second, a forward search should also occur. Ideally, if a backward search fails to provide guidance, a forward search should be performed. This would allow a seamless continuity of pattern lock even across a pattern change! A simpler way of saying it is that pattern lock would always be instantaneous (Decomb currently requires to play through a full in-pattern cycle to get back into pattern lock).

  • The Film versus Video (Hybrid) Problem  In practice, a lot of source material consists of a mix of "film" (24 progressive frames per second 3:2 pulled-down to 60 fields per second) and "video" (60 fields per second with each field a separate temporal moment). While Decomb can produce reasonable results with such material (partially explaining its popularity), its handling of such hybrid material is not correct from a theoretical perspective. Here's why. When Telecide() encounters a video frame, it doesn't know that and applies field matching anyway, when it should just pass through the frame untouched so that it can be deinterlaced by the post-processing. If it matches to current, then all is well, but if it matches to previous or next, then the correct frame is not passed through. In practice it is not disastrous because the result is only a small jerkiness of the resulting video sequence. And because the user usually sets the final frame rate to 24fps to get good film rendering, he attributes the jerkiness due to this mechanism to the decimation of the video sequence! Of course, if the user leaves the frame rate at 30fps, any remaining jerkiness can be attributed only to this faulty mechanism. Matters could be improved if Telecide() could reliably distinguish video frames and pass them through to the deinterlacer. It would also improve Decimate()'s hybrid modes because it would have a more accurate indication of which cycles contain film and which contain video.

  • The Field Matching Problem  Decomb's field matching is simply not as good as that found in some other field matching/IVTC solutions. It's taken me a long time to face this and to discover the reason, but I now know the reason, and the new generation Decomb will correct that.

  • The Postprocessing Problem  Postprocessing is the process whereby restored progressive frames are examined to see if they are still combed (there are various reasons why they could be) and then are deinterlaced if they are. Decomb's postprocessing is simply too darned slow. Is there anything else to say?

  • The Marching Ants Problem  Decomb's deinterlacer, embodied in Telecide()'s postprocessing and in FieldDeinterlace(), has an irritating deficiency. Where there is a hard stable horizontal boundary, for example, the boundary between a black letterbox area and the video, or between a static logo area and the video, a corruption of the edge can occur that one user characterized as marching ants. This occurs because a very minimal combing metric is used for speed. An edge avoidance term in the metric would dramatically improve matters at some performance expense.
So, why am I telling you all this? Good question. I'm glad you asked. The answer is that I have already begun a redesign and implementation of a second-generation Decomb to address these issues. In future journal entries I will describe the new design and how it addresses the deficiencies. You will be happy to know that the Indeterministic Pattern Guidance Problem and the Insufficient Pattern Guidance Support Problem have already been designed and tested. The Film versus Video Problem has been designed and is now being coded.

Y'all come back, now, and I'll tell you all about it.