Welcome to my journal!So what's this all about?This is my journal where I expound about anything and everything that interests me at the time. It includes descriptions of my thinking and work in progress, rants, raves, and other random musings. The formatting is shamelessly borrowed from Avery Lee's VirtualDub site (imitation is the sincerest form of flattery!). My only hope for this page is that it won't bore you (too badly). Note that I add new entries at the top, so if you are an infrequent visitor, you'll probably want to scroll down a little to get into the flow. [All contents are Copyright (c) 2003, 2004 Donald A. Graft, All Rights Reserved.] 8-10-2003: KernelDeint(): A Fixed Spatial/Temporal Filter for DeinterlacingTime flies when you're having fun. And I am having fun. Recently, Colin Browell sent me an email calling my attention to an old US patent that described a deinterlacing method based on a fixed spatial/temporal filter. To be honest, while I found it interesting I wasn't very optimistic about it being useful. I was not optimistic because I thought it's an old idea that you don't see in use today. If it was worth anything, it would be well-known and used. Nevertheless, I decided to crank out a quick prototype to see how it looked. Suffice it to say I was surprised and delighted! To see why, let me show you some frame grabs. Then I'll discuss the theory and a refinement of the basic idea. I polished the prototype into a finished Avisynth filter called KernelDeint(), and I encourage you to download it and use it for your interlaced video. Following is a grab of a test clip deinterlaced using field discarding and bicubic interpolation to restore the original height. This acts as a sort of standard for the resolution we can expect with simple field discarding. And following is a grab of the same frame processed with KernelDeint(). The threshold has been set to 0 to force the entire frame to be deinterlaced (and not just "moving" areas), in order to compare it fairly to the previous grab. Cognoscenti will have no problem seeing a noticeable improvement in retained resolution. For those with less sensitive eyes, following is a frame showing the difference between the images (made using Avisynth's Subtract() filter. This image allows you to visualize the extra detail retained by KernelDeint(). Obviously, this spatial/temporal approach has something to offer compared to simple spatial interpolation! But how does it work? If you read the patent you can gain a complete understanding, but I will state the main point. Information from the previous and following fields in included in the calculations made for interpolating the missing field. This information is band-passed such that low frequency (DC) components are excluded, and high frequency components (including those that would cause combing) are excluded. But the vertical frequencies that are included are enough to improve the overall resolution while still removing the combing. Following is the small kernel used to generate the frame grabs. Even more dramatic results can be obtained by using the larger kernel described in the patent. It performs some sharpening, however, that some may not want. But often people add a sharpener after field discarding, so there is a precedent for having a sharpening kernel, and with the large kernel you don't need a separate sharpening filter. One very important point that doesn't show up very well in this particular example is that this filter also produces significantly less jaggies on diagonal edges (and especially near horizontal edges). Try it on one of your problematic clips. You will be impressed! My friend Ivo ('i4004' on the forum) suggested that a standard motion test as done in SmartDeinterlacer could be performed and then the kernel applied only to the "moving" areas. This would allow static picture areas to be faithfully retained while moving areas still benefit from the better resolution of the spatial/temporal approach. I implemented this idea in KernelDeint(). It also allows a choice of small versus large kernel, and allows the user to show the motion map. Following is the frame from above with the motion test enabled at threshold=10.
Short of full motion estimation/compensation techniques, it's hard to conceive of a better result. This algorithm can run in real time on fast processors. For that reason, and because of its impressive results, this spatial/temporal approach should arguably be the standard operating procedure for deinterlacing moving areas of interlaced video. Are you wondering how it performs when you throw hybrid progressive/interlaced material at it? I'll talk about that in a later journal entry. 6-25-2003: Pattern GuidanceLet's talk about pattern guidance. Consider a standard 3:2 pulldown sequence, using our familiar notation:a a b c d a b c c dDecomb 5's matching strategy will match the first frame to current (it could match to next also, we'll come back to this subtlety), the second to next, the third to next, the fourth to current, and the fifth to current. We can write a shorthand for the sequence, then, as: ...c n n c c c n n c c c n n c c c n n c c...Given a clip that is clean 3:2 pulled down throughout, we might expect the output of the blind field matching to give the exact pattern shown above. In the real world, however, noise and the imperfection of the blind matching algorithm result in errors. The output will be corrupted with sporadic little failures of the matching. For example, for the source string above, our matching result might be: ...c n n c n c n n c c c n n c c c n c c c... The challenge for a pattern guidance algorithm is now clear. Somehow take advantage of the fact that a 3:2 pattern is known to be present to correct the sporadic mismatches. This will require us to detect and track the 3:2 pattern and to generate a prediction for a match. If the actual blind match is the same as the predicted match we just accept it. If they differ, we reach a crossroads because two things could be happening. First, it could just be one of the sporadic mismatches we mentioned above. Second, it could be a real scenario where the pattern is no longer valid. If the first case applies, we want to overrule the blind match and use the predicted match. If the second case applies, we want to accept the blind match. I have found that the two cases can be adequately distinguished by comparing the predicted and blind match metrics; if they differ by less than a threshold amount, you have the first case, and if they differ by too much, you have the second case. The hard part is tracking the pattern and making good predictions. Ideally, we would like to have a solution that will allow us to randomly navigate to any frame and the result would be completely determined and independent of any previous actual match decisions. While this is possible, it doesn't give the best solution, as will become clear. Nevertheless, since at the beginning of a clip we have no history of actual matches, and because the user can navigate to a random frame, our algorithm has to start in the ideal mode, which I call "soft" guidance. Let's consider how the soft guidance is implemented. Consider again the ideal 3:2 sequence shown above. I pointed out that for the first frame the match metrics for a current match and for a next match will be equal or very close. Such a case occurs only once per cycle, so if we can detect it, we will have determined the pattern phase. Let's denote that frame by 'c*'. Then we have this: ...c* n n c c c* n n c c c* n n c c c* n n c c...Our soft guidance strategy, then, is as follows (this is the easy version; Decomb does something better, but let's understand the easy way first). When trying to match frame N, examine the current and next match metric pairs for frames N+1 through N+5. Find the frame with the lowest difference between the current and next match metrics. This frame fixes the phase of the 3:2 pattern! Given the phase, we can easily predict the match for frame N. If the predicted match is close enough to the blind match, then overrule the blind match with the predicted match. Well now, that was pretty simple, and entirely worthy of Joe Six-Pack. And it works well when there is steady motion in the clip. It fails when a non-3:2 section arrives and when motion becomes intermittent. For the first case, we simply require that the lowest current/next difference be low, if it isn't, then it can't be 3:2 and we make no prediction, thereby disabling guidance for the section. The second case occurs when there is no motion, or low motion, or duplicated frames in the cycle we are examining. In these cases there are multiple candidate phases, and the lowest one is not always the correct one (again due to noise and algorithmic imperfections). To handle this multiple candidate phase problem, Decomb generates a list of candidate phases sorted by goodness, i.e., by lowness of the current/next difference. Decomb then tries to apply guidance using the candidates in best-first order. As soon as one succeeds (predicted/blind mismatch is low enough), the prediction is accepted and used. If none of them succeed, the blind match is used. There, now you know how soft pattern guidance works. Soft pattern guidance works quite well, but in the presence of low motion and duplicated frames, it can still jump back and forth between possible phases, and thereby emit slightly wrong matches. Soft guidance is good when we arrive at a frame with no history of a 3:2 pattern of actual matches, but when we know that a 3:2 pattern of actual matches is behind us, we ought to be able to simply continue it. That is "hard" pattern guidance. It clearly is not completely deterministic for random access, because it depends upon the actual matches that precede the randomly accessed frame. And the actual matches for those depend upon their predecessors, and so on, and so on. So we cannot achieve a hard guidance and complete determinism with random access. Let's see how hard guidance is done. Hard guidance is easily accomplished. We simply store the last five actual matches if they exist. They will exist if we have played through them. When encoding we will be playing through from beginning to end, so hard guidance will always have the 5-frame history (except for the first 5 frames of the clip). Then, if the 5 frames match a 3:2 pattern, we know the phase and make a hard prediction. If the hard prediction succeeds (predicted/blind mismatch is low enough), then we use it. The final piece of the puzzle is how hard and soft guidance are combined. We start by trying hard guidance. If no hard prediction is possible, or the prediction fails, then we invoke soft guidance. If both fail, the blind field match is used. There you have it: pattern guidance in Decomb 5. It works surprisingly well, as long as the threshold for accepting an override ('gthresh') is not set too high. Fortunately, most of the sporadic mismatches are small and can be successfully overridden by the pattern guidance. 6-21-2003: DGBob(): More Artifact ReductionDGBob() was still producing artifacts on some unusual clips, so I modified the motion detection to better detect motion by considering more pixels in the temporal neighborhood. Consider this:a b c d e x _ y f g h i jThe revised test is now this: if ((abs(c-a) < D) && (abs(c-b) < D) && (abs(c-e) < D) && (abs(c-d) < D) && (abs(h-f) < D) && (abs(h-g) < D) && (abs(h-j) < D) && (abs(h-i) < D) && (abs(x-y) < D)) pixel position '_' is static else pixel position '_' is movingI also added an optional 'artifact protection' mode. In this mode, when the test above determines that a pixel position is static so that the previous field's pixel can be used, extra checking is performed to test whether that would produce a visible artifact. The test is as follows: if (((x + AP < c) && (x + AP < h)) || ((x - AP > c) && (x - AP > h))) use (c + h) / 2 else use xAP is the artifact protection threshold. It is set to a relatively high value to avoid preempting desirable weaving while still catching many obvious artifacts. The artifact protection mode should be used only when absolutely necessary, as it can increase flickering a bit (because some valid weaving gets preempted), and it requires more processing. The motion detection is now good enough that this mode is usually not required, but when that rare perverse clip comes along, it is available to almost guarantee the absence of perceivable artifacts. I can't think of any other ways to improve DGBob(), and it now supports all color spaces, so I am now going to return to Decomb 5. 6-18-2003: DGBob() Revisited: Artifact ReductionDGBob() in its first release was, in my humble opinion, barely acceptable. And that applies to all the smart bob filters, including SmoothDeinterlace(). The reason for my opinion, of course, is the artifacts that they produce. Just to be fair to my own work, here is a typical artifact from SmoothDeinterlace(), but the first release of DGBob() artifacts in the same way (but see below!):Smart bob artifacts [SmoothDeinterlace()] Here we have a light pole that is being panned across the field of view. The artifacts along the pole are plainly obvious and quite objectionable. They arise because pixels are changing in just the current field and not the previous and following fields. Consider this depiction of three lines from five successive fields: x x x o _ o x x x Suppose that we are trying to invent a pixel to replace the pixel position labeled by the underscore character. We want to either interpolate from the 'x' pixels above and below if the pixel position is in a moving area, or we want to simply use the 'o' pixel from the previous field if the pixel position is in a static area. But how do we determine whether the pixel position is moving? A naive approach is to consider the two 'o' pixels on each side. If they differ by no more than a threshold amount, then the pixel position marked by '_' can be considered to be static. One hopes that this test will be correct most of the time. But it fails surprisingly often. Any kind of change that appears only in the single field containing the '_' pixel will fail the test. Such changes can be caused by single field events, such as flashes, but more commonly by fast motion, as in the case of the moving pole shown in the image above. This is a serious artifact, and no consideration of the history or future of the line containing the '_' pixel gives us any way to avoid it. I was in despair about this until just this morning. Then I had the brainstorm that we could get an indication of the single-field change in the '_' pixel position by looking at the lines above and below. It's quite unlikely that a single-field change would be limited to a single line, so we can use the lines above and below to help us determine what is happening at our '_' pixel position. Here is a recasting of the pixel map with the pixels replaced by different letters, so that I can refer to them: a b c x _ y d e f To determine if our '_' pixel position is moving, we will make the following test: if ((abs(a-b) < D) && (abs(d-e) < D) && (abs(x-y) < D)) pixel position '_' is static else pixel position '_' is moving The improvement due to this new motion test is impressive. Here is how DGBob() so modified renders the pole: Result from new motion detection [DGBob()] I'm certainly quite pleased with the result. My next steps for DGBob() will be addition of YV12 support, addition of a 'show motion map' option, and improved interpolation (cubic and/or edge-directed). When DGBob() is completed, I will return full-force to Decomb 5, with an attack on pattern guidance. 6-14-2003: DGBob(): A MetricX SpinoffI needed a little break from Decomb 5, so I decided to tackle an area that I have been frustrated by for some time, i.e., performing high-quality smart bobbing (converting fields to frames and doubling the frame rate). It is not hard to write a basic smart bob filter. I wrote the first one for VirtualDub many years ago. But writing a smart bob that can mitigate the effects of flutter is not so easy. Gunnar Thalin was a pioneer in this area with his popular SmoothDeinterlacer() for VirtualDub (later ported to Avisynth by 'Xesdeeni'). SmoothDeinterlacer() successfully achieves its goal of significantly reducing the flutter and shimmering that often result from bobbing.SmoothDeinterlacer(), however, arguably suffers from two problems. First, it is very slow. Second, it can take some time before areas are detected as static and anti-flutter mitigation put into play. So I am trying to write a new smart bob for Avisynth that is much faster and which engages the anti-flutter mitigation faster. The challenge for engaging anti-flutter mitigation is to know when a pixel position is static. When we are creating a missing pixel (when creating a frame from a field we have to create the missing lines), if the pixel is static we can use the pixel from the previous field, rather than having to interpolate it somehow from the current field. We can look at the corresponding pixels in the previous and following fields, and if they are close enough, we can (with fingers crossed) assume that the pixel is static. This doesn't always work, because sometimes the pixel can differ validly for one field period. To reduce artifacts from that, I added a test for the two previous corresponding pixels rather than just one. This reduces the artifacts but does not eliminate them completely. Fortunately, they occur only for fast motions, and the eye doesn't notice them. The artifacts of DGBob() are comparable to or less than those of SmoothDeinterlacer() in this regard. For pixels that are not static, we potentially have to deinterlace them. I used the new metric that I described earlier and which I called 'MetricX'. Initial results with DGBob() are encouraging. It performs much faster than SmoothDeinterlacer() and appears to give comparable results. But there are some wrinkles to be worked out. Following is a frame grab showing what DGBob() can produce. Compare it to Figure 1 in the MetricX description below (6-6-2003). To see the anti-flutter mitigation, you'll need to play a clip because a single frame grab shows only one interpolated field and flutter results from the alternation of fields. DGBob() Frame from Field 6-10-2003: Scene Changes RevisitedWhen I talked about bad edits (aka "scene changes") I did not approach it in a fully complete and rigorous way. So I need to revisit bad edit handling.Suppose we have a field sequence like this: ... a [x y] d ... I list below the possibilities for the frame [x y], where b and c are fields different from a and d. After each combination, I give the appropriate match result, where 'C' means match current, 'N' means match next, and 'P' means match previous. 'WEIRD' means the fields are out of order and thus the combination is highly unlikely unless the stream is seriously malformed. 'NO MATCH' means there is no successful match. Finally I give the number of orphaned fields for each combination (orphaned fields imply more perverse bad edits). Note that preference is always given to matching current. a [a a] d => C 0 a [a b] d => P 1 a [a c] d => P 1 a [a d] d => N or P 0 <***> a [b a] d => WEIRD a [b b] d => C 0 a [b c] d => NO MATCH a [b d] d => N 1 a [c a] d => WEIRD a [c b] d => NO MATCH a [c c] d => C 0 a [c d] d => N 1 a [d a] d => WEIRD a [d b] d => WEIRD a [d c] d => WEIRD a [d d] d => C 0 If we adopt a two-way matching strategy, i.e., one where we can possibly return either the current match and always either the forward (N) or backward (P) match, it is clear from the enumerated combinations that there is no reason to prefer either current-plus-next or current-plus-previous matching. Either way, we would miss two good matching combinations. Let's now consider the more radical three-way matching strategy, where we consider the backward, current, and forward matches. It appears to be a good idea, because it successfully returns a progressive frame for all combinations where there is a progressive frame to be returned. But appearances can be deceiving, we are told, and there is no exception to the rule here. Consider the combination marked with <***>. Here we have a simple edit cut between the fields of a frame. Other than the clean edit cuts on frame boundaries, this is the cleanest cut, because it leaves no orphaned fields. So we might expect it to be more common than the other, more perverse cuts. Our problem now is that this relatively clean cut puts us in a serious dilemma. Do we return the forward match or the backward match? We cannot compare the metrics meaningfully, because there are no fields in common (a/a versus d/d). It appears to be an arbitrary decision. But if it occurs arbitrarily we can randomly drop or duplicate frames! Here's an example to show why: a [a b] [b c] [c d] dSuppose for the first bracketed frame we match backward, for the second we match forward. We have lost frame b/b! The reader can confirm through analysis that extra frame duplicates can also be created. Our conclusion is that three-way matching can result in random frame deletion and duplication when cuts are made between the fields of frames. Experience shows that such cuts are common. The resulting jerkiness, or juddering, can be visible, depending upon the clip type. For animations, where there is sporadic motion and duplicated frames in the source, the effect is disguised and often is invisible. For smooth motion in normal video, such as pans, it might be quite annoying. And of course, the extent of the problem depends on how frequent this type of cut is in the clip. We should exhaust all options to make three-way matching work, because it succeeds in more cases (all of them!) than does two-way matching. Maybe there is a solution to the juddering problem. Suppose we try to make the decision in the <***> case non-arbitrary. One way is to examine the two matches for combing. If both are combed or both are clean, then always take the forward match. If one is combed and the other is not, take the uncombed one. It seems that this strategy will avoid juddering while allowing the three-way match to be used. This idea suggests the following modified two-way matching strategy to achieve the same effect. Perform forward two-way matching and if the returned frame is combed, test the backward match. If it is not combed use it; if it is combed use current. This is the strategy currently implemented in the new Decomb. I would welcome new ideas about all of this from my readers. But I cannot think of anything better to do. 6-6-2003: A New Combing Metric for DecombThe last release of Decomb contained a new metric for deciding whether a frame is combed. Initial feedback indicates that it is an improvement. Let's take a look at it in detail.First, here is a frame from one of my deinterlacing "torture test" clips. The figures in the foreground are completely still while the (out-of-focus) hand moves in the background. Let's forget everything we know, or think we know, about deinterlacing. Let's ask Joe Six-Pack for his opinion. How do we get rid of the ugly combing? Well, Joe says, you can pass through every other line and just worry about the rest. For the rest, we have to consider each pixel on the line. Consider one such pixel B with pixel A above it and pixel C below it. Now it's obvious, Joe continues, that if pixel B is combed, then it must be either lighter than both A and C, or it must be darker than both A and C. So just test that and declare the pixel combed if it is true. Well now, that certainly cuts to the chase, doesn't it? We can test it on our Figure 1 above quite easily. Using Joe's test and setting combed pixels to full white yields the following map of combed pixels according to Joe's algorithm: Joe's algorithm certainly picked up the combed areas very well. But ouch, look what it did to the rest of the frame! We need to suppress all that noise, otherwise we will have trouble distinguishing it from real combing when we use our window technique previously described. But Joe says don't worry, it's easy to suppress that noise! (I'm beginning to develop greater respect for beer drinkers.) Joe goes on to say that we can't even perceive contrast differences below a certain threshold T, say about 7, so just adjust the test to allow for this. Here is the equation he gives, explaining that if R is true then the pixel is combed: R = (B+T < A && B+T < C) || (B-T > A && B-T > C); Well, there's nothing to lose; let's try it. Here is the result: Hey Joe! That's not too shabby. But Joe's not done. He says he has to run off to the bar but he'll give us one more trick before he goes. He says a lot of the remaining noise in the combing map is isolated pixels. You can suppress those by requiring a map pixel to have another map pixel to the immediate right or left. He says not to try that in the vertical direction because it creates noticeable artifacts. Hmmm. Anyway, he flashes this image and disappears: The metric that produces Figure 4 looks very useful! We could just run a fixed window over it and look for a threshold number of combed pixels to declare a combed frame. And this is just what Decomb now does. The variable T is called vthresh. Ah, dear reader, I see you gesticulating wildly. What's that you say? "Use this map to actually perform the deinterlacing as well!" It's certainly a possibility. Let's suppose that we simply do a blend on each of the pixels detected as combed in Figure 4, while passing through all the other pixels directly from the source image: Nothing wrong with that. We could improve it with some edge-directed interpolation, something I've been meaning to add to Decomb for a while. And it may even give us a solution to the Marching Ants problem. So there you are: a great new combed frame detector and a potentially great general-purpose deinterlacer. See y'all later. I've got some work to do! 6-5-2003: New BetaI just finished up and released a major new beta of New Generation Decomb. It replaces the combed frame detection metric with a better one, adds vthresh override capability, and displays the pattern guidance mismatch metric to allow tweaking of gthresh. Initial feedback is encouraging. Be sure to read the tutorial to understand the new vthresh handling.After that burst of energy I need a rest before resuming my exposition here. When I do resume, I will describe the new combing metric and then talk about pattern guidance. 6-1-2003: A Digression: The Holy Grail of Hybrid RenderingWe've all run into those nasty clips that contain a mix of 3:2 pulldown and straight video content. Fans of StarTrek know all about how hard such clips are to render satisfactorily! If you leave them at 30fps while deinterlacing the video sections, then the video sections look fine but the film sections look jerky. On the other hand, decimating the clip to 24fps leaves the films sections fine but makes the video jerky. This unfortunate dilemma has led some people to go to ridiculous extremes. For example, people have created clips at 120fps because it has 30fps and 24fps as factors.Decomb's Decimate filter provides a special mode (mode=3) for decimating hybrid clips. It improves matters but is not miraculous. A recent email from Kevin Atkinson contained an idea that got me thinking that it might be possible to do better than Decimate mode=3. So, I have run some experiments. Download the clips that I reference below (right click and Save Target As...) and play them off your hard disk in BSPlayer or any other decent media player. The first clip is the raw 30fps video source, deinterlaced of course (I do recommend Fox News for your viewing pleasure): As expected, this clip is smooth as silk, especially the scrolling banner at the bottom. Let's try converting it to 24fps by simply tossing every fifth frame: 24fps clip made by tossing every fifth frame Seems a might, shall we say, jerky! Yes, indeed. No big surprise, though. Let's try Decimate(mode=3): 24fps clip made by Decimate(mode=3) Better, isn't it? The video looks OK but the scrolling banner still looks rather dodgy. The movement seems spatially smoother, but there is an annoying strobing effect. Also, if we single step, we'll see that some of the frames are now blends of two original frames. Some purists lose their lunch over that. OK, the stage has been set. All ears are a-twitter. What magic is coming our way?: Is it the holy grail of hybrid rendering? Not quite. But to my eyes it is an improvement. There are no hideous blended frames now and the movement, while not perfectly smooth, seems controlled and consistent. It almost looks like it is intended to be as it is, sort of a stock ticker effect. But how is it done? Here is a cycle before decimation: [a b] [c d] [e f] [g h] [i j]And here is a field-decimated cycle, intended to spread the decimation more evenly through the cycle: [a b] [d e] [f g] [i j]Looks good, except that the frames d/e and f/g are now unusable, because top fields will be displayed as bottom fields and vice-versa. Fortunately, it's not a deal killer because we can resample the fields to put them back in the right spatial positions (compare VirtualDub's "field bob" filter). Doing that creates the "new method" clip. Is it an improvement that should be incorporated into Decomb? I await the opinion of my readers before deciding. But I believe it is. 5-30-2003 [2]: Fixes and "A Scene Change is as Good as a Rest"I've repaired the big bugs in the new Decomb and fixed scene change handling as earlier described, and I now feel comfortable in making a general beta available. Get it here: Decomb 5.0.0 beta 5. Please be sure to carefully read the tutorial because, as you might imagine, a lot of things are changed from Decomb classic. Let's talk about scene changes a little more. It's actually a misnomer when you think about it, but it's the terminology people use, so we will continue to do so as well. Think about it, though; every time a new frame picture comes along that is a "scene change". The change can be bigger or smaller, but there is no reason why any change in picture in and of itself should cause a matching problem. What causes the problem is bad edits. Of course they typically occur when editors cut or insert clips based on different scenes of the movie, so there is a coincidental correspondence between scene changes and problems resulting from bad edits. But if a scene change is made with a good edit, there is no problem for field matching. That is why not all scene changes experience problems. Let's consider a typical PAL scene change done with a good edit (we assume top field first in all the following examples): a a b b a a b bThis clip causes no problems whatsoever and the field matcher continues to match "current" right through it without skipping a beat. But consider this edit: a a a b b a a b b bOn the third frame we have to match backwards but that is not a problem and no combed frames are emitted. What if our editor is really drunk and makes this edit, leaving an orphaned field: a a b c c a a c c cOn the third frame, we have to match to "next" to avoid emitting a combed frame. That is why I made the modification to Decomb earlier described. In PAL, the only way to get a match failure is to make a double bad edit leaving two orphaned fields: a a b d d a a c d dThere is no good match for the third frame. Can we assume no editor would ever do this? Probably. It's almost perverse. I have seen one instance of it in my entire life. But still, there is an even more compelling reason to avoid adding special handling for this, although we easily could (we could deliver a/a or d/d). The frame b/c can be viewed as the insertion of a video frame. It's a very small video section! Recall that we need to pass video sections through to the deinterlacer. So for those reasons, we do not attempt to handle this special case. The worst that will happen is that one frame in a zillion will get deinterlaced when we could theoretically deliver a duplicate good frame instead. That's PAL. Similar arguments apply to NTSC 3:2 pulldown. But there is a single cut that will produce the match failure (the blank part represents a cut-out portion): a a b c d e l m m n o a b c c d k l m n o oThe combined frame e/k has no good match. Again, however, I have seen only a few of these in my life and the resulting frame looks like a video frame. Good editing practice of course avoids silly things like this. I suppose one could argue that if there is just one video frame in a row, then treat it as a failed match due to a really bad edit. But until one of my users actually complains about it and submits a source clip of original material showing multiple instances, I think it will be fine just to deinterlace in the rare cases in which it occurs. So sayeth I. 5-30-2003: New Generation Decomb Initial ResultsI completed a first beta of New Generation Decomb and released it to my favorite testers at Doom9 (I will release a beta here shortly). Initial results are encouraging. A major bug was identified rather quickly and was squashed, although I haven't released the fixed beta yet, as I want to add a new feature to address another deficiency, that of handling of scene changes. In Decomb classic, Telecide() performed a three-way match. Therefore, at some scene changes, where there is no available match to the previous frame, Telecide() could match to the following frame and output a non-combed frame. But now, because the new Telecide() performs only a two-way match for speed and better matching reliability, it emits a combed frame at some scene changes. This is a degradation of performance versus Decomb classic and something we cannot accept. Yet we do not want to restore three-way matching. What to do? I think a good compromise solution is this: when a two-way matched frame is declared interlaced, Telecide() tries the third match. If it is an improvement it is used. This requires postprocessing to be enabled, but postprocessing is now so fast that I can't imagine that anyone would choose to run without it. The best part of this compromise is that it is free. The calculation for the forward match is already calculated as the backward match for the following frame! Having a metrics cache means that we have to calculate metrics for a frame only once; it doesn't matter when we do it. I'll explain more about that when I discuss pattern guidance, because it relies heavily on the metrics cache. So, I will implement this solution and then release a wider beta. After that, I hope to start expounding on pattern guidance, which is a very interesting problem. 5-27-2003: Postprocessing At a DiscountWe've established an effective field comparison algorithm that allows us to correctly match the fields for each frame. Now we require a means to decide whether a matched frame is still combed, as it might be for several well-known reasons (noise, too much vertical detail at the wrong spatial frequency, missing fields, hybrid clips, hybrid frames, blended fields, poor edits, etc.). Combed frames need to be deinterlaced. Let's take the direct approach. No pussyfooting around. We already have a nice metric that we calculated for the best field match. Why don't we simply threshold that? We'll say that metrics above a certain value indicate an interlaced frame, i.e., one in which the fields are not from the same picture. If we can live with that, then we will have made the postprocessing decision totally for free! Unfortunately, we cannot live with that. The problem is that the metric is too blunt. We already know that there is a certain amount of pseudo-difference that remains due to the spatial offset between the fields. We have to set the threshold above that floor. Furthermore, the floor level varies depending on the contents of the frame. Now consider a small mouth that comes out combed. The actual metric contribution from the very small mouth is swamped by the residual pseudo-difference floor, such that we cannot reliably distinguish between expected variation of the floor and a small area of interlacing. There is also the problem that the metrics will be dependent upon the size of the frames. It would be preferable for the user to have a normalized metric so that "threshold=40" always means the same thing. I'm sure that my dear readers have already conceived the solution. We simply calculate difference metrics for small windows on the frame. Every window (say 16x16 pixels) has an independent difference metric calculated. Then we select the window with the highest difference metric and use that as our metric for deciding if the frame is combed. This solves our two problems neatly. First, the contribution of a small area is magnified. Second the metric is automatically normalized because the window size is fixed. We will still need to threshold the metric to make the final decision, but it is not problematic or sensitive thanks to the windowing. Readers who have studied the source code are thinking, "Old generation Decomb windowed the combed frame detection, so what is the big deal?" The big deal is that in old generation Decomb, the frame was first sampled and calculated for the field matching and then, if postprocessing is enabled, it was sampled and calculated again. Why? It is a legacy of the fact that FieldDeinterlace() was written before Telecide() had integrated postprocessing, and FieldDeinterlace()'s code was absorbed into Telecide uncritically. Clearly, as long as the subsampling works for both matching and combed frame detection, the two should share the sampling and calculations. New generation Decomb does this, resulting in a major speedup of postprocessing. Some practical details are still dangling. How do we efficiently calculate the metrics per window? That is actually trivial. We declare an array of accumulators, one for each block. Then after we calculate the contribution of a sampled pixel, we accumulate it into the overall sum for matching purposes and into the appropriate window sum for combed frame detection purposes. Then after all the pixels have been processed, we scan the window metrics and pick the highest one. Of course we still have to deinterlace the frames that we have decided are combed. We cannot re-use our calculations for that because a) deinterlacing requires full sampling, and b) the deinterlacing combing detection algorithm differs from the frame differencing algorithm. In our favor, however, is the fact that relatively few frames need to be deinterlaced when we are processing progressive material. The main processing hit has been eliminated by re-using the field matching calculations for combed frame detection. Old generation Decomb was capable of automatically adapting to the field order of the clip. This is a mixed blessing. While it is slightly easier for the user, it makes the filter slower and, worse, it makes possible extra spurious bad field matches. New generation Decomb therefore requires the user to specify the field order of the clip. Fortunately there is a simple, quick, and reliable way to determine the field order. I believe most users would trade off the few minutes required to determine the field order against several hours of saved processing time and a diminished frequency of bad field matches. In fact old generation Decomb had the 'mm' option to suppress the spurious match test and thereby achieve the same result. Next time, we will look at the major revamp of pattern guidance in new generation Decomb. It supports full random timeline navigation and is not dependent upon a history of previous decisions, yet it performs better than old generation Decomb's pattern guidance. Think about how that can be the case. Here's a status report on new generation Decomb. All coding is complete except for YV12 support. The new reference manual is complete. A user manual is in progress. When that is done, I will release an alpha version. While that is out I will complete the YV12 support and fix any bugs that are found. 5-26-2003: Field DifferencingRecall that we were considering a field sequence ...a [b c]... where b and c make up a frame. We concluded that the frame is progressive if and only if a and b are from the same picture (temporal moment) or b and c are from the same picture. Our problem is how to reliably determine if two fields are from the same picture. The direct approach is to simply subtract one field from the other, hoping that the spatial offset is not too serious. We can get an idea about how that works using Avisynth: a=SeparateFields(clip) Subtract(a,Trim(a,1,0)The differences are centered on a middle gray level by Subtract() to show positive and negative differences. Following is a typical progressive frame. We hope that the picture will be totally flat, featureless gray. But alas, we are disappointed; there is a lot of strong detail.
The spatial offset results in detail that looks like movement between the fields, making us think that this frame is not progressive. What to do? Maybe it will help if we resample the fields so as to move one up a bit and the other down, so that we are comparing the corresponding spatial points. The following image uses VirtualDub's built-in field bob filter to align the fields. Then the Avisynth Subtract() operation is applied as before.
Obviously it's much better but you can still make out significant detail, implying field difference. Is there anything else we can do? We can blur the bobbed fields before the Subtract operation, as in the following image.
Now we're getting somewhere! For comparison, the following image shows the same bob/blur/subtract operation applied to fields from two different pictures. In this case we hope and expect to see a big difference.
Comparison of these last two images gives us confidence that with the help of a threshold we can reliably distinguish between progressive and video frames. In fact, new generation Decomb uses this approach for both field matching and video frame detection. When we try to match field b to either field c or field a in the example ...a [b c]..., we simply calculate the two field differences and pick the pairing that produces the smallest field difference. The beauty of this is that the most expensive part of postprocessing, determining if a frame is video, will come almost for free from the field matching operation! Let us examine the details of how Decomb (assume new generation unless specifically qualified from now on) performs these two tasks. First consider the field matching. Clearly, performing the bobbing and blurring as described above is computationally expensive. We need an algorithm that approaches it but can be computed efficiently. Decomb subsamples the frame to select pixels to examine. Suppose we denote one of these sampled pixels as C. The pixels above and below C then are as follows: a b C d ePixels a, C, and e are from the top field and b and d are from the bottom field. Now we calculate: difference = abs((a+C+e)/3 - (b+d)/2)In practice we use a trick to avoid the division by 3. It is clear that the bobbing effect is achieved by the choice of pixels and the blurring is achieved by the averaging. We now accumulate all the differences calculated for all our sample points and the result is our difference metric for the field comparison. Finally, returning to the matching scenario ...a [b c]..., we get the field difference metric for [b a] and for [b c] and we select the pairing that has the lowest field difference metric. That is how Decomb performs its field matching. There are some refinements but that is the essence of it. In my next journal entry I will describe how the progressive/video decision is derived from this field matching calculation and how it brings us postprocessing at a discount. Think about it! It's not so simple as you may think because you'll see that the obvious approach fails. 5-23-2003: In the beginningIn the beginning man created film. And it was progressive. And man said, Let there be interlacing: and there was interlacing. And man saw the interlacing, that it was good: and man divided the even fields from the odd. If God had been making these decisions we'd be a lot better off. Before I transition into an explanation of this parable, I want to alert readers to the fact that I have added two items to the Decomb deficiency list adduced below on 5-21. Now for the transition... This fundamental yin-yang of video, progressive versus interlaced, is the first consideration we need to pay attention to when deciding how to render a clip. And we need to decide on a per-frame basis, not on a per-clip one, because hybrid material is ubiquitous. An interlaced video frame (one that cannot be field-matched back to a progressive source frame) must be passed through a field restoration process as is, and then it must be deinterlaced. This need to weed out video frames before applying field reconstruction and pattern guidance means that it is the first problem that we need to solve. The problem statement is simple. How do we examine a frame and determine if it is video as defined above, or if it is actually a good progressive frame, or can result in a good progressive frame if appropriately matched? Several solutions suggest themselves. We might take the tack that if it is not progressive film, it must be video. So we'll look for the 3:2 pulldown pattern and if we don't see it, we have video. But that is unacceptable because we can have progressive frames in the absence of any pattern, e.g., 24fps or 30fps progressive. Being out of pattern is not the same as being video. We are left with the idea of comparing the fields of the frame. Consider this sequence, where the top field is first (each letter denotes a field; a field and the field below are a frame, and a different letter denotes a new picture): a c e g i ... b d f h j ...Here we have frames of interlaced video. Now consider this: a b c d e... a b c d e...Here we have simple progressive frames. Consider this: . a b c d e ... a b c d e . ...Here we have progressive frames but with a one-field phase shift. a a b c d ... a b c c d ...Here we have 3:2 pulldown. Considering all of the above patterns, what is the key that allows us to look at a given frame of the video sequence and say, "that is a video frame"? We cannot simply reply that if the frame is combed it is video, because the phase-shift case and the 3:2 case both have combed frames, but they are not video, because the progressive original frames can be recovered through field matching. Consider the frame c/d in the video case. We have the temporal field sequence ... x [c d] y. If c does not match to d, it must match to x, otherwise it is video. Consideration of the cases above shows that this is a valid conclusion. But comparing c to d and c to x is just the same comparison strategy we use to attempt to match fields! This will be a useful coincidence for performance reasons. For now, we just note that our algorithm for determining if a frame is video is as follows: Attempt to field-match the frame. Select the best match. If the two fields of the match are not the same picture, i.e., they differ, then the frame is video.That makes it sound easy, but it isn't. Sadly. The problem is that it is not straightforward to compare fields to see if they are the same picture. The reason is not just because there may be noise. We can allow for that. The problem is that the two fields are spatially offset from one another. Even though they may come from the same picture, they differ because one field has the even lines and the other has the odd lines. We need a differencing method that will not be confused by this spatial offset. And it must also not be confused by a progressive picture that has vertical detail whose spatial frequency would spoof simple line-to-line comparison. It is not an easy problem. In my next journal entry, I will describe my solution to this problem. The solution is not the same as the one currently implemented in Telecide, and it is an improvement. It has other highly useful consequences too, as you will see. For now, it will be instructive for you to think about this problem and its possible solutions.
5-21-2003: New Generation DecombMy Decomb filter package for VirtualDub has become probably my most popular and often-used tool. It is now mature and stable and has accumulated useful features that make it a very flexible solution capable of meeting many diverse needs in desktop video. Certainly it has some features that were unique when introduced, such as the idea of using blind field matching for progressive frame restoration as one operation, and decimation as a separate decoupled operation, such that combining the two results in an inverse telecine (IVTC) operation. The decoupling of the functions allowed Decomb to address more application domains than an integrated tool would have done. Another example of an innovative feature is the idea of allowing the deinterlacer to decide whether a frame is combed before applying adaptive deinterlacing to the frame. This allows the deinterlacer to be applied as a post-processor to a progressive frame restoration process: only the frames emerging still combed from the restoration process are touched; good progressive frames are not degraded. FieldDeinterlace() is still the only deinterlacer that offers this functionality. A final example is Decimate()'s special modes for hybrid material. While this functionality is still not the holy grail of hybrid rendering that we all seek, it offers an advance over previous practice. Despite having important features that commend it, Decomb has some deficiencies. Living as I do by the philosophy "we can improve anything!", I'm motivated to look squarely at the deficiencies and think about what can be done to improve matters. Here I am amply and ably helped by my users, who are not shy to tell me what they don't like! So what is there not to like about Decomb? Following are the major deficiencies that make me sleep poorly for having designed and implemented such things (I have named the problems so that I can make specific references to the deficiencies later). There are surely other problems, but I don't lose sleep over them.
Y'all come back, now, and I'll tell you all about it. |