Fields: Why Video Is Crucially Different from Graphics

By Chris Pirazzi. Information provided by folks throughout the company.

The major video signals used in the world today are field-based, not frame based. Whenever you deal with video, it is absolutely crucial that you understand a few basic facts about fields. Correctly dealing with fields in software is tricky; it is fundamentally different than dealing with plain ol' graphics images. This document explains many of these basic concepts.

Note that the information here applies to any video signal format that has two interlaced fields per frame, including all of the major video signal formats which SGI machines deal with: NTSC, PAL, and 525- and 625-line Rec. 601 digital video (often incorrectly referred to as "D1").

Important: this document will give you a general understanding of the programming issues brought out by field-based video. But before you can go and write some code, it is also crucial that you understand the basic terms used to describe fields in our documentation and our library APIs. So after you read this, read:

What the Heck is a Field?

You probably know that a field, for the signal formats listed above, is an image that contains only half of the lines you would need to make a complete picture. Many computer types think of fields as simply a weird way to lay out the lines of a picture in memory. They are more than that. A field is a set of image data all of which was sampled at the same instant of time.* Each field in a video sequence is sampled at a different time, determined by the video signal's field rate. This temporal difference between all fields, not just fields of different frames, is what makes dealing with fields so tricky.

An illustration. Pretend you have a film camera that can take 60 pictures per second. Say you use that camera to take a picture of a ball whizzing by the field of view. Here are 10 pictures from that sequence:

The time delay between each picture is a 60th of a second, so this sequence lasts 1/6th of a second.

Now say you take a modern* NTSC video camera and shoot the same sequence. We all know that NTSC video is 60 fields a second, so you might think that the video camera would record the same as the above. This is incorrect. The video camera does record 60 images per second, but each image consists of only half of the scanlines of the complete picture at a given time, like this:

Note that the odd-numbered images contain one set of lines, and the even-numbered images contain the other set of lines (if you can't see this, try bringing up snoop or mag). The data captured by the video camera does not look like this:

and it does not look like this:

The fields captured are all temporally different. The harsh reality of video is that in any video sequence, you are missing half of the spatial information for every temporal instant. This is what we mean when we say "video is not frames." In fact, the notion of video as "frames" is something we computer people made up so as not to go insane---but sooner or later, we have to face the fact...

Why Do I Care?

Why is this an issue for the kinds of software one writes on SGI?

Say you want to take a video sequence which you have recorded (perhaps as uncompressed, or perhaps as JPEG-compressed data) and you want to show a still frame of this sequence. Well, a still frame would require a complete set of spatial information at a single instant of time---the data is simply not available to do a still frame correctly. So, one thing that much of our software does today to deal with this problem (and this is often done without knowledge of the real issue at hand), is to choose two adjacent fields from which to grab each set of lines. This technique has the rather ugly problem shown below:

Nomatter which pair of fields you choose, the resulting still frame looks quite bad. This artifact, known as "tearing" or "fingering," is an inevitable consequence of putting together an image from bits of images snapped at different times. You wouldn't notice the artifact if the fields whizzed past your eye at field rate, but as soon as you try and do a freeze frame, the effect is highly visible and bad. You also wouldn't notice the artifact if the objects being captured were not moving between fields.

There's another thing about these fingering artifacts which we've often ignored in our software---they are terrible for most compressors. If you are making still frames so that you can pass frame-sized images on to a compressor, you definitely want to avoid tearing at all costs. The compressor will waste lots of bits trying to encode the high-frequency information in the tearing artifacts and fewer bits encoding your actual picture. Depending on what size and quality of compressed image you will end up with, you might even consider just sending every other field (perhaps decimated horizontally) to the compressor, rather than trying to create frames that will compress well.

Another possible technique for producing still-frames is to choose some field and double the lines in that field:

As you can see, this looks a little better, but there is an obvious loss of spatial resolution (ie, there's lots of jaggies and vertical blockiness now visible. To some extent, this can be reduced by interpolating adjacent lines in one field to get the lines of the other field:

But there is also a more subtle problem with any technique that uses one field only, which we'll see later.

There are an endless variety of more elaborate tricks you can use to come up with good still frames, all of which come under the heading of "de-interlacing methods." Some of these tricks attempt to use data from both fields in areas of the image that are not moving (so you get high spatial resolution), and double or interpolate lines of one field in areas of the image that are moving (so you get high temporal resolution). Many of the tricks take more than two fields as input. Since the data is simply not available to produce a spatially complete picture for one instant, there is no perfect solution. But depending on why you want the still frame, the extra effort may well be worth it.

You Mean It Matters for Output Too?

Yup, afraid so. When a CRT-based television monitor displays interlaced video, it doesn't flash one frame at a time on the screen. During each field time (each 50th or 60th of a second), the CRT lights up the phosphors of the lines of that field only. Then, in the next field interval, the CRT lights up the phosphors belonging to the lines of the other field. So, for example, at the instant when a pixel on a given picture line is refreshed, the pixels just above and below that pixel have not been refreshed for a 50th or 60th of a second, and will not be refreshed for another 50th or 60th of a second.

So if that's true, then how come video images don't flicker hideously or jump up and down as alternate fields are refreshed?

This is partially explained by the persistence of the phosphors on the screen. Once refreshed, the lines of a given field start to fade out slowly, and so the monitor is still emitting some light from those lines when the lines of the other field are being refreshed. The lack of flicker is also partially explained by a similar persistence in your visual system.

Unfortunately though, these are not the only factors. Much of the reason why you do not perceive flicker on a video screen is that good-looking video signals themselves have built-in characteristics that reduce the visibility of flicker. It is important to understand these characteristics, because when you synthesize images on a computer or process digitized images, you must produce an image that also has these characteristics. An image which looks good on a non-interlaced computer monitor can easily look abysmal on an interlaced video monitor.

Disclaimer: a complete understanding of when flicker is likely to be perceivable and how to get rid of it requires an in-depth analysis of the properties of the phosphors of a particular monitor (not only their persistence but also their size, overlap, and average viewing distance), it requires more knowledge of the human visual system, and it may also require an in-depth analysis of the source of the video (for example, the persistence, size, and overlap of the CCD elements used in the camera, the shape of the camera's aperture, etc.). This description is only intended to give a general sense of the issues.

Disclaimer 2: standard analog video (NTSC and PAL) is fraught with design "features" (bandwidth limitations, etc.) which can introduce many similar artifacts to the ones we are describing here into the final result of video output from a computer. These artifacts are beyond the scope of this document, but are also important to consider when creating data to be converted to an analog video signal. Examples of this would be antialiasing (blurring!) data in a computer to avoid chroma aliasing when the data is converted to analog video.

Here are some of the major gotchas to worry about when creating data for video output:

Abrupt Vertical Transitions: One-Pixel-High Lines

First of all, typical video images do not have abrupt vertical changes. For example, say you output an image that is entirely black except for one, one-pixel-high line line in the middle.

Since the non-black data is contained on only one line, it will appear in only one field. A video monitor will only update the image of the line 30 times a second, and it will flicker on and off quite visibly. To see this on a video-capable machine, run "videoout," turn off the anti-flicker-filter, and point videoout's screen window at the image above.

You do not have to have a long line for this effect to be visible: thin, non-antialiased text exhibits the same objectionable flicker.

Typical video images are more vertically blurry; even where there is a sharp vertical transition (the bottom of an object in sharp focus, for example), the method typical cameras use to capture the image will cause the transition to blur over more than one line. It is often necessary to simulate this blurring when creating synthetic images for video.

Abrupt Vertical Transitions: Two-Pixel-High Lines

So you might think one solution would be never to output single-pixel-high lines. Ok, how about changing the image above so that it has a two-pixel-high line?

These lines would include data in both fields, so part of the line is updated each 50th or 60th of a second. Unfortunately, when you actually look at the image of this line on a video monitor, the line appears to be solid in time, but it appears to jump up and down, as the top and bottom line alternate between being brighter and darker. You can also see this with the "videoout" program.

Flicker Filter

The severity of both of these effects depends greatly on the monitor and its properties, but you can pretty much assume that someone will find them objectionable. One partial solution is to vertically blur the data you are outputting. Turning on the "flicker filter" option to videoout will cause some boards (such as ev1) to vertically prefilter the screen image by a simple 3-tap (1/4,1/2,1/4) filter. This noticeably improves (but does not remove) the flickering effect.

There is no particular magic method that will produce flicker-free video. The more you understand about the display devices you care about, and about when the human vision system perceives flicker and when it does not, the better a job you can do at producing a good image.

Synthetic Imagery Must Also Consist of Fields

When you modify digitized video data or synthesize new video data, the result must consist of fields with all the same properties--temporally offset and spatially disjoint. This may not be trivial to implement in a typical renderer without wasting lots of rendering resources (rendering 50/60 images a second, throwing out unneeded lines in each field) unless the developer has fields in mind from the start.

You might think that you could generate synthetic video by taking the output of a frame-based renderer at 25/30 frames per second and pulling two fields out of each frame image. This will not work well: the motion in the resulting sequence on an interlaced video monitor will noticeably stutter, due to the fact that the two fields are scanned out at different times, yet represent an image from a single time. Your renderer must know that it is rendering 50/60 temporally distinct images per second.

Playing Back "Slow," or Synthesizing Dropped Fields

Two tasks which are relatively easy to do with frame-based data, such as movies, are playing slowly (by outputting some frames more than once) or dealing with frames that are missing in the input stream by duplicating previous frames. Certainly there are more elaborate ways to generate better-looking results in these cases, and they too are not so hard on frame-based data.

When fields enter the picture, things get ugly. Say you are playing a video sequence, and run up against a missing field (the issues we are discussing also come up when you wish to play back video slowly). You wish to keep the playback rate of the video sequence constant, so you have to put some video data in that slot:

which field do you choose? Say you chose to duplicate the previous field, field 2:

You could also try duplicating field 4 or interpolating between 2 and 4. But with all of these methods there is a crucial problem: those fields contain data from a different spatial location than the missing field. If you viewed the resulting video, you would immediately notice that the image visually jumps up and down at this point. This is a large-scale version of the same problem that made the two-pixel-high line jump up and down: your eye is very good at picking up on the vertical "motion" caused by an image being drawn to the lines of one field, then being drawn again one picture line higher, into the lines of the other field. Note that you would see this even if the ball was not in motion.

Ok, so you respond to this by instead choosing to fill in the missing field with the last non-missing field that occupies the same spatial locations:

Now you have a more obvious problem: you are displaying the images temporally out of order. The ball appears to fly down, fly up again for a bit, and then fly down. Clearly, this method is no good for video which contains motion. But for video containing little or no motion, it would work pretty well, and would not suffer the up-and-down jittering of the above approach.

Which of these two methods is best thus depends on the video being used. For general-purpose video where motion is common, you'd be better off using the first technique, the "temporally correct" technique. For certain situations such as computer screen capture or video footage of still scenes, however, you can often get guarantees that the underlying image is not changing, and the second technique, the "spatially correct" technique, is a win.

As with de-interlacing methods, there are tons of more elaborate methods for interpolating fields which use more of the input data. For example, you could interpolate 2 and 4 and then interpolate the result of that vertically to guess at the content of the other field's lines. Depending on the situation, these techniques may or may not be worth the effort.

Still Frames on Video Output

By this point you've probably guessed that the problem of getting a good still frame from a video input has a counterpart in video output. Say you have a digitized video sequence and you wish to pause playback of the sequence. Either you, the video driver, or the video hardware must continue to output video fields even though the data stream has stopped, so which fields do you output?

If you choose the "temporally correct" method and repeatedly output one field (effectively giving you the "line-doubled" look described above), then you get an image with reduced vertical resolution. But you also get another problem: as soon as you pause, the image appears to jump up or down, because your eye picks up on an image being drawn into the lines of one field, and then being drawn one picture line higher or lower, into the lines of another field. Depending on the monitor and other factors, the paused image may appear to jump up and down constantly or it may only appear to jump when you enter and exit pause.

If you choose the "spatially correct" method and repeatedly output a pair of fields, then if there happened to be any motion at the instant where you paused, you will see that motion happening back and forth, 60 times a second. This can be very distracting.

There are, of course, more elaborate heuristics that can be used to produce good looking pauses. For example, vertically interpolating an F1 to make an F2 or vice versa works well for slow-motion, pause, and vari-speed play. In addition, it can be combined with inter-field interpolation for "super slow-mo" effects.

How Do I Show Video on the Graphics Screen?

Another permutation we haven't talked about is this: say you have some video coming into memory, and you want to show it on the graphics screen (a monitor function for capturing, for example).

The simplest method is to use the VL to capture already-interleaved frames, and display each frame on the screen at 25/30 per second using lrectwrite() or glDrawPixels(). Displaying In-Memory Video Using OpenGL provides some tips and code samples for this method.

While this looks okay, it does not look like a video monitor does. A video monitor is "interlaced." It scans across the entire screen, refreshing one field at a time, every 50th or 60th of a second. A typical graphics monitor is "progressive scan." It scans across the entire screen, refreshing every line of the picture, generally 50, 60, 72, or 76 times a second. Because graphics monitors are designed to refresh more often, their phosphors have a much shorter persistence than those of a video monitor.

If you viewed a video monitor in slow motion, you'd see a two-part pattern repeating 25 or 30 times a second: you'd see one field's lines light up brightly while the other field is fading out, then a 50th or a 60th of a second later, you'd see the other field's lines light up brightly while the first field's lines were fading out, as seen in this diagram:*

On a computer monitor running at 50 or 60 Hz, using the simple frame-based technique described above, you'd see a full-screen pattern repeating 50 or 60 times a second. The entire video image (the lines from both fields) light up and fade out uniformly, as in:*

These differences in the slow-motion view can lead to noticeable differences when viewed at full-rate. Some applications demand that preview on the graphics screen look as much like the actual view on a monitor as possible, including (especially) the jitter effects associated with using fields incorrectly. Customers want to avoid having to buy an external monitor to verify whether or not their images will look ok on interlaced video.

Making video on a graphics monitor look like video is no easy task. Essentially, you have to create some software or hardware which will simulate the light which a video monitor would emit using the pixels of a graphics monitor.

So far, SGI has come up with two solutions to this problem:

It's Not Even That Simple *

This document has attempted to disillusion you from the assumption that you can treat field pairs as frames, citing the temporal difference between fields as the main cause for concern. Well, it turns out that in some cases, the temporal reality of video is even more harsh.

The Reality of Cameras

It's now clear that when you capture a scene using a videocamera, the fields you capture are temporally distinct. One question which we have not addressed until now is: how about the individual lines of data within a field---do they represent samplings of the scene from the same instant of time? How about the individual pieces of data along a given line of a given field?

The answer to this depends on the kind of camera. Modern cameras use CCD arrays, which produce a field of data by sampling the light incident on a grid of sensors (throughout a certain exposure time) simultaneously. Therefore all of the pixels of a field are coincident: each pixel is a sampling of the light from a particular part the scene during the same, finite period of time.

Older tube-based cameras (which were distinguished by crusty old names like vidicon and plumbicon) would sample a scene by scanning through that scene in much the same way a video monitor scans across its tube. Therefore, in the fields created by such a camera, none of the data is coincident! Instead of capturing the crispy images which we presented to you above:

A tube camera would capture an image more like this:

Tube cameras are dinosaurs and are being replaced by CCD-based cameras everywhere. But it is still quite possible that you'd run into one or possibly even be asked to write software to deal with video data from one.

The Reality of Monitors

A similar split exists in monitors. There are array-based display devices which change the state of all the pixels on the screen simultaneously or all the pixels on a given line simultaneously, and there are tube-based display devices whose electron beams take a whole field time to scan each line across the screen (from left-to-right then top-to-bottom). Obviously, tube-based display devices are by far still in the majority.

When considering questions like how to photograph or videotape a computer monitor using a camera, this harsh reality can come into play.

However, because most of the flickering effects in interlaced video are due to local phenomena (ie, the appearance of data on adjacent picture lines), and because the temporal difference between samples on adjacent picture lines is so close to the field period, it is often the case that you don't have to worry about this harsh reality.

Other Video Signal Formats

We should also mention that the four major video signal formats, while popular, are only a tiny part of the gamut of video signal formats. Other video formats, many of which are used for graphics monitors, have only one field per frame (often the term field is not used at all in these cases), which is called "non-interlaced" or "progressive scan." Sometimes, video signals have fields, but the fields are not temporally different. Instead, the fields each contain the information for one color basis vector (R, G, and B for example); these signals are called "field sequential." Basically, if you can imagine it, somebody has implemented it, and InfiniteReality will probably generate it.

Important! Crucial Terminology Relating to Fields

This document has so far handily avoided giving a name to one field versus another. If you are going to write any software at all or talk to anyone else on a matter of fields, it is crucial that you know the exact definition of the different terms used to refer to fields. You will see terms like F1/F2, dominant/non-dominant, even/odd, each of which has a meaning that is different in a subtle but important way. Many bugs have been introduced into SGI software by people who were not clear on these definitions, or who were, but assumed different definition for these terms than other people.

It's worth your while to check out:

to see the definitions and save yourself some headaches.

More Fun With Fields: 3:2 Pulldown

This section is a bit of an aside, but since 3:2 pulldown seems to come up in any discussion of fields, it's worth defining here.

3:2 pulldown is a method of going between photographic film images at 24 frames per second and interlaced video images at 60 fields per second. It does not apply to 50-field-per-second signals. The method consists of a sequence which repeats every 4 film frames and 5 video frames (this chart assumes F1 dominance):

3:2 Pulldown
Film Framesframe Aframe Bframe Cframe D
Video FieldsF1F2F1F2F1F2F1F2F1F2
Video Framesframe 1frame 2frame 3frame 4frame 5

This chart tells you which film image to use in order to produce each video field. The resulting video will then contain many fields which are duplicates of other fields in the sequence. It is often very interesting to tag the video stream with information indicating which video fields are redundant, so that agents which operate on that data such as compressors or video processors can avoid wasted effort.

The lurkers guess that it's called 3:2 pulldown because the pattern of fields you get contains sequences of 3 fields followed by 2. Or perhaps it's called that because 3 of the 5 video frames do not end up coinciding with the start of a film frame and 2 do.