Hands on with high-touch encoding: Streaming Media All-Stars Redo

Sign in to queue



As Expression Encoder 2 approaches its immenent release, I've been using it for more and more real-world projects. This recent one was particularly chewy fun, and I thought it would make a good tutorial for a high-touch workflow.

As you may remember from a few weeks ago, I was one of the inaugural class of Streaming Media's Streaming Media All-Stars. There was a fun video montage of all of us on baseball cards being announce by ballpark-style narration. Good stuff, but the FLV compression wasn't quite up to my standards for this rare intersection of compression obsession and personal vanity. So I contacted Streaming Media and asked if I could take my own whack at it.

I'll have an expanded version of this post as an article in an upcoming issue of Streaming Media Magazine. If you don't get it, you can sign up for a free subscription.


The Source

One thing I noticed in the original is that the background graphics and a few of the animations were interlaced, as you can see in the last "before" image at the very bottom of the page.

While deinterlacing it may have been possible, the heavyweight motion-adaptive deinterlacers available for technologies like AVISynth can be finicky to configure, and extremely slow. And in the end, nothing beats getting the source fixed in the first place. Compression is the art of getting output that's as close to the original as possible with the bits you have available; often getting access to higher quality sources can provide a much bigger improvement to final quality than all the codec tweaking in the world.

So, I contacted the post house, and they fixed the background interlacing (it was just a matter of properly flagging the source as interlaced in After Effects) and re-rendered it for me as a lossless RGB PNG codec QuickTime .mov file. However, there were two shots that snuck through where one layer was still interlaced. I didn't want to wait for another disc, so I dived into After Effects (in the end, all difficult preprocessing jobs seem to wind up in After Effects). I used the "Reduce Interlace Filter" with a softness of 1 to blend the two fields together. Traditional deinterlace methods messed up the text on the cards too much. However, the softness increase from that filter wound up causing a slight visual discontinuity when it kicked in. So, I broke out the two shots with interlacing into layers, and then used a five-frame cross-dissolve transition from the original progressive frames to the start of the interlaced shot which hid the slight loss of focus (masked in part by the motion). Both interlaced shots ended on a hard cut, so I was able to switch back to the original video without a transition.

I then rendered the new version out from After Effects in 32-bit float (to reduce the risk of introducing banding via an 8-bit to 8-bit conversion) into the Lagarith codec in YV12 mode, which uses the native 8-bit 4:2:0 colorspace of VC-1 and other codecs. This means that Expression Encoder doesn't need to do any color space conversion, making compression slightly faster.



The other notable issue with the original clip was "keyframe popping"; when an obvious "jump" in the video happens at the keyframe rate of the video. Watch the original FLV, and you'll see it during any of the longer static shots. Since the whole section with the cards is one single long shot over 3 minutes long without any hard cuts, there wasn't a place for natural keyframes (automatically inserted at a hard cut) to go. Thus keyframe transitions would happen while the cards were otherwise static, making even a slight change visible.

I also wanted to show off the Expression Encoder templates a bit by doing thumbnail navigation. In EEv2, I'm able to graphically set markers on particular frames, and set them to be keyframes and/or thumbnails. A thumbnail becomes an image file which, with the supported templates, automatically gets included in the menus for navigation (think a chapter on a DVD). Normally you also want to make the chapter points keyframes, since keyframes support immediate random access, as no other frames need to be decoded before displaying a keyframe.

This was an opportunity to kill two birds with one stone; if I set the markers on the first static frame of every card, it'd be nice high quality image that all the later frames that reference that I-frame can be based on, propagating its quality forward. If I set my keyframe spacing long enough, there wouldn't be any other keyframes in that interval to cause keyframe popping, and so the static card would be very consistent.

So, I set a marker for each person, flagged to be both a thumbnail and a keyframe. The audio doesn't always sync up exactly so that the person's name begins after their card is down, so sometimes the first name is cut off. This would have been easy to fix by just delaying the audio a second.

You can also use non-thumbnail keyframe markers; these become keyframes without showing up in navigation. I stuck a few of those in as well in the intro/outro sections, on the first full frames after the logo gets built. Since the sponsor pays the bills (Ripcode in this case), I always want to make sure that logos remain nice and crisp.

Setting keyframes has been around in compression projects for ages now; I did a lot of this in Premiere 4.0 for Cinepak encodes in the pre-Media Cleaner days, since Cinepak was prone to keyframe popping issues. Modern codecs like VC-1 do a much better job of finding good natural keyframes, and also to reduce popping issues. The Silverlight version would have looked a lot better than the Flash even if I hadn't set them, but they did get a further boost in quality. But don't think this is something you should be doing in every case; this clip is unusual in having minutes without cuts with a mix of static and moving elements, at an extremely low bitrate.


Encoding Settings

Now, what encoding settings do we want to use?


  • Frame Rate: Source. We want to capture all the motion in the source perfectly (29.97 frames per second in this case).
  • Keyframe Interval: 15 seconds. The longest gap between markers in the source is a hair left than 15 seconds, so this will prevent keyframe popping between cards.
  • Profile: VC-1 Advanced Profile. So we can use DQuant, as discussed below.
  • Mode: VBR Peak Constrained. This is a progressive download project, so VBR Peak Constrained gives us optimum quality.
  • Bitrate (average): 488 Kbps. Matching the original FLV's actual bitrate (400 Kbps was requested, but VP6 overshot by over 20%).
  • Peak bitrate: 896 Kbps. So video + audio + overhead (9 Kbps in this case) max bitrate is a consumer broadband friendly 1000 Kbps total.
  • Peak Buffer Size: 15 seconds. So the buffer duration can contain an entire Group of Pictures (a keyframe and frames that reference it).
  • Width and Height: 640x480. Same as source. The original project had both 320x240 and 640x480, but they used the same data rate, so I'm doing just the 640x480 and Silverlight embed can be set to the desired size.


  • Codec: WMA. We're targeting Silverlight 1.0 compatibility, so WMA Pro isn't an option
  • Mode: VBR. Always better quality for progressive download.
  • Bitrate: 48 Kbps. Matching the data rate of the FLV source. Also this is minimum bitrate for WMA VBR. I always try to use at least 48 Kbps for WMA progressive for that reason; it's a massive quality jump from 32 Kbps CBR for typical content.
  • Sample Rate: 44.1 KHz. Same as source. Also, 44.1 is the native audio rendering mode for Silverlight, and so offers the same quality and better performance versus 48 KHz.
  • Bits per Sample :16. The only option for WMA
  • Channels: Stereo. VBR audio requires stereo. I'd use mono if I needed to do 32 Kbps, since there isn't stereo separation important to the experience here.
  • Audio peak bitrate: 96 Kbps. Again, so total peak comes out as 1000 Kbps. The audio isn't that difficult or variable, so higher likely wouldn't sound any different.
  • Audio peak buffer size: 1.5 Kbps. The default is almost always fine.



Advanced Codec Settings

  • Video Complexity: Best (5). It's a short clip at a reasonable frame size. Complexity 3 probably would have been just as good, but the encode only takes about 12 minutes at 5, so I didn't bother doing anything less than the max (I love my new 8-core workstation!).

Perceptual Optimizations

  • Adaptive Dead Zone: Conservative. The normal default. It softens out edges that might ring or get too blocky, but not by too much. I tried both Off and Aggressive, and Conservative definitely looked the best, as usual.
  • DQuant: I-Frames Only. There aren't many I-frames (mainly the few dozen we set manually, and perhaps a few more natural ones), but they contain the important visual data of the faces on the cards, so we want them to be as high quality as possible. DQuant spends too many bits on smooth parts of the image to use on every frame, but upping the bitrates on a few dozen I-frames won't hurt quality much, and improves the quality of the static parts of the card we wind up staring at for those many seconds.


  • In-Loop: On. Always on unless using Simple Profile; it helps reduce artifacts and improve quality, particularly at these aggressive bitrates
  • Overlap: On. Further hides artifacts, which are a challenge with motion graphics at such a low bitrate
  • Denoise: Off. The source doesn't have a hint of noise. If there were a lot of textures, Denoise can help to soften them some for easier encoding, but there's not much texture either.
  • Noise Edge Removal: Off. This is really only useful for noisy edges of analog captures, and even then we're better off cropping. It obviously doesn't apply here.

Group of Pictures

  • B-Frame Number: 2. Normally we use 1 for film and video sources, but for this kind of motion graphics, 2 is more efficient. This gives us an IBBP pattern, so each B-frame is adjacent to a P-frame. 3 B-frame is less efficient in this case, probably since with the IBBBP pattern the middle B-frame is two frames away from a reference frame (only I and P frames can be reference frames), and the P-frames are too far apart and so require more bits to store the change over four frames instead of three since the previous I or P frame. Using 2 also gives us better random access than 1, since worse-case random access time is based on the maximum number of P-frames between I-frames. With 15 seconds between keyframes at 30 fps, that gives us 450 frames per GOP maximum. With 2 B-frames, that'll gives us 149 P-frames per GOP, the same (and thus the same random access) as if we had a 5 second GOP without P-frames (the old Windows Media Encoder default).
  • Scene Change Detection: On. This will give us natural keyframes where need them. The codec seems to do a good job of putting them in the right place. I've never changed this in EE.
  • Adaptive GOP: On. Always have this on.
  • Closed GOP: Off. This is required to be on for CBR encodes in EE, but slightly reduces quality with VBR encoding. In particular, it can increase keyframe popping, since an Open GOP pattern starts with B-frames before the first keyframe/I-frame, you get BBIBBPBBP..., with the B-frames able to reference the last P-frame of the previous GOP. This helps smooth over changes between GOPs, since you have the leading B-frame(s) to spread the change over.

Motion Estimation

  • Chroma Search: Full True Chroma. Motion Graphics is a canonical time we want chroma search. The encode is so fast, there's no reason to not go for the full meal deal and do Full True Chroma.
  • Match Method: SAD. For this kind of content with very simple, flat areas, the Sum of Absolute Differences Motion Match is actually both higher quality and faster than either Hadamard or my normal video/film default of Adaptive.
  • Search Range: Adaptive. The smallest range works for most of the frames, but there's some very fast motion when the cards zoom in which need the bigger range. Adaptive it is.




The Output pane has some of my favorite usability features of Expression Encoder, letting us apply rich templates and automatic publishing.

First, the Template. I picked the "Clean" template, which has a nice subtle overlay control, and a popup navigation via the thumbnails we made above when you mouse over the top of window. It also supports going full screen with a double-click. One thing I like about Clean is that the video fills the frame exactly, without having to account for the control bar or other elements. So I can embed at exactly 640x480 for a 640x480 clip.

The publish mode (I've got the optional Silverlight Streaming publishing plugin installed) lets me automatically or manually upload the final project to our Silverlight Streaming service. This is a great way to test or deliver Silverlight projects. You can sign up for a free account with 10 GB of storage and 5 TB/month of bandwidth.




So, how much did all this help? Here's a couple of the more pronounced before/after shots. All the below are inserted as 100% scale PNG, so there's no scaling or further compression to complicate the comparison. Note that the FLV came out darker for some reason. I'm not sure what the cause of that was; the VC-1 brightness matches the source. Perhaps something to do with the Mac/Windows gamma difference on the platform the FLV was encoded on? This actually makes VC-1's job relatively harder, since the motion graphics are easier to see.

And you can see the actual clips in action here:


Before: FLV VP6 in Flash

After: WMV VC-1 in Silverlight


Detail improvements

I grabbed a frame right after the transition that really shows the detail difference between VP6 and VC-1 here; it's especially striking in the texture of the shirt. The VP6  gets sharper after a keyframe pop, but this is how it starts. VC-1 quality in the card is maintained perfectly throughout.






Deinterlacing improvements

In this frame (man, do I look like a stiff!), you can see the effect of my blend deinterlace to hide the fields. Notice the ringing artifacts in the original frame. Encoding fields as progressive is extremely challenging for codecs, since you have high motion 1-pixel high horizontal lines, combing high frequency and high detail. I normally don't like doing a blend, since those double-images are also hard to encode, but it was only for a very short duration in this clip, and the deinterlacing filters I had handy had a lot of trouble preserving the text perfectly.





The Discussion

Add Your 2 Cents