My World: Video Telephony (VT)

Sometime back I worked on implementing a solution for video telephony and thought I could document it for a wider audience. By now video telephony is known for a handful audience through Apple’s “FaceTime” application. FaceTime is nothing innovative from Apple since this technology was long established in China/Japan/Korea mobile market. But Apple still has the pleasure of being the first mover in the market. I won’t consider Apple’s FaceTime has a real video telephony application since it uses the WiFi and internet backbone (high bandwidth) instead of cellular backbone (low bandwidth). So the problem space is much easier to tackle with for Apple. One reason could be the wireless bandwidth crunch in US cellular network providers. One thing amuses me in US is that cellular chip technology is growing at a faster rate but not the cellular communication infrastructure. Still they are at least 1 generation behind in the kind of applications and kind of bandwidth supported in Asian markets like China, Japan, and Korea.
As far as multimedia is concerned is, I see the following 3 performance beast applications:
(1) Video/Image capture
(2) Video playback of different codecs (resolution, encoding, bit rate, frame rate) and container formats and image decoding
(3) Video telephony (VT)
I consider VT has one of the interesting problem to work with. To start with some of the design know how and specs:
In Tx side:
Video Chain: Camera (Driver to imaging sensor)->Video Capture Filter-> Video Encode Filter->Mux Filter
Audio Chain: Mic(Audio Driver)-> Audio Capture Filter-> Audio Encode Filter-> Mux Filter
Video Preview Chain: Camera-> Video Capture Filter-> Video Renderer->Display Driver

In Rx side:
Video Chain: Demux Filter->Video Decode Filter->Video Renderer->Display Driver
Audio Chain: Demux Filter->Audio Decode Filter-> Audio Renderer->Audio Driver (Speakers)

Couple of big missing block is the acoustic engine to reduce feedback audio noise and VT engine to take care of "packetization" and "depacketization". VT packets are transmitted and received at 64 kbps. Every 64 kbps consists of video encoded at 48 kbps and audio encoded at 8 kbps and another 8 kbps allocated for VT packeting overhead. It is interesting to note that there are no special synchronization mechanisms in the Rx end and so one less problem to work with compared to video playback. Just follow the specs and if there are no video frame drops then synchronization will be taken care by design. But one of the biggest problems in the receiving end is how to assemble the broken video frames (VT packets) into a frame. The problem becomes much more interesting since some time a packet carrying video header will be dropped or corrupted. More pronounced in wireless cellular communication compared to wired internet. You need to develop a robust solution in order to handle such random scenarios and there holds our engineering ingenuity. H263/MPEG4 is the most popular encoding standards for VT applications. H264 is slowly catching up. Chinese mobile operators I would say have a better solution to tackle the bandwidth problem and network conditions. All VT calls from A to B are routed through their servers which take care of packet loss and other random corruption in VT packets. I thoroughly enjoyed implementing the end to end VT solution for mobile phones. It is one of the best engineering problems I have worked with.

My World

Thursday, November 4, 2010

Video Telephony (VT) – Know-how

No comments: