Space Savers:
Bend, Twist and Crunch Your Audio Files for the Web

by Neil Leonard
www.neilleonard.com

published in Electronic Musician, July 1998

Now that so many people are connected via the Web it is easier to exchange music than ever before. All one has to do is point their browser to a Web page with posted music files, click on a file name and download the music. Right? Actually, in many respects the Web has increased the complexities of distribution. When you buy a CD or standard audio cassette there is no question that it will play on your home system.

Instant playback is not guaranteed when dealing with audio files that you download from the Internet. Web distribution is more complex than CD or cassette distribution partly due to the limitations of available bandwidth. A person using a dedicated T1 line can transfer up to 128 Kbps (thousand bits per second), which is fast enough to download a CD quality audio file in real-time. However, a vast number of people accessing the Internet still use 28.8 Kbps or even 14.4 Kbps modems. If you try to download one minute of CD quality audio using one of these modems you might find yourself waiting for over an hour.

In order to expedite transfer rates and reduce the amount of disk space occupied by digital audio files, several approaches to encoding audio data have evolved. You will find yourself using these technologies when you download audio clips, tune in to a Web radio station, play a digital video, or listen to audio streaming from a musician or record companies Web site. Even when you are not on-line you might use an audio compression algorithm simply to save disk space.

The piece of software that makes all this possible is called a codec, which stands for compressor/decompressor. Its worth pointing out that a codec is unlike a software based dynamic range compressor. Here the goal is to reduce the number of bits that are required to represent the waveform.

The majority of codecs that we will look at here are lossy, meaning that once the audio signal has been encoded, there is no guarantee that decompressing it will produce an exact replica of the original data. You might ask, 'How can we just throw away parts of the signal?' To answer that question lets look at how a type of codec called a waveform coder works.

Waveform coders produce a close approximation of the waveform using fewer bits. One widely used waveform coder is IMA-ADPCM which stands for International Multimedia Association's specification for Adaptive Differential Pulse-Code Modulation. This is a variant of the ADPCM that is widely used in the telecommunications industry. IMA-ADPCM, was designed specifically for desktop audio applications and is incorporated into the Windows operating system, where it is refer to as ADPCM. It is also part of Apple's QuickTime software. Mac users often refer to it as IMA compression. In some cases, files compressed with IMA-ADPCM can sound indistinguishable from the original un-compressed version.

IMA-ADPCM works because the amplitude of an audio waveform tends to change gradually. As it turns out, the waveform can be represented more efficiently by saving the difference between consecutive samples, as opposed to saving the absolute value of individual samples.

When an audio signal is sampled by a 16-bit analog to digital converter, the incoming analog signal is measured at periodic intervals and converted to corresponding 16-bit quantities. These 16-bit values can represent any whole number between 0 and 65,525. If we measure the difference between consecutive samples in one channel, we might find that the absolute difference between two values rarely exceeds 100. Well, if the difference values of a waveform never exceed 100, then we do not need 16 bits per sample to represent this it.

In IMA-ADPCM each sample is represented by a 4-bit difference value. A 16-bit, 44.1 kHz, stereo file will be reduced to 25% of its original size. This gives us a fixed 4:1 compression ratio. [Fig 1. Sound Converter can be used to encode audio files using a variety of codecs including IMA-ADPCM.] Sometimes the difference between samples is referred to as difference modulation, or delta modulation. Difference modulation is not unique to IMA-ADPCM, in fact it is the basis for other audio codecs including Dolby Labs' AC-1 codec.

Lets have a closer look at how these four bit values are used. Fifteen different values can be represented by four bits. So, in its simplest form a four bit difference value can represent a number between -7 to 7. However, what happens when the waveform's amplitude jumps by a value greater than 7? Well, rather than using a range of numbers between -7 to 7, we could use the same 4-bits to represent even numbers between -14 to +14. To wrap up our overview of this codec, lets look at how IMA-ADPCM formats the waveform data.

The IMA-ADPCM codec groups consecutive samples in bundles. On the Macintosh each bundle consists of 64 samples. Bundles begin with a step index, or multiplier to scale the difference values. For example, this value determines whether the difference values are on a -7 to 7 or -14 to 14 scale. The step index value can vary, or be adapted to the needs of each bundle, hence the A in ADPCM. The beginning of each bundle also has a predictor value to specify the absolute amplitude of the first sample of each bundle.

Despite its often stunning results, there are drawbacks. The IMA did not define the number of samples that are in a bundle or the number of bits that are allocated for the step index and predictor values. As a result, Microsoft and Apple came up with their own incompatible implementations of IMA-ADPCM, that use different bundle sizes and bit allotments for step index and predictor bytes. You might need to know what platform was used to create an IMA-ADPCM file prior to selecting a piece of software to listen to it.

There are additional caveats. ADPCM does not lend its self to random access. You might have to decode your IMA-ADPCM files with a piece of utility software before editing them with your favorite waveform editor. IMA-ADPCM encoders convert the incoming file to 16-bits before creating the final file. If you process an 8-bit sample file, it will automatically be converted to the 16-bit before it is reduced to a 4-per sample file. So, you are better off encoding a 16-bit version of the file. It will sound much better and use the same amount of disk space.

The waveform coder is just one type of codec. What happens when you go to a Web page where audio playback happens nearly instantaneously and files are not downloaded. These streaming technologies rely on perceptual coders, which use more intensive algorithms to provide even greater data reduction ratios. Perceptual coders are the basis of MPEG (used in Shockwave) and Dolby AC-2 and Dolby AC-3. (See "Surfing the Pipeline," EM, September, 1997 for an overview of products that use these technologies).

Perceptual coders radically reduce the amount of stored data, yet can yield CD quality sound files. The music industry now views this as a powerful alternative to traditional distribution methods. Web distribution practically eliminates manufacturing costs and provides around-the-clock shopping.

At present you can audition high quality preview files or even purchase tracks that have been encoded using a perceptual coder. If you download a track you can use a piece of software to decode it and burn it to a standard Red Book audio CD. One such system, Liquid Audio, has already been used to publish Duran Duran's new album. Liquid Audio's file server software generates broadcast reports for BMI, ASCAP and the Harry Fox Agency. While this delivery medium is in its infancy, some specialists believe that on-lines sales will reach $1.3 billion by the end of this millennium. [Fig 2. Liquid Audio Screen Shot <<<The Berklee firewall would not let me actually load examples - can you please grab these?>>>] Both Liquid Audio and Real Audio use perceptual coders developed by Dolby Laboratories as the basis of their streaming technologies.

Unlike waveform coders, perceptual coders do not attempt to preserve the contour of the original waveform. Instead, the goal is to ensure that the final output signal sounds like the original. To achieve this, the encoding algorithm uses a model of the human auditory system to determine what parts of the signal are masked, or inaudible. These parts of the audio signal are deemed irrelevant and are removed. Hence, the amount of information that needs to be stored is reduced.

For example, if a guitar concerto was encoded using a perceptual coder, the algorithm would determine that the frequencies produced by the guitar are of critical importance during cadenzas. However, when the full string section comes in we cannot always distinguish the guitar part. At these points the coder would eliminated the frequencies produced by the guitar, without any perceptible loss of audio quality.

In order to perform these tasks the encoder analyses the input signals within consecutive overlapping time blocks that might be anywhere from a few hundred to a few thousand samples long. Each block is divided into narrow frequency sub-bands of different sizes according to the frequency sensitivity of human hearing. A psychoacoustic model is then used to determine which sub-bands contain irrelevant information that can be discarded.

Perceptual coders are scalable codecs, meaning that the compression ratio can be adjusted by the user. It is common for these coders to include a dialog box that allows the user to set the compression ratio to meet a minimum bit rate that is expected when the file is played back via a modem of a particular speed. [Fig 3. Macromedia's Shockwave Audio codec allows the user scale the size of the encoded file to match the limits of a particular modem speed.] This information is used to help determine how many bits to allocate for different frequency ranges. Sub-bands that are deemed more critical to our perception of the music get a more generous allotment of bits.

Encoded files can be streamed or posted on the Web for downloading. In either case, a special piece of software is required to playback the file. At playback time the decoder uses an inverse filter bank to synthesize audio.

So, we have examined two types of audio codecs. Are there more? Definitely. If you are running Windows 95, look at the Advanced Multimedia Properties in the Control Panel. Chances are that you will find over a half dozen audio codecs listed here. Fortunately, in most basic cases Windows finds the right codec for the task, and you might not even know that codecs are being used.

Once you begin to explore the available codecs for your OS you might run across µ-Law, which is used for some timee in North America and Japan. It was defined by CCITT (International Telegraph and Telephone Consultative Committee). It compresses audio using 8-bits per sample and can achieve a signal to noise and dynamic range equivalent to that of a 12-bit system. The step index is based on a logarithmic scale that is well suited for encoding speech. Another waveform coder is Apple's MACE (Macintosh Audio Compression Expansion) for encoding 8-bit files using difference modulation.

Does it end here? Hardly. Audio coding technologies are being updated on a monthly, if not weekly, basis. Emagic just introduced ZAP (Zero-loss Audio Packer), a stand alone application that allows users to archive their work with up to 60% savings in file size. When expanded from the compressed files, the original waveform is restored unchanged. ZAP supports SoundDesigner II, AIFF and Windows Wave file formats. Files that have been compressed with ZAP can be saved as self-extracting files, making it possible to decompress the files without additional software.

By the time you read this Apple should have released QuickTime 3.0 which ups the ante even further by incorporating two new audio codecs. The QDesign Music Codec (QDMC) is designed to deliver CD quality music via a 28.8 Kbit modem in real-time. QDMC offers 99 percent file size reduction, without reducing audio quality. QUALCOMM's PureVoice is optimized for speech and can stream telephone quality speech information over a 28.8k modem.