- For processes which reduce the amount of time it takes to
listen to and understand a recording, see time-compressed speech.
Audio compression is a form of
data compression designed to reduce the
transmission bandwidth requirement of digital
audio streams and the storage size of
audio files. Audio compression
algorithms are implemented in
computer software as
audio codecs. Generic data compression
algorithms perform poorly with audio data, seldom reducing data
size much below 87% from the original, and are not designed for use
in real time applications. Consequently, specifically optimized
audio
lossless and
lossy algorithms have been
created. Lossy algorithms provide greater compression rates and are
used in mainstream consumer audio devices.
In both lossy and lossless compression,
information redundancy is
reduced, using methods such as
coding,
pattern recognition and
linear prediction to reduce the amount of
information used to represent the uncompressed data.
The trade-off between slightly reduced audio quality and
transmission or storage size is outweighed by the latter for most
practical audio applications in which users may not perceive the
loss in playback rendition quality. For example, one compact disk
(CD) holds approximately one hour of uncompressed high fidelity
music, less than 2 hours of music compressed losslessly, or 7 hours
of music compressed in the
MP3 format at medium
bit rates.
Lossless audio compression
Lossless audio compression produces a representation of digital
data that can be expanded to an exact digital duplicate of the
original audio stream. This is in contrast to the irreversible
changes upon playback from lossy compression techniques such as
Vorbis and
MP3.
Compression ratios are similar to those for generic lossless data
compression (around 50–60% of original size), and substantially
less than for lossy compression, which typically yield 5–20% of
original size.
Applications
The primary application areas of lossless encoding are:
- Archives: For archival purposes it is
generally desired to maintain best possible quality.
- Editing: Audio engineers use
lossless compression for audio editing to avoid digital generation loss.
- High fidelity playback: Audiophiles
prefer lossless compression formats to avoid compression artifacts.
- Mastering of casual-use audio media: High quality master copies of recordings are used to
produce lossily compressed versions for digital audio players. As formats and
encoders improve, updated lossily compressed files may be generated
from the lossless master.
As file storage and communications bandwidth have become less
expensive and more available, lossless audio compression has become
more popular.
Formats
Shorten was an early lossless format; newer ones
include Free Lossless Audio
Codec (FLAC), Apple's
Apple Lossless, MPEG-4
ALS, Monkey's Audio, and TTA.
Some audio formats feature a combination of a lossy format and a
lossless correction; this allows stripping the correction to easily
obtain a lossy file. Such formats include
MPEG-4 SLS (Scalable to Lossless),
WavPack, and
OptimFROG DualStream.
Some formats are associated with a technology, such as:
Difficulties in lossless compression of audio data
It is difficult to maintain all the data in an audio stream and
achieve substantial compression. First, the vast majority of sound
recordings are highly complex, recorded from the real world. As one
of the key methods of compression is to find patterns and
repetition, more chaotic data such as audio doesn't compress well.
In a similar manner,
photographs compress
less efficiently with lossless methods than simpler
computer-generated images do. But interestingly, even computer
generated sounds can contain very complicated
waveforms that present a challenge to many
compression algorithms. This is due to the nature of audio
waveforms, which are generally difficult to simplify without a
(necessarily lossy) conversion to frequency information, as
performed by the human ear.
The second reason is that values of audio
samples change very quickly, so generic data
compression
algorithms don't work well for
audio, and strings of consecutive bytes don't generally appear very
often. However,
convolution with the
filter [-1 1] (that is, taking the first difference) tends to
slightly
whiten (
decorrelate, make flat) the spectrum, thereby
allowing traditional lossless compression at the encoder to do its
job; integration at the decoder restores the original signal.
Codecs such as
FLAC,
Shorten and
TTA use
linear prediction to estimate the
spectrum of the signal. At the encoder, the estimator's inverse is
used to whiten the signal by removing spectral peaks while the
estimator is used to reconstruct the original signal at the
decoder.
Evaluation criteria
Lossless audio codecs have no quality issues, so the usability can
be estimated by
- Speed of compression and decompression
- Degree of compression
- Robustness and error correction
- Product support
Lossy audio compression
Lossy audio compression is used in an extremely wide range of
applications. In addition to the direct applications (mp3 players
or computers), digitally compressed audio streams are used in most
video DVDs; digital television; streaming media on the
internet; satellite and cable radio; and
increasingly in terrestrial radio broadcasts. Lossy compression
typically achieves far greater compression than lossless
compression (data of 5 percent to 20 percent of the original
stream, rather than 50 percent to 60 percent), by discarding
less-critical data.
The innovation of lossy audio compression was to use
psychoacoustics to recognize that not all
data in an audio stream can be perceived by the human auditory
system. Most lossy compression reduces perceptual redundancy by
first identifying sounds which are considered perceptually
irrelevant, that is, sounds that are very hard to hear. Typical
examples include high frequencies, or sounds that occur at the same
time as louder sounds. Those sounds are coded with decreased
accuracy or not coded at all.
While removing or reducing these 'unhearable' sounds may account
for a small percentage of bits saved in lossy compression, the real
savings comes from a complementary phenomenon:
noise shaping. Reducing the number of bits
used to code a signal increases the amount of noise in that signal.
In psychoacoustics-based lossy compression, the real key is to
'hide' the noise generated by the bit savings in areas of the audio
stream that cannot be perceived. This is done by, for instance,
using very small numbers of bits to code the high frequencies of
most signals - not because the signal has little high frequency
information (though this is also often true as well), but rather
because the human ear can only perceive very loud signals in this
region, so that softer sounds 'hidden' there simply aren't
heard.
If reducing perceptual redundancy does not achieve sufficient
compression for a particular application, it may require further
lossy compression. Depending on the audio source, this still may
not produce perceptible differences. Speech for example can be
compressed far more than music. Most lossy compression schemes
allow compression parameters to be adjusted to achieve a target
rate of data, usually expressed as a
bit
rate. Again, the data reduction will be guided by some model of
how important the sound is as perceived by the human ear, with the
goal of efficiency and optimized quality for the target data rate.
(There are many different models used for this perceptual analysis,
some better suited to different types of audio than others.) Hence,
depending on the bandwidth and storage requirements, the use of
lossy compression may result in a perceived reduction of the audio
quality that ranges from none to severe, but generally an obviously
audible reduction in quality is unacceptable to listeners.
Because data is removed during lossy compression and cannot be
recovered by decompression, some people may not prefer lossy
compression for archival storage. Hence, as noted, even those who
use lossy compression (for portable audio applications, for
example) may wish to keep a losslessly compressed archive for other
applications. In addition, the technology of compression continues
to advance, and achieving a state-of-the-art lossy compression
would require one to begin again with the lossless, original audio
data and compress with the new lossy codec. The nature of lossy
compression (for both audio and images) results in increasing
degradation of quality if data are decompressed, then recompressed
using lossy compression.
Coding methods
Transform domain methods
In order to determine what information in an audio signal is
perceptually irrelevant, most lossy compression algorithms use
transforms such as the
modified discrete cosine
transform (MDCT) to convert
time
domain sampled waveforms into a transform domain. Once
transformed, typically into the
frequency domain, component frequencies can
be allocated bits according to how audible they are. Audibility of
spectral components is determined by first calculating a
masking threshold, below which it is
estimated that sounds will be beyond the limits of human
perception.
The masking threshold is calculated using the
absolute threshold of hearing
and the principles of
simultaneous
masking - the phenomenon wherein a signal is masked by another
signal separated by frequency - and, in some cases,
temporal masking - where a signal is masked
by another signal separated by time.
Equal-loudness contours may also be
used to weight the perceptual importance of different components.
Models of the human ear-brain combination incorporating such
effects are often called
psychoacoustic models.
Time domain methods
Other types of lossy compressors, such as the
linear predictive coding (LPC) used
with speech, are
source-based coders. These coders use a
model of the sound's generator (such as the human vocal tract with
LPC) to whiten the audio signal (i.e., flatten its spectrum) prior
to quantization. LPC may also be thought of as a basic perceptual
coding technique; reconstruction of an audio signal using a linear
predictor shapes the coder's quantization noise into the spectrum
of the target signal, partially masking it.
Applications
Due to the nature of lossy algorithms,
audio quality suffers when a file is
decompressed and recompressed (
digital generation loss). This makes
lossy compression unsuitable for storing the intermediate results
in professional audio engineering applications, such as sound
editing and multitrack recording. However, they are very popular
with end users (particularly
MP3), as a megabyte
can store about a minute's worth of music at adequate
quality.
Usability
Usability of lossy audio codecs is determined by:
- Perceived audio quality
- Compression factor
- Speed of compression and decompression
- Inherent latency of algorithm (critical for real-time streaming
applications; see below)
- Product support
Lossy formats are often used for the distribution of streaming
audio, or interactive applications (such as the coding of speech
for digital transmission in cell phone networks). In such
applications, the data must be decompressed as the data flows,
rather than after the entire data stream has been transmitted. Not
all audio codecs can be used for streaming applications, and for
such applications a codec designed to stream data effectively will
usually be chosen.
Latency results from the methods used to encode and decode the
data. Some codecs will analyze a longer segment of the data to
optimize efficiency, and then code it in a manner that requires a
larger segment of data at one time in order to decode. (Often
codecs create segments called a "frame" to create discrete data
segments for encoding and decoding.) The inherent
latency of the coding algorithm can be
critical; for example, when there is two-way transmission of data,
such as with a telephone conversation, significant delays may
seriously degrade the perceived quality.
In contrast to the speed of compression, which is proportional to
the number of operations required by the algorithm, here latency
refers to the number of samples which must be analysed before a
block of audio is processed. In the minimum case, latency is 0 zero
samples (e.g., if the coder/decoder simply reduces the number of
bits used to quantize the signal). Time domain algorithms such as
LPC also often have low latencies, hence their popularity in speech
coding for telephony. In algorithms such as MP3, however, a large
number of samples have to be analyzed in order to implement a
psychoacoustic model in the frequency domain, and latency is on the
order of 23 ms (46 ms for two-way communication).
Speech encoding
Speech encoding is an important
category of audio data compression. The perceptual models used to
estimate what a human ear can hear are generally somewhat different
from those used for music. The range of frequencies needed to
convey the sounds of a human voice are normally far narrower than
that needed for music, and the sound is normally less complex. As a
result, speech can be encoded at high quality using relatively low
bit rates.
This is accomplished, in general, by some combination of two
approaches:
- Only encoding sounds that could be made by a single human
voice.
- Throwing away more of the data in the signal—keeping just
enough to reconstruct an "intelligible" voice rather than the full
frequency range of human hearing.
Perhaps the earliest algorithms used in speech encoding (and audio
data compression in general) were the
A-law algorithm and the
µ-law algorithm. Protocol now requires
for 7-Zip programmes to stop compressing audio files, due to legal
reasons.
History
Solidyne 922: The world's first commercial audio bit compression
card for PC, 1990
A literature compendium for a large variety of audio coding systems
was published in the IEEE Journal on Selected Areas in
Communications (JSAC), February 1988. While there were some papers
from before that time, this collection documented an entire variety
of finished, working audio coders, nearly all of them using
perceptual (i.e. masking) techniques and some kind of frequency
analysis and back-end noiseless coding. Several of these papers
remarked on the difficulty of obtaining good, clean digital audio
for research purposes. Most, if not all, of the authors in the JSAC
edition were also active in the MPEG-1 Audio committee.
The
world's first commercial broadcast automation audio compression
system was developed by Oscar Bonello, an Engineering professor at
the University of
Buenos Aires
. In 1983, using the psychoacoustic principle
of the masking of critical bands first published in 1967, he
started developing a practical application based on the recently
developed
IBM PC computer, and the broadcast
automation system was launched in 1987 under the name
Audicom. 20 years later, almost all the radio
stations in the world were using similar technology, manufactured
by a number of companies.
Glossary
- ABR:Average bitrate
- CBR:Constant bitrate
- VBR:Variable bitrate
See also
References
- Journal on Selected Areas in Communications, February 1988
- Solidyne... 40 years of
innovation
- The Ear as a Communication Receiver. English
translation of Das Ohr als Nachrichtenempfänger by
Eberhard Zwicker and Richard Feldtkeller. Translated from German by
Hannes Müsch, Søren Buus, and Mary Florentine. Originally published
in 1967; Translation published in 1999
External links