Lossless data compression is a class of
data compression algorithms that allows the exact original data to
be reconstructed from the compressed data. The term
lossless is in contrast to
lossy data compression, which only
allows an approximation of the original data to be reconstructed,
in exchange for better
compression rates.
Lossless data compression is used in many applications. For
example, it is used in the popular
ZIP file format and in the
Unix tool
gzip.It is also often
used as a component within lossy data compression
technologies.
Lossless compression is used when it is important that the original
and the decompressed data be identical, or when no assumption can
be made on whether certain deviation is uncritical. Typical
examples are executable programs and source code. Some image file
formats, like
PNG or
GIF, use only lossless
compression, while others like
TIFF and
MNG may use either
lossless or lossy methods.
Lossless compression techniques
Most lossless compression programs do two things in sequence: the
first step generates a
statistical model for the input
data, and the second step uses this model to map input data to bit
sequences in such a way that "probable" (e.g. frequently
encountered) data will produce shorter output than "improbable"
data.
The primary encoding algorithms used to produce bit sequences are
Huffman coding (also used by
DEFLATE) and
arithmetic coding. Arithmetic coding
achieves compression rates close to the best possible for a
particular statistical model, which is given by the
information entropy, whereas Huffman
compression is simpler and faster but produces poor results for
models that deal with symbol probabilities close to 1.
There are two primary ways of constructing statistical models: in a
static model, the data are analyzed and a model is
constructed, then this model is stored with the compressed data.
This approach is simple and modular, but has the disadvantage that
the model itself can be expensive to store, and also that it forces
a single model to be used for all data being compressed, and so
performs poorly on files containing heterogeneous data.
Adaptive models dynamically update the model as the data
are compressed. Both the encoder and decoder begin with a trivial
model, yielding poor compression of initial data, but as they learn
more about the data performance improves. Most popular types of
compression used in practice now use adaptive coders.
Lossless compression methods may be categorized according to the
type of data they are designed to compress. While, in principle,
any general-purpose lossless compression algorithm
(
general-purpose meaning that they can compress any
bitstring) can be used on any type of data, many are unable to
achieve significant compression on data that are not of the form
for which they were designed to compress. Many of the lossless
compression techniques used for text also work reasonably well for
indexed image.
Text
Statistical modeling algorithms for text (or text-like binary data
such as executables) include:
Multimedia
Techniques that take advantage of the specific characteristics of
images such as the common phenomenon of contiguous 2-D areas of
similar tones.Every pixel but the first is replaced by the
difference to its left neighbor. This leads to small values having
a much higher probability than large values.This is often also
applied to sound files and can compress files which contain mostly
low frequencies and low volumes.For images this step can be
repeated by taking the difference to the top pixel, and then in
videos the difference to the pixel in the next frame can be
taken.
A hierarchical version of this technique takes neighboring pairs of
data points, stores their difference and sum, and on a higher level
with lower resolution continues with the sums. This is called
discrete wavelet
transform.
JPEG2000 additionally uses
data points from other pairs and multiplication factors to mix then
into the difference. These factors have to be integers so that the
result is an integer under all circumstances. So the values are
increased, increasing file size, but hopefully the distribution of
values is more peaked.
The adaptive encoding uses the probabilities from the previous
sample in sound encoding, from the left and upper pixel in image
encoding, and additionally from the previous frame in video
encoding. In the wavelet transformation the probabilities are also
passed through the hierarchy.
Historical legal issues
Many of these methods are implemented in open-source and
proprietary tools, particularly LZW and its variants.
Some algorithms are
patented in the USA and other countries and their legal usage requires
licensing by the patent holder. Because of patents on
certain kinds of LZW compression, and in particular licensing
practices by patent holder Unisys that many developers considered
abusive, some open source proponents encouraged people to avoid
using the
Graphics
Interchange Format (GIF) for compressing image files in favor
of Portable Network Graphics
PNG, which combines the
LZ77-based
deflate algorithm with a selection of
domain-specific prediction filters. However, the patents on LZW
have now expired.
Many of the lossless compression techniques used for text also work
reasonably well for
indexed images,
but there are other techniques that do not work for typical text
that are useful for some images (particularly simple bitmaps), and
other techniques that take advantage of the specific
characteristics of images (such as the common phenomenon of
contiguous 2-D areas of similar tones, and the fact that color
images usually have a preponderance to a limited range of colors
out of those representable in the color space).
As mentioned previously, lossless sound compression is a somewhat
specialised area. Lossless sound compression algorithms can take
advantage of the repeating patterns shown by the wave-like nature
of the data – essentially using models to predict the "next" value
and encoding the (hopefully small) difference between the expected
value and the actual data. If the difference between the predicted
and the actual data (called the "error") tends to be small, then
certain difference values (like 0, +1, -1 etc. on sample values)
become very frequent, which can be exploited by encoding them in
few output bits.
It is sometimes beneficial to compress only the differences between
two versions of a file (or, in
video
compression, of an image). This is called
delta compression (from the Greek letter
Δ which is commonly used in
mathematics to denote a difference), but the term is typically only
used if both versions are meaningful outside compression and
decompression. For example, while the process of compressing the
error in the above-mentioned lossless audio compression scheme
could be described as
delta
compression from the approximated sound wave to the original
sound wave, the approximated version of the sound wave is not
meaningful in any other context.
Lossless compression methods
By operation of the
pigeonhole
principle, no lossless compression algorithm can efficiently
compress all possible data, and completely random data streams
cannot be compressed. For this reason, many different algorithms
exist that are designed either with a specific type of input data
in mind or with specific assumptions about what kinds of redundancy
the uncompressed data are likely to contain.
Some of the most common lossless compression algorithms are listed
below.
General purpose
- Run-length encoding – a
simple scheme that provides good compression of data containing
lots of runs of the same value.
- LZW – used by gif and compress among
others
- Deflate – used by gzip, modern versions
of zip and as part of the compression process of PNG, PPP, HTTP,
SSH
Audio
Graphics
- ABO – Adaptive
Binary Optimization
- GIF – (lossless, but
contains a very limited number color range)
- JBIG2 – (lossless or lossy compression of
B&W images)
- JPEG-LS – (lossless/near-lossless
compression standard)
- JPEG 2000 – (includes lossless
compression method, as proven by Sunil Kumar, Prof San Diego State
University)
- JPEG XR - formerly WMPhoto and
HD Photo, includes a lossless compression method
- PGF – Progressive
Graphics File (lossless or lossy compression)
- PNG – Portable Network
Graphics
- TIFF - Tagged Image File Format
3D Graphics
- OpenCTM – Lossless compression of 3D
triangle meshes
Video
Cryptography
Cryptosystems often compress data
before encryption for added security; compression prior to
encryption helps remove redundancies and patterns that might
facilitate cryptanalysis. However, many ordinary lossless
compression algorithms introduce predictable patterns (such as
headers, wrappers, and tables) into the compressed data that may
actually make cryptanalysis easier. Therefore, cryptosystems often
incorporate specialized compression algorithms specific to the
cryptosystem—or at least demonstrated or widely held to be
cryptographically secure—rather than standard compression
algorithms that are efficient but provide potential opportunities
for cryptanalysis.
Limitations
Lossless data compression algorithms cannot guarantee compression
for all input data sets. In other words, for any (lossless) data
compression algorithm, there will be an input data set that does
not get smaller when processed by the algorithm. This is easily
proven with elementary mathematics using a
counting argument, as follows:
- Assume that each file is represented as a string of bits of
some arbitrary length.
- Suppose that there is a compression algorithm that transforms
every file into a distinct file which is no longer than the
original file, and that at least one file will be compressed into
something that is shorter than itself.
- Let M be the least number such that there is a file F with
length M bits that compresses to something shorter. Let N be the
length (in bits) of the compressed version of F.
- Because Nmath>, every file of length N
keeps its size during compression. There are 2^N such files.
Together with F, this makes 2^N+1 files which all compress into one
of the 2^N files of length N.
- But 2^N is smaller than 2^N+1, so by the pigeonhole principle there must be some
file of length N which is simultaneously the output of the
compression function on two different inputs. That file cannot be
decompressed reliably (which of the two originals should that
yield?), which contradicts the assumption that the algorithm was
lossless.
- We must therefore conclude that our original hypothesis (that
the compression function makes no file longer) is necessarily
untrue.
Any lossless compression algorithm that makes some files shorter
mustnecessarily make some files longer, but it is not necessary
that those files become
very much longer. Most practical
compression algorithms provide an "escape" facility that can turn
off the normal coding for files that would become longer by being
encoded. Then the only increase in size is a few bits to tell the
decoder that the normal coding has been turned off for the entire
input. For example,
DEFLATE compressed files
never need to grow by more than 5 bytes per 65,535 bytes of
input.
In fact, if we consider files of length N, if all files were
equally probable, then for any lossless compression that reduces
the size of some file, the expected length of a compressed file
(averaged over all possible files of length N) must necessarily be
greater than N. So if we know nothing about the properties
of the data we are compressing, we might as well not compress it at
all. A lossless compression algorithm is useful only when we are
more likely to compress certain types of files than others; then
the algorithm could be designed to compress those types of data
better.
Thus, the main lesson from the argument is not that one risks big
losses, but merely that one cannot always win. To choose an
algorithm always means implicitly to select a
subset of
all files that will become usefully shorter. This is the
theoretical reason why we need to have different compression
algorithms for different kinds of files: there cannot be any
algorithm that is good for all kinds of data.
The "trick" that allows lossless compression algorithms, used on
the type of data they were designed for, to consistently compress
such files to a shorter form is that the files the algorithm are
designed to act on all have some form of easily-modeled
redundancy that the
algorithm is designed to remove, and thus belong to the subset of
files that that algorithm can make shorter, whereas other files
would not get compressed or even get bigger. Algorithms are
generally quite specifically tuned to a particular type of file:
for example, lossless audio compression programs do not work well
on text files, and
vice versa.
In particular, files of
random data cannot be
consistently compressed by any conceivable lossless data
compression algorithm: indeed, this result is used to
define the concept of randomness in
algorithmic complexity
theory.
There have been many claims through the years of companies
achieving 'perfect-compression' where an arbitrary number of random
bits can always be compressed to N-1 bits. This is, of course,
impossible: if such an algorithm existed, it could be applied
repeatedly to losslessly reduce any file to length 0. These kinds
of claims can be safely discarded without even looking at any
further details regarding the purported compression scheme.
An
algorithm that is asserted to be able
to losslessly compress any data stream is provably impossible. In a
more general sense, any compression algorithm whose proposed
properties contradict fundamental laws of
mathematics may be called magic.
On the other hand, it has also been proven that there is no
algorithm to determine whether a file is incompressible in the
sense of
Kolmogorov
complexity; hence, given any particular file, even if it
appears random, it's possible that it may be significantly
compressed, even including the size of the decompressor. An example
is the digits of the mathematical constant
pi, which appear random but can be generated by a
very small program. However, even though it cannot be determined
whether a particular file is incompressible, a
simple theorem about
incompressible strings shows that over 99% of files of any
given length cannot be compressed by more than one byte (including
the size of the decompressor).
Mathematical background
Any
compression algorithm can
be viewed as a
function that
maps sequences of units (normally
octets) into
other sequences of the same units. Compression is successful if the
resulting sequence is shorter than the original sequence. In order
for a compression algorithm to be considered
lossless, there needs to exist a reverse mapping
from compressed bit sequences to original bit sequences; that is to
say, the compression method would need to encapsulate a
bijection between "plain" and "compressed" bit
sequences.
The sequences of length N or less are clearly a strict superset of
the sequences of length N-1 or less. It follows that there are more
sequences of length N or less than there are sequences of length
N-1 or less. It therefore follows from the
pigeonhole principle that it is not
possible to map every sequence of length N or less to a unique
sequence of length N-1 or less. Therefore it is not possible to
produce an algorithm that reduces the size of
every
possible input sequence.
Psychological background
Most everyday files are relatively 'sparse' in an
information entropy sense, and thus,
most lossless algorithms a layperson is likely to apply on regular
files compress them relatively well. This may, through
misapplication of
intuition,
lead some individuals to conclude that a well-designed compression
algorithm can compress
any input, thus, constituting a
magic compression algorithm.
Points of application in real compression theory
Real compression algorithm designers accept that streams of high
information entropy cannot be compressed, and accordingly, include
facilities for detecting and handling this condition. An obvious
way of detection is applying a raw compression algorithm and
testing if its output is smaller than its input. Sometimes,
detection is made by
heuristics; for
example, a compression application may consider files whose names
end in ".zip", ".arj" or ".lha" uncompressible without any more
sophisticated detection. A common way of handling this situation is
quoting input, or uncompressible parts of the input in the output,
minimising the compression overhead. For example, the
zip data format specifies the 'compression
method' of 'Stored' for input files that have been copied into the
archive verbatim.
The Million Random Number Challenge
Mark Nelson, frustrated over many cranks trying to claim having
invented a magic compression algorithm appearing in
comp.compression, has constructed a 415,241
byte binary file (
[2669]) of highly entropic content, and issued
a public challenge of $100 to anyone to write a program that,
together with its input, would be smaller than his provided binary
data yet be able to reconstitute ("decompress") it without
error.
The
FAQ for the
comp.compression newsgroup contains a challenge by Mike Goldman
offering $5,000 for a program that can compress random data.
Patrick Craig took up the challenge, but rather than compressing
the data, he split it up into separate files all of which ended in
the number '5' which was not stored as part of the file. Omitting
this character allowed the resulting files (plus, in accordance
with the rules, the size of the program that reassembled them) to
be smaller than the original file. However, no actual compression
took place, and the information stored in the names of the files
was necessary in order to reassemble them in the correct order into
the original file, and this information was not taken into account
in the file size comparison. The files themselves are thus not
sufficient to reconstitute the original file, the file names are
also necessary. A full history of the event, including discussion
on whether or not the challenge was technically met, is on Patrick
Craig's web site.
See also
References
External links