A
file format is a particular way that information
is encoded for storage in a
computer
file.
Since a
disk drive, or indeed any
computer storage, can store only
bits, the computer must have some way of
converting
information to 0s and 1s and
vice-versa. There are different kinds of formats for different
kinds of information. Within any format type, e.g.,
word processor documents, there will
typically be several different formats. Sometimes these formats
compete with each other.
File formats are divided in
proprietary and
open formats.
Generality
Some file formats are designed to store very particular sorts of
data: the
JPEG format, for example, is designed
only to store static photographic
images.
Other file formats, however, are designed for storage of several
different types of data: the
GIF format supports storage of
both still images and simple animations, and the
QuickTime format can act as a
container for many different types of
multimedia. A
text
file is simply one that stores any text, in a format such as
ASCII or
UTF-8, with few
if any
control characters. Some
file formats, such as
HTML, or the
source code of some particular programming
language, are in fact also text files, but adhere to more specific
rules which allow them to be used for specific purposes.
Specifications
Many file formats, including some of the most well-known file
formats, have a published
specification document (often with a
reference implementation) that
describes exactly how the data is to be encoded, and which can be
used to determine whether or not a particular
program treats a particular file format
correctly. There are, however, two reasons why this is not always
the case. First, some file format developers view their
specification documents as
trade
secrets, and therefore do not release them to the public.
Second, some file format developers never spend time writing a
separate specification document; rather, the format is defined only
implicitly, through the program(s) that manipulate data in the
format.
Using file formats without a publicly available specification can
be costly. Learning how the format works will require either
reverse engineering it from a
reference implementation or acquiring the specification document
for a fee from the format developers. This second approach is
possible only when there
is a specification document, and
typically requires the signing of a
non-disclosure agreement. Both
strategies require significant time, money, or both. Therefore, as
a general rule, file formats with publicly available specifications
are supported by a large number of programs, while non-public
formats are supported by only a few programs.
Patent law, rather than
copyright, is more often used to protect a file
format. Although patents for file formats are not directly
permitted under US law, some formats require the encoding of data
with patented
algorithms. For example,
using compression with the GIF file format requires the use of a
patented algorithm, and although initially the patent owner did not
enforce it, they later began collecting fees for use of the
algorithm. This has resulted in a significant decrease in the use
of
GIFs, and is partly responsible for the
development of the alternative
PNG format. However, the patent
expired in the US in mid-
2003, and worldwide in
mid-
2004. Algorithms are usually held not to be
patentable under current European law, which also includes a
provision that members "shall ensure that, wherever the use of a
patented technique is needed for a significant purpose such as
ensuring conversion of the conventions used in two different
computer systems or networks so as to allow communication and
exchange of data content between them, such use is not considered
to be a patent infringement", which would apparently allow
implementation of a patented file system where necessary to allow
two different computers to interoperate.
Identifying the type of a file
Since files are seen by programs as streams of data, a method is
required to determine the format of a particular file within the
filesystem—an example of
metadata. Different
operating systems have traditionally taken
different approaches to this problem, with each approach having its
own advantages and disadvantages.
Of course, most modern operating systems, and individual
applications, need to use all of these approaches to process
various files, at least to be able to read 'foreign' file formats,
if not work with them completely.
Filename extension
One popular method in use by several operating systems, including
Mac OS X,
CP/M,
DOS,
VMS,
VM/CMS, and
Windows,
is to determine the format of a file based on the section of its
name following the final period. This portion of the filename is
known as the
filename extension.
For example, HTML documents are identified by names that end with
.html (or
.htm), and GIF images by
.gif.
In the original
FAT
filesystem, filenames were limited to an eight-character identifier
and a three-character extension, which is known as
8.3 filename. Many formats thus still use
three-character extensions, even though modern operating systems
and application programs no longer have this limitation. Since
there is no standard list of extensions, more than one format can
use the same extension, which can confuse the operating system and
consequently users.
One artifact of this approach is that the system can easily be
tricked into treating a file as a different format simply by
renaming it—an HTML file can, for instance, be easily treated as
plain text by renaming it from
filename.html to
filename.txt. Although this strategy was useful to expert
users who could easily understand and manipulate this information,
it was frequently confusing to less technical users, who might
accidentally make a file unusable (or 'lose' it) by renaming it
incorrectly.
This led more recent
operating
system shells, such as
Windows 95 and
Mac OS X, to hide the extension when
displaying lists of recognized files. This separates the user from
the complete filename, preventing the accidental changing of a file
type, while allowing expert users to still retain the original
functionality through enabling the displaying of file
extensions.
A downside of hiding the extension is that it then becomes possible
to have what appear to be two or more identical filenames in the
same folder. This is especially true when image files are needed in
more than one format for different applications. For example, a
company logo may be needed both in
.tif format (for
publishing) and
.gif format (for web sites). With the
extensions visible, these would appear as the unique filenames
"
CompanyLogo.tif" and "
CompanyLogo.gif". With the
extensions hidden, these would both appear to have the identical
filename "
CompanyLogo", making it more difficult to
determine which to select for a particular application.
A further downside is that hiding such information can become a
security risk. This is because on a filename extensions reliant
system all usable files will have such an extension (for example
all JPEG images will have ".jpg" or ".jpeg" at the end of their
name), so seeing file extensions would be a common occurrence and
users may depend on them when looking for a file's format. By
having file extensions hidden a malicious user can create an
executable program with an
innocent name such as "
Holiday photo.jpg.exe". In this
case the "
.exe" will be hidden and a user will see this
file as "
Holiday photo.jpg", which appears to be a JPEG
image, unable to harm the machine save for bugs in the application
used to view it. However, the operating system will still see the
"
.exe" extension and thus will run the program, which is
then able to cause harm and presents a security issue. To further
trick users, it is possible to store an icon inside the program, as
done on Microsoft Windows, in which case the operating system's
icon assignment can be overridden with an icon commonly used to
represent JPEG images, making such a program look like and appear
to be called an image, until it is opened that is. This issue
requires users with extensions hidden to be vigilant, and never
open files which seem to have a known extension displayed despite
the hidden option being enabled (since it must therefore have 2
extensions, the real one being unknown until hiding is disabled).
This presents a practical problem for Windows systems where
extension hiding is turned on by default.
Internal metadata
A second way to identify a file format is to store information
regarding the format inside the file itself. Usually, such
information is written in one (or more) binary string(s), tagged or
raw texts placed in fixed, specific locations within the file.
Since the easiest place to locate them is at the beginning of it,
such area is usually called a
file header when it is
greater than a few bytes, or a
magic number if it is just
a few bytes long.
File header
First of all, the meta-data contained in a
file header are not necessarily stored only at
the beginning of it, but might be present in other areas too, often
including the end of the file; that depends on the file format or
the type of data it contains. Character-based (text) files have
character-based human-readable headers, whereas binary formats
usually feature binary headers, although that is not a rule: a
human-readable file header may require more bytes, but is easily
discernable with simple text or hexadecimal editors.File headers
may not only contain the information required by algorithms to
identify the file format alone, but also real metadata about the
file and its contents. For example most
image file formats store information
about image size, resolution,
colour
space/format and optionally other
authoring information like who, when and where it
was made, what camera model and shooting parameters was it taken
with (if any, cfr.
Exif), and so on. Such
metadata may be used by a program reading or interpreting the file
both during the loading process and after that, but can also be
used by the operating system to quickly capture information about
the file itself without loading it all into memory.
The downsides of file header as a file-format identification method
are at least two. First, at least a few (initial) blocks of the
file need to be read in order to gain such information; those could
be
fragmented in different
locations of the same storage medium, thus requiring more seek and
I/O time, which is particularly bad for the identification of large
quantities of files altogether (like a
GUI
browsing inside a folder with thousands or more files and
discerning file icons or
thumbnails for
all of them to visualize). Second, if the header is binary
hard-coded (i.e. the header itself is subject to a non-trivial
interpretation in order to be recognized), especially for metadata
content protection's sake, there is some risk that file format is
misinterpreted at first sight, or even badly written at the source,
often resulting in corrupt metadata (which, in extremely
pathological cases, might even render the file unreadable
anymore).
A more logically sophisticated example of file header is that used
in
wrapper (or container)
file formats.
Magic number
One way to incorporate such metadata, often associated with
Unix and its derivatives, is just to store a
"magic number" inside the file itself. Originally, this term was
used for a specific set of 2-
byte identifiers
at the beginning of a file, but since any undecoded binary sequence
can be regarded as a number, any feature of a file format which
uniquely distinguishes it can be used for identification. GIF
images, for instance, always begin with the
ASCII representation of either
GIF87a or
GIF89a, depending upon the standard to which they adhere.
Many file types, most especially plain-text files, are harder to
spot by this method. HTML files, for example, might begin with the
string
<html> (which is not case sensitive), or an
appropriate
document type
definition that starts with
<!DOCTYPE, or, for
XHTML, the
XML identifier,
which begins with
<?xml. The files can also begin with
HTML comments, random text, or several empty lines, but still be
usable HTML.
The magic number approach offers better guarantees that the format
will be identified correctly, and can often determine more precise
information about the file. Since reasonably reliable "magic
number" tests can be fairly complex, and each file must effectively
be tested against every possibility in the magic database, this
approach is relatively inefficient, especially for displaying large
lists of files (in contrast, filename and metadata-based methods
need check only one piece of data, and match it against a sorted
index). Also, data must be read from the file itself, increasing
latency as opposed to metadata stored in the directory. Where
filetypes don't lend themselves to recognition in this way, the
system must fall back to metadata. It is, however, the best way for
a program to check if a file it has been told to process is of the
correct format: while the file's name or metadata may be altered
independently of its content, failing a well-designed magic number
test is a pretty sure sign that the file is either corrupt or of
the wrong type. On the other hand a valid magic number does not
guarantee that the file is not corrupt or of a wrong type.
So-called
shebang lines in
script files are a special
case of magic numbers. Here, the magic number is human-readable
text that identifies a specific
command interpreter and
options to be passed to the command interpreter.
Another operating system using magic numbers is
AmigaOS, where magic numbers were called "Magic
Cookies" and were adopted as a standard system to recognize
executables in
Hunk executable file
format and also to let single programs, tools and utilities deal
automatically with their saved data files, or any other kind of
file types when saving and loading data. This system was then
enhanced with the Amiga standard
Datatype
recognition system.
External metadata
A final way of storing the format of a file is to explicitly store
information about the format in the file system, rather than within
the file itself.
This approach keeps the metadata separate from both the main data
and the name, but is also less
portable than
either file extensions or "magic numbers", since the format has to
be converted from filesystem to filesystem. While this is also true
to an extent with filename extensions — for instance, for
compatibility with
MS-DOS's three character
limit — most forms of storage have a roughly equivalent definition
of a file's data and name, but may have varying or no
representation of further metadata.
Note that zip files or
archive files
solve the problem of handling metadata. A utility program collects
multiple files together along with metadata about each file and the
folders/directories they came from all within one new file (e.g. a
zip file with extension .zip). The new file is also compressed and
possibly encrypted, but now is transmissible as a single file
across operating systems by FTP systems or attached to email. At
the destination, it must be unzipped by a compatible utility to be
useful, but the problems of transmission are solved this way.
Mac OS type-codes
The
Mac OS'
Hierarchical File System stores
codes for
creator and
type as part of the directory
entry for each file. These codes are referred to as
OSTypes, and for instance a
HyperCard "stack" file has a
creator of
WILD (from Hypercard's previous name, "WildCard") and a
type of
STAK. The type code specifies the format
of the file, while the creator code specifies the default program
to open it with when double-clicked by the user. For example, the
user could have several text files all with the type code of
TEXT, but which each open in a different program, due to
having differing creator codes.
RISC OS uses
a similar system, consisting of a 12-
bit number
which can be looked up in a table of descriptions — e.g. the
hexadecimal number FF5 is "aliased" to
PoScript,
representing a
PostScript file.
Mac OS X Uniform Type Identifiers (UTIs)
A Uniform Type Identifier (UTI) is a method used in
Mac OS X for uniquely identifying "typed" classes
of entity, such as file formats.
It was developed by Apple
as a replacement for OSType
(type & creator codes).
The UTI is a
Core Foundation
string, which uses a
reverse-DNS string. Common or standard
types use the
public domain (e.g.
public.png for
a
Portable Network
Graphics image), while other domains can be used for
third-party types (e.g.
com.adobe.pdf for
Portable Document Format). UTIs can
be defined within a hierarchical structure, known as a conformance
hierarchy. Thus,
public.png conforms to a supertype of
public.image, which itself conforms to a supertype of
public.data. A UTI can exist in multiple hierarchies,
which provides great flexibility.
In addition to file formats, UTIs can also be used for other
entities which can exist in the OS X
file
system, including:
- Pasteboard data
- Folders
(directories)
- Translatable types (as handled by the Translation Manager)
- Bundles
- Frameworks
- Streaming data
- Aliases and symlinks
The
HPFS,
FAT12 and FAT16 (but not FAT32)
filesystems allow the storage of "extended attributes" with files.
These comprise an arbitrary set of triplets with a name, a coded
type for the value and a value, where the names are unique and
values can be up to 64 KB long. There are standardized meanings for
certain types and names (under OS/2). One such is that the ".TYPE"
extended attribute is used to determine the file type. Its value
comprises a list of one or more file types associated with the
file, each of which is a string, such as "Plain Text" or "HTML
document". Thus a file may have several types.
The
NTFS filesystem also allows to store OS/2
extended attributes, as one of file
forks, but this
feature is merely present to support the OS/2 subsystem (not
present in XP), so the Win32 subsystem treats this information as
an opaque block of data and does not use it. Instead, it relies on
other file forks to store meta-information in Win32-specific
formats. OS/2 extended attributes can still be read and written by
Win32 programs, but the data must be entirely parsed by
applications.
POSIX extended attributes
On
Unix and
Unix-like
systems, the
ext2,
ext3,
ReiserFS version 3,
XFS,
JFS,
FFS, and
HFS+ filesystems allow the storage of extended
attributes with files. These include an arbitrary list of
"name=value" strings, where the names are unique, which can be
accessed by their "name" parts.
PRONOM Unique Identifiers (PUIDs)
The
PRONOM Persistent Unique Identifier is an extensible scheme of
persistent, unique and unambiguous identifiers for file formats,
which has been developed by The National Archives of the UK
as part of its PRONOM technical registry
service. PUIDs can be expressed as
Uniform Resource Identifiers
using the
info:pronom/ namespace. Although not yet widely
used outside of UK government and some
digital preservation programmes, the
PUID scheme does provide greater granularity than most alternative
schemes.
MIME types
MIME types are widely used in many
Internet-related applications, and increasingly
elsewhere, although their usage for on-disc type information is
rare. These consist of a standardised system of identifiers
(managed by
IANA) consisting of a
type and a
sub-type, separated by a
slash — for instance,
text/html
or
image/gif. These were originally intended as a way of
identifying what type of file was attached to an
e-mail, independent of the source and target
operating systems. MIME types identify files on
BeOS,
AmigaOS 4.0 and
MorphOS, as well as store unique application
signatures for application launching. In AmigaOS and MorphOS the
Mime type system works in parallel with Amiga specific
Datatype
system.
There are problems with the MIME types though; several
organisations and people have created their own MIME types without
registering them properly with IANA, which makes the use of this
standard awkward in some cases.
File format identifiers (FFIDs)
File format identifiers is another, not widely used way to identify
file formats according to their origin and their file category. It
was created for the Description Explorer suite of software. It is
composed of several digits of the form
NNNNNNNNN-XX-YYYYYYY. The first part indicates the
organisation origin/maintainer (this number represents a value in a
company/standards organisation database), the 2 following digits
categorize the type of file in hexadecimal. The final part is
composed of the usual file extension of the file or the
international standard number of the file, padded left with zeros.
For example, the PNG file specification has the FFID of
000000001-31-0015948 where 31 indicates an image file,
0015948 is the standard number and 000000001 indicates the ISO
Organisation.
File content based format identification
Another but least popular way to identify the file format is to
look at the file contents for distinguishable patterns among file
types. As we know, the file contents are sequence of bytes and a
byte has 256 unique patterns (0~255). Thus, counting the occurrence
of byte patterns that is often referred as byte frequency
distribution gives distinguishable patterns to identify file types.
There are many content based file type identification schemes that
use byte frequency distribution to build the representative models
for file type and use any statistical and data mining techniques to
identify file types
File structure
There are several types of ways to structure data in a file. The
most usual ones are described below.
Unstructured formats (raw memory dumps)
Earlier file formats used raw data formats that consisted of
directly dumping the memory images of one or more structures into
the file.
This has several drawbacks. Unless the memory images also have
reserved spaces for future extensions, extending and improving this
type of structured file is very difficult. It also creates files
that might be specific to one platform or programming language (for
example a structure containing a
Pascal string is not
recognized as such in
C).
On the other hand, developing tools for reading and writing these
types of files is very simple.
The limitations of the unstructured formats led to the development
of other types of file formats that could be easily extended and be
backward compatible at the same time.
Chunk based formats
Electronic Arts and
Commodore-
Amiga
pioneered this file format in 1985, with their IFF (
Interchange File Format) file
format. In this kind of file structure, each piece of data is
embedded in a container that contains a signature identifying the
data, as well the length of the data (for binary encoded files).
This type of container is called a
"chunk". The signature
is usually called a chunk id, chunk identifier, or tag
identifier.
With this type of file structure, tools that do not know certain
chunk identifiers simply skip those that they do not
understand.
This concept has been taken again and again by
RIFF (Microsoft-IBM equivalent of IFF),
PNG,
JPEG storage, DER (
Distinguished Encoding Rules)
encoded streams and files (which were originally described in CCITT
X.409:1984 and therefore predate IFF), and
Structured Data Exchange Format . Even
XML can be considered a kind of chunk based format,
since each data element is surrounded by tags which are akin to
chunk identifiers.
Directory based formats
This is another extensible format, that closely resembles a file
system (
OLE Documents are actual filesystems),
where the file is composed of 'directory entries' that contain the
location of the data within the file itself as well as its
signatures (and in certain cases its type). Good examples of these
types of file structures are
disk images,
OLE documents and
TIFF
images.
See also
References