The
human genome is the
genome of
Homo
sapiens, which is stored on 23 chromosome pairs.
Twenty-two of these are
autosomal chromosome
pairs, while the remaining pair is
sex-determining. The
haploid human genome occupies a total of just over 3
billion
DNA base
pairs.The
Human Genome
Project (HGP) produced a reference sequence of the
euchromatic human genome, which is used
worldwide in
biomedical
sciences.
The haploid human genome contains ca. 23,000 protein-coding
genes, far fewer than had been expected before
its sequencing. In fact, only about 1.5% of the genome codes for
proteins, while the rest consists of
non-coding RNA genes,
regulatory sequences,
introns and (controversially)
"junk" DNA.
Features
Genes
There are estimated ca. 23,000 human protein-coding
genes. The estimate of the number of human genes has
been repeatedly revised down from initial predictions of 100,000 or
more as genome sequence quality and
gene
finding methods have improved.
Surprisingly, the number of human genes seems to be less than a
factor of two greater than that of many much simpler organisms,
such as the
roundworm and the
fruit fly. However, human
cells make extensive use of
alternative splicing to produce several
different proteins from a single gene, and the human
proteome is thought to be much larger than those of
the aforementioned organisms. Besides, most human genes have
multiple
exons, and human
introns are frequently much longer than the flanking
exons.
Human genes are distributed unevenly across the chromosomes. Each
chromosome contains various gene-rich and gene-poor regions, which
seem to be correlated with
chromosome
bands and
GC-content. The
significance of these nonrandom patterns of gene density is not
well understood. In addition to protein coding genes, the human
genome contains thousands of
RNA genes,
including
tRNA,
ribosomal RNA,
microRNA,
and other non-coding RNA genes.
Regulatory sequences
The human genome has many different
regulatory sequences which are crucial to
controlling
gene expression. These
are typically short sequences that appear near or within genes. A
systematic understanding of these regulatory sequences and how they
together act as a
gene
regulatory network is only beginning to emerge from
computational, high-throughput expression and
comparative genomics studies. Some
types of non-coding DNA are genetic "switches" that do not encode
proteins, but do regulate when and where genes are expressed.
Identification of regulatory sequences relies in part on
evolutionary conservation. The evolutionary branch between the
human and
mouse, for example, occurred 70–90
million years ago. So computer comparisons of gene sequences that
identify conserved non-coding sequences will be an indication of
their importance in duties such as gene regulation.
Another comparative genomic approach to locating regulatory
sequences in humans is the gene sequencing of the
puffer fish. These vertebrates have essentially
the same genes and regulatory gene sequences as humans, but with
only one-eighth the "junk" DNA. The compact DNA sequence of the
puffer fish makes it much easier to locate the regulatory
genes.
Other DNA
Protein-coding sequences (specifically, coding
exons) comprise less than 1.5% of the human genome.
Aside from genes and known regulatory sequences, the human genome
contains vast regions of DNA the function of which, if any, remains
unknown. These regions in fact comprise the vast majority, by some
estimates 97%, of the human
genome size.
Much of this is composed of:
Repeat elements
Transposons
Junk DNA
However, there is also a large amount of sequence that does not
fall under any known classification. Much of this sequence may be
an evolutionary artifact that serves no present-day purpose, and
these regions are sometimes collectively referred to as
"junk" DNA. There are, however, a variety of
emerging indications that many sequences within are likely to
function in ways that are not fully understood. Recent experiments
using
microarrays have revealed that
a substantial fraction of non-genic DNA is in fact transcribed into
RNA,"
...a tiling array with 5-nucleotide
resolution that mapped transcription activity along 10 human
chromosomes revealed that an average of 10% of the genome (compared
to the 1 to 2% represented by bona fide exons) corresponds to
polyadenylated transcripts, of which more than half do not overlap
with known gene locations. which leads to the possibility
that the resulting transcripts may have some unknown function.
Also, the evolutionary conservation across the
mammalian genomes of much more sequence than can be
explained by protein-coding regions indicates that many, and
perhaps most, functional elements in the genome remain
unknown."
...the proportion of small (50-100 bp) segments in
the mammalian genome that is under (purifying) selection can be
estimated to be about 5%. This proportion is much higher than can
be explained by protein-coding sequences alone, implying that the
genome contains many additional features (such as untranslated
regions, regulatory elements, non-protein-coding genes, and
chromosomal structural elements) under selection for biological
function." The investigation of the vast quantity of
sequence information in the human genome whose function remains
unknown is currently a major avenue of scientific inquiry.
Information content
The 3 billion base pairs of the haploid human genome correspond to
an
information content of about 750
megabytes. The
entropy rate of the genome differs
significantly between coding and non-coding sequences. It is close
to the maximum of 2 bits per base pair for the coding sequences
(about 45 million base pairs), and between 1.5 and 1.9 bits per
base pair for each individual chromosome, except for the Y
chromosome, which has an entropy rate below 0.9 bits per base
pair.
Information content of the haploid human genome by chromosome:
|
total (XY) |
total (XX) |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
X |
Y |
| million base pairs |
3,080 |
3,022 |
247 |
243 |
199 |
191 |
181 |
171 |
159 |
146 |
140 |
135 |
134 |
132 |
114 |
106 |
100 |
89 |
79 |
76 |
63 |
62 |
47 |
50 |
155 |
58 |
| megabytes (raw data) |
770 |
756 |
61.8 |
60.7 |
49.9 |
47.8 |
45.2 |
42.7 |
39.7 |
36.6 |
35.1 |
33.9 |
33.6 |
33.1 |
28.5 |
26.6 |
25.1 |
22.2 |
19.7 |
19.0 |
16.0 |
15.6 |
11.7 |
12.4 |
38.7 |
14.4 |
megabytes in zipped
Human Genome Project file |
827 |
819 |
65.1 |
68.2 |
57.4 |
52.3 |
51.3 |
48.8 |
45.3 |
38.6 |
33.9 |
39.1 |
39.8 |
38.8 |
28.8 |
26.5 |
22.9 |
22.5 |
22.7 |
22.2 |
16.4 |
18.9 |
10.4 |
10.4 |
38.6 |
8.0 |
entropy rate in bits per base pair
(Liu, Venkatesh and Maley 2008) |
1.70 |
1.71 |
1.82 |
1.80 |
1.82 |
1.82 |
1.83 |
1.82 |
1.81 |
1.83 |
1.59 |
1.83 |
1.84 |
1.59 |
1.56 |
1.53 |
1.66 |
1.82 |
1.87 |
1.58 |
1.86 |
1.82 |
1.62 |
1.83 |
1.80 |
0.84 |
|
Sequencing
DNA sequencing determines the
order of the nucleotide bases in a
genome.
Composite
The
Human Genome Project and a
parallel project by
Celera Genomics
each produced and published a
haploid human
genome sequence, both of which were a composite of the DNA sequence
of several individuals.
Personal
A personal genome sequence is a complete
sequencing of the chemical base pairs that
make up the
DNA of a single person. Because
medical treatments have different effects on different people
because of genetic variations such as
single-nucleotide
polymorphisms (SNPs), the analysis of personal genomes may lead
to personalized medical treatment based on individual
genotypes.
The completion of the fifth such map was announced in December
2008. The genome mapped was that of a Korean researcher
Seong-Jin Kim. Genome maps had previously been
completed for
Craig Venter of the U.S.
in 2007,
James Watson of the U.S. in
April 2008, and
Yang Huanming of China
in November 2008 and
Dan Stoicescu in
January 2008.
Personal genomes had not been sequenced in the Human Genome Project
to protect the identity of volunteers who provided DNA samples.
That sequence was derived from the DNA of several volunteers from a
diverse population. Another distinction is that the HGP sequence is
haploid, however, the sequence maps for Venter and Watson for
example are
diploid, representing both sets
of
chromosomes.
Kim’s genome had 1.58 million SNPs that had never been reported
before and indicates that six out of 10,000 DNA bases are unique to
Koreans. Kim's sequence map can be used to assist in building a
standard Korean genome, which can then be used to compare the
genomes of other Korean individuals for personalized medical
treatments.
Mapping
Whereas a genome sequence lists the order of every DNA base in a
genome, a genome map identifies the landmarks. A genome map is less
detailed than a genome sequence and aids in navigating around the
genome.
Variation
An example of a variation map is the HapMap being developed by the
International HapMap
Project. The HapMap is a
haplotype map
of the human genome, "which will describe the common patterns of
human DNA sequence variation." It catalogs the patterns of
small-scale variations in the genome that involve single DNA
letters, or bases.
Researchers published the first sequence-based map of large-scale
structural variation across the human genome in the journal
Nature in May 2008.
Large-scale structural variations are differences in the genome
among people that range from a few thousand to a few million DNA
bases; some are gains or losses of stretches of genome sequence and
others appear as re-arrangements of stretches of sequence. These
variations include
differences in
the number of copies individuals have of a particular
gene.
Variation
Most studies of human genetic variation have focused on
single nucleotide
polymorphisms , which are substitutions in individual bases
along a chromosome. Most analyses estimate that SNPs occur on
average somewhere between every 1 in 100 and 1 in 1,000 base pairs
in the
euchromatic human genome,
although they do not occur at a uniform density. Thus follows the
popular statement that "we are all, regardless of
race, genetically
99.9% the same", although this would be somewhat qualified by most
geneticists. For example, a much larger fraction of the genome is
now thought to be involved in
copy
number variation. A large-scale collaborative effort to catalog
SNP variations in the human genome is being undertaken by the
International HapMap
Project.
The genomic loci and length of certain types of small
repetitive sequences are highly
variable from person to person, which is the basis of
DNA fingerprinting and DNA
paternity testing technologies. The
heterochromatic portions of the
human genome, which total several hundred million base pairs, are
also thought to be quite variable within the human population (they
are so repetitive and so long that they cannot be accurately
sequenced with current technology). These regions contain few
genes, and it is unclear whether any significant
phenotypic effect results from typical variation
in repeats or heterochromatin.
Most gross genomic mutations in
Gamete germ
cells probably result in inviable embryos; however, a number of
human diseases are related to large-scale genomic abnormalities.
Down syndrome,
Turner Syndrome, and a number of other
diseases result from
nondisjunction
of entire chromosomes.
Cancer cells
frequently have
aneuploidy of chromosomes
and chromosome arms, although a
cause and
effect relationship between aneuploidy and cancer has not been
established.
Genetic disorders
Most aspects of human biology involve both genetic (inherited) and
non-genetic (environmental) factors. Some inherited variation
influences aspects of our biology that are not medical in nature
(height, eye color, ability to taste or smell certain compounds,
etc). Moreover, some genetic disorders only cause disease in
combination with the appropriate environmental factors (such as
diet). With these caveats, genetic disorders may be described as
clinically defined diseases caused by genomic DNA sequence
variation. In the most straightforward cases, the disorder can be
associated with variation in a single gene. For example,
cystic fibrosis is caused by mutations in
the CFTR gene, and is the most common recessive disorder in
caucasian populations with over 1,300 different mutations known.
Disease-causing mutations in specific genes are usually severe in
terms of gene function, and are fortunately rare, thus genetic
disorders are similarly individually rare. However, since there are
many genes that can vary to cause genetic disorders, in aggregate
they comprise a significant component of known medical conditions,
especially in pediatric medicine. Molecularly characterized genetic
disorders are those for which the underlying causal gene has been
identified, currently there are approximately 2,200 such disorders
annotated in the OMIM database.
Studies of genetic disorders are often performed by means of
family-based studies. In some instances population based approaches
are employed, particularly in the case of so-called founder
populations such as those in Finland, French-Canada, Utah,
Sardinia, etc. Diagnosis and treatment of genetic disorders are
usually performed by a
geneticist-physician trained in clinical/medical
genetics. The results of the Human Genome Project are likely to
provide increased availability of
genetic testing for gene-related disorders,
and eventually improved treatment. Parents can be screened for
hereditary conditions and
counselled on the consequences, the
probability it will be inherited, and how to avoid or ameliorate it
in their offspring.
As noted above, there are many different kinds of DNA sequence
variation, ranging from complete extra or missing chromosomes down
to single nucleotide changes. It is generally presumed that much
naturally occurring genetic variation in human populations is
phenotypically neutral, i.e. has little or no detectable effect on
the physiology of the individual (although there may be fractional
differences in fitness defined over evolutionary time frames).
Genetic disorders can be caused by any or all known types of
sequence variation. To molecularly characterize a new genetic
disorder, it is necessary to establish a causal link between a
particular genomic sequence variant and the clinical disease under
investigation. Such studies constitute the realm of human molecular
genetics.
With the advent of the Human Genome and
International HapMap Project,
it has become feasible to explore subtle genetic influences on many
common disease conditions such as diabetes, asthma, migraine,
schizophrenia, etc. Although some causal links have been made
between genomic sequence variants in particular genes and some of
these diseases, often with much publicity in the general media,
these are usually not considered to be genetic disorders
per
se as their causes are complex, involving many different
genetic and environmental factors. Thus there may be disagreement
in particular cases whether a specific medical condition should be
termed a genetic disorder.
Evolution
Comparative genomics studies of
mammalian genomes suggest that approximately 5% of the human genome
has been conserved by evolution since the divergence of those
species approximately 200 million years ago, containing the vast
majority of genes. Intriguingly, since genes and known regulatory
sequences probably comprise less than 2% of the genome, this
suggests that there may be more unknown functional sequence than
known functional sequence. A smaller, yet large, fraction of human
genes seem to be shared among most known
vertebrates.The
chimpanzee genome is 95% identical to the human
genome. On average, a typical human protein-coding gene differs
from its chimpanzee
ortholog by only two
amino acid substitutions; nearly one
third of human genes have exactly the same protein translation as
their chimpanzee orthologs. A major difference between the two
genomes is human
chromosome 2,
which is equivalent to a fusion product of chimpanzee chromosomes
12 and
13."
Human chromosome 2
resulted from a fusion of two ancestral chromosomes that remained
separate in the chimpanzee lineage"
"
Large-scale sequencing of the chimpanzee genome is now
imminent."
Humans have undergone an extraordinary loss of
olfactory receptor genes during our
recent evolution, which explains our relatively crude sense of
smell compared to most other mammals.
Evolutionary evidence suggests that the emergence of
color vision in humans and several other
primate species has diminished the need for
the sense of smell."
Our findings suggest that the
deterioration of the olfactory repertoire occurred concomitant with
the acquisition of full trichromatic color vision in
primates."
Mitochondrial genome
The human
mitochondrial genome,
while usually not included when referring to the "human genome", is
of tremendous interest to geneticists, since it undoubtedly plays a
role in
mitochondrial disease.
It also sheds light on human evolution; for example, analysis of
variation in the human mitochondrial genome has led to the
postulation of a recent common ancestor for all humans on the
maternal line of descent. (see
Mitochondrial Eve)
Due to the lack of a system for checking for copying errors,
Mitochondrial DNA (mtDNA) has a more rapid rate of variation than
nuclear DNA. This 20-fold increase in the mutation rate allows
mtDNA to be used for more accurate tracing of maternal ancestry.
Studies of
mtDNA in populations have allowed ancient migration paths to be
traced, such as the migration of Native Americans from
Siberia
or Polynesians from
southeastern Asia. It has also been used
to show that there is no trace of
Neanderthal DNA in the European gene mixture
inherited through purely maternal lineage.
Epigenome
A variety of features of the human genome that transcend its
primary DNA sequence, such as
chromatin
packaging,
histone modifications and
DNA methylation, are important in
regulating gene expression, genome replication and other cellular
processes.These "epigenetic" features are thought to be involved in
cancer and other abnormalities, and some may be heritable across
generations.
See also
References
- More than 9,000,000 Unique Genes in Human Gut
Bacterial Community: Estimating Gene Numbers Inside a Human Body,
27 Jun 2009
- Evolutionary Trajectories of Primate Genes Involved
in HIV Pathogenesis, 2 September 2009
- [1]
- [2]
- Carroll, Sean B. et al. (May 2008). "Regulating Evolution",
Scientific American, pp. 60–67.
- Summary
- Zhandong Liu, Santosh S Venkatesh and Carlo C Maley,
Sequence space coverage, entropy of genomes and the potential
to detect non-human DNA in human samples, BMC Genomics 2008,
9:509, doi:10.1186/1471-2164-9-509, fig. 6, using the
Lempel-Ziv
estimators of entropy rate.
- at Project Gutenberg[3], zipped plaintext file, includes header
information
-
http://www.nytimes.com/2008/03/04/health/research/04geno.html
- from Bill Clinton's 2000 State of the Union
address
- Online Mendelian Inheritance in Man (OMIM)
External links