A
database is an integrated collection of
logically related records or files consolidated into a common pool
that provides data for one or more multiple uses.
One way of classifying databases involves the type of content, for
example: bibliographic, full-text, numeric, image. Other
classification methods start from examining
database models or database architectures:
see below.
Software organizes the data in a database according to a
database model. the
relational model occurs most commonly.
Other models such as the
hierarchical
model and the
network model use a
more explicit representation of relationships.
Architecture
A number of database architectures exist. Many databases use a
combination of strategies.
Databases consist of software-based "containers" that are
structured to collect and store information so users can retrieve,
add, update or remove such information in an automatic fashion.
Database programs are designed for users so that they can add or
delete any information needed. The structure of a database is the
table, which consists of rows and columns of information.
Online Transaction Processing systems (OLTP) often use a "row
oriented" or an "object oriented" data store architecture, whereas
data-warehouse and other retrieval focused applications like
Google's
BigTable, or
bibliographic database (library catalog) systems may use a Column
oriented DBMS architecture.
Document-Oriented,
XML,
knowledgebases, as well as frame databases and
RDF-stores (also
known as
triple stores), may also use a
combination of these architectures in their implementation.
Not all databases have or need a
database schema ("schema-less
databases").
general-purpose database systems have dominated the database industry. These offer a wide range of functions, applicable to many, if not most circumstances in modern data processing. These have been enhanced with extensible datatypes (pioneered in the PostgreSQL project) to allow development of a very wide range of applications.
There are also other types of databases which cannot be classified
as relational databases. Most notable is the
object database management system, which
stores language objects natively without using a separate data
definition language and without translating into a separate storage
schema. Unlike relational systems, these object databases store the
relationship between complex data types as part of their storage
model in a way that does not require runtime calculation of related
data using relational algebra execution algorithms.
Database management systems
A database management system (DBMS) consists of
software that organizes the storage of data. A DBMS
controls the creation, maintenance, and use of the database storage
structures of social organizations and of their users. It allows
organizations to place control of organization wide database
development in the hands of Database Administrators (DBAs) and
other specialists. In large systems, a DBMS allows users and other
softwares to store and retrieve data in a structured way.
Database management systems are usually categorized according to
the
database model that they support,
such as the network, relational or object model. The model tends to
determine the
query languages that
are available to access the database. One commonly used query
language for the relational database is
SQL,
although SQL syntax and function can vary from one DBMS to another.
A common query language for the object database is
OQL, although not all vendors of object databases
implement this, majority of them do implement this method. A great
deal of the internal engineering of a DBMS is independent of the
data model, and is concerned with managing factors such as
performance, concurrency, integrity, and recovery from hardware
failures. In these areas there are large differences between the
products.
A relational database management system (RDBMS) implements features
of the relational model. In this context,
Date's "Information Principle" states:
"the entire information content of the database is represented in
one and only one way. Namely as explicit values in column positions
(attributes) and rows in
relation (
tuples).
Therefore, there are no explicit pointers between related tables."
This contrasts with the object database management system (ODBMS),
which does store explicit pointers between related types.
Components of DBMS
According to the wikibooks open-content textbooks, "
Design of Main Memory Database System/Overview of
DBMS", most DBMS implement a relational model. Other less-used
DBMS systems, such as the object DBMS, generally operate in areas
of application-specific data management where performance and
scalability take higher priority than the flexibility of
ad
hoc query capabilities provided via the
relational-algebra execution algorithms
of a relational DBMS.
RDBMS components
- Interface drivers - A user or application
program initiates either schema modification or content
modification. These drivers are built on top of SQL. They provide
methods to prepare statements, execute statements, fetch results,
etc. Examples include DDL, DCL, DML, ODBC, and JDBC. Some vendors
provide language-specific proprietary interfaces. For example MySQL
provides drivers for PHP, Python, etc.
- SQL engine - This component interprets and
executes the SQL query. It comprises three major components
(compiler, optimizer, and execution engine).
- Transaction engine - Transactions are
sequences of operations that read or write database elements, which
are grouped together.
- Relational engine - Relational objects such as
Table, Index, and Referential integrity constraints are implemented
in this component.
- Storage engine -
This component stores and retrieves data records. It also provides
a mechanism to store metadata and control
information such as undo logs, redo logs, lock tables, etc.
ODBMS components
- Language drivers - A user or application
program initiates either schema modification or content
modification via the chosen programming language. The drivers then
provide the mechanism to manage object lifecycle coupling of the
application memory space with the underlying persistent storage.
Examples include C++, Java, .NET, and Ruby.
- Query engine - This component interprets and
executes language-specific query commands in the form of OQL, LINQ,
JDOQL, JPAQL, others. The query engine returns language specific
collections of objects which satisfy a query predicate expressed as
logical operators e.g. >, <,>=, <=, AND,="" OR,=""
NOT,="" GroupBY,="" etc.=""></=,></,>
- Transaction engine - Transactions are
sequences of operations that read or write database elements, which
are grouped together. The transaction engine is concerned with such
things as data isolation and consistency in the driver cache and
data volumes by coordinating with the storage engine.
- Storage engine -
This component stores and retrieves objects in an arbitrarily
complex model. It also provides a mechanism to manage and store
metadata and control information such as undo logs, redo logs, lock
graphs, etc.
Primary tasks of DBMS packages
- Database Development: used to define and
organize the content, relationships, and structure of the data
needed to build a database.
- Database Interrogation: can access the data in
a database for information retrieval and report generation. End
users can selectively retrieve and display information and produce
printed reports and documents.
- Database Maintenance: used to add, delete,
update, correct, and protect the data in a database.
- Application Development: used to develop
prototypes of data entry screens, queries, forms, reports, tables,
and labels for a prototyped application. Or use 4GL or 4th
Generation Language or application generator to develop program
codes.
Types
Operational database
These databases store detailed data needed to support the
operations of an entire organization. They are also called
subject-area databases (SADB), transaction databases, and
production databases. For example:
- customer databases
- personal databases
- inventory databases
Analytical database
These databases store data and information extracted from selected
operational and external databases. They consist of summarized data
and information most needed by an organization's management and
other end-users. Some people refer to analytical databases as
multidimensional databases, management databases, or information
databases.
Data warehouse
A data warehouse stores data from current and previous years — data
extracted from the various operational databases of an
organization. It becomes the central source of data that has been
screened, edited, standardized and integrated so that it can be
used by managers and other end-user professionals throughout an
organization
Distributed database
These are databases of local work-groups and departments at
regional offices, branch offices, manufacturing plants and other
work sites. These databases can include segments of both common
operational and common user databases, as well as data generated
and used only at a user’s own site.
End-user database
These databases consist of a variety of data files developed by
end-users at their workstations. Examples of these are collections
of documents in spreadsheets, word processing and even downloaded
files.
External database
These databases provide access to external, privately-owned data
online — available for a fee to end-users and organizations from
commercial services. Access to a wealth of information from
external database is available for a fee from commercial online
services and with or without charge from many sources in the
Internet.
Hypermedia databases on the web
These are a set of interconnected multimedia pages at a web-site.
They consist of a home page and other hyperlinked pages of
multimedia or mixed media such as text, graphic, photographic
images, video clips, audio etc.
Navigational database
In navigational databases, queries find objects primarily by
following references from other objects. Traditionally navigational
interfaces are procedural, though one could characterize some
modern systems like XPath as being simultaneously navigational and
declarative.
In-memory databases
In-memory databases primarily rely on main memory for computer data
storage. This contrasts with database management systems which
employ a disk-based storage mechanism. Main memory databases are
faster than disk-optimized databases since the internal
optimization algorithms are simpler and execute fewer CPU
instructions. Accessing data in memory provides faster and more
predictable performance than disk. In applications where response
time is critical, such as telecommunications network equipment that
operates emergency systems, main memory databases are often
used.
Document-oriented databases
Document-oriented databases are computer programs designed for
document-oriented applications. These systems may be implemented as
a layer above a relational database or an object database. As
opposed to relational databases, document-based databases do not
store data in tables with uniform sized fields for each record.
Instead, they store each record as a document that has certain
characteristics. Any number of fields of any length can be added to
a document. Fields can also contain multiple pieces of data.
Real-time databases
A real-time database is a processing system designed to handle
workloads whose state may change constantly. This differs from
traditional databases containing persistent data, mostly unaffected
by time. For example, a stock market changes rapidly and
dynamically. Real-time processing means that a transaction is
processed fast enough for the result to come back and be acted on
right away. Real-time databases are useful for accounting, banking,
law, medical records, multi-media, process control, reservation
systems, and scientific data analysis. As computers increase in
power and can store more data, real-time databases become
integrated into society and are employed in many
applications.
Relational Database
The standard of
business
computing , relational databases are the most commonly used
database today . It uses the table to structure information so that
it can be readily and easily searched through.
Models
Post-relational database models
Products offering a more general data model than the relational
model are sometimes classified as
post-relational. The data model in such
products incorporates
relation
but is not constrained by the Information Principle , which
requires the representation of all information by
data values in relation to it.
Some of these extensions to the relational model actually integrate
concepts from technologies that pre-date the
relational model. For example, they allow
representation of a
directed graph
with
trees on the
nodes.
Some products implementing such models do so by extending
relational database systems with non-relational features. Others,
however, have arrived in much the same place by adding relational
features to pre-relational systems. Paradoxically, this allows
products that are historically pre-relational, such as
PICK and
MUMPS,
to make a plausible claim to be post-relational in their current
architecture.
Object database models
In , the
object-oriented paradigm
has been applied to database technology, creating various kinds of
new programming models known as
object
databases. These databases attempt to bring the database world
and the application-programming world closer together, in
particular by ensuring that the database uses the same
type system as the application program. This
aims to avoid the overhead (sometimes referred to as the
impedance
mismatch) of converting information between its
representation in the database (for example as rows in tables) and
its representation in the application program (typically as
objects). At the same time, object databases attempt to introduce
key ideas of object programming, such as
encapsulation and
polymorphism, into the world
of databases.
A variety of these ways have been tried for storing objects in a
database. Some products have approached the problem from the
application-programming side, by making the objects manipulated by
the program
persistent. This also
typically requires the addition of some kind of query language,
since conventional programming languages do not have the ability to
find objects based on their information content. Others have
attacked the problem from the database end, by defining an
object-oriented data model for the database, and defining a
database
programming language
that allows full programming capabilities as well as traditional
query facilities.
Storage structures
Databases may store relational tables/indexes in memory or on hard
disk in one of many forms:
These have various advantages and disadvantages - discussed further
in the articles on each topic. The most commonly used are B+ trees
and ISAM.
Object databases use a range of storage mechanisms. Some use
virtual memory-mapped files to make the native language (
C++,
Java
etc.) objects persistent. This can be highly efficient but it can
make multi-language access more difficult. Others break the objects
down into fixed- and varying-length components that are then
clustered tightly together in fixed sized blocks on disk and
reassembled into the appropriate format either for the client or in
the client address space. Another popular technique involves
storing the objects in tuples (much like a relational database)
which the database server then reassembles for the client.
Other important design choices relate to the clustering of data by
category (such as grouping data by month, or location), creating
pre-computed views known as
materialized views, partitioning data by
range or hash. Memory management and storage topology can be
important design choices for database designers as well. Just as
normalization is used to
reduce storage requirements and improve the
extensibility of the database, conversely
denormalization is often used to reduce join complexity and reduce
execution time for queries.
Indexing
All of these databases can take advantage of
indexing to increase their speed. This
technology has advanced tremendously since its early uses in the
1960s and 1970s. The most common kind of index uses a sorted list
of the contents of some particular table column, with pointers to
the row associated with the value. An index allows a set of table
rows matching some criterion to be quickly located. Typically,
indexes are also stored in the various forms of data-structure
mentioned above (such as
B-trees,
hash, and
linked
lists). Usually, a database designer selects specific
techniques to increase efficiency in the particular case of the
type of index required.
Most relational DBMSs and some object DBMSs have the advantage that
indexes can be created or dropped without changing existing
applications making use of them, The database chooses between many
different strategies based on which one it estimates will run the
fastest. In other words, indexes act transparently to the
application or end-user querying the database; while they affect
performance, any SQL command will run with or without indexes to
compute the result of an
SQL statement. The
RDBMS will produce a
query plan of how to
execute the query: often generated by analyzing the run times of
the different algorithms and select the quickest process. Some of
the key algorithms that deal with
joins
are
nested loop join,
sort-merge join and
hash join. Which of these an RDBMS selects may
depend on whether an index exists, what type it is, and its
cardinality.
An index speeds up access to data, but it has disadvantages as
well. First, every index increases the amount of storage used on
the hard drive which is also necessary for the database file, and
second, the index must be updated each time the data are altered,
and this costs time. (Thus an index saves time in the reading of
data, but it costs time in entering and altering data. It thus
depends on the use to which the data are to be put whether an index
is overall a net plus or minus in the quest for efficiency.)
A special case of an index is a primary index based on a primary
key: a primary index must ensure a unique reference to a record.
Often, for this purpose one simply uses a running index-number (ID
number). Primary indexes play a significant role in relational
databases, and they can speed up access to data considerably.
Transactions and concurrency
In addition to their data model, most practical databases
("transactional databases") attempt to enforce
database transactions. Ideally, the
database software should enforce the
ACID
rules, summarized here:
- Atomicity: Either
all the tasks in a transaction must happen, or none of them. The
transaction must be completed, or else it must be undone (rolled
back).
- Consistency:
Every transaction must preserve the integrity constraints — the
declared consistency rules — of the database. It cannot leave the
data in a contradictory state.
- Isolation: Two
simultaneous transactions cannot interfere with one another.
Intermediate results within a transaction must remain invisible to
other transactions.
- Durability:
Completed transactions cannot be aborted later or their results
discarded. They must persist through (for instance) restarts of the
DBMS after crashes.
In practice, many DBMSs allow the selective relaxation of most of
these rules — for better performance.
Concurrency control ensures that
transactions execute in a safe manner and follow the ACID rules.
The DBMS must be able to ensure that only
serializable,
recoverable
schedules are allowed, and that no actions of committed
transactions are lost while undoing aborted transactions.
Replication
Replication of databases often relates closely to transactions. If
a database can log its individual actions, one can create a
duplicate of the data in real time.DBAs can use the duplicate to
improve performance and/or the availability of the whole database
system.
Common replication concepts include:
- master/slave Replication: All write-requests are performed on
the master and then replicated to the slave(s)
- quorum: The result of Read and Write requests are calculated by
querying a "majority" of replicas
- multimaster: Two or more replicas sync each other via a
transaction identifier
Parallel synchronous replication of databases enables the
replication of transactions on multiple servers simultaneously,
which provides a method for backup and security as well as data
availability. This is commonly referred to as
database
clustering.
Security
Database security denotes the
system, processes, and procedures that protect a database from
unintended activity. Enforcing security is one of the major tasks
of the DBA.
DBMSs usually enforce security through
access control,
auditing, and
encryption:
- Access control ensures and restricts who can connect and what
they can do to the database.
- Auditing logs what action or change has been performed, when
and by whom.
- Encryption: many commercial databases include built-in
encryption mechanisms to encode data natively into tables and to
decipher information "on the fly" when a query comes in. DBAs can
also secure and encrypt connections if required using DSA, MD5,
SSL or legacy encryption
standards.
In the
United
Kingdom
, legislation protecting the public from
unauthorized disclosure of personal information held on databases
falls under the Office of the Information
Commissioner. Organizations based in the United Kingdom
and holding personal data in electronic format (databases for
example) must register with the Data Commissioner.
Locking
Databases handles multiple concurrent operations with
locking. This is how concurrency and
some form of basic integrity is managed within the database system.
Such locks can be applied on a row level, or on other levels like
page (a basic data block), extent (multiple array of pages) or even
an entire table. This helps maintain the integrity of the data by
ensuring that only one process at a time can modify the
same data.
In basic filesystem files or folders, only one lock at a time can
be set , restricting the usage to one process only. Databases, on
the other hand, can set and hold mutiple locks at the same time on
the different levels of the physical data structure. The
database engine locking scheme determines
how to set and maintain locks based on the submitted SQL or
transactions by the users. Generally speaking, any activity on the
database should involve some or extensive locking.
most DBMS systems use shared and exclusive locks.
Exclusive locks mean that no other lock can acquire the current
data object as long as the exclusive lock lasts. DBMSs usually set
exclusive locks when the database needs to change data, as during
an UPDATE or DELETE operation.
Shared locks can take ownership one from the other of the current
data structure. Shared locks are usually used while the database is
reading data (during a SELECT operation).The number, nature of
locks and time the lock holds a data block can have a huge impact
on the database performances. Bad locking can lead to disastrous
performance response (usually the result of poor SQL requests, or
inadequate database physical structure)
The
isolation level of the data server enforces
default locking behavior. Changing the isolation level will affect
how shared or exclusive locks must be set on the data for the
entire database system. Default isolation is generally 1, where
data can not be read while it is modified, forbidding the return of
"ghost data" to end users.
At some point intensive or inappropriate exclusive locking can lead
to a "
deadlock" situation between two
locks, where none of the locks can be released because they try to
acquire resources mutually from each other. The database should
have a fail-safe mechanism which will automatically "sacrifice" one
of the locks, thus releasing the resource. Processes or
transactions involved in the "deadlock" get rolled back.
Databases can also be locked for other reasons, like access
restrictions for given levels of user.Some DBAs also lock databases
for routine maintenance, which prevents changes being made during
the maintenance. See
"Locking tables and databases" (section in some
documentation / explanation from IBM) for more detail.)
However, many modern databases don't lock the database during
routine maintenance. e.g.
"Routine Database Maintenance" for
PostgreSQL.
Applications
Databases function in many applications, spanning virtually the
entire range of
computer software.
Databases have become the preferred method of storage for large
multiuser applications, where coordination between many users is
needed. Even individual users find them convenient, and many
electronic mail programs and personal organizers are based on
standard database technology. Software database drivers are
available for most database platforms so that
application software can use a common
API to retrieve
the information stored in a database. Commonly used database APIs
include
JDBC and
ODBC.
Databases in new media
Within
new media, databases collect items
on which the user can carry out various operations such as viewing,
navigating, create, and searching. Though there are various types
of items within the database, each item has the same significance.
Unlike a narrative or film, the collections are computerized and
therefore may offer a unique experience with each view. This form
of data may present a unique presentation of what the world is
like. Databases can be seen as a symbolic form of the computer
age.
See also
References
Further reading
- Ling Liu and Tamer M. Özsu (Eds.) (2009). " Encyclopedia of Database Systems, 4100
p. 60 illus. ISBN 978-0-387-49616-0. Table of Content
available at
http://refworks.springer.com/mrw/index.php?id=1217
- Beynon-Davies, P. (2004). Database Systems. 3rd Edition.
Palgrave, Houndmills, Basingstoke.
- Connolly, Thomas and Carolyn Begg. Database Systems.
New York: Harlow, 2002.
- Date, C. J. An Introduction to Database Systems,
Eighth Edition, Addison Wesley, 2003.
- Galindo, J.; Urrutia, A.; Piattini, M. Fuzzy Databases:
Modeling, Design and Implementation (FSQL
guide). Idea Group Publishing Hershey, USA, 2006.
- Galindo, J., Ed. Handbook on Fuzzy Information Processing
in Databases. Hershey, PA: Information Science Reference (an
imprint of Idea Group Inc.), 2008.
- Gray, J. and Reuter, A. Transaction Processing: Concepts
and Techniques, 1st edition, Morgan Kaufmann Publishers,
1992.
- Kroenke, David M. Database Processing: Fundamentals,
Design, and Implementation (1997), Prentice-Hall, Inc., pages
130-144.
- Kroenke, David M. and David J. Auer. Database
Concepts. 3rd ed. New York: Prentice, 2007.
- Lightstone, S.; T. Teorey and T. Nadeau, Physical Database
Design: the database professional's guide to exploiting indexes,
views, storage, and more, Morgan Kaufmann Press, 2007. ISBN
0-12369-389-6.
- O'Brien, James. "Management Information Systems". New York
1999
- Shih, J. " Why
Synchronous Parallel Transaction Replication is Hard, But
Inevitable?", white paper, 2007.
- Teorey, T.; Lightstone, S. and Nadeau, T. Database Modeling
& Design: Logical Design, 4th edition, Morgan Kaufmann
Press, 2005. ISBN 0-12-685352-5
- Tukey, John W. Exploratory Data Analysis. Reading, MA:
Addison Wesley, 1977.
- Manovich, Lev.Database as a Symbolic Form, Cambridge:
MIT press, 2001
External links