In
web archiving, an
archive
site is a
website that stores
information on, or the actual, webpages from the past for anyone to
view.
Common techniques
Two common techniques are #1 using a
web
crawler or #2 user submissions.
- By using a web crawler the service will not depend on an active
community for their content, thereby building a larger database
faster, which usually results in the community growing larger as
well. However, web site developers and system administrators do
have the ability to block these robots from accessing [certain] web
pages (using a robots.txt).
- While it can be difficult to start such services due to
potentially low rates of user submission, this system can yield
some of the best results. By crawling web pages one is only able to
obtain the information the public has bothered to post to the
Internet. They may have not bothered to post it due to not thinking
anyone would be interested in it, lack of a proper medium, etc.
However, if they see someone wants their information then they may
be more apt to submit it.
Examples
Google Groups
On February 12, 2001,
Google acquired the
Usenet discussion group archives from
Deja.com and turned it into their Google Groups service
[181773]. They allow users to search old discussions
with Google's search technology, while still allowing users to post
to the
mailing lists.
Internet Archive
The
Internet
Archive
( official website) is building a compendium of websites
and digital media. Starting in
1996, Archive has been employing a web crawler
to build up their database. They are one of the best known
archive sites.
TextFiles.com
TextFiles.com is a large library of old text files
sustained by
Jason Scott
Sadofsky. Its mission is to archive the old documents that had
floated around the
bulletin board
systems (BBS) of his youth and to document other people's
experiences on the BBSes.
PANDORA Archive
PANDORA (
Pandora Archive), founded
in
1996 by the National Library of
Australia, stands for Preserving and Accessing
Networked Documentary Resources of Australia, which encapsolates
their mission. They provide a long-term catalog of select online
publications and web sites authored by Australians or that are of
an Australian topic. They employ their PANDAS (PANDORA Digital
Archiving System) when building their catalog.
See also