rsync is a
software application for
Unix systems which synchronizes
file and
directories from one location to
another while minimizing
data transfer using
delta encoding when appropriate. An
important feature of rsync not found in most similar
programs/protocols is that the
mirror takes place with only one
transmission in each direction. rsync can copy or display directory
contents and copy files, optionally using
compression and
recursion.
In daemon mode, rsync listens to the default
TCP port of 873, serving files in the native
rsync protocol or via a remote
shell such as
RSH or
SSH. In the
latter case, the rsync client executable must be installed on both
the local and the remote host.
Released under the
GNU
General Public License, rsync is
free
software.
Algorithm
The rsync utility uses an
algorithm
(invented by the Australian computer programmer
Andrew Tridgell) for efficiently
transmitting a structure (such as a file) across a communications
link when the receiving computer already has a different version of
the same structure.
The recipient splits its
copy of the file
into fixed-size non-overlapping chunks, of size S, and computes two
checksums for each chunk: the
MD4 hash, and a weaker
'
rolling checksum'. It sends these
checksums to the sender. Version 30 of the protocol (released with
rsync version 3.0.0) now uses
MD5 hashes rather
than MD4.
The sender computes the rolling checksum for
every chunk
of size S in its own version of the file, even overlapping chunks.
This can be calculated efficiently because of a special property of
the rolling checksum: if the rolling checksum of
bytes n through n+S-1 is R, the rolling checksum of
bytes n+1 through n+S can be computed from R, byte n, and byte n+S
without having to examine the intervening bytes. Thus, if one had
already calculated the rolling checksum of bytes 1–25, one could
calculate the rolling checksum of bytes 2–26 solely from the
previous checksum, and from bytes 1 and 26.
The
rolling checksum used in rsync is
based on Mark Adler's
adler-32 checksum,
which is used in
zlib, and which itself is
based on
Fletcher's
checksum.
The sender then compares its rolling checksums with the set sent by
the recipient to determine if any matches exist. If they do, it
verifies the match by computing the hash for the matching
block and by comparing it with the hash for that block
sent by the recipient.
The sender then sends the recipient those parts of its file that
did not match the recipient's blocks, along with information on
where to merge these blocks into the recipient's version. This
makes the copies identical. However, there is a small probability
that differences between chunks in the sender and recipent are not
detected, and thus remains uncorrected. This requires a
simultaneous hash collision in MD5 and the rolling checksum. It is
possible to generate MD5 collisions, and the rolling checksum is
not cryptographically strong, but the chance for this to occur by
accident is nevertheless extremely remote. With 128 bits from MD5
plus 32 from the rolling checksum, and assuming maximum
entropy in these bits, the possibility of a hash
collision with this combined checksum is 2
-(128+32) =
2
-160. The actual possibility is a few times higher,
since good checksums approach maximum output entropy.
If the sender's and recipient's versions of the file have many
sections in common, the utility needs to transfer relatively little
data to synchronize the files.
While the rsync algorithm forms the heart of the rsync application
that essentially optimizes transfers between two computers over
TCP/IP, the rsync application supports other key features that aid
significantly in data transfers or backup. They include compression
and decompression of data block by block using
zlib at sending and receiving ends, respectively, and
support for protocols such as
ssh that
enables encrypted transmission of compressed and efficient
differential data using rsync algorithm. Instead of ssh,
stunnel can also be used to create an encrypted
tunnel to secure the data transmitted.
Finally, rsync is capable of limiting the bandwidth consumed during
a transfer, a useful feature that few other standard file transfer
protocols offer.
Uses
rsync was originally written as a replacement for
rcp and
scp. As such, it has a similar
syntax to its parent programs. Like its predecessors, it still
requires a source and a destination to be specified, one of which
may be remote. Because of the flexibility, speed and script-ability
of rsync, its popularity with system administrators has resulted in
rsync being ported to Windows, Mac and Linux operating
systems.
Possible ueses:
rsync [OPTION] … SRC [SRC]… [USER@]HOST=DEST
rsync [OPTION] … [USER@]HOST:SRC [DEST]
One of the earliest applications of rsync was to implement
mirroring or backup for multiple Unix clients onto a central Unix
server using rsync/ssh and standard Unix accounts.
With a scheduling utility such as
cron, one can
even schedule automated encrypted rsync-based mirroring between
multiple host computers and a central server.
An alternative to scripting rsync is using a GUI software like
BackupAssist, which uses rsync to
perform automatic, scheduled backups of Windows-based servers to
other rsync servers.
Variations
A utility called uses the rsync algorithm to generate
delta file with the difference from file A to
file B (like the utility
diff, but in a
different delta format). The delta file can then be applied to file
A, turning it into file B (similar to the
patch utility).
Unlike diff, the process of creating a delta file has two steps:
first a signature file is created from file A, and then this
(relatively small) signature and file B is used to create the delta
file. Also unlike diff, rdiff works well with
binary files.
Using rdiff, a utility called
rdiff-backup has
been created, capable of maintaining a
backup
mirror of a file or directory either locally or remotely over the
network, on another server. rdiff-backup stores incremental rdiff
deltas with the backup, with which it is possible to recreate any
backup point.
duplicity is a variation on
rdiff-backup that allows for backups without cooperation from the
storage server, as with simple storage services like
Amazon S3. It works by generating the hashes for
each block in advance, encrypting them, and storing them on the
server, then retrieving them when doing an incremental backup. The
rest of the data is also stored encrypted for security
purposes.
rsyncrypto is a utility to encrypt files in an
rsync-friendly fashion. The rsyncrypto algorithm ensures that two
almost identical files, such as the same file before and after a
change, when encrypted using rsyncrypto and the same key, will
produce almost identical encrypted files. This allows for the
low-overhead data transfer achieved by rsync while providing
encryption for secure transfer and storage of sensitive data in a
remote location.
History
Andrew Tridgell and
Paul Mackerras wrote the original rsync.
Tridgell
discusses the design, implementation and performance of rsync in
chapters 3 through 5 of his Australian
National University
PhD
thesis.
rsync was first announced on
19 June
1996.
Rsync 3.0 was released on
1 March 2008.
See also
References
- http://rsync.samba.org/ftp/rsync/src/rsync-3.0.0-NEWS
- See the README file
- Andrew Tridgell: Efficient
Algorithms for Sorting and Synchronization, February 1999.
Retrieved 29 Sept. 2009.
External links
Tutorials
Examples