is the happenstance of incorrect, unreadable characters shown when computer software fails to render text correctly according to its associated character encoding.
Causes
Mojibake is often caused when a
character encoding is not correctly
tagged in a document, or when a document is moved to a system with
a different default encoding. Such incorrect display occurs when
writing systems or
character encodings are mistagged or
"foreign" to the user's computer system: if a computer does not
have the software required to process a foreign language's
characters, it will attempt to process them in its default language
encoding, usually resulting in gibberish. Messages transferred
between different encodings of the same language can also have
mojibake problems. Japanese language users, with several
different encodings historically employed, would encounter this
problem relatively often. For example, the intended word "文字化け",
encoded in
UTF-8, is incorrectly displayed as
"æ–‡å—化ã‘" in software that is configured to expect text in the
Windows-1252 or
ISO-8859-1 encodings, usually labeled
Western.
A
web browser may not be able to
distinguish a page coded in
EUC-JP and
another in
Shift-JIS if the coding scheme
is not assigned explicitly using the
HTTP headers sent along with the
documents, or using the
HTML document's
meta tags that are used to substitute for
missing HTTP headers if the server cannot be configured to send the
proper HTTP headers. Heuristics can be applied to guess at the
character set, but these are not always successful.
In the mid 1990s, as this problem became common, several websites
featured mojibake not as a problem to be tackled but simply for
amusement. Words and even sentences were "deciphered" with meanings
made up to deliver funny messages.
Mojibake can also occur between what appears to be the same
encodings. For example, some software by
Microsoft and
Eudora for
Windows purportedly encoded their output
using the
ISO-8859-1 encoding while, in
reality, used
Windows-1252 that
contains extra printable characters in the
C1 range.
These characters were not displayed properly in standards-compliant
software, this especially affected software running under other
operating systems (e.g.
Unix).
Resolutions
Applications using
UTF-8 as a default encoding
may achieve a greater degree of interoperability due to its
widespread use and backwards compatibility with
US-ASCII.
The difficulty of resolving an instance of mojibake varies
depending on the application within which it occurs and the causes
of it. Two of the most common applications in which mojibake may
occur are
web browsers and
word processors. Modern browsers and word
processors often support a wide array of character encodings.
Browsers often allow a user to change their
rendering engine's encoding setting on the
fly, while word processors allow the user to select the appropriate
encoding when opening a file. It may take some
trial and error for users to find the
correct encoding.
The problem gets more complicated when it occurs in an application
that normally does not support a wide range of character encoding,
such as in a non-Unicode computer game. In this case, the user must
change the operating system's encoding settings to match that of
the game. However, changing the system-wide encoding settings can
also cause Mojibake in pre-existing applications. In
Windows XP or later, a user also has the option
to use
Microsoft AppLocale, an
application that allows the changing of per-application locale
settings. Even so, changing the operating system encoding settings
is not possible on earlier operating systems such as
Windows 98; to resolve this issue on earlier
operating systems, a user would have to use third party font
rendering applications.
Problems in specific languages
Mojibake in English texts generally occurs in punctuation, such as
em dashes (—),
en
dashes (–), and
curly quotes (“, ”),
but rarely in character text, since most encodings agree with
ASCII on the encoding of the
English alphabet. For example, the pound
sign "£" will appear as "£" if it was encoded by the sender as
UTF-8 but interpreted by the recipient as
CP1252 or
ISO
8859-1. If iterated, this can lead to "£", "£",
etc.
In
Japanese, the phenomenon is, as
mentioned, called
mojibake . It is often encountered by
non-Japanese when attempting to run software written for the
Japanese market.
Users of
Central and
Eastern European languages can also be
affected. Because most computers were not connected to any network
during the mid- to late-1980s, there were different character
encodings for
every language with diacritical
characters.

Sender's handwritten
krakozyabry corrected by a postal employee before
delivery.

Mojibake on a webpage
In
Russian slang, mojibake is
humorously called
krakozyabry ( , meaning "childish
scribbles"). During the 1990s, several different encodings for the
Cyrillic alphabet (Unix
KOI8-R, Windows
CP-1251, DOS
866, standard
ISO 8859-5, and several others) competed. Poorly
configured servers and lack of compatibility made garbled text a
common and frustrating experience. Many e-mail servers stripped the
eighth bit from the characters, as permitted by earlier standards
(which rendered
UTF-8 unreadable, as well as
the non-KOI8 Russian encodings). For this reason many Cyrillic
users resorted to
Volapuk encoding.
An even more frustrating problem emerged in the early 2000s, when
the popular e-mail client
Microsoft
Outlook started to replace correctly entered Cyrillic
characters with question marks when replying to or forwarding
messages created in competing encodings.
In
Bulgarian, mojibake is often
called
maymunitsa (маймуница), meaning monkey's alphabet.
In
Serbian, it is called
(
đubre), meaning
trash. In
German,
Buchstabensalat (letter
salad) and
Krähenfüße (crow's feet) are common terms for
this phenomenon.
In
Poland
every company selling early DOS
computers created its own encoding, and simply reprogrammed the
EPROMs of the video cards (typically CGA, EGA or Hercules) with the needed glyphs for
Polish — arbitrarily located without reference to where other
computer sellers had placed them. Additionally, users of
then-popular home computers (such as the
Atari
ST) invented their own encodings, incompatible with
international standards (
ISO 8859-2),
vendor standards (IBM
CP852,
Windows CP1250) and locally agreed-upon PC/MS
DOS standards (
Mazovia). The
situation began to improve when, after pressure from academic and
user groups,
ISO 8859-2 succeeded as the
"Internet standard" with limited support of the dominant vendors'
software (today largely replaced by Unicode). With the numerous
problems caused by the variety of encodings, even today some users
tend to refer to Polish diacritical characters as
krzaczki
("bushes").
Commodore brand
8-bit computers used
PETSCII
encoding, particularly notable for
inverting the upper and
lower case compared to standard
ASCII. PETSCII
printers worked fine on other computers of the era, but flipped the
case of all letters.
Among the Nordic languages, mojibake is not uncommon, but is more
of an annoyance than a problem.
Finnish and
Swedish use the letters of the
English alphabet and three more characters:
å, ä and ö, and typically these three are the only ones that become
corrupted. The situation is similar for Norwegian and Danish,
except the three relevant letters are æ, ø and å. In Swedish,
Norwegian and Danish, vowels are rarely repeated, and it is usually
obvious when one character gets corrupted, such as the second
letter in "kärlek" (
kärlek, "love"). This way, even
though the reader has to guess among å, ä and ö, almost all texts
remain perfectly readable. However,
Finnish does have repeating vowels in words
like "Hääyö" (
hääyö), which can sometimes render text
very hard to read.
Icelandic is
worse off, with possibly confounding characters being ten:
á,
ð,
é,
í,
ó,
ú,
ý,
þ,
æ and
ö.
Another type of mojibake occurs when text is erroneously parsed in
a multi-byte encoding, such as one of the east Asian encodings.
With this kind of mojibake more than one (typically two) characters
are corrupted at once, e.g. "k舐lek" (
kärlek) in Swedish,
where "är" is parsed as "舐". Compared to the above mojibake, this
is harder to read, since letters unrelated to the problematic å, ä
or ö are missing, and is especially problematic for short words
starting with å, ä or ö such as "än" (which becomes "舅"). Since two
letters are combined, the mojibake also seems more random (over 50
variants compared to the normal three, not counting the rarer
capitals). In some rare cases, an entire text string which happens
to include a pattern of particular word lengths, such as the
sentence "
Bush hid the facts",
may be misinterpreted.
In most of
former Yugoslavia
, an addition to the basic Latin alphabet are the
letters š, đ, č, ć, ž, and their capital counterparts Š, Đ, Č, Ć,
Ž. All of these letters are defined in
Latin2 and
Windows-1250, while only some (š, Š, ž, Ž, Đ)
exist in the usual OS-default
Western.
Although even those that exist in extended Western ASCII
(Windows-1252) are not immune to errors, the ones that don't are
much more prone to errors. Thus, even nowadays, "šđčćž ŠĐČĆŽ" is
all too often interpreted as "šðèæž ŠÐÈÆŽ", making users wonder
where ð, è, æ, È, Æ are used. When confined to basic ASCII (most
usernames, for example), common replacements are: š→s, đ→dj, č→c,
ć→c, ž→z (capital forms analogously, with Đ→Dj or Đ→DJ depending on
word case). All of these replacements introduce ambiguities, so
reconstructing the original from such a form is usually done
manually if required.
Hungarian is another affected
language, which uses the 26 basic English characters, plus the
accented forms á, é, í, ó, ú, ö, ü (all present in the Latin-1
character set), plus the 2 characters
ő and
ű, which are not in Latin-1. These 2
characters can be correctly encoded in Latin-2, Windows-1250 and
Unicode. Before Unicode became common in e-mail clients, e-mails
containing Hungarian text often had the letters ő and ű corrupted,
sometimes to the point of unrecognizability. It is common to
respond to an e-mail rendered unreadable by character mangling
(referred to as "betűszemét", meaning "garbage lettering") with the
phrase "Árvíztűrő tükörfúrógép", a nonsense phrase (literally
"Flood-resistant mirror-drilling machine") containing all accented
characters used in Hungarian.
Esperanto was once commonly encoded in
Latin-3 to display the accented characters:
ĉ, ĝ, ĥ, ĵ, ŝ, and ŭ. In some browsers these characters would be
incorrectly displayed as Latin-1 characters with the same encoding:
æ, ø, ¶, ¼, þ, and ý.
Another affected language is
Arabic
(see below).
A similar effect can occur in Indic text, even if the character set
used is properly recognized by the application. This is because, in
many Indic scripts, the rules by which individual letter symbols
combine to create symbols for syllables may not be properly
understood by a computer missing the appropriate software, even if
the glyphs for the individual letter forms are available.
A particularly notable example of this is the Wikipedia logo, which
attempts to show the character analogous to "w" or "wi" (the first
letter or syllable of "Wikipedia") on each of many puzzle pieces.
Instead, the puzzle piece meant to bear the Sanskrit character for
"wi" actually shows a somewhat nonsensical scribble with a dangling
line at the end, easily recognizable as mojibake generated by a
computer not configured to display Indic text. That this occurs in
the venerable front-page logo and has never been corrected over
many years has been seen as humorously emblematic of Wikipedia's
alleged accuracy and reliability problems.
Example
| Output encoding |
Setting in browser |
Result |
| Arabic example: |
منتدى عرب شير مشاهدة الملف الشخصي |
| Windows-1251 |
ISO 8859-1 |
منتدى عرب شير - مشاهدة الملÙ
الشخصي |
| KOI8-R |
ы┘ы├ь╙ь╞ы┴ ь╧ь╠ь╗ ь╢ы┼ь╠ - ы┘ь╢ь╖ы┤ь╞ь╘ ь╖ы└ы┘ы└ы│
ь╖ы└ь╢ь╝ь╣ы┼ |
| ISO 8859-5 |
й
йиЊиЏй иЙиБиЈ иДйиБ - й
иДиЇйиЏиЉ иЇйй
йй
иЇйиДиЎиЕй |
| CP 866 |
Е┘Ж╪к╪п┘Й ╪╣╪▒╪и ╪┤┘К╪▒ - ┘Е╪┤╪з┘З╪п╪й ╪з┘Д┘Е┘Д┘Б
╪з┘Д╪┤╪о╪╡┘К |
| Windows-1256 |
ISO 8859-6 |
ظ…ظ†طھط¯ظ‰ ط¹ط±ط¨ ط´ظٹط± - ظ…ط´ط§ظ‡ط¯ط© ط§ظ„ظ…ظ„ظپ
ط§ظ„ط´ط®طµظٹ |
| CP 852 |
┘ů┘ćě¬ě»┘ë ě╣ě▒ěĘ ě┤┘Őě▒ - ┘ůě┤ěž┘çě»ěę ěž┘ä┘ů┘ä┘ü
ěž┘äě┤ě«ěÁ┘Ő |
| ISO 8859-2 |
Ů
Ůتد٠ؚعب Ř´ŮŘą - Ů
شاŮŘŻŘŠ اŮŮ
ŮŮ
اŮش؎ؾ٠|
| ASMO 708 |
عàع╢ظ╔ظ»عë ظ╥ظ▒ظ╟ ظ═عèظ▒ - عàظ═ظ╞عçظ»ظ╚ ظ╞ع╡عàع╡ع┤
ظ╞ع╡ظ═ظ«ظ╬عè |
References
External links