Posted by Sharon Correll on 2013-11-07 10:54:00
Welcome to the first in a series of posts on the Unicode Bidirectional Algorithm. Often called the "bidi" algorithm, it describes how software should process text that contains both left-to-right and right-to-left sequences of characters. The algorithm is described in excruciating detail in the Unicode Standard Annex #9, but for the less-than-intrepid among you, these posts offer a gentle introduction.
Bidi processing is needed in two possible situations:
- A Semitic script like Hebrew or Arabic that is primarily written from right to left, but with numbers (usually) written from left to right.
- A combination of left-to-right and right-to-left scripts, such as Turkish containing an Arabic quote, or Hebrew that includes German phrases.
Let's examine the first situation above. Many Semitic scripts, while written from right to left, have numbers written from left to right. This means that the order of the characters in memory does not match the order in which they are written. The bidi algorithm includes specifications for reordering the numeric characters, as shown in the Hebrew sentence below.
In the graphic above, the data is shown vertically, indicating that "logically" there is no direction, it is simply a sequence of characters. Notice the difference between a straightforward right-to-left rendering, and the correct rendering for Hebrew which reverses the order of the numbers.
Because of this, one of the main complexities of the bidi algorithm involves handling numerical data - numbers, arithmetic expressions, ranges, dates, etc. To further complicate things, Hebrew script, which uses European numbers, behaves differently than Arabic with its own traditional set of digits.
In addition to Hebrew and Arabic, bidirectional scripts include Thaana and modern Syriac using Arabic-Indic numerals. On the other hand, there are right-to-left scripts that are not bidirectional, such as N'Ko, Mende, and Tifinagh. For these scripts, numbers are simply written from right to left as the characters are.
In the second situation, involving more than one language or script, the range of text that flows in the opposite direction may involve several words or even multiple sentences.
It is even possible to have several layers of embedded directionality, such as a German article that includes a Hebrew quote which in turn includes English phrases. The bidi algorithm is defined to handle many levels of embedded (up to 128, in fact), although in real life you rarely would see more than two or perhaps three.
In the next post, we'll discuss how paragraph flow affects bidirectional rendering.