If you missed them, here are the previous posts in this series:

  1. Introduction
  2. Paragraph direction and flow

The main approach of the Unicode bidi algorithm involves assigning a directionality code to each character in the text, and then figuring out how they interact. Some directionality codes indicate left-to-right flow, others right-to-left. As well as direction, some codes are considered "strong", meaning that they are not influenced by neighboring text, while others are "weak", in that their directionality may change if they are next to characters with a different direction. Still other characters are assigned neutral codes, so their behavior is totally controlled by surrounding text or the paragraph direction.

Here are some of the most important bidi codes:

  • L = left to right: used for the basic characters of left-to-right scripts (strong)
  • R = right-to-left: used for the basic characters of Hebrew and non-bidi RTL scripts (e.g., N'Ko, Mende, Tifinagh) (strong)
  • AL = right-to-left Arabic: basic characters and punctuation of Arabic and similar scripts (strong)
  • EN = European Number: Latin numbers (1, 2, 3...) and other numbers that behave similarly (weak)
  • AN = Arabic Number: , , etc. (weak)
  • ES = European Number Separator: plus and minus (weak)
  • ET = European Number Terminator: degree sign, percent sign, number sign (#), currency symbols (weak)
  • WS = Whitespace: spaces (neutral)
  • ON = Other Neutrals: displayable characters that don't have any directionality associated with them - e.g., parentheses, underscore, equals sign (neutral)

A note on terminology: you might be used to referring to the numbers that are used with the Latin script (1, 2, 3) as "Arabic numbers", but this term is ambiguous, since Arabic script has its own set of numbers. So the bidi algorithm uses the term "European numbers" to refer to these characters. The numbers that are used with the Arabic script are often called "Arabic-Indic digits", but this is also somewhat ambiguous as there is a set of "eastern" Arabic-Indic numbers that behave somewhat differently!

The reason for distinguishing between Arabic and Hebrew, as well as European and Arabic numbers, is in order to properly handle the differences between the way European and Arabic numbers interact with certain symbols. For instance, in Hebrew using European numbers, the sequence "123-456+78" would be considered an expression that would be treated as a left-to-right unit, while in Arabic script, each number in the expression is treated as a unit (eg, -+ which would be rendered right-to-left as +-). In addition, a terminator such as a degree symbol or percent sign is displayed to the right of a number in Hebrew, but to the left in Arabic.

After the directionality codes are assigned, the meat of the bidi algorithm is run. This consists of a whole series of rules that consider, combine, and adjust the directionality of each letter until the final flow of the text is determined.