Legends and Keys
Instructions and Tutorials
|Visit our facebook page|
Direct link to this page: http://scriptsource.org/glossary
This glossary covers a wide range of terms used for discussing writing systems. If you have suggestions or corrections please contact us.
a form of writing in which the vowels are omitted or optional, such as Hebrew and Arabic scripts.
a unit of information used for the organization, control or representation of textual data. Abstract characters may be non-graphic characters used in textual information systems to control the organization of textual data (e.g. U+FFF9 INTERLINEAR ANNOTATION ANCHOR), or to control the presentation of textual data (e.g. U+200D ZERO WIDTH JOINER).
|abstract character repertoire||
a collection of abstract characters compiled for the purposes of encoding. See also charset.
a form of writing in which the consonants and vowels in a syllable are treated as a cluster or unit; typical of scripts from South Asia.
the amount by which the current display position is adjusted vertically after rendering a given glyph. This number is generally only meaningful for vertical writing systems, and is usually zero within fonts used for horizontal writing systems.
the amount by which the current display position is adjusted horizontally after rendering a given glyph.
the phonological process by which a simple stop, such as [t], is converted to an affricate, such as [tʃ]. For example, in some dialects of British English the word "tuna" is pronounced [tʃu:na], the first consonant having been affricated.
a variant form of a grapheme. For example, in Arabic, each letter has an initial, medial and final form depending on its position within a word. These are considered allographs of a single grapheme. The term may also be used in cases where a group of letters is used as an alternative for a single grapheme, as in the English word cough, where the sequence g+h represents the same sound as the single grapheme f.
a variant of a phoneme. It is not distinctive, that is, substituting one allophone for another of the same phoneme will not change the meaning of the word, although it will sound unnatural. Broadly speaking, the test to determine whether two sounds are allophones of the same phoneme, or separate phonemes, is to see whether they are in complementary distribution, that is, when two phonological elements are found only in two complementary environments. For example, in English /ph/ only occurs syllable-initially when followed by a stressed vowel, but /p/ occurs in all other environments. This is illustrated by the words pin /phin/ and spin /spin/. Therefore, /ph/ and /p/ are seen to be in complementary distribution, and therefore allophones of the phoneme [p]. This test is not foolproof; some sounds are in complementary distribution but are not considered to be allophones. For example, in English /h/ only occurs syllable-initially and /ʔ/ only occurs syllable-finally. However they are phonetically so different that they are still considered to be separate phonemes. One allophone can be assigned to more than one phoneme, as illustrated in some North American English dialects, where the phonemes /t/ and /d/ can both be changed into the allophone [ɾ].
a segmental writing system having symbols for individual sounds, rather than for syllables or morphemes. In a true alphabet, consonants and vowels are written as independent letters, in contrast to an abugida or an abjad. In a perfectly phonemic alphabet, phonemes and letters would be predictable in both directions; that is, the sound of a word could be predicted from its spelling and vice-versa. A phonetic alphabet is also predictable in this way, however it uses separate letters for separate allophones, whereas a phonemic alphabet may describe allophones of the same phoneme using a single letter.
see attachment point.
any vocal organ used to form a speech sound. For example, the articulators used to form the sound [f] are the lower lip (a labial articulator) and the upper teeth (a dental articulator), so is described as a 'labiodental' sound. Active articulators are those organs which can move - for example the tongue - and passive articulators are those which are fixed - for example the roof of the mouth.
a standard that defines the 7-bit numbers (codepoints) needed for most of the U.S. English writing system. The initials stand for American Standard Code for Information Interchange. Also specified as ISO 646-IRV.
a point defined relative to a glyph outline such that if two attachment points on two glyphs are positioned on top of each other, the glyphs are positioned correctly relative to each other. For example, a base character may have an attachment point used to position a diacritic, which would also have a corresponding attachment point. Also called anchor point.
the vertical point of origin for all the glyphs rendered on a single line. Roman scripts have a baseline on which the glyphs appear to 'sit', with occasional descenders below. Many Indic scripts have a hanging baseline, in which the bulk of the letters are placed below the baseline, with occasional ascenders above the line. Some scripts, such as Chinese, use a centered baseline, where the glyphs are all positioned with their centers on the baseline.
|Basic Multilingual Plane (BMP)||
the portion of Unicode's codespace in which all of the most commonly used characters are encoded, corresponding to codepoints U+0000 to U+FFFF, abbreviated as BMP. Also known as Plane 0. See also Supplementary Planes. Unicode
describes a script with two sets of symbols that correspond to each phoneme, most often upper- and lower-case. See also unicameral. Examples of bicameral scripts include Roman (or Latin), Greek, and Cyrillic.
the characteristic of some writing systems to contain ranges of text that are written left-to-right as well as ranges that are written right-to-left. Specifically, in Arabic and Hebrew scripts, most text is written right-to-left, but numbers are written left-to-right. This can also be used to refer to text containing runs in multiple writing systems, some RTL and some LTR.
see byte order mark.
a way of writing in which successive lines of text alternate between left-to-right and right-to-left directionality. (See also blog entry on boustrophedon)
a diacritic mark (˘) shaped like a concave-up half-circle and positioned centrally above a letter. In phonetic transcription it indicates a short vowel. It is also sometimes used in other writing systems to represent language-specific sounds (usually vowels) which are not represented by other, single letters in the script. For example, Romanian uses the character Ă/ă to represent the sound [ə]. Contrast with macron and caron.
|byte order mark (BOM)||
the Unicode character U+FEFF ZERO WIDTH NO-BREAK SPACE when used as the first character in a UTF-16 or UTF-32 plain text file to indicate the byte serialization order, i.e. whether the least significant byte comes first (little-endian) or the most significant byte comes first (big-endian). Byte order is not an issue for UTF-8, though the byte order mark is sometimes added to the beginning of UTF-8 encoded files as an encoding signature that applications can look for to detect that the file is encoded in UTF-8. Unicode.
a diacritic mark (ᵛ), also called an inverted circumflex or haček, shaped like a pointed breve or superscripted 'v' and positioned centrally above a letter. For some Central European languages, the caron combined with certain letters (lower-case ť, ď, ľ, and upper-case Ľ) is reduced to a small stroke resembling but not identical to an apostrophe. It is used in a number of Baltic, Slavic, and Finno-Lappic languages to indicate that a sound has been palatalized, iotatized (mixed with the approximant /j/), or articulated with the tongue near or touching the back of the alveolar ridge. It is also used in some African writing systems and in Chinese pinyin romanization to mark tone.
|cascading style sheets (CSS)||
one of two stylesheet languages used in Web-based protocols (the other is XSL). CSS is mainly used for rendering HTML, but can also be used for rendering XML. It is much less complex than XSL, i.e., it can only be used when the structure of the source document is already very close to what is desired in the final form.
|character encoding form||
a system for representing the codepoints associated with a particular coded character set in terms of code values of a particular datatype or size. For many situations, this is a trivial mapping: codepoints are represented by bytes with the same integer value as the codepoint. Some encoding forms may represent codepoints in terms of 16- or 32-bit values, though, and some 8-bit encoding forms may be able to represent a codespace that has more than 256 codepoints by using multiple-byte sequences. Most encoding forms are designed specifically for use in connection with a particular coded character set; e.g. UTF-8 is used specifically for encoded representation of the Universal Character Set defined by Unicode and ISO/IEC 10646. Some encoding forms may be designed for use with multiple repertoires, however. For example, the ISO 2022 encoding form supports an open collection of coded character sets and specifies changes between character sets in a data stream using escape sequences. Unicode
|character encoding scheme||
a character encoding form with a specific byte order serialization (relevant mainly for 16- or 32-bit encoding forms).
|character set encoding|
an identifier used to specify a set of characters. Used particularly in Microsoft Windows and TrueType fonts, and in HTML and other Internet or Web protocols to refer to identifiers for particular subsets of the Universal Character Set.
|CJKV (Chinese, Japanese, Korean and Vietnamese)|
The Common Locale Data Repository. An extensive repository of locale data, where a locale is a language, spoken in a particular country, written in a particular script. The CLDR is designed to provide key building blocks for software to support the world's languages, and is hosted by the Unicode Consortium.
character-glyph map: the table within a font containing a mapping of codepoints (characters) to glyph ID numbers. In an Unicode-based font the codepoints are Unicode values; in other fonts they correspond to other encodings.
|coded character set|
(1) synonym for coded character set.
(2) synonym for character set encoding; i.e. In some contexts, codepage is used to refer to a specification of a character repertoire and an encoding form for representing that repertoire.
a numeric value used as an encoded representation of some abstract character within a computer or information system. Codepoints are integer values used to represent particular characters within a particular encoding.
in writing, the distribution of text into sense lines, so that a new clause starts on new line.
the relation between two or more variants of a given entity (sound, morpheme etc) which appear in mutually exclusive contexts. x and y are said to be in complementary distribution if x is used where y is not used, y is used where x is not used, and between the two environments where x and y are used, every potential environment for that element is covered. In English, the plural suffix is pronounced [s] after non-sibilant voiceless consonants as in trucks [tr∧ks], [z] after voiced non-sibilants and vowels, as in trees [tɹi:z], and [əz] after sibilant consonants as in buses [b∧səz]. Therefore the three variants of the plural suffix are said to be in complementary distribution. Sounds which are in complementary distribution are known as allophones; morphemes which are in complementary distribution are known as allomorphs. The counterparts to complementary distribution are contrastive distribution and free variation. Complementary Distribution
a script characterized by one or more of the following: a very large set of characters, right-to-left or vertical rendering, bidirectionality, contextual glyph selection (shaping), use of ligatures, complex glyph positioning, glyph reordering, and splitting characters into multiple glyphs.
a ligature, in particular, a ligature representing a consonant cluster in an Indic script.
in phonetics, a speech sound which is produced without complete closure of the vocal tract. That is, any sound other than a stop, or any sound which can be articulated continuously.
the relation between two or more variants of a given entity (sound, morpheme etc) which
distinguish between units. For example, a pair of phonemes such as
a fully-functioning language which has developed as a result of interaction between two (or more) parent languages. Often, a creole develops from a pidgin if the pidgin is used for long enough for a sophisticated grammar and vocabulary to evolve, and if the pidgin acquires native speakers (if children learn it as their first language).
a key in a particular keyboard layout that does not generate a character, but rather changes the character generated by a following keystroke. Dead keys are commonly used to enter accented forms of letters in writing systems based on Roman script.
see semantic encoding.
with regard to writing systems, a writing system which does not represent all the distinctive sounds of the language it represents.
in semantics, a class of words that indicates, specifies or limits a noun, such as the definite or
indefinite article, the genitive (possessive) marker, or cardinal numbers.
a written symbol which is structurally dependent upon another symbol; that is, a symbol that does not occur independently, but always occurs with and is visually positioned in relation to another character, usually above or below. Diacritics are also sometimes referred to as accents. For example, acute, grave, circumflex, etc.
a diacritic mark (¨), also called tréma, commonly placed over the second of two adjacent vowels to indicate that they are to be pronounced as separate sounds rather than as a diphthong, as in the English word naïve. It can also be used to indicate that an otherwise unpronounced vowel is to be pronounced, as in the English name Brontë or the French word cigüe. In Welsh orthography, it is often written on the first of two adjacent vowels to indicate that the first vowel bears stress. The same mark when used over a single vowel in Germanic writing is called an umlaut and indicates a change in vowel quality.
a multigraph composed of two components.
in phonetics, a complex speech sound occupying one syllable, which begins with one vowel and ends with another. For example [eɪ] in British (RP) pronunciation of the word lane. See also monophthong.
also contrastive. An element which makes a distinction between units. In phonology, a process or a pair of sounds, the alternation of which changes the meaning of a word. See also phoneme, minimal pair. For example, voicing is distinctive in most non-tonal languages, as illustrated by the difference between English fan and van, or German Kern and gern.
a collection of information. This includes the common sense of the word, i.e. an organization of primarily textual information that can be produced by a word processing or data processing application. It goes beyond this, however, to include structured information held within an XML file. Each XML file is considered to contain one document, whatever the structure and type of that information.
|Document Type Definition (DTD)||
a markup declaration used by SGML and XML that contains the formal specifications, or grammar, of an SGML or XML document. One use of the DTD is to run a validation process over an XML file, which indicates if it matches the DTD, or if not, provides a listing of each line at which the file fails some part of the required structure.
the square grid which is the basis for the design of all glyphs within a given font; so called because it historically corresponded to the size of the letter M. When rendering, the requested point size specifies the size of the fonts em square to which all glyphs are scaled.
(1) synonym for a character encoding form.
(2) synonym for a character set encoding. This usage is common, especially in cases in which distinctions between a coded character set and a character encoding form is not important (i.e. 8-bit, single-byte implementations). Someone might think of an encoding as simply a mapping between byte sequences and the abstract characters they represent, though this model is not adequate to describe some implementations, particularly CJKV standards, or Unicode and ISO/IEC 10646.
|Extensible Markup Language (XML)||
a standard for marking up data so as to clearly indicate its structure, generally in a way that indicates the meaning of different parts of it rather than how they will be displayed. See http://www.w3.org/XML/ for details.
|Extensible Stylesheet Language (XSL)||
a language for expressing stylesheets. It consists of two parts: XSL transformations (XSLT) and an XML vocabulary for specifying formatting semantics. See http://www.w3.org/Style/XSL for full details.
|Extensible Stylesheet Language Transformations (XSLT)|
|featural writing system||
a writing system in which phonetic features, rather than phones (sounds), are represented. For example, there might be a symbol to represent the feature bilabial (a sound produced with both lips), a symbol to represent the feature voiced, and a symbol to represent the feature stop. These symbols would be combined to represent the sound [b]. The closest functioning writing system to this is the Korean Hangul, in which many of the strokes making up the symbols represent place or manner of articulation. Some writing systems used for representing signed languages also contain symbols which stand for particular features of signs. In this case, the symbol often visually resembles the feature it represents, such as direction of movement.
the relation between entities which have similar distributions but which are not distinctive, that is, they are interchangeable without changing the meaning of the word or sounding unnatural. It can apply to sounds, for example, in many southern dialects of British English, inter-vocalic [ʔ] and [t] are in free variation in words such as butter [b∧ʔə] / [b∧tə]. It can also apply to entire words, for example the concept dreamPAST in English can be pronounced either [dri:md] or [drɛmpt] with no change in meaning. Free variation can also apply to writing systems, with regard to variants of symbols. Mongolian uses largely predictable variant forms of many letters; these are encoded in Unicode in such a way that the rendering system can select the correct form. However, some letter forms are unpredictable and in free variation; in this case a free variation selector must be appended manually by the user to indicate to the rendering system which form is required.
in phonetics, consonant lengthening, usually by about a time-and-a-half of the length of a 'short' consonant. Geminated fricatives, trills, nasals and approximants are simply prolonged. In geminated stop stops, the 'hold' is prolonged. In some languages, such as Japanese, Hungarian, Arabic, Italian and Finnish, gemination is distinctive, but in most it is not. In languages where it is distinctive, it is usually restricted to certain consonants. English contains very few words in which gemination affects the meaning; among these are unnamed vs. unaimed or, in some dialects sixths /siks:/ vs. six} /siks/ (source: John Lawler, University of Michigan). In some languages, consonant length and vowel length depend on each other. For example in Swedish and Italian a short vowel must be followed by a long consonant (geminate), whereas a long vowel must be followed by a short consonant.
a shape that is the visual representation of a character. It is a graphic object stored within a font. Glyphs are objects that are recognizably related to particular characters and which are dependent on particular design (i.e. g, g and g are all distinct glyphs). Glyphs may or may not correspond to characters in a one-to-one manner. For example, a single character may correspond to multiple glyphs that have complementary distributions based upon context (e.g. final and non-final sigma in Greek), or several characters may correspond to a single glyph known as a ligature (e.g. conjuncts in Devanagari script). (For more information on glyphs and their relationship to characters, see ISO/IEC TR 15285.)
a character or sequence of characters that functions as a distinct unit within an orthography. A grapheme may be a single character, a multigraph, or a diacritic, but in all cases graphemes are defined in relation to the particular orthography. Most graphemes represent a single phoneme, but some represent a sequence of phonemes. For instance, the character sequence ‹ch› is often used to represent the phoneme /tʃ/ in English, while the single letter ‹x› usually represents the phoneme sequence /ks/. In a highly phonemic writing system, there is a close correspondence between graphemes and phonemes. English (written with Latin script) is an example of a writing system that is not highly phonemic, and therefore the mappings between graphemes and phonemes are more complex. Graphemes are often written enclosed in angle brackets (‹›).
a package developed by SIL to provide 'smart rendering' for complex writing systems in an extensible way. It is programmable using a language called Graphite Description Language (GDL). Because it is extensible, it can be used to provide rendering for minority languages not supported by Uniscribe.
homographs which, although spelled the same way, are pronounced differently and have different meanings. For example, in English 'wind' (noun, as in weather) and 'wind' (verb, to coil something).
one of multiple words having the same spelling but different meanings. They may be pronounced differently (for example in English 'tear: rip' and 'tear: secreted when crying'), in which case they are also heteronyms, or they may be pronounced the same (for example in American English 'tire: cause to be fatigued' and 'tire: wheel of a car'), in which case they are also homophones.
one of multiple words having the same pronunciation but different meanings. They may be spelled differently (for example in English 'write' and 'right'), in which case they are called heterographs, or the same (for example in English 'bark: on a tree' and 'bark: of a dog'), in which case they are also homographs.
see input method editor.
any mechanism used to enter textual data, such as keyboards, speech recognition or handwriting recognition. The most common form of input method is the keyboard. The term "input method" is intended to include all forms of keyboard handling, including but not limited to input methods that are available for Chinese and other very-large-character-set languages and that are commonly known as input method editors (IMEs). An IME is taken to be a specific type of the more general class of input methods.
|input method editor (IME)||
a special form of keyboard input method that makes use of additional windows for character editing or selection in order to facilitate keyboard entry of writing systems with very large character sets.
a process for producing software that can easily be adapted for use in (almost) any cultural environment; i.e. a methodology for producing software that can be script-enabled and is localizable. Sometimes abbreviated as 'I18N'.
to adjust the display position whilst rendering in order to visually improve the spacing between two glyphs. For instance, kerning might be used on the word WAVE to reduce the illusion of white space between the diagonal strokes of the W, A, and V.
an input method program which changes and rearranges incoming characters to allow easy ways of typing data in writing systems that would otherwise be difficult or inconvenient to type. See www.tavultesoft.com/keyman.
the use of the lips as a secondary articulator while the rest of the oral cavity produces a different sound. Generally - but not always - this means that a sound is modified by simultaneous rounding of the lips. The term is normally used to refer to consonants; the process can also be applied to vowels, but these are more commonly referred to as rounded rather than labialized. The process is most commonly applied to velar consonants (that is, those produced with the back of the tongue against the soft palate, such as [k] or [x]). For example, in the English word cool [ku:l], the [k] is often pronounced with the lips rounded, whereas in keep [ki:p] it is not. Labialization is extremely widespread across the world's languages and, in some, it is distinctive.
in the Microsoft Win32 API, a 16-bit integer used to identify a language or locale. A LANGID is composed of a 10-bit primary language identifier together with a 6-bit sub-language identifier (the latter being used to indicate regional distinctions for locales that use the same language).
a constant value within some system used for metadata identification of the language in which information is expressed. May be numeric or character based, depending on the system.
see Roman script.
the white space at the left edge of a glyph's visual representation, or more specifically, the distance between the current horizontal display position and the left edge of the glyph's bounding box. A positive left side-bearing indicates white space between the glyph and the previous one; a negative left side-bearing indicates overlap or overhang between them.
a collection of parameters that affect how information is expressed or presented within a particular group of users, generally distinguished from one another on the basis of language or location (usually country). Locale settings affect things such as number formats, calendrical systems and date and time formats, as well as language and writing system.
the extent to which the design and implementation of a software product allows potential for localization of the software.
the process of adapting software for use by users of different languages or in different geographic regions. For purposes of this document, localization has to do with the language and script of users, and is distinct from script enabling, which has to do with the script in which language data is written. The localization process may include such modifications as translating user-interface text, translating help files and documentation, changing icons, modifying the visual design of dialog boxes, etc. Sometimes abbreviated 'L10N'.
a way of storing characters. Roughly, the order in which characters are read or pronounced, as opposed to the order in which they appear on the page (visual order). In many cases, these are the same. However, in some left-to-right Indic scripts, certain vowels are written to the left of a consonant but pronounced after it. In these cases, the vowel character is stored as the second character in the sequence, although visually it is the first. Similarly, in bidirectional text, some portions are read from left to right and some are read from right to left, so there is a discrepancy between the order in which they appear on the page and the order in which they are read. One of the characteristics of Unicode is that characters are stored in logical order.
also called a logogram or ideograph. A written symbol representing a whole word. Technically, this is distinct from an ideogram, which represents a concept independently of words, although the two are often used interchangeably.
|logographic writing system||
also known as an ideographic writing system. A writing system in which each symbol represents a complete word or morpheme. The symbols do not indicate the word's pronunciation, only its meaning. Historically, Sumerian cuneform and Egyptian hieroglyphics were logographic, but today Chinese is the only known writing system in the world that remains logographic. See also logosyllabary.
a writing system in which each sign is used primarily to represent words or morphemes, with some subsidiary usage to represent syllables. Most natural logosyllabaries employ the rebus principle to extend the character set so that syllables as well as morphemes can be represented. Logosyllabaries may also include determinatives to mark semantic categories which would otherwise be ambiguous. The extent to which syllabic sounds are represented varies from one writing system to another. In instances where a relatively large number of symbols represent syllabic sounds, a logosyllabary may evolve into an abugida or an abjad as the syllabic use overtakes the logographic use.
a diacritic mark (ˉ) shaped like a short horizontal line and placed centrally above or below a symbol. Above a vowel it normally indicates that the vowel is long. Some writing systems use macron above a vowel to represent tone. Transliterations of Biblical Hebrew use macron below certain consonants to represent sounds which cannot be written as a single Latin letter. Macron
(lit. "mothers of reading"). The use of certain consonant letters, primarily in semitic abjads to indicate vowel sounds.
a phonological change in which the order of segments, particularly successive sounds, in a word is reversed. For example, the English word 'ask' was pronounced [æks] between the 5th and 12th centuries, and some dialects have reverted back to this pronunciation in modern times.
a pair of words distinguished by only one phoneme, for example in German /kern/ 'centre' and /gern/ 'like, with pleasure'.
a keyboard layout based on the characters appearing on the keytops of the keyboard. See also positional keyboard.
a vowel sound which does not change in quality as it is articulated. (Contrast with diphthong.) It can be short, as in English bed [b?d], or long, as in English bead [bi:d]. A single short monophthong is the shortest syllable in any language. The process by which monophthongs change to diphthongs or vice versa is an important factor in language change. Diphthongization in the 15th or 16th century changed the long German monophthong [i?] to [a?], as in Eis 'ice', and long [u?] to [a?] as in Haus 'house'. A characteristic of Southern American English is the monophthongization of certain dipthongs such as [a?] to long [a:] in words such as kite.
a unit of rhythmic measurement based syllable weight, which is distinctive in some languages. Japanese is one of the most well-documented of these languages. Short (or light) syllables are monomoraic, consisting of one mora. Long (or heavy) syllables are bimoraic, consisting of two morae. Some languages contain superheavy syllables, for example Hindi, in which a long vowel can be followed by a geminate consonant. These syllables are said to be trimoraic. The first consonant of a syllable does not represent any morae, as it does not constitute a syllable in itself. Syllable-final consonants can either form the final part of a bi- or trimoraic syllable, as is the case in Goidelic Irish, or they can represent a mora in themselves, as is the case in Japanese. Although there is a relation between syllables and morae, they are not necessarily interchangeable. For example, the Japanese word for 'photograph', [sjasin], consists of 2 syllables: sja + sin, but 3 morae: sja + si + n. (source: Jouji Miwa at Mora and Syllable)
see script enabling.
see script enabling.
an encoding implementation for some particular language that is designed to enable input to and rendering from that encoding using more than one writing system. When such an implementation is used, the different writing systems are normally based on different scripts.
a combination of two or more written symbols or orthographic characters (e.g. letters) that are used together within an orthography to represent a single sound. (Combinations consisting of two characters are also known as digraphs.)
a script using a set of characters other than those used by the ancient Romans. Non-Roman scripts include relatively simple ones such as Cyrillic, Georgian, and Vai, and complex scripts such as Arabic, Tamil, and Khmer.
transformation of data to a normal form. For historical reasons, the Unicode standard allows some characters to have more than one encoded representation. For example, á may be represented as a single codepoint, U+00E1 LATIN SMALL LETTER A WITH ACUTE, or two codepoints, U+0061 LATIN SMALL LETTER A and U+0301 COMBINING ACUTE ACCENT. A normalization scheme is used to standardize the codepoints so that every character is always represented by the same sequence of codepoints. Normalization is described in the Unicode Standard Section 5.7, Normalization.
a written symbol that is conventionally perceived as a distinct unit of writing in some writing system or orthography.
a speech sound which is identified as the audible realization of a phoneme.
the smallest distinctive segment of sound in any language. It is actually comprised of a group of similar sounds, called allophones, which native speakers of a language may perceive as being all the same. If a pair of words exist which differ only in one phonological element (known as a minimal pair), the element in which they differ is distinctive, and represents two phonemes in the language. For example, in English, bit and pit are a minimal pair; [b] and [p] are distinct phonemes, written as /b/ and /p/. (Phonemes are conventionally written surrounded by slashes.) Phonemes are not consistent across languages; two sounds may be separate phonemes in one language and allophones in another.
an inventory of all the distinctive sounds (phonemes) in a given language, also called a phoneme inventory. A language's phonemic inventory is not fixed over time; as the language changes, sounds which were previously allophones may become phonemes. The smallest documented phoneme inventory belongs to the Rotokas language, which uses only 11 phonemes. The largest belongs to !Xóõ, with an estimated 112 phonemes. The number of phonemes used in speech does not necessarily correspond to the number of symbols used in writing for a given language. For example, the English alphabet contains 26 letters, but the phonemic inventory numbers between 35 and 47 depending on the dialect used (source: Wikipedia). In a true phonemic script the symbols should map on a one-to-one basis to the sounds in the phonemic inventory.
a writing system in which each symbol tends to correspond to one phoneme. For example, the N'ko alphabet assigns one symbol to each phoneme. Also sometimes called a phonetic script although technically this is not accurate, as a true phonetic script should represent every allophone in a language.
see the rebus principle.
a simplified contact language that develops and used by people who do not speak a common language but share a geographic area. Pidgins are characterized by a simplistic grammatical structure, and a limited (usually context-specific) vocabulary. They are not spoken as a first language. Given time, a pidgin may evolve into a creole.
one of two types of phonological accent which is realized by differences in the frequency of accented and unaccented syllables. Its counterpart is a stress accent. Phonetically, a higher pitch is due to a more rapid vibration of the vocal cords. The placement of pitch may determine the meaning of a word, for example in the case of the two Japanese words pronounced /haꜜsi/ chopsticks and /hasiꜜ/ bridge. Technically, a pitch accent language is differentiated from a true tone language in that every syllable is characterized by a particular tone in the latter, but only some in the former. However, this distinction is often lost in practice. Some languages, such as Welsh, have a pitch accent which does not differentiate the meaning of words. Accents may or may not be marked in writing, depending on the orthographic conventions of a particular language.
textual data that contains no document-structure or format markup, or any tagging devices that are controlled by a higher-level protocol. The meaning of plain text data is determined solely by the character encoding convention used for the data.
in Unicode, a range of 64K codepoints. Plane zero is the original 64K codepoints that can be represented in a single 16-bit character. See also Basic Multilingual Plane, supplementary planes, and surrogate pair.
|Portable Document Format (PDF)|
a keyboard layout defined in terms of the relative positions of keys rather than what they have printed on them. See also mnemonic keyboard.
a page description language defined by Adobe. Originally implemented in laser printers so pages were described in terms of line drawing commands rather than as a bitmap.
a font in a format suitable for use within a Postscript document. There are many types. Type 1 is the most common and is what is meant most commonly when people refer to Postscript fonts. There are also ways of embedding other font formats into a Postscript document. For example a Type 42 font is a TrueType font formatted for use within a Postscript document. Type 1 fonts differ in the way their outlines are described from TrueType fonts.
|Practical Extraction and Reporting Language (PERL)||
an interpreted programming language particularly strong for text processing.
the addition of a short nasal onset to another consonant sound produced at the same place. Prenasalized sounds behave phonetically as a single consonant, not as a cluster. The African Bantu languages are famous for having prenasalized consonants, for example in the Lingala words /mboka/ hill, /ndako/ house, and /ŋgomba/ hill.
a character encoding system in which the abstract characters that are encoded match one-for-one with the glyphs required for text display. Such encodings allow correct rendering of writing systems on 'dumb' rendering systems by having distinct codepoints for contextual forms, positional variants, etc. and are designed on the basis of rendering needs rather than on the basis of character semantics (the linguistically relevant information). Also known as glyph encoding, display encoding or surface encoding; distinguished from semantic encoding.
|Private Use Area (PUA)||
a range of Unicode codepoints (E000 - F8FF and planes 15 and 16) that are reserved for private definition and use within an organization or corporation for creating proprietary, non-standard character definitions. For more information see The Unicode Consortium, 1996, pp. 619 ff.
see Private Use Area.
converting a graphical image described in terms of lines and fills into a bitmap for display on an imaging device.
also known as phonetization. The use of a pre-existing logograph to represent a syllabic sound having the same sound as, but a different meaning from, that of the word originally represented. The rebus principle is especially useful for representing function words, proper names, and other words which would otherwise be difficult to depict. A well-known example is the Egyptian use of the symbol representing “swallow” (pronounced wr) also being used ro represent the word “big” (which was also pronounced wr). A symbol used in this way is called a rebus. The rebus strengthens the phonetic aspect of a logographic writing system by exploiting the phonetic similarities between words. If a logographic writing system is fully (or almost fully) phonetized, it may become an abugida or an abjad. Other times, it is only partially phonetized and develops into a logosyllabary.
a test (usually a whole set of tests, often automated) designed to check that a program has not 'regressed', that is, that previous capabilities have not been compromised by introducing new ones.
to display or draw text on an output device (usually the computer screen or paper). This usually consists of two processes: transforming a sequence of characters to a set of positioned glyphs and rasterizing those glyphs into a bitmap for display on the output device.
the white space at the right edge of a glyph's visual representation, or more specifically, the distance between the display position after a glyph is rendered and the right edge of the glyph's bounding box. A positive right side-bearing indicates white space between the glyph and the following one; a negative right side-bearing indicates overlap or overhang between them.
the script based on the alphabet developed by the ancient Romans ("A B C D E F G ..."), and used by most of the languages of Europe, including English, French, German, Czech, Polish, Swedish, Estonian, etc. Also called Latin script.
a small, annotative character which is written above or to the right of a Chinese character to indicate its pronunciation. Ruby characters are usually written in the Latin, Japanese, or Bopomofo script.
in markup, a set of rules for document structure and content.
a maximal collection of characters used for writing languages or for transcribing linguistic data that share common characteristics of appearance, share a common set of typical behaviours, have a common history of development, and that would be identified as being related by some community of users. Examples: Roman (or Latin) script, Arabic script, Cyrillic script, Thai script, Devanagari script, Chinese script, etc.
|Script Description File (SDF)||
a file describing certain kinds of complex script behaviour, used to control a rendering engine to which it has given its name. Created by Tim Erickson and used in Shoebox, LinguaLinks, and ScriptPad.
providing the capability in software to allow documents to include text in multiple languages or scripts, and to handle input, display, editing and other text-related operations of text data in multiple languages and scripts. Script enabling has to do with the script in which language data is written, as opposed to localization, which has to do with the language and script of the user interface.
|segmental writing system||
one of two categories of phonologically-based (that is, not logographic or featural) writing systems, the other being syllabic writing systems, or syllabaries. Segmental writing systems represent consonants and vowels, rather than whole syllables, as individual units. Alphabets, abugidas, and abjads are all classed as segmental writing systems. There is potential for confusion over the inclusion of abugidas in this category, as each character in this type of script does represent a full syllable. However, individual consonants and vowels are acknowledged as discrete elements in that there is a systematic graphic similarity between characters which represent syllables sharing a particular consonant or vowel. As a test to determine whether a writing system is a syllabary or a segmental abugida, if it is syllabic there will be no systematic visual similarity between, for example, the characters or character sequences representing [ka], [ke], [ko], or [ki], [pi], [ti], but in an abugida there will be.
an encoding that has the property of one codepoint for every semantically distinct character (the linguistically relevant units). In general, such encodings require the use of 'smart' rendering systems for correct appearance to be achieved, but are more appropriate for all other operations performed on the text, especially for any form of analysis. Also known as deep encoding; distinguished from presentation-form encoding.
a font capable of performing transformations on complex patterns of glyphs, above and beyond the simple character-to-glyph mapping that is a basic function of font rendering (see cmap). The information specifying the smart behavior is typically in the form of extra tables embedded in the font, and will generally allow layered transformations involving one-to-many, many-to-one, and many-to-many mappings of glyphs.
a sequence of numbers that when appropriately processed using a particular standard algorithm will position the corresponding string in the correct sort position in relation to other strings. The sort key need not correspond one number to one codepoint in the input string.
|Standard Format Marker (SFM)||
an element of a proprietary format developed by SIL International and used by some linguistic software applications. A standard format marker begins with a backslash (\); for example, \p would represent a paragraph tag. It is possible (and even probable) that SFMs in a single document have different character encodings. When converting to one encoding (Unicode) these must be converted with different mapping files.
|Standard Generalized Markup Language (SGML)||
a notation for generalized markup developed by the International Organization for Standardization (ISO). It separates textual information from the processing function used for formatting. It was found difficult to parse, due to the many variants possible, and so XML was developed as a subset to resolve the ambiguities and to make parsing easier.
also called a plosive. In phonology, a speech sound whose production involves a complete blockage of the air flow. This may include only consonants in which the air flow is blocked through both the mouth and the nose, such as [p] or [k], or those in which the air flow is blocked through the mouth only, such as [m] or [n]. Sounds in which the airflow is blocked through both the mouth and the nose cannot be articulated continuously.
one of two types of phonological accent by which one syllable is heard to be more prominent than others, its counterpart being a pitch accent. Phonetically, stress is due to a difference in length, volume, vowel quality, or a combination of these. These differences are thought to reflect a greater muscular energy in the production of the stressed syllable. The placement of stress may determine the meaning of a word, for example in the case of the two English words /conˈtent/ and /ˈcontent/. Accents may or may not be marked in writing, depending on the orthographic conventions of a particular language.
Unicode Planes 1 through 16, consisting of the supplementary code points, corresponding to codepoints U+10000 to U+10FFFF. In The Unicode Standard 3.1, characters were assigned in the supplementary planes for the first time, in Planes 1, 2 and 14. See also Basic Multilingual Plane.
a unit or feature whose domain extends over more than one minimal element. For example, stress is classed as a suprasegmental feature because its domain is a whole syllable, comprised of the smaller minimal elements consonants and vowels. Suprasegmental features may be marked in writing; in these cases, the area in which they are written is called the suprasegmental box.
a mechanism in the UTF-16 encoding form of Unicode in which two 16-bit code unites from the range 0xD800 to 0xDFFF are used to encode Unicode supplementary plane characters, i.e. with Unicode scalar values in the range U+10000 to U+10FFFF.
a form of writing in which the symbols represent syllables--most commonly a vowel-and-consonant combination. A syllabary differs from an abugida in that there are no distinct elements of the symbols to correspond to the syllable's phonemes.
part of a syllabary, a character which represents a syllable. The term is used almost exclusively to refer to a character from the Ethiopic script, though this tendency does not preclude its use in other contexts.
a font used either for non-orthographic collections of shapes (such as Wingdings) or for legacy orthographies (e.g., SIL Ezra, SIL Galatia, SIL IPA) created prior to availablility of Unicode-based solutions. Symbol-encoded fonts encode characters in the Private Use Area, typically U+F020 .. U+F0FF.
the process of analysing a string into a contiguous sequence of smaller units: for example, word breaking or syllable breaking or the creation of a sort key.
a unit belonging to a set characterized primarily by differences or changes in the levels of pitch. In a tone language, tone is used to distinguish each syllable, as illustrated by the three Ngbaka words /mà/ (with a low tone) magic, /mā/ (mid tone) I, and /má/ (high tone) to me. Tone can also be used in a system of intonation; for example, in English a rising tone may indicate surprise while a falling tone may indicate disappointment.
font format used primarily in Windows and on the Mac, allows for glyph scaling and hinting.
a diacritic mark (¨) used to represent fronting or rounding of a vowel, particularly in Germanic languages. For example, in German an umlaut changes a [a] to ä [æ]. In modern computer systems, umlaut and diaeresis are represented identically with a pair of dots, although in handwritten texts umlaut can vary from two short vertical lines to a single horizontal line over the vowel.
a style of writing entirely in single stroke, rounded upper case forms, commonly found in European texts from the 4th to the 8th century. Uncial writing is the predecessor of modern capital letters.
describes a script with only one set of symbols per phoneme. See also bicameral.
|Unicode Scalar Value (USV)||
a number written as a hexadecimal (base 16) value that serves as the codepoint for Unicode characters. Characters in the BMP are written with four hex digits, eg: U+0061, U+AA32. Characters in supplementary planes use five or six digits.
|Uniscribe (Unicode Script Processor)||
due to technical limitations in OpenType, it is necessary to pre-process strings before applying OpenType smart behaviour. Microsoft uses a particular DLL (Dynamic Link Library) called Uniscribe to do this pre-processing. Uniscribe does all of the script specific, font generic processing of a string (such as reordering) leaving the font specific processing (such as contextual forms) to the OpenType lookups of a font.
|Universal Character Set (UCS)|
see Unicode Scalar Value.
an encoding form for storing Unicode codepoints in 32-bit words. Since 32 bits encompasses the entire range of Unicode, every codepoint is encoded as a single 32-bit word. See Unicode Technical Report #19.
an encoding form for storing Unicode codepoints in terms of 8-bit bytes. Characters are encoding listing sequences of 1-4 bytes. Characters in the ASCII character set are all represented using a single byte. See http://www.unicode.org/unicode/faq/utf_bom.html.
a numeral system based on the number twenty (as opposed to a decimal counting system which is based on the number ten). In the context of writing systems this usually means that twenty individual numeral symbols are used. Vigesimal systems are used in many languages in Europe, Asia, South America and Africa.
the generic name for a written symbol, particularly common in Brahmic abugidas, having the function of silencing the inherent vowel that otherwise occurs with every consonant character. The shape of the symbol varies from script to script, but it is often a diacritic, written above, below or alongside the consonant which it modifies.
a software component that allows a user to enter characters, often characters which cannot be accessed through a standard physical keyboard. Virtual keyboards can be useful for multilingual users who need to switch between different character sets.
|Visual OpenType Layout Tool (VOLT)|
a way of storing characters so that the order in which they are stored corresponds to the order in which they appear on the page, as opposed to the order in which they are read. In many cases these are the same, but the distinction is particularly pertinent in bidirectional or mixed text, as the order in which the characters appear on the page does not necessarily correspond to the order in which they are pronounced. Older legacy font encodings tended to store characters in visual order, but Unicode-encoded fonts use logical order.
an implementation of one or more scripts to form a complete system for writing a particular language. Most writing systems are based primarily upon a single script; writing systems for Japanese and Korean are notable exceptions. Many languages have multiple writing systems, however, each based on different scripts; e.g. the Mongolian language can be written using Mongolian or Cyrillic scripts. A writing system uses some subset of the characters of the script or scripts on which it is based with most or all of the behaviours typical to that script and possibly certain behaviours that are peculiar to that particular writing system.
the distance from the baseline of a line of text to the top of the main body of lower-case letters, that is, without ascenders or descenders. It is the height of a lower-case x, as well as a lower-case u, v, w, and z. Curved letters such as a, e, n, and s tend to be slightly taller than the x-height for aesthetic purposes.