Writing Systems Computing - NRSI

Posted Entries

This blog contains general ideas about computing and writing systems from the NRSI team, mostly applicable to those implementing and using WSIs, and also developing WSI components.

5
  • Posted by Lorna Evans on 2017-11-09 04:37:31

    Steven Loomis has written a blog post called  Full Stack Language Enablement. He says:

    There are a lot of steps to be taken in order to ensure that a language is fully supported. The objective of this document is to collect the steps needed and begin to plan how to accomplish them for particular languages. The intent is for this to serve as a guide to language community members and other interested parties in how to improve the support for a particular language.

    This is a great start. It would be great if others could assist in turning this into a "go to" document for where to get help for all of the steps needed to get full support for a language.

  • Posted by Martin Hosken on 2017-08-25 06:58:26

    Introduction

    There are many languages that are written in more than one script. Texts look completely different in two scripts but sound and mean the exact same thing. There are two approaches to achieving the production of the same text in multiple scripts. One approach is to simply convert from one script to another. But what if we have more than two scripts involved, could we convert to a common interlingua type encoding from which text in any of the other scripts could be generated easily? We will examine both approaches and their relative problems and merits.

    Direct Conversion

    The simplest approach to consider is that of mapping between two different scripts. A program reads text in one script and generates the identical text in another script. Given that most orthographies are based on the phonemes of a language, one can simply convert the source orthography into the phonemes and then express those in the target script.

    The problem here is that most orthographies, especially those that have existed across any kind of language shift, are not all that pure. Some orthographies are over-differentiated; they have two ways of spelling the same thing. This is not difficult to 'read' but is hard when it comes to generating text in that script. Some orthographies are under-differentiated; one can only understand what the text says in context. Which phonemes do these characters result in? An orthography may start out being relatively pure, with little under- or over-differentiation. But as languages shift (phonemes split or combine, tonal systems change), often their orthographies do not and the result is that the orthography loses its purity and becomes more complex.

    Add into this mix loanwords from foreign languages. They can introduce unexpected phonemes or phonemic sequences that are not in the main language. Again, how an orthography handles these may add complexity to the orthography.

    The result is that while it is possible to convert a lot of text in a language directly between two orthographies, there is almost always a residue of words that do not fit the pattern or that need to be handled directly because they are ambiguous in one or both of the scripts.

    Common Encoding

    We've looked at direct conversion between two scripts. But what if we could come up with a pure script, a common encoding that represents the language, and then convert from that? Could that give us a simple transformation that always works?

    One of the difficulties with this approach is that the common encoding needs to hold all the over-differentiation that occurs in all the scripts the text may be output in. For Latin script, for example, it must store case for proper nouns.

    The result is, in effect, yet another orthography for the language that may not be in a convenient script for a typist. Someone typing the language in Lao script has to be concerned with case, even though there is no concept of case in Lao script. How is this extra information, then, presented in Lao script? A Latin script typist has to handle the different kinds of word breaks that Lao script has, with no space between words but spaces between phrases. The Lao script user has to think about commas and periods. The Latin script user may not have as expressive a way of presenting tone as does the Lao script, and so the complexities go on.

    The common encoding becomes something that only a few experts can use and even then may need changing in response to other scripts in which the language may need to be rendered.

    Having worked with various cross script conversion projects, I see the value in a common encoding, but in practice its value is only as a transition between two orthographies as a way of reducing the number of conversion descriptions needed. The encoding itself is an implementation detail and not something any reader or writer of the language should ever have to deal with.

    As such, therefore, it would be inappropriate to publish such a common encoding as part of a standard. This is on top of the core principle of Unicode that it is a character encoding and not an encoding of morphophonemics.

    Solution

    The ideal situation is that a user can type their language in the script of their choice, select the text and then convert it to another script, complete with appropriate font and style changes. There are a number of steps that need to happen for this to be available.

    The conversion converts from one script directly to another. This may be a two step process, from the source script to an intermediate encoding and then out to the target script, or it can be direct. Different languages and script collections have different needs and solutions. How that direct conversion occurs is an implementation detail.

    Different applications may support conversion in different ways. There is plenty of opportunity to see growth in this area.  SIL Converters is one such solution which, while not directed at multi-script conversion directly, can be used for it.

  • Posted by Martin Hosken on 2016-10-11 06:13:01

    I had the privilege recently of attending the LibreOffice Conference 2016 in Brno in the Czech Republic. What a lovely city and what a lovely bunch of people. There were people ranging from first time attendees to those who have been involved with the code for over 20 years. And yet for all that these people know each other so well, they are still welcoming to newbies and new ideas.

    It's amazing how many issues you can resolve with all the right people in the same room. I was able to flesh out some of the ideas we are having regarding providing minority language locale information for general access, and then to make that useful in a program like LIbreOffice. Likewise it'll be exciting to see the new work to reduce the complexity of the text layout stack down to a few classes so that we can then really address the over layout issue and speed up text rendering. All this through face to face interaction. That's hard from a quarter of the way across the world (in the wrong direction).

    Of course this is LIbreOffice and while huge strides have been made to make LibreOffice development so much easier, what with a much improved build system and gerrit commit workflow, it is still a hugely complex beast and making significant changes to something so large and mature will not happen overnight. But I wish Khaled Hosny every success in his work in simplifying the text layout code. Hopefully I will be able to contribute a little to that in some helpful way.

    So, all in all, a very successful conference. I'm very glad I was able to go and I look forward to some great fruit from it. Now all I have to do is start digging!

  • Posted by Jim Brase on 2016-03-31 05:53:02

    Microsoft Keyboard Layout Creator (MSKLC) has been one of our tools for creating Windows keyboards for a number of years. The following is a summary of my experience with this tool on Windows 10.

    Installing existing keyboards

    I was able to install several keyboards that had previously been created with MSKLC, and they all worked OK. So keyboards we created in the past seem to be safe.

    Using MSKLC on Windows 10 to create new keyboards

    One of my colleagues installed MSKLC on Windows 10, but when she tried to compile a keyboard, MSKLC would consistently crash.

    I tried installing MSKLC on Window 10 twice. The first time it seemed to install OK, but I didn't have time to actually try using it. Later I started to get system crashes (reason unknown), and it was decided that we needed to rebuild Windows and reinstall everything. When I got back to trying to install MSKLC, it gave me the usual message about needing to install .NET Framework 2.0 first. I tried installing .NET 3.5 (which includes 2.0), but wasn't able to get it to install (more details available  here). After a couple days I gave up.

    Conclusions

    1. The future of MSKLC

    Given these experiences, it seems that the future of MSKLC is uncertain at best. One should be able to run it on an earlier version of Windows within a VM on Windows 10, but that's an awkward solution. If you have had success running MSKLC on Windows 10, please let us know.

    Fortunately, we have a better solution in Keyman, and given the upcoming changes that we anticipate in Keyman licensing, it seems likely that we will put more emphasis (perhaps exclusive emphasis) on that solution for our keyboards.

    2. .NET 2.0 and 3.5 on Windows 10

    Through our local IT people, I have heard other reports of people having problems installing .NET Framework on Windows 10. This is disturbing, and we would be interested in knowing what other software may be affected by this problem.

  • Posted by Lorna Evans on 2015-07-10 07:18:43

    We recently updated the writing system information for languages in ISO 639-3. We are often asked for information on how many languages are written (or not written). This information is not easily derived. However, this is our best stab at the information for “Living” languages.

    AfricaAmericasAsiaEuropePacificTotal
    Living Languages 2,138 1,064 2,301 286 1,313 7,102
    Known to be written 1,118 637 1,079 191 636 3,661
    Known to be unwritten 333 76 242 14 54 719
    No Information on whether written or not1 662 319 945 41 619 2,586
    Sign Languages2 25 32 35 40 4 136

    1 They could be newly written, or not written at all.
    2 Sign Languages have not been included in either “written” or “unwritten” statistics

Previous Posts

24
  • Detecting font usage in a web browser

    sharoncorrell | 2015-02-27 08:18:28

  • The Bidi Algorithm, Part 5: Overrides and embedding

    sharoncorrell | 2014-09-18 05:11:57

  • Unicode Character Browsing

    BobHallissy | 2014-05-08 05:31:25

  • What is a Warsh Orthography?

    priestla | 2014-04-09 04:02:19

  • The Bidi Algorithm, Part 4: Mirroring

    sharoncorrell | 2014-04-03 10:29:40

  • The Bidi Algorithm, Part 3: Directionality codes

    sharoncorrell | 2014-03-31 04:46:38

  • Accessing Graphite features in LibreOffice

    sharoncorrell | 2014-02-11 04:57:47

  • The Bidi Algorithm, Part 2: Paragraph direction and flow

    sharoncorrell | 2013-11-18 04:02:00

  • The Unicode Bidirectional Algorithm: a gentle introduction

    sharoncorrell | 2013-11-07 10:54:00

  • Keyman for iPhone and iPad

    BobHallissy | 2013-11-07 04:51:00

  • Dotted circle issues on web browsers

    sharoncorrell | 2013-10-03 16:49:00

  • Everyday Unicode

    priestla | 2013-09-09 06:44:00

  • Introduction to Text Conversion and Transliteration

    davidr | 2013-06-06 08:59:00

  • New documentation for Keyboard Systems

    davidr | 2013-04-19 06:04:00

  • Different kinds of hamza characters in Arabic script

    sharoncorrell | 2013-03-24 09:38:00

  • The Inclusion of TypeTuner in FontUtils

    wardak | 2013-03-20 17:08:00

  • Unicode conversion of non-breaking hyphens in MS Word

    jlbrase | 2012-12-11 10:50:00

  • Tai Heritage Pro 2.5 released

    jlbrase | 2012-11-27 06:22:00

  • Graide 0.5 released

    sharoncorrell | 2012-10-11 14:13:00

  • fontinfo Firefox add-on

    priestla | 2012-09-26 03:37:00

  • Virama or Halant, which model do I choose?

    hoskenmj | 2012-06-06 03:49:00

  • Towards a system for organizing glyph collision rules

    sharoncorrell | 2012-05-16 04:49:00

  • Graphite font features in Firefox 11

    sharoncorrell | 2012-03-27 14:31:00

  • Graphite support in Firefox 11

    sharoncorrell | 2012-03-26 05:00:00

  • Copyright © 2017 SIL International and released under the  Creative Commons Attribution-ShareAlike 3.0 license (CC-BY-SA) unless noted otherwise. Language data includes information from the  Ethnologue. Script information partially from the  ISO 15924 Registration Authority. Some character data from  The Unicode Standard Character Database and locale data from the  Common Locale Data Repository. Used by permission.