ScriptSource

Entry

Unicode conversion of non-breaking hyphens in MS Word

My Information

Log in | Register | Reset Password

Why register? Becoming a member of ScriptSource allows you to contribute information, post needs and add links to software and other resources. Please help us build a wonderful resource for the design, computing and linguistic communities!

Content

1

Unicode conversion of non-breaking hyphens in MS Word

Posted by Jim Brase on 2012-12-11 10:50:48

Issue

The USV for the non-breaking hyphen is U+2011. However, if a document is saved in Microsoft Word as 8-bit text, a non-breaking hyphen is represented as 0x1E. This may apply to data created with MS Word in the following situations:

  • Data is saved as plain text. (The default in current versions of MS Word is to save data as Unicode text.) I have verified that Word 2010 on a Windows 7 platform will convert a non-breaking hyphen to 0x1E when a document is saved as plain text.
  • A document has been created on a pre-Unicode version of MS Word, which would be very old. I believe that such was the type of document that my colleague, who brought this issue to my attention, was working with.

It is not certain how many of our mapping tables recognize 0x1E as a non-breaking hyphen. When I ran a test using the CP1252 converter, it changed a 0x1E to U+001E. This is what one would normally want. But it will not handle 8-bit data from MS Word correctly.

Solution

If you have 8-bit data that originated in MS Word which contains 0x1E, and your mapping table converts 0x1E to U+001E, you can handle it in one of the following ways:

  • Customize the mapping table to convert 0x1E to U+2011, or:
  • Convert that data with the existing mapping table, then use a global 'find and replace' to change U+001E to U+2011, or:
  • First use a global 'find and replace' to change 0x1E to an unused ASCII character sequence, such as ^~, which represents a non-breaking hyphen. Then convert the data with an existing map.

If you have converted MS Word data, and hyphens seemed to disappear, or the data was left with seemingly extraneous U+001Es, you should check to see if this is the cause.

It is not clear that all mapping tables should be modified to include the 0x1E > U+2011 change, since this is specific to MS Word 8-bit data, and tables are often used with data from other sources.

Comments

0

No comments yet.

Properties

1
Created 2012-11-29 15:30:29 by jlbrase
Modified 2012-12-11 10:50:48 by raymondmj
Status approved

Copyright © 2013 SIL International and released under the  Creative Commons Attribution-ShareAlike 3.0 license (CC-BY-SA) unless noted otherwise. Language data includes information from the  Ethnologue. Script information partially from the  ISO 15924 Registration Authority. Some character data from  The Unicode Standard Character Database and locale data from the  Common Locale Data Repository. Used by permission.