Unicode conversion of non-breaking hyphens in MS Word
Posted by Jim Brase on 2012-12-11 10:50:00
The USV for the non-breaking hyphen is U+2011. However, if a document is saved in Microsoft Word as 8-bit text, a non-breaking hyphen is represented as 0x1E. This may apply to data created with MS Word in the following situations:
- Data is saved as plain text. (The default in current versions of MS Word is to save data as Unicode text.) I have verified that Word 2010 on a Windows 7 platform will convert a non-breaking hyphen to 0x1E when a document is saved as plain text.
- A document has been created on a pre-Unicode version of MS Word, which would be very old. I believe that such was the type of document that my colleague, who brought this issue to my attention, was working with.
It is not certain how many of our mapping tables recognize 0x1E as a non-breaking hyphen. When I ran a test using the CP1252 converter, it changed a 0x1E to U+001E. This is what one would normally want. But it will not handle 8-bit data from MS Word correctly.
If you have 8-bit data that originated in MS Word which contains 0x1E, and your mapping table converts 0x1E to U+001E, you can handle it in one of the following ways:
- Customize the mapping table to convert 0x1E to U+2011, or:
- Convert that data with the existing mapping table, then use a global 'find and replace' to change U+001E to U+2011, or:
- First use a global 'find and replace' to change 0x1E to an unused ASCII character sequence, such as ^~, which represents a non-breaking hyphen. Then convert the data with an existing map.
If you have converted MS Word data, and hyphens seemed to disappear, or the data was left with seemingly extraneous U+001Es, you should check to see if this is the cause.
It is not clear that all mapping tables should be modified to include the 0x1E > U+2011 change, since this is specific to MS Word 8-bit data, and tables are often used with data from other sources.