Examining Unicode, Part I – The dissection

by Fraser Gordon on March 31, 2014 5 comments

As I mentioned in my previous blog post, Unicode text is hard (which is one of the reasons it has taken such a monumental effort to get LiveCode 7.0 ready for release – it is now in public testing if you’d like to try it out). In order to make everything work transparently for the writers and users of LiveCode stacks, a lot has to go on behind the scenes. In this post and its follow-up, I hope to explain how some of these innards work. This first post is a bit technical but will lay the groundwork for some new Unicode text processing techniques.

The most important thing with Unicode is to understand what is meant by a character – different people have different definitions, some quite technical. Older computer software will often refer to 8-bit bytes as a character, a standard which LiveCode and its predecessors followed. Sometimes, "character" is used for the symbols defined by the Unicode standard (these are more properly termed "codepoints"). Neither of these is necessarily what a human reader would think of as a character, however.

Consider the letter "é" – that’s obviously a single character, right? Well, it depends on who you ask… Considered as 8-bit bytes, it could be anywhere between 1 and 8 "characters". Looking at it with Unicode-coloured glasses, it could be either 1 or 2 codepoints. However, in LiveCode 7, it is always a single character. If you were a Unicode geek like me, you’d call this LiveCode definition a "grapheme cluster".

Why do these different interpretations arise? If you’ll bear with me, I’ll take it apart piece-by-piece.

First comes the codepoints. The Unicode standard defines two types of representation for accented characters known as "composed" and  "decomposed". Continuing with "é" as our example, Unicode would call this U+00E9 "LATIN SMALL LETTER E WITH ACCUTE" in its composed form. In its decomposed form, it would be a U+0065 "LATIN SMALL LETTER E" followed by U+0301 "COMBINING ACCUTE ACCENT". Basically, composed versus decomposed is the choice between accented characters being characters in their own right or instead being an un-accented character with an accent atop it. Conversion between these forms is called "normalisation" and will be discussed in my next post.

Next comes the variable number of bytes that are used to store these codepoints – this comes down to how these codepoints are encoded. Sometimes, old 8-bit encodings have a single byte to represent a particular composed character. Unfortunately, these encodings can only represent 256 different characters so Unicode encodings are used instead. The particular encoding used within LiveCode is UTF-16 (but this is internal to the engine and isn’t visible to LiveCode scripts).

The UTF-16 encoding uses 16-bit values to store codepoints, termed "code units". This extra term is needed because although many languages have all of their symbols representable using a single code unit, a number (including Chinese) need two code units per codepoint for certain characters, due to the large number of symbols within the language. Because of this, a codepoint can be either 2 or 4 bytes in length when encoded with UTF-16.

Other common text encodings are:

  1. UTF-8. Uses between 1 and 4 bytes to encode codepoints. Common on Linux and MacOS X systems.
  2. UTF-32. Always uses 4 bytes per codepoint. Trades space efficiency for simplicity.
  3. MacRoman. Always 1 byte, non-Unicode. Legacy encoding on most MacOS systems.
  4. ISO-8859-1. Always 1 byte, non-Unicode. Legacy encoding on many Linux systems
  5. CP1252. Always 1 byte, non-Unicode. Legacy encoding on many Windows systems.

As you can see, there is a fair bit of complexity behind the transparent Unicode support in LiveCode 7. In my next post, I’ll show you how you can take advantage of knowing how it all fits together.

Fraser GordonExamining Unicode, Part I – The dissection

Related Posts

Take a look at these posts

5 comments

Join the conversation
  • Michael - March 31, 2014 reply

    Hi Fraser, I was wondering if the push for Unicode in LC7 will mean we can type the trademark symbol and have it be able to be recognized across both Mac and Windows?

    As it stands, I have to use a hacky sort of way involving typing ™ into the code and then have LiveCode perform an OS specific NumToChar() to be able to display a trademark symbol across all platforms.

    If Unicode would allow that symbol to appear regardless of the platform, it would save on a couple of lines of then redundant code.

    Fraser Gordon - April 1, 2014 reply

    Devin got it spot-on: enter the TM symbol and it will be stored as Unicode and should work on all platforms. The script editor also supports Unicode text so you can do:


    put "AcmeSoft Widgetizer™" into field "Banner"

    and it should “just work™” 😉

  • Devin - March 31, 2014 reply

    Not wanting to speak for Frasier here, but in LC 7 unicode text in LiveCode fields “just works” cross platform.

    If you have the DP of LC 7 try a short experiment:

    Create a field and type a ™ character into it. Now in the message box:

    put codepointToNum(char 1 of fld 1)

    It should put 8482 into the result area of the message box; i.e., a code point in the unicode range. In LC < v. 7 if you use charToNum(char 1 of fld 1) you get 170, i.e., upper ASCII range, not reliably consistent between platforms. (Actually charToNum in LC 7 still gives this result, for backward compatibility.)

Join the conversation

*