Examining Unicode, Part II – Digesting Text

by Fraser Gordon on April 2, 2014 11 comments

In my last article, I described how Unicode text can be broken down into its individual subcomponents: characters are composed of one or more codepoints and these codepoints are encoded into code units which comprise one or more bytes. Expressed in LiveCode:

byte a of codeunit b of codepoint c of character d

This article will explain how you can use these components when processing Unicode text.

Characters to Codepoints: Normalisation

Following the chunk expression above, the first step is breaking a character up into its constituent codepoints. As discussed yesterday, these could be in either composed or decomposed form (or even somewhere in between!). Should you prefer one particular form over another, LiveCode includes a conversion function:

put normalizeText("é", "NFD") into tDecomposed

The “NFD” supplied as the second parameter says you want the string in a Normal Form, Decomposed. For composition, you would specify “NFC”. (There are also “NFKC” and “NFKD” forms but these are not often useful. The “K” stands for “compatibility”…).

What do you think will happen when you execute the following line of code?

answer  normalizeText("é", "NFC") is normalizeText("é",  "NFD")

LiveCode will happily tell you that both strings are equal! This shouldn’t really be a surprise; when considered as graphical characters they are the same, even if the underlying representation is different. Just like case sensitivity, you have to explicitly ask LiveCode to treat them differently:

set the  formSensitive to true

With that set, LiveCode will now consider composed and decomposed forms of the same text to be different, just as it treats “a” and “A” as different when in case-sensitive mode. Also like the caseSensitive property, it only applies to the current handler.

Lets use this knowledge to do something useful. Consider a search function – maybe you’d like to match the word “café” when the user enters “cafe” – here’s how you’d remove accents from a bunch of text:

function stripAccents pInput
local tDecomposed
local tStripped

-- Separate the accents from the base letters
put normalizeText(pInput, "NFD") into tDecomposed

repeat for each codepoint c in tDecomposed
-- Copy everything but the accent marks
if codepointProperty(c, "Diacritic") is false then
put c after tStripped
end if
end repeat

return tStripped
end stripAccents

The function also demonstrates another very useful function – codepointProperty – which will be our next port-of-call.

Codepoint Properties

The supporting library that LiveCode uses to assist in some of the Unicode support (libICU) provides an interface for querying various properties of codepoints and this is exposed to LiveCode scripts via the new codepointProperty function. To use this function, simply provide a codepoint as the first parameter and the name of the property you’d like to retrieve as the second parameter.

There are a large number of properties that exist, some of which are more useful than others. For an overview of the properties that the Unicode character database provides, please see here (http://www.unicode.org/reports/tr44/). Some of my personal favourites are:

  1. “Name” – returns the official Unicode name of the codepoint
  2. “Script” – script the character belongs to, e.g. Latin or Cyrillic
  3. “Numeric value” – the value of the character when interpreted as a number
  4. “Lowercase Mapping” and “Uppercase Mapping” – lower- or upper-cases the character

 

Example output from these properties:

answer codepointProperty("©", "Name")              -- "COPYRIGHT SIGN"
answer codepointProperty("Ω", "Script")            -- "Greek"
answer codepointProperty("¾", "Numeric Value")     -- 0.75
answer codepointProperty("ß", "Uppercase Mapping") -- "SS" 

Code Units and Bytes: Encoding

The LiveCode engine does a lot work to hide the complications of Unicode from the user but, unfortunately, not all software is written in LiveCode. This means that when you talk to other software, you have to tell the engine how to talk to it in Unicode. This is where text encodings come in – every time you read from or write to a file, process, network socket or URL, text has to be encoded in some way.

To convert between text and one of these binary encodings, use one of the aptly named textEncode and textDecode functions:

put url("binfile:input.txt") into tInputEncoded
put textDecode(tInputEncoded, "UTF-8") into tInput
…
put textEncode(tOutput, "UTF-8") into tOutputEncoded
put tOutputEncoded into url("binfile:output.txt")

If you are using the open file/socket/process syntax, you can have the conversion done for you:

open tFile for utf-8 text read 

Unfortunately, the URL syntax does not offer the same convenience. It can, however, auto-detect the correct encoding to use in some circumstances: when reading from a file URL, the beginning of the file is examined for a “byte order mark” that specifies the encoding of the text. It also uses the encoding returned by the web server when HTTP URLs are used. If the encoding is not recognised, it assumes the platform’s native text encoding is used. As the native encodings do not support Unicode, it is usually better to be explicit when writing to files, etc.

An an aside, we are hoping to improve the URL syntax in order to allow for the same auto-conversion but have not yet settled on what it will be.

Fraser GordonExamining Unicode, Part II – Digesting Text

Related Posts

Take a look at these posts

11 comments

Join the conversation
  • Richmond - April 2, 2014 reply

    Thanks for a great article. However, I don’t entirely understand “Diacritic” (or, put another way; how does the engine work out which Unicode poits are diacritics and which are not?); and was sad to see that 7.0 dp 1 seems to have a wee problem with the search field in the Dictionary so one is unable to look up ‘codepointProperty’.

    Fraser Gordon - April 2, 2014 reply

    It works it out by asking libICU. The ICU library includes the Unicode Character Database and the UCD defines a number of properties for characters, one of which is “Diacritic”. The UCD itself is rather sizeable – almost 30MB even in the compressed form that ICU stores it in.

    This is obviously quite a lot of data to be hauling around so we’re working on a feature for the standalone builder that will allow you to specify how much of this database you want to be included in your apps – some of the data is fairly esoteric and can probably be jettisoned without too much loss of functionality.

  • Richmond - April 2, 2014 reply

    Well that explains why LC 7.0 dp 1 is so much larger than earlier versions; because of all the ICU stuff wrapped up in it.

  • Dave - April 2, 2014 reply

    Thank you Fraser – great explanations 🙂

    I have a really-really-simple question about Unicode which I’ve been wondering about for a while …

    I’ve been assuming (and hope I’m right in assuming) that in the new world of Unicode I’ll still be able to write “delete the last char of tList” after a repeat where I’ve been adding a variable plus a cr, return, tab or similar to tList

    Fraser Gordon - April 2, 2014 reply

    I can’t think of any reason why it wouldn’t work – tab and cr are still characters. If, for some reason, it doesn’t work then it is most likely a bug.

    Dave - April 2, 2014 reply

    Excellent! Just wanted confirmation 🙂

  • Paul Dupuis - April 2, 2014 reply

    Very helpful post in the expanded Unicode functionality. Are there any new file functions to determine the encoding of a text file? i.e. if you have a file path to a “text” file, any new 7.0 feature to help the programmer determine whether the file is UTF8, UTF16, MacRoman, Window text or whatever to then be able to read it properly?

    Fraser Gordon - April 2, 2014 reply

    In some circumstances, the engine will guess for you. But it does seem like a good idea to have a guessEncoding or similar function to tell you what the file looks like it might be.

    Such a function could never be 100% accurate but, as long as people are aware of that, it should be correct in most cases.

    Tom B. - February 23, 2016 reply

    Fraser, thanks for this article. You mention ‘the beginning of the file is examined for a “byte order mark” that specifies the encoding of the text.’ How is that “byte order mark” created? Can I specify a “byte order mark” in a text file my stack exports so I can later re-import that file with the same encoding? Thanks.

  • Nathan - May 20, 2014 reply

    you’re actually a just right webmaster. The website loading speed is incredible. It sort of feels that you’re doing any unique trick. Also, The contents are masterpiece. you’ve done a wonderful process in this subject!

Join the conversation

*