In my last article, I described how Unicode text can be broken down into its individual subcomponents: characters are composed of one or more codepoints and these codepoints are encoded into code units which comprise one or more bytes. Expressed in LiveCode:
byte a of codeunit b of codepoint c of character d
This article will explain how you can use these components when processing Unicode text.
Characters to Codepoints: Normalisation
Following the chunk expression above, the first step is breaking a character up into its constituent codepoints. As discussed yesterday, these could be in either composed or decomposed form (or even somewhere in between!). Should you prefer one particular form over another, LiveCode includes a conversion function:
put normalizeText("é", "NFD") into tDecomposed
The “NFD” supplied as the second parameter says you want the string in a Normal Form, Decomposed. For composition, you would specify “NFC”. (There are also “NFKC” and “NFKD” forms but these are not often useful. The “K” stands for “compatibility”…).
What do you think will happen when you execute the following line of code?
answer normalizeText("é", "NFC") is normalizeText("é", "NFD")
LiveCode will happily tell you that both strings are equal! This shouldn’t really be a surprise; when considered as graphical characters they are the same, even if the underlying representation is different. Just like case sensitivity, you have to explicitly ask LiveCode to treat them differently:
set the formSensitive to true
With that set, LiveCode will now consider composed and decomposed forms of the same text to be different, just as it treats “a” and “A” as different when in case-sensitive mode. Also like the caseSensitive property, it only applies to the current handler.
Lets use this knowledge to do something useful. Consider a search function – maybe you’d like to match the word “café” when the user enters “cafe” – here’s how you’d remove accents from a bunch of text:
function stripAccents pInput
local tDecomposed
local tStripped
-- Separate the accents from the base letters
put normalizeText(pInput, "NFD") into tDecomposed
repeat for each codepoint c in tDecomposed
-- Copy everything but the accent marks
if codepointProperty(c, "Diacritic") is false then
put c after tStripped
end if
end repeat
return tStripped
end stripAccents
The function also demonstrates another very useful function – codepointProperty – which will be our next port-of-call.
Codepoint Properties
The supporting library that LiveCode uses to assist in some of the Unicode support (libICU) provides an interface for querying various properties of codepoints and this is exposed to LiveCode scripts via the new codepointProperty function. To use this function, simply provide a codepoint as the first parameter and the name of the property you’d like to retrieve as the second parameter.
There are a large number of properties that exist, some of which are more useful than others. For an overview of the properties that the Unicode character database provides, please see here (http://www.unicode.org/reports/tr44/). Some of my personal favourites are:
- “Name” – returns the official Unicode name of the codepoint
- “Script” – script the character belongs to, e.g. Latin or Cyrillic
- “Numeric value” – the value of the character when interpreted as a number
- “Lowercase Mapping” and “Uppercase Mapping” – lower- or upper-cases the character
Example output from these properties:
answer codepointProperty("©", "Name") -- "COPYRIGHT SIGN"
answer codepointProperty("Ω", "Script") -- "Greek"
answer codepointProperty("¾", "Numeric Value") -- 0.75
answer codepointProperty("ß", "Uppercase Mapping") -- "SS"
Code Units and Bytes: Encoding
The LiveCode engine does a lot work to hide the complications of Unicode from the user but, unfortunately, not all software is written in LiveCode. This means that when you talk to other software, you have to tell the engine how to talk to it in Unicode. This is where text encodings come in – every time you read from or write to a file, process, network socket or URL, text has to be encoded in some way.
To convert between text and one of these binary encodings, use one of the aptly named textEncode and textDecode functions:
put url("binfile:input.txt") into tInputEncoded
put textDecode(tInputEncoded, "UTF-8") into tInput
…
put textEncode(tOutput, "UTF-8") into tOutputEncoded
put tOutputEncoded into url("binfile:output.txt")
If you are using the open file/socket/process syntax, you can have the conversion done for you:
open tFile for utf-8 text read
Unfortunately, the URL syntax does not offer the same convenience. It can, however, auto-detect the correct encoding to use in some circumstances: when reading from a file URL, the beginning of the file is examined for a “byte order mark” that specifies the encoding of the text. It also uses the encoding returned by the web server when HTTP URLs are used. If the encoding is not recognised, it assumes the platform’s native text encoding is used. As the native encodings do not support Unicode, it is usually better to be explicit when writing to files, etc.
An an aside, we are hoping to improve the URL syntax in order to allow for the same auto-conversion but have not yet settled on what it will be.
read more
Recent Comments