Unicode

Binary vs Text

by Fraser Gordon on June 2, 2014 8 comments

One aspect of LiveCode 7.0 that I keep bringing up in my blog posts is the distinction between textual data and binary data. Although LiveCode does not implement data types for scripts, it does use them internally. Being aware of how the engine treats these types is important for getting the maximum speed out of your scripts.

The four basic types that the engine operates on are Text, BinaryData, Numbers and Arrays – there are a number of sub-types for each of these but the majority of them are not important here. Arrays are very different to the other three (for example, you can’t treat a number as an array or vice-versa) so I’ll ignore them in this blog post.

Bird, Plane or Binary?

First thing first: what exactly are each of the types?

Number: integer or floating-point (decimal) value
Binary: a sequence of bytes
Text: human-readable text

The distinction between Binary and Text doesn’t seem all that clear; after all, isn’t text ultimately just a sequence of bytes stored somewhere?

The important difference is how you, the script writer, intend to use the data: if you think terms like "char" or "word" are meaningful for your data, it is probably Text. On the other hand, Binary is just a bunch of bytes to which the engine assigns no particular meaning. Another useful rule-of-thumb is that anything a human will interact with is text while anything a computer will use is binary (notable exceptions to this are HTML and XML, which are text-based formats).

Promotions

LiveCode will happily hide issues of binary vs text vs number from you. It is, however, very useful to know how exactly it does this because getting it wrong can result in unexpected results.

The types used by the engine can be thought of as a hierarchy, with the least meaningful types at the bottom and most meaningful at the top. This order, from top-to-bottom is:

Number
Text
Binary

Inside the engine, types get moved up and down this hierarchy as necessary. A type can always be moved downwards to a less-specific type but cannot always be moved upwards. For example, the string "hello" isn’t exactly meaningful as a number.

One potentially surprising consequence of this is the code "put byte 1 of 42" will return ‘4’ rather than the byte with value 42. This happens because the conversion from Number to Binary passes through Text, where the number is converted to the string "42", of which the first byte is "4".

Types in a Typeless Language

Long-time LiveCoders amongst you will no doubt now be thinking that LiveCode is a typeless language and wondering how this information can be used. The key to it is simply being consistent in how you use any piece of data.

Implicit type-conversion within the engine can be slow, especially for large chunks of data, so you don’t want to do things like this:

answer byte  "1" to -1 of "¡Hola señor!"

That short line of code contains a number of un-necessary type conversions. The obvious one is specifying the byte using a string instead of a number. It also contains a conversion of Text to Binary – the "byte" chunk expression indicates you want to treat the input as binary data instead of text. Finally, the "answer" command expects Text, so it converts it back again.

Although the example is contrived, you do need to be careful with chunk expressions. Using "byte" means that you want the data to be Binary while any other chunk type is used for Text.

Text is Slow

It isn’t just type conversions that are slow; treating data as the wrong type can also be slow. In particular, comparison operations are slower for Text than for Binary or Number. To compare text, the caseSensitive and formSensitive properties have to be taken into consideration; case conversion and text normalisation operations mean that the Text operations are slower than the equivalent for Binary.

On the other hand, Text is far more flexible that Binary. Operations on Binary are limited to the following (anything outside this will result in conversion to Text):

"is" and "is not" (must be Binary on both sides of the operation)
"byte" chunk expressions
"byteOffset" function
concatenation ("put after", "put before" and "&" but not "&&")
I/O from/to files and sockets opened as binary
"binfile:" URLs
functions explicitly dealing with binary data (e.g. compress, base64encode)
properties marked as being binary data

Text Encodings and Strict-Mode

Those of you who have been paying attention to my previous blog posts (if you exist!) will have heard me mention that to convert between Text and Binary you need to use textEncode and textDecode. With these functions, you specify an encoding. But when the engine does it automatically, what encoding does it use?

The answer is the "native" encoding of the OS on which LiveCode is running. This means "CP1252" on Windows, "MacRoman" on OSX and iOS and "ISO-8859-1" on Linux and Android. All of these platforms fully support Unicode these days but these were the traditional encodings on these platforms before the Unicode standard came about. LiveCode keeps these encodings for backwards-compatibility.

The reason LiveCode continues to automatically convert between these encodings is because the engine previously did not treat Text and Binary any differently (one impact of this is having to set caseSensitive to true before doing comparisons with binary data). In some situations, you might want to turn this legacy behaviour off.

To aid this, we are planning to (at some point in the future) implement a "strict mode" feature similar to the existing Variable Checking option – in this mode, there will be no auto-conversion between Binary and Text and any attempt to do so will throw an error (similar to using a non-numeric string where a number is expected). Like variable checking, it will be optional but using it should help find bugs and potential speed improvements.

Livecode 7 – ‘put 0 into slowdown’

by Sébastien Nouat on April 30, 2014 9 comments

Since we’ve finished the refactoring of the engine, with every functionality working as it should, a performance issue was raised by several community members, on different parts of the new engine. I have been working closely with Ali on this aspect, and here is my take on it, with fewer eggs but a nice musical clef.

We were of course aware of this slowdown, which was mainly caused by the fact the engine was working in a uniform way, regardless of the kind of characters in the strings in use. This consideration of the strings was the first part of the global refactoring plan, to allow us to bring modifications impacting the whole engine instead of micro-changes targeting one area, as had been done on the former versions; the pitfalls behind a single modification were obviously numerous.

Chiefly, the main slowdown comes with the handling of Unicode: there is much, much work to be done any time a string operation is executed. As Fraser explained it in his blog post, Unicode is not simply the ability to use characters out of the ISO-8859-1 encoding. It also comes with all the subtlety of introducing characters longer than one character, be it:

– combining chars, to add as many accents as needed on the same letter (this type of character OS X gives you easily, since it uses the NFD form).

surrogate pairs – who uses music? The treble clef, pictured left, is one of these characters stored as a surrogate pair.

As one may guess, it becomes slightly more difficult to compare two strings when they can be the same, even having a different number of characters – even worse, finding a substring within a string is an operation starting from the beginning, since no character indexing is valid when combining chars or surrogate pairs are present. And since LiveCode users love to use ‘items’, ‘word’, ‘line’ or any of the new chunks introduced in 7.0, it would be best to bring the script executions back to their pre-Unicode timing, when possible.

The main goal has been to avoid as much as possible using the CPU-costly Unicode functions, which boils down to storing more states for a string – is it native, combined, does it include surrogate pairs? This is makes the string operations work differently according to their content, and discards any slowdown which could be caused by Unicode’s intricate rules. In the end, some operations even became faster than they were before – ‘before’ shows it for instance.

In the same way, the engine is now clever enough to keep in mind when a string has been converted to a number. Since a LiveCode variable only stores strings, that is something which comes quite handy when used in a loop, and probably could be tagged as speedup!

Following on the examples coming as bug reports against the slowdown, here is a comparison between 6.6.1 and (future) DP-3:

That should have you enjoying the DP-3!

Examining Unicode, Part II – Digesting Text

by Fraser Gordon on April 2, 2014 11 comments

In my last article, I described how Unicode text can be broken down into its individual subcomponents: characters are composed of one or more codepoints and these codepoints are encoded into code units which comprise one or more bytes. Expressed in LiveCode:

byte a of codeunit b of codepoint c of character d

This article will explain how you can use these components when processing Unicode text.

Characters to Codepoints: Normalisation

Following the chunk expression above, the first step is breaking a character up into its constituent codepoints. As discussed yesterday, these could be in either composed or decomposed form (or even somewhere in between!). Should you prefer one particular form over another, LiveCode includes a conversion function:

put normalizeText("é", "NFD") into tDecomposed

The “NFD” supplied as the second parameter says you want the string in a Normal Form, Decomposed. For composition, you would specify “NFC”. (There are also “NFKC” and “NFKD” forms but these are not often useful. The “K” stands for “compatibility”…).

What do you think will happen when you execute the following line of code?

answer  normalizeText("é", "NFC") is normalizeText("é",  "NFD")

LiveCode will happily tell you that both strings are equal! This shouldn’t really be a surprise; when considered as graphical characters they are the same, even if the underlying representation is different. Just like case sensitivity, you have to explicitly ask LiveCode to treat them differently:

set the  formSensitive to true

With that set, LiveCode will now consider composed and decomposed forms of the same text to be different, just as it treats “a” and “A” as different when in case-sensitive mode. Also like the caseSensitive property, it only applies to the current handler.

Lets use this knowledge to do something useful. Consider a search function – maybe you’d like to match the word “café” when the user enters “cafe” – here’s how you’d remove accents from a bunch of text:

function stripAccents pInput
local tDecomposed
local tStripped

-- Separate the accents from the base letters
put normalizeText(pInput, "NFD") into tDecomposed

repeat for each codepoint c in tDecomposed
-- Copy everything but the accent marks
if codepointProperty(c, "Diacritic") is false then
put c after tStripped
end if
end repeat

return tStripped
end stripAccents

The function also demonstrates another very useful function – codepointProperty – which will be our next port-of-call.

Codepoint Properties

The supporting library that LiveCode uses to assist in some of the Unicode support (libICU) provides an interface for querying various properties of codepoints and this is exposed to LiveCode scripts via the new codepointProperty function. To use this function, simply provide a codepoint as the first parameter and the name of the property you’d like to retrieve as the second parameter.

There are a large number of properties that exist, some of which are more useful than others. For an overview of the properties that the Unicode character database provides, please see here (http://www.unicode.org/reports/tr44/). Some of my personal favourites are:

“Name” – returns the official Unicode name of the codepoint
“Script” – script the character belongs to, e.g. Latin or Cyrillic
“Numeric value” – the value of the character when interpreted as a number
“Lowercase Mapping” and “Uppercase Mapping” – lower- or upper-cases the character

Example output from these properties:

answer codepointProperty("©", "Name")              -- "COPYRIGHT SIGN"
answer codepointProperty("Ω", "Script")            -- "Greek"
answer codepointProperty("¾", "Numeric Value")     -- 0.75
answer codepointProperty("ß", "Uppercase Mapping") -- "SS"

Code Units and Bytes: Encoding

The LiveCode engine does a lot work to hide the complications of Unicode from the user but, unfortunately, not all software is written in LiveCode. This means that when you talk to other software, you have to tell the engine how to talk to it in Unicode. This is where text encodings come in – every time you read from or write to a file, process, network socket or URL, text has to be encoded in some way.

To convert between text and one of these binary encodings, use one of the aptly named textEncode and textDecode functions:

put url("binfile:input.txt") into tInputEncoded
put textDecode(tInputEncoded, "UTF-8") into tInput
…
put textEncode(tOutput, "UTF-8") into tOutputEncoded
put tOutputEncoded into url("binfile:output.txt")

If you are using the open file/socket/process syntax, you can have the conversion done for you:

open tFile for utf-8 text read

Unfortunately, the URL syntax does not offer the same convenience. It can, however, auto-detect the correct encoding to use in some circumstances: when reading from a file URL, the beginning of the file is examined for a “byte order mark” that specifies the encoding of the text. It also uses the encoding returned by the web server when HTTP URLs are used. If the encoding is not recognised, it assumes the platform’s native text encoding is used. As the native encodings do not support Unicode, it is usually better to be explicit when writing to files, etc.

An an aside, we are hoping to improve the URL syntax in order to allow for the same auto-conversion but have not yet settled on what it will be.

Examining Unicode, Part I – The dissection

by Fraser Gordon on March 31, 2014 5 comments

As I mentioned in my previous blog post, Unicode text is hard (which is one of the reasons it has taken such a monumental effort to get LiveCode 7.0 ready for release – it is now in public testing if you’d like to try it out). In order to make everything work transparently for the writers and users of LiveCode stacks, a lot has to go on behind the scenes. In this post and its follow-up, I hope to explain how some of these innards work. This first post is a bit technical but will lay the groundwork for some new Unicode text processing techniques.

The most important thing with Unicode is to understand what is meant by a character – different people have different definitions, some quite technical. Older computer software will often refer to 8-bit bytes as a character, a standard which LiveCode and its predecessors followed. Sometimes, "character" is used for the symbols defined by the Unicode standard (these are more properly termed "codepoints"). Neither of these is necessarily what a human reader would think of as a character, however.

Consider the letter "é" – that’s obviously a single character, right? Well, it depends on who you ask… Considered as 8-bit bytes, it could be anywhere between 1 and 8 "characters". Looking at it with Unicode-coloured glasses, it could be either 1 or 2 codepoints. However, in LiveCode 7, it is always a single character. If you were a Unicode geek like me, you’d call this LiveCode definition a "grapheme cluster".

Why do these different interpretations arise? If you’ll bear with me, I’ll take it apart piece-by-piece.

First comes the codepoints. The Unicode standard defines two types of representation for accented characters known as "composed" and "decomposed". Continuing with "é" as our example, Unicode would call this U+00E9 "LATIN SMALL LETTER E WITH ACCUTE" in its composed form. In its decomposed form, it would be a U+0065 "LATIN SMALL LETTER E" followed by U+0301 "COMBINING ACCUTE ACCENT". Basically, composed versus decomposed is the choice between accented characters being characters in their own right or instead being an un-accented character with an accent atop it. Conversion between these forms is called "normalisation" and will be discussed in my next post.

Next comes the variable number of bytes that are used to store these codepoints – this comes down to how these codepoints are encoded. Sometimes, old 8-bit encodings have a single byte to represent a particular composed character. Unfortunately, these encodings can only represent 256 different characters so Unicode encodings are used instead. The particular encoding used within LiveCode is UTF-16 (but this is internal to the engine and isn’t visible to LiveCode scripts).

The UTF-16 encoding uses 16-bit values to store codepoints, termed "code units". This extra term is needed because although many languages have all of their symbols representable using a single code unit, a number (including Chinese) need two code units per codepoint for certain characters, due to the large number of symbols within the language. Because of this, a codepoint can be either 2 or 4 bytes in length when encoded with UTF-16.

Other common text encodings are:

UTF-8. Uses between 1 and 4 bytes to encode codepoints. Common on Linux and MacOS X systems.
UTF-32. Always uses 4 bytes per codepoint. Trades space efficiency for simplicity.
MacRoman. Always 1 byte, non-Unicode. Legacy encoding on most MacOS systems.
ISO-8859-1. Always 1 byte, non-Unicode. Legacy encoding on many Linux systems
CP1252. Always 1 byte, non-Unicode. Legacy encoding on many Windows systems.

As you can see, there is a fair bit of complexity behind the transparent Unicode support in LiveCode 7. In my next post, I’ll show you how you can take advantage of knowing how it all fits together.

7 Chunks for LiveCode 7

by Ali Lloyd on March 20, 2014 7 comments

There are a lot of things to tell you about LiveCode 7.0, and of course we will be writing about many of them in upcoming blog posts. For this blog I’m going to tell you bit about some of the new chunk types that have been introduced. There are 7 new chunk types if you count the new synonym, which I do if only to allow for the chiastic blog title – byte, codeunit, codepoint, trueWord, segment, sentence and paragraph.

Arguably the most important new chunk types are the sentence and trueWord chunks. These new chunk types make analysing text much easier than it was previously. Suppose you want to know how the first sentence of Mary Shelley’s Frankenstein compares to the rest of the text, in terms of the number of words it contains (I’m sure you do). Well you can find out if it is smaller than average with ease:

But you needn’t be restricted to English texts, control names or indeed variable names in 7.0. Perhaps you’re interested in the frequencies of different words in the original Russian of Dostoevsky’s Crime and Punishment (I’ve no doubt that you are). Simply repeat over the trueWords of your “Преступление и наказание” field, count, and process.

The most important feature of the new trueWord and sentence chunks is that they draw on a large base of rules about sentence and word boundaries provided by the ICU library. This means in particular that word breaks are identified in places that would be impossible to detect in older versions of LiveCode. Suppose you’ve got a sentence written in Chinese which you want to divide into its constituent words (I’m absolutely certain you have). In this example I’m just using “Hello World.”

The segment chunk, a synonym of the old word chunk, looks for space characters as delimiters, whereas the trueWord chunk uses the ICU data to split the string correctly.
Note that in addition to all the variables, LiveCode 7 is perfectly happy to have a handler named 处理 (“process” – although I don’t claim it is conjugated correctly for this context)! I also exported the snapshot of that stack to a file named “快照.png” from the message box because… well, just because.

Also added is the paragraph chunk. It is very similar to the existing line chunk except it can also be delimited by the Unicode paragraph separator character. This means that it is more useful for processing text which is or will be displayed in a field – field text breaks are precisely the delimiters of the paragraph chunk.

Finally there are the codepoint and codeunit chunks. Well, there’s also the byte chunk, but that should only be used for binary data. For greater detail on these chunks, you should consult the release notes for LiveCode 7.0., but I thought I’d mention a potential application that occurred to me. Imagine, if you will, that you want to make sure Georges Perec and his translators had done a good job with their constrained writing (of course you do). Well the codepoint chunk (with a bit of help from normalizeText) can help you out!

Now here’s a challenge. What’s the longest chunk expression you can find a genuine use for? Put codeunit a of codepoint b of char c of token d of trueWord e of segment f of item g of sentence h of paragraph i of line j… (of field k of group l…)

To upgrade to this release please download the installers directly at: http://downloads.livecode.com/livecode/

To view the release notes please visit: http://downloads.livecode.com/livecode/7_0_0/LiveCodeNotes-7_0_0_dp_1.pdf

7.0 Alchemy

by Sébastien Nouat on March 14, 2014 1 comment

This longed-for day of the 7.0 engine, stable release is coming soon; some of my colleagues have already described before me the most appealing feature linked to this new version is the handling of Unicode characters. But that’s not all: allowing a flow so different from plain text to run through the engine resulted in touching so many parts of it that a deep refactoring has been executed.

In the 7.0 developers team, we’ve been fighting for almost a year the massive beast that renewing the whole engine was, and it’s now more than ever tamed and keen to act as expected.
All this process involved the use of artful alchemy; starting from a closed-mind golem, we had to move and refashion almost every single stone he was made of, and to rearrange all the cogs linking those moving parts to keep the communication between them going in the same way. This included giving a new shape to his brain; and he now has the ability to learn more easily a new knowledge or update independently what he already has, thus making the addition of a new feature a simpler way to go.

Now that the final shape of the new engine is finished, the only thing separating the community from the first stable release is the cogs correctness. All these parts relocations and updates let a little play come in few mechanisms interaction, and the huge amount of changes makes the tracking of those tiny errors way easier with your help: given the symptoms, it’s always possible for us to find the issue and bring the engine closer to the perfection aimed.

The best point in all of this is not that our creature can now handle Unicode: it’s more about all the advantages the community’s applications will be able to gain from it!

Diverging and Merging

by Ali Lloyd on March 7, 2014 No comments

As you probably already know we are getting close to the first developer preview of the Unicode-compatible LiveCode 7. It’s very exciting for me personally, as I have been working on this in one way or another since I started here in November 2012. I knew there was a lot to do, but suffice it to say that at the time I didn’t realise quite how much! In many ways it has been the perfect introduction to the LiveCode engine for me – a not-so-whistlestop tour around all the areas touched on by the refactoring project, which is to say almost all of it.

One of the challenges of maintaining the refactored engine is keeping it up to date, by continually merging in new bug fixes and features. It can already be a little tricky to resolve merge conflicts when so much of the engine has changed, but in many cases code has been moved from its original location to a new file. Sometimes this can result in code getting merged automatically into blocks of code that are no laonger executed, so we’ve had to come up with a system to ensure than any updates which land in the old location are flagged up.

Challenged by Ben to come up with a name for this system, I opted for syntax caravan- the idea being that the syntax branch is diverging from the main branch (going on holiday), but needs to keep being updated on what the main branch is doing (needs to bring the main branch in a caravan). Ok it doesn’t quite work as an analogy, but I was thinking fast and I was asked to be cryptic! It’s a bit snappier than ‘the occasional telephone call from master to tell syntax what it is up to’.

Here is a screenshot of the tool written by Seb which shows the syntax caravan in action:

click image for full size preview

You can see that the usePixelScaling property, and a change to the pixelScale property, have been happily merged by Git. Unfortunately the active version of that code now resides in a completely different file. Thanks to the syntax caravan we can ensure that none of these things get lost in the merge. It shouldn’t be too long now before the syntax branch comes home, and we can put the caravan away for good. Or at least until the next engine overhaul…

7.0 – Unicode Strikes Back

by Fraser Gordon on February 27, 2014 14 comments

It has been a number of months since Ali reported our progress on the engine refactoring project and the integration of Unicode into LiveCode (Slaying the Unicode Monster) and in that time, much has changed. The project is nearly complete and, as Kevin said yesterday, we are approaching a DP release.

Supporting Unicode and international text has required extensive changes throughout the engine – too extensive to cover in a single blog entry – so today I’ll explain the changes to one of the most visible parts of LiveCode: fields.

In the current releases of LiveCode, it is possible to use Unicode text in fields. Unfortunately, it requires special syntax and can be a bit cumbersome to manipulate properly. In addition, the support is fairly rudimentary and doesn’t work properly for languages requiring complex text layout (for example, Arabic).

7.0 will change all that – Unicode text in fields (and throughout the engine) is manipulated the same way as any other text. In fact, the engine doesn’t distinguish between Unicode text and “plain” text anymore – they are both just text. But that’s a story for another time.

Most of the changes in the field to support Unicode are “below-the-hood” and won’t be immediately apparent. They have, however, allowed for a much greater deal of flexibility in how text in fields is processed and I’ll summarise what this has allowed us to do:

East Asian languages such as Chinese and Japanese. Previously, these could be entered but the field had difficulty with certain characters that required a certain type of Unicode encoding called “surrogate pairs” – the components of these pairs were treated as separate characters, causing problems when one of them was deleted or had its style changed.

Complex scripts where multiple character fragments combine to form one graphical character (called a “grapheme”). For text manipulation, these are now treated as single characters (and new chunk types “codepoint” and “codeunit” have been added for those who need to access the individual components).

Cursor navigation working appropriately for non-English text. Navigating left and right through a field happens on grapheme boundaries, ensuring that the cursor never ends up between a character and its accent. The keyboard commands for moving forwards and backwards by whole words also works for text that doesn’t use spaces as word separators (e.g. Chinese).

Right-to-left and bidirectional text. Mixing left-to-right and right-to-left languages (e.g. Hebrew and Arabic) text in a field now lays text out in the correct order, including the situation when LTR is embedded within RTL or vice-versa.

All of this is available without any extra work on the part of a developer creating a LiveCode app – our goal with our Unicode support is to make it just as easy to create an app with Unicode support as without. We hope you’ll be pleased with the result!

Binary vs Text

Bird, Plane or Binary?

Promotions

Types in a Typeless Language

Text is Slow

Text Encodings and Strict-Mode

Livecode 7 – ‘put 0 into slowdown’

Examining Unicode, Part II – Digesting Text

Characters to Codepoints: Normalisation

Codepoint Properties

Code Units and Bytes: Encoding

Examining Unicode, Part I – The dissection

7 Chunks for LiveCode 7

7.0 Alchemy

Diverging and Merging

7.0 – Unicode Strikes Back

Recent Posts

Recent Comments

Looking for LiveCode FileMaker?

Bird, Plane or Binary?

Promotions

Types in a Typeless Language

Text is Slow

Text Encodings and Strict-Mode

Characters to Codepoints: Normalisation

Codepoint Properties

Code Units and Bytes: Encoding

Recent Posts

Recent Comments

Tags