Fraser Gordon

Better Theming in LiveCode 8

by Fraser Gordon on March 9, 2016 3 comments

The design of user interfaces has changed substantially since LiveCode was first created. For example, instead of a single font used throughout the UI, modern designs use a variety of different fonts for different purposes, in varying sizes and styles.

timelinejpg

Full Access to the Clipboard

by Fraser Gordon on November 4, 2015 1 comment

LiveCode has had support for clipboard operations since the very early days, using the clipboard function and the clipboardData property. These are perfectly good if plain text, RTF and images are all you need to put on the clipboard but sometimes you need a little more than that.

How the Business Application Framework Works

by Fraser Gordon on August 31, 2015 2 comments

Wish it was easier to write complex software?

Want an easier way to write applications?

The Business Application Framework (BAF) is a toolkit that assists writing complex software in LiveCode. It does this by making it easier to write applications using the Model-View-Controller (MVC) design pattern. But what does this actually mean?

Binary vs Text

by Fraser Gordon on June 2, 2014 8 comments

One aspect of LiveCode 7.0 that I keep bringing up in my blog posts is the distinction between textual data and binary data. Although LiveCode does not implement data types for scripts, it does use them internally. Being aware of how the engine treats these types is important for getting the maximum speed out of your scripts.

The four basic types that the engine operates on are Text, BinaryData, Numbers and Arrays – there are a number of sub-types for each of these but the majority of them are not important here. Arrays are very different to the other three (for example, you can’t treat a number as an array or vice-versa) so I’ll ignore them in this blog post.

Bird, Plane or Binary?

First thing first: what exactly are each of the types?

Number: integer or floating-point (decimal) value
Binary: a sequence of bytes
Text: human-readable text

The distinction between Binary and Text doesn’t seem all that clear; after all, isn’t text ultimately just a sequence of bytes stored somewhere?

The important difference is how you, the script writer, intend to use the data: if you think terms like "char" or "word" are meaningful for your data, it is probably Text. On the other hand, Binary is just a bunch of bytes to which the engine assigns no particular meaning. Another useful rule-of-thumb is that anything a human will interact with is text while anything a computer will use is binary (notable exceptions to this are HTML and XML, which are text-based formats).

Promotions

LiveCode will happily hide issues of binary vs text vs number from you. It is, however, very useful to know how exactly it does this because getting it wrong can result in unexpected results.

The types used by the engine can be thought of as a hierarchy, with the least meaningful types at the bottom and most meaningful at the top. This order, from top-to-bottom is:

Number
Text
Binary

Inside the engine, types get moved up and down this hierarchy as necessary. A type can always be moved downwards to a less-specific type but cannot always be moved upwards. For example, the string "hello" isn’t exactly meaningful as a number.

One potentially surprising consequence of this is the code "put byte 1 of 42" will return ‘4’ rather than the byte with value 42. This happens because the conversion from Number to Binary passes through Text, where the number is converted to the string "42", of which the first byte is "4".

Types in a Typeless Language

Long-time LiveCoders amongst you will no doubt now be thinking that LiveCode is a typeless language and wondering how this information can be used. The key to it is simply being consistent in how you use any piece of data.

Implicit type-conversion within the engine can be slow, especially for large chunks of data, so you don’t want to do things like this:

answer byte  "1" to -1 of "¡Hola señor!"

That short line of code contains a number of un-necessary type conversions. The obvious one is specifying the byte using a string instead of a number. It also contains a conversion of Text to Binary – the "byte" chunk expression indicates you want to treat the input as binary data instead of text. Finally, the "answer" command expects Text, so it converts it back again.

Although the example is contrived, you do need to be careful with chunk expressions. Using "byte" means that you want the data to be Binary while any other chunk type is used for Text.

Text is Slow

It isn’t just type conversions that are slow; treating data as the wrong type can also be slow. In particular, comparison operations are slower for Text than for Binary or Number. To compare text, the caseSensitive and formSensitive properties have to be taken into consideration; case conversion and text normalisation operations mean that the Text operations are slower than the equivalent for Binary.

On the other hand, Text is far more flexible that Binary. Operations on Binary are limited to the following (anything outside this will result in conversion to Text):

"is" and "is not" (must be Binary on both sides of the operation)
"byte" chunk expressions
"byteOffset" function
concatenation ("put after", "put before" and "&" but not "&&")
I/O from/to files and sockets opened as binary
"binfile:" URLs
functions explicitly dealing with binary data (e.g. compress, base64encode)
properties marked as being binary data

Text Encodings and Strict-Mode

Those of you who have been paying attention to my previous blog posts (if you exist!) will have heard me mention that to convert between Text and Binary you need to use textEncode and textDecode. With these functions, you specify an encoding. But when the engine does it automatically, what encoding does it use?

The answer is the "native" encoding of the OS on which LiveCode is running. This means "CP1252" on Windows, "MacRoman" on OSX and iOS and "ISO-8859-1" on Linux and Android. All of these platforms fully support Unicode these days but these were the traditional encodings on these platforms before the Unicode standard came about. LiveCode keeps these encodings for backwards-compatibility.

The reason LiveCode continues to automatically convert between these encodings is because the engine previously did not treat Text and Binary any differently (one impact of this is having to set caseSensitive to true before doing comparisons with binary data). In some situations, you might want to turn this legacy behaviour off.

To aid this, we are planning to (at some point in the future) implement a "strict mode" feature similar to the existing Variable Checking option – in this mode, there will be no auto-conversion between Binary and Text and any attempt to do so will throw an error (similar to using a non-numeric string where a number is expected). Like variable checking, it will be optional but using it should help find bugs and potential speed improvements.

Examining Unicode, Part II – Digesting Text

by Fraser Gordon on April 2, 2014 11 comments

In my last article, I described how Unicode text can be broken down into its individual subcomponents: characters are composed of one or more codepoints and these codepoints are encoded into code units which comprise one or more bytes. Expressed in LiveCode:

byte a of codeunit b of codepoint c of character d

This article will explain how you can use these components when processing Unicode text.

Characters to Codepoints: Normalisation

Following the chunk expression above, the first step is breaking a character up into its constituent codepoints. As discussed yesterday, these could be in either composed or decomposed form (or even somewhere in between!). Should you prefer one particular form over another, LiveCode includes a conversion function:

put normalizeText("é", "NFD") into tDecomposed

The “NFD” supplied as the second parameter says you want the string in a Normal Form, Decomposed. For composition, you would specify “NFC”. (There are also “NFKC” and “NFKD” forms but these are not often useful. The “K” stands for “compatibility”…).

What do you think will happen when you execute the following line of code?

answer  normalizeText("é", "NFC") is normalizeText("é",  "NFD")

LiveCode will happily tell you that both strings are equal! This shouldn’t really be a surprise; when considered as graphical characters they are the same, even if the underlying representation is different. Just like case sensitivity, you have to explicitly ask LiveCode to treat them differently:

set the  formSensitive to true

With that set, LiveCode will now consider composed and decomposed forms of the same text to be different, just as it treats “a” and “A” as different when in case-sensitive mode. Also like the caseSensitive property, it only applies to the current handler.

Lets use this knowledge to do something useful. Consider a search function – maybe you’d like to match the word “café” when the user enters “cafe” – here’s how you’d remove accents from a bunch of text:

function stripAccents pInput
local tDecomposed
local tStripped

-- Separate the accents from the base letters
put normalizeText(pInput, "NFD") into tDecomposed

repeat for each codepoint c in tDecomposed
-- Copy everything but the accent marks
if codepointProperty(c, "Diacritic") is false then
put c after tStripped
end if
end repeat

return tStripped
end stripAccents

The function also demonstrates another very useful function – codepointProperty – which will be our next port-of-call.

Codepoint Properties

The supporting library that LiveCode uses to assist in some of the Unicode support (libICU) provides an interface for querying various properties of codepoints and this is exposed to LiveCode scripts via the new codepointProperty function. To use this function, simply provide a codepoint as the first parameter and the name of the property you’d like to retrieve as the second parameter.

There are a large number of properties that exist, some of which are more useful than others. For an overview of the properties that the Unicode character database provides, please see here (http://www.unicode.org/reports/tr44/). Some of my personal favourites are:

“Name” – returns the official Unicode name of the codepoint
“Script” – script the character belongs to, e.g. Latin or Cyrillic
“Numeric value” – the value of the character when interpreted as a number
“Lowercase Mapping” and “Uppercase Mapping” – lower- or upper-cases the character

Example output from these properties:

answer codepointProperty("©", "Name")              -- "COPYRIGHT SIGN"
answer codepointProperty("Ω", "Script")            -- "Greek"
answer codepointProperty("¾", "Numeric Value")     -- 0.75
answer codepointProperty("ß", "Uppercase Mapping") -- "SS"

Code Units and Bytes: Encoding

The LiveCode engine does a lot work to hide the complications of Unicode from the user but, unfortunately, not all software is written in LiveCode. This means that when you talk to other software, you have to tell the engine how to talk to it in Unicode. This is where text encodings come in – every time you read from or write to a file, process, network socket or URL, text has to be encoded in some way.

To convert between text and one of these binary encodings, use one of the aptly named textEncode and textDecode functions:

put url("binfile:input.txt") into tInputEncoded
put textDecode(tInputEncoded, "UTF-8") into tInput
…
put textEncode(tOutput, "UTF-8") into tOutputEncoded
put tOutputEncoded into url("binfile:output.txt")

If you are using the open file/socket/process syntax, you can have the conversion done for you:

open tFile for utf-8 text read

Unfortunately, the URL syntax does not offer the same convenience. It can, however, auto-detect the correct encoding to use in some circumstances: when reading from a file URL, the beginning of the file is examined for a “byte order mark” that specifies the encoding of the text. It also uses the encoding returned by the web server when HTTP URLs are used. If the encoding is not recognised, it assumes the platform’s native text encoding is used. As the native encodings do not support Unicode, it is usually better to be explicit when writing to files, etc.

An an aside, we are hoping to improve the URL syntax in order to allow for the same auto-conversion but have not yet settled on what it will be.

Examining Unicode, Part I – The dissection

by Fraser Gordon on March 31, 2014 5 comments

As I mentioned in my previous blog post, Unicode text is hard (which is one of the reasons it has taken such a monumental effort to get LiveCode 7.0 ready for release – it is now in public testing if you’d like to try it out). In order to make everything work transparently for the writers and users of LiveCode stacks, a lot has to go on behind the scenes. In this post and its follow-up, I hope to explain how some of these innards work. This first post is a bit technical but will lay the groundwork for some new Unicode text processing techniques.

The most important thing with Unicode is to understand what is meant by a character – different people have different definitions, some quite technical. Older computer software will often refer to 8-bit bytes as a character, a standard which LiveCode and its predecessors followed. Sometimes, "character" is used for the symbols defined by the Unicode standard (these are more properly termed "codepoints"). Neither of these is necessarily what a human reader would think of as a character, however.

Consider the letter "é" – that’s obviously a single character, right? Well, it depends on who you ask… Considered as 8-bit bytes, it could be anywhere between 1 and 8 "characters". Looking at it with Unicode-coloured glasses, it could be either 1 or 2 codepoints. However, in LiveCode 7, it is always a single character. If you were a Unicode geek like me, you’d call this LiveCode definition a "grapheme cluster".

Why do these different interpretations arise? If you’ll bear with me, I’ll take it apart piece-by-piece.

First comes the codepoints. The Unicode standard defines two types of representation for accented characters known as "composed" and "decomposed". Continuing with "é" as our example, Unicode would call this U+00E9 "LATIN SMALL LETTER E WITH ACCUTE" in its composed form. In its decomposed form, it would be a U+0065 "LATIN SMALL LETTER E" followed by U+0301 "COMBINING ACCUTE ACCENT". Basically, composed versus decomposed is the choice between accented characters being characters in their own right or instead being an un-accented character with an accent atop it. Conversion between these forms is called "normalisation" and will be discussed in my next post.

Next comes the variable number of bytes that are used to store these codepoints – this comes down to how these codepoints are encoded. Sometimes, old 8-bit encodings have a single byte to represent a particular composed character. Unfortunately, these encodings can only represent 256 different characters so Unicode encodings are used instead. The particular encoding used within LiveCode is UTF-16 (but this is internal to the engine and isn’t visible to LiveCode scripts).

The UTF-16 encoding uses 16-bit values to store codepoints, termed "code units". This extra term is needed because although many languages have all of their symbols representable using a single code unit, a number (including Chinese) need two code units per codepoint for certain characters, due to the large number of symbols within the language. Because of this, a codepoint can be either 2 or 4 bytes in length when encoded with UTF-16.

Other common text encodings are:

UTF-8. Uses between 1 and 4 bytes to encode codepoints. Common on Linux and MacOS X systems.
UTF-32. Always uses 4 bytes per codepoint. Trades space efficiency for simplicity.
MacRoman. Always 1 byte, non-Unicode. Legacy encoding on most MacOS systems.
ISO-8859-1. Always 1 byte, non-Unicode. Legacy encoding on many Linux systems
CP1252. Always 1 byte, non-Unicode. Legacy encoding on many Windows systems.

As you can see, there is a fair bit of complexity behind the transparent Unicode support in LiveCode 7. In my next post, I’ll show you how you can take advantage of knowing how it all fits together.

7.0 – Unicode Strikes Back

by Fraser Gordon on February 27, 2014 14 comments

It has been a number of months since Ali reported our progress on the engine refactoring project and the integration of Unicode into LiveCode (Slaying the Unicode Monster) and in that time, much has changed. The project is nearly complete and, as Kevin said yesterday, we are approaching a DP release.

Supporting Unicode and international text has required extensive changes throughout the engine – too extensive to cover in a single blog entry – so today I’ll explain the changes to one of the most visible parts of LiveCode: fields.

In the current releases of LiveCode, it is possible to use Unicode text in fields. Unfortunately, it requires special syntax and can be a bit cumbersome to manipulate properly. In addition, the support is fairly rudimentary and doesn’t work properly for languages requiring complex text layout (for example, Arabic).

7.0 will change all that – Unicode text in fields (and throughout the engine) is manipulated the same way as any other text. In fact, the engine doesn’t distinguish between Unicode text and “plain” text anymore – they are both just text. But that’s a story for another time.

Most of the changes in the field to support Unicode are “below-the-hood” and won’t be immediately apparent. They have, however, allowed for a much greater deal of flexibility in how text in fields is processed and I’ll summarise what this has allowed us to do:

East Asian languages such as Chinese and Japanese. Previously, these could be entered but the field had difficulty with certain characters that required a certain type of Unicode encoding called “surrogate pairs” – the components of these pairs were treated as separate characters, causing problems when one of them was deleted or had its style changed.

Complex scripts where multiple character fragments combine to form one graphical character (called a “grapheme”). For text manipulation, these are now treated as single characters (and new chunk types “codepoint” and “codeunit” have been added for those who need to access the individual components).

Cursor navigation working appropriately for non-English text. Navigating left and right through a field happens on grapheme boundaries, ensuring that the cursor never ends up between a character and its accent. The keyboard commands for moving forwards and backwards by whole words also works for text that doesn’t use spaces as word separators (e.g. Chinese).

Right-to-left and bidirectional text. Mixing left-to-right and right-to-left languages (e.g. Hebrew and Arabic) text in a field now lays text out in the correct order, including the situation when LTR is embedded within RTL or vice-versa.

All of this is available without any extra work on the part of a developer creating a LiveCode app – our goal with our Unicode support is to make it just as easy to create an app with Unicode support as without. We hope you’ll be pleased with the result!

Better Theming in LiveCode 8

Full Access to the Clipboard

How the Business Application Framework Works

Binary vs Text

Bird, Plane or Binary?

Promotions

Types in a Typeless Language

Text is Slow

Text Encodings and Strict-Mode

Examining Unicode, Part II – Digesting Text

Characters to Codepoints: Normalisation

Codepoint Properties

Code Units and Bytes: Encoding

Examining Unicode, Part I – The dissection

7.0 – Unicode Strikes Back

Recent Posts

Recent Comments

Looking for LiveCode FileMaker?

Bird, Plane or Binary?

Promotions

Types in a Typeless Language

Text is Slow

Text Encodings and Strict-Mode

Characters to Codepoints: Normalisation

Codepoint Properties

Code Units and Bytes: Encoding

Recent Posts

Recent Comments

Tags