LabTalk and OriginC Changes Due to Unicode Support in Origin 2018

posted in: Programming | 0

NOTE: This post pertains to the upcoming version of Origin- Origin 2018- which will be released in October of 2017. Origin 2017 and earlier versions are not Unicode-aware and thus this information does not pertain to those versions.

Introduction

In a break from the past, Origin 2018 will utilize Unicode for text data (strings) enabling it to support a large variety of languages with much less dependency on computer locales. While Unicode is known as a character set, how it is implemented in terms of storage and manipulation is based on a character encoding. A number of encodings are available for Unicode data, but OriginLab settled on UTF-8 as the internal Unicode encoding for a number of reasons not the least of which is that it plays well with ASCII data, a very compelling advantage.

Without going into too much background information, Unicode treats text “characters” as integer-based Code Points and not actual characters per se. In UTF-8, these Code Points break down into a sequence of 1 to 4 bytes with the ASCII range occupying 1 byte and other languages variously occupying 2, 3 or 4 bytes. For those who are interested, this is a quite good discussion of the subject.

Let’s take a look at some actual text data and break it down to illustrate what is going on. In the table below, we see the Simplified Chinese version of the text “Experiment #1”: 实验 #1. The most important thing to notice is that the first two Code Points occupy 3 bytes each while the last 3 Code Points are 1 byte each due to the fact they are within the ASCII range. Also notice that there is a 1 to 1 relationship between “visual characters” (graphemes) and Code Points. For the vast majority of cases this holds true but there are some exceptions that will be outlined at the end of this post.

What does this mean for the LabTalk or OriginC programmer who wants to manipulate data containing non-ASCII UTF-8 encoded data? Let’s see.

 

LabTalk

In order to properly support UTF-8 encoded text strings a new system variable, @SCOL , has been added to Origin 2018. When it is set to 1 (default), most all common LabTalk string functions and methods of the LabTalk String Object will utilize Code Points for character offset and character count (like Excel). If @SCOL is set to 0 (zero), those functions will utilize byte number for offset and number of bytes for count. Remember in LabTalk, offsets & positions are 1-based.

So if we apply two LabTalk string functions, Len() and Mid(), to the Simplified Chinese text illustrated in the table above we can see how the results differ between using Code Points versus bytes:

Notice that when Code Points are used (@SCOL=1), Len() returns the expected number and Mid() returns the expected substring. But if we look at the result when bytes are used (@SCOL=0) and reference the table above, we can see that Len() gives us the number of bytes used for the text string. Mid() starts at byte #4 and returns byte #4 & byte #5. Well, those two bytes are part of a 3-byte Code Point. This means that the returned substring is incomplete and essentially garbage.

While LabTalk now supports proper UTF-8 encoded text strings, there are some restrictions on where non-ASCII range text may be used in scripting. It is not supported in the Script Window or Command Window. If non-ASCII range text (more than 1 byte per “character”) is output or non-ASCII range string literals are entered in to either of those windows, it is likely that garbage will be displayed in place of the text. On the other hand, output to the Results Log or Message Log, or Dialog Box will display the text properly.

If there is the need to use UTF-8 encoded text outside the ASCII range in LabTalk scripts, then OGS files must be used. What’s more, the actual OGS file MUST be saved as UTF-8 with a Byte Order Mark (BOM) added to the file in order for the script to execute properly. Code Builder  will prompt you to save an OGS file as UTF-8 if needed and will automatically add a BOM but other text editors may not do so. For example, NotePad can save UTF-8 and will automatically add a BOM but NotePad++ will not add a BOM unless specifically told to do so. So it is a good idea to determine what your text editor of choice is doing if it is not Code Builder.

Note: the older LabTalk Substring Notation will NOT support UTF-8 strings outside of the ASCII range and there are no plans to implement it. However, you are encouraged to use modern string functions anyway, so this should not present much of an issue with new scripts.

Finally, it is very good practice NOT to try use characters outside of the ASCII range for variable or function names in LabTalk. This is a general best practice in most programming languages, not just LabTalk.

 

OriginC

Like LabTalk, OriginC can now support UTF-8 encoded text strings and it is controlled via the new system variable @SCOC . When it is set to 1, then most all methods of the string class will utilize Code Points for position and character count. When it is set to 0 (zero), those methods will utilize 0-based byte offset for position and number of bytes for count. Unlike the LabTalk version of this system variable, @SCOC is set to 0 (zero) by default meaning that “out of box” OriginC won’t consider Code Points and will stick with bytes. This is a choice made for Origin 2018 to ensure that there aren’t unforeseen issues in large code base of OriginC that ships with the product.

While the methods of the string class can support UTF-8 encoded text strings, most other character and string manipulation functions will NOT take into account non-ASCII range text regardless of the value of @SCOC- they will only deal with bytes. Therefore one must be thoughtful in their use, especially with strings that may contain characters outside of the ASCII range (more than 1 byte).

So if OriginC does not support using Code Points “out of box” then how to utilize them if you need to when calling members of the string class?

Well an OriginC macro is provided to enable such functionality. The macro is ENABLE_STR_LENGTH_USE_CHAR_COUNT;  It sets @SCOC to 1 only within the scope that the macro is used. Once it goes out of scope, @SCOC is set back to it’s previous value. Thus the macro should be placed within the scope where it needs to be used.

For example, you can use the function below to perform the same thing as the LabTalk code discussed above (remembering that OriginC uses 0-based offsets):

Notice the difference before and after the macro is used? And because it is only within the scope of the function, @SCOC reverts to its previous value when the function returns.

Finally along these lines, there is a macro that explicitly sets @SCOC to 0 if needed. It is: DISABLE_STR_LENGTH_USE_CHAR_COUNT;

 

To further support UTF-8 encoded text strings, some methods of the string class have been modified.

String::GetLength() is now prototyped as:

The  bAuto  param is only applicable when @SCOC=1. If true (the default), the method returns the length in Code Points. If false, it return the length in bytes. If @SCOC=0, then length in bytes is always returned regardless of the value of bAuto .

 

String::GetAt() and string::SetAt() have been overloaded with additional versions of each that allow getting and setting substrings (and not just char types) at certain offsets within a given string. This allows for characters than span multiple bytes in UTF-8.

 

Finally, if you want to get the buffer associated with a string variable, it is tempting to use  str.GetBuffer(str.GetLength()); . Don’t do it like that because if Code Point usage is in effect, the length of the buffer will be wrong. Instead use str.GetBuffer(0);  because it is agnostic to whether Code Points or bytes are used.

 

What you See Isn’t Always What you Get

The following is a vast oversimplification of a complex topic but should be good enough to get the point across. In most common cases, it should NOT come up but nonetheless it is good to be aware of.

One of the confounding aspects of Unicode text from a programming perspective is the concept that there is not always a 1 to 1 relationship between “visual characters” (graphemes) and Code Points. Some “visual characters” are actually what are known as grapheme clusters. That is, the “visual character” is composed of a number of Code Points rather than just one. These grapheme clusters are typically composed of a base character and a number of combining characters or nonspacing marks. Those are special Code Points that visually modify the base character in some way but are not visible by themselves.

An example borrowed from this Wikipedia entry is the Swedish surname “Åström”. Visually the name is 6 “characters” but it can be composed of a number of different Code Points. In the following table, we can see two variations of the name that look exactly the same:

Text Code Points
Åström U+00C5 U+0073 U+0074 U+0072 U+00F6 U+006D
Åström U+0041 U+030A U+0073 U+0074 U+0072 U+006F U+0308 U+006D

 

In the 1st example, the Å and ö (Code Points highlighted in blue) are precomposed characters. That is the base character and any combining characters are rolled in to one code point. So, this example presents no problems. However, in the 2nd example, the base characters in green are modified by the combining characters in red. For example,  U+0041 is the base character A. It is modified by U+030A which is the combining ring above character. U+006F is the base character o modified by the combining diaeresis U+0308. Both examples are perfectly legitimate and valid though.

So what does this mean for LabTalk and OriginC? Well, when working with text strings using Code Points, neither language takes into account the “clusters”. That is, the relevant string functions will consider a cluster of Code Points to be individual Code Points. For example, while LabTalk’s len() function will return 6 in the case of the 1st example, it will report 8 for the 2nd because there are indeed 8 Code Points in the entire string.

This is not uncommon behavior among programming languages and other products. For example Excel will report the same results as Origin. Without specialized modules, Python displays the same behavior. And while there coding methods for determining boundaries of grapheme clusters, it is like the subject in general- quite complicated. Thus it will not be implemented in Origin 2018. Luckily the vast majority of Unicode text strings are not grapheme clusters so this particular issue may not have impact on most coding efforts.

 

Wrapping Up

I hope this post has been helpful in understanding the changes Unicode support have brought to programming in Origin. Unicode is not the easiest thing to grasp and dealing with Unicode text can require a somewhat different mindset on the programmers part. For the most part, those who code in LabTalk or OriginC will not be dealing with Unicode (UTF-8) text. But if the need arises, fairly good support has been built into both those languages.

Leave a Reply