[Cuis] Unicode in Cuis

Wed Feb 6 20:59:46 CST 2013

Hi Folks,

I was'n able to jump before into the recent discussion, but I've been 
thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8 
made me realize that using UTF8 internally to represent Unicode strings 
is a way to minimize required changes to existing software.

Hannes, your proposal for representing WideStrings as 
variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right? 
And if I remember correctly, it is what Squeak uses.

I ended sketching this to compare alternatives:

ISO 8859-15
========
    pros:
    -------
       - We already have a lot of specific primitives in the VM
       - Efficient use of memory
       - Very fast #at: and #at:put:
    cons:
    -------
       - Can only represent Latin alphabets and not full Unicode

Unicode UTF-8 (in a new variableByteSubclass)
    pros:
    -------
       - We could reuse many existing String primitives
       - It was created to allow existing code deal with Unicode with 
minimum changes
       - Efficient use of memory
    cons:
    -------
       - Does not allow #at:put:
       - #at: is very slow, O(n) instead of O(1)

Unicode UTF-32 (in a new variableWordSubclass)
    pros:
    -------
       - Very fast #at: and #at:put: (although I guess we don use this 
much... especially #at:put:, as Strings are usually regarded as immutable)
    cons:
    --------
       - very inefficient use of memory (4 bytes for every character!)
       - doesn't take advantage of existing String primitives (sloooow)

I think that for some time, our main Character/String representation 
should be ISO 8859-15. Switching to full Unicode would require a lot of 
work in the paragraph layout and text rendering engines. But we can 
start building a Unicode representation that could be used for webapps.

So I suggest:

- Build an Utf8String variableByteSubclass and Utf8Character. Try to use 
String primitives. Do not include operations such as #at:, to discourage 
their use.

- Make conversion to/from ISO 8859-15 lossless, by a good codification 
of CodePoints in regular Strings (not unlike Hannes' recent contribution).

- Use this for the Clipboard. This is pretty close to what we have now, 
but we'd allow using an external Unicode textEditor to build any Unicode 
text, and by copying and pasting into a String literal in Cuis code, 
we'd have something that can be converted back into UTF-8 without losing 
any content.

- Web servers should convert back Strings into UTF-8. This would let us 
handle and serve content using full Unicode, without needing to wait 
until our tools can display it properly.

- Use this when editing external files, if they happen to be in UTF-8 
(there are good heuristics for determining this). On save, we offer the 
option to save as UTF-8 or ISO 8859-15.

- We can start adapting the environment to avoid using the String 
protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and 
related). This would ease an eventual migration to using only UTF-8 for 
everything.

What do you think? Do we have a plan?

Cheers,
Juan Vuletich