[Cuis] Unicode in Cuis

Thu Feb 7 05:55:07 CST 2013

Hi Janko,

On 2/7/2013 4:44 AM, Janko Mivšek wrote:
> Hi Juan,
>
> Please don't use UTF8 encoded strings internally but store Unicode
> strings as plain Unicode in image, while provide encoding to/from UTF8
> "on the edge" of image, that is, when interfacing to outside worlds like
> file system, sockets etc.

There's no such thing as "plain Unicode". What you mean is UTF-32.

> All other Smalltalks have such Unicode solution and I think this is
> simplest to achieve. Specially because you can reserve all usual string
> manipulations from simplest like determining the string size. Try this
> on UTF8 encoded string ...
>
> All you need for that is a ByteString for ASCII/ISO8859-1 etc (<256) and

(side note: Cuis already uses ISO 8859-15, that is quite better than ISO 
8859-1)

> WideString for all other Unicode, and specially a flexible Character
> class, which can handle both 8bit and wider Unicode "code points". It
> seems that a Character class is most problematic to achieve.

Do we need multiple String representations? Can just one suffice?

> I once worked on a Unicode patch to Squeak (to introduce interim
> TwoByteString as in VW for more efficient storing Unicode strings in
> image), maybe this can help you (and Hannes) a bit:
>
>          Unicode patch
> 	http://forum.world.st/Unicode-patch-tt64011.html
> 	as part of this very long discussion:
> http://forum.world.st/New-Win32-VM-m17n-testers-needed-tp63730p63798.html

Interesting. It shows that this can get complex.

> One more resource:
>
> 1) The Design and Implementation of Multilingualized Squeak
>      Yoshiki Ohshima, Kazuhiro Abe
>      http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/main.html
>
> Best regards
> Janko

Let me rephrase my though. Right now, Cuis uses single byte Char/String. 
No matter what, we need to support UTF-8. So, is UTF-8 enough? Can we 
avoid adding a third encoding and implementation? Can we drop 
ISO-8859-15 some day (and use ONLY UTF-8 everyewhere)?

In Cuis, we try to do the simplest thing. Having one single encoding for 
Chars and String would indeed be the simplest.

Looking a bit at the links you send, everybody agrees with me that the 
main drawback of UTF-8 is slow #at:, and #size (thanks for mentioning 
#size), and impossible #at:put:. So, is this a serious problem? Maybe 
not, especially if we encourage stream like processing and not array 
like access; and this is the main question I'd like to answer.

Cheers,
Juan Vuletich

> Dne 07. 02. 2013 03:59, piše Juan Vuletich:
>> Hi Folks,
>>
>> I was'n able to jump before into the recent discussion, but I've been
>> thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8
>> made me realize that using UTF8 internally to represent Unicode strings
>> is a way to minimize required changes to existing software.
>>
>> Hannes, your proposal for representing WideStrings as
>> variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right?
>> And if I remember correctly, it is what Squeak uses.
>>
>> I ended sketching this to compare alternatives:
>>
>>
>> ISO 8859-15
>> ========
>>     pros:
>>     -------
>>        - We already have a lot of specific primitives in the VM
>>        - Efficient use of memory
>>        - Very fast #at: and #at:put:
>>     cons:
>>     -------
>>        - Can only represent Latin alphabets and not full Unicode
>>
>> Unicode UTF-8 (in a new variableByteSubclass)
>>     pros:
>>     -------
>>        - We could reuse many existing String primitives
>>        - It was created to allow existing code deal with Unicode with
>> minimum changes
>>        - Efficient use of memory
>>     cons:
>>     -------
>>        - Does not allow #at:put:
>>        - #at: is very slow, O(n) instead of O(1)
>>
>> Unicode UTF-32 (in a new variableWordSubclass)
>>     pros:
>>     -------
>>        - Very fast #at: and #at:put: (although I guess we don use this
>> much... especially #at:put:, as Strings are usually regarded as immutable)
>>     cons:
>>     --------
>>        - very inefficient use of memory (4 bytes for every character!)
>>        - doesn't take advantage of existing String primitives (sloooow)
>>
>>
>> I think that for some time, our main Character/String representation
>> should be ISO 8859-15. Switching to full Unicode would require a lot of
>> work in the paragraph layout and text rendering engines. But we can
>> start building a Unicode representation that could be used for webapps.
>>
>> So I suggest:
>>
>> - Build an Utf8String variableByteSubclass and Utf8Character. Try to use
>> String primitives. Do not include operations such as #at:, to discourage
>> their use.
>>
>> - Make conversion to/from ISO 8859-15 lossless, by a good codification
>> of CodePoints in regular Strings (not unlike Hannes' recent contribution).
>>
>> - Use this for the Clipboard. This is pretty close to what we have now,
>> but we'd allow using an external Unicode textEditor to build any Unicode
>> text, and by copying and pasting into a String literal in Cuis code,
>> we'd have something that can be converted back into UTF-8 without losing
>> any content.
>>
>> - Web servers should convert back Strings into UTF-8. This would let us
>> handle and serve content using full Unicode, without needing to wait
>> until our tools can display it properly.
>>
>> - Use this when editing external files, if they happen to be in UTF-8
>> (there are good heuristics for determining this). On save, we offer the
>> option to save as UTF-8 or ISO 8859-15.
>>
>> - We can start adapting the environment to avoid using the String
>> protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and
>> related). This would ease an eventual migration to using only UTF-8 for
>> everything.
>>
>> What do you think? Do we have a plan?
>>
>> Cheers,
>> Juan Vuletich
>>
>> _______________________________________________
>> Cuis mailing list
>> Cuis at jvuletich.org
>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>