[Cuis] Unicode in Cuis

Thu Feb 7 01:44:58 CST 2013

Hi Juan,

Please don't use UTF8 encoded strings internally but store Unicode
strings as plain Unicode in image, while provide encoding to/from UTF8
"on the edge" of image, that is, when interfacing to outside worlds like
file system, sockets etc.

All other Smalltalks have such Unicode solution and I think this is
simplest to achieve. Specially because you can reserve all usual string
manipulations from simplest like determining the string size. Try this
on UTF8 encoded string ...

All you need for that is a ByteString for ASCII/ISO8859-1 etc (<256) and
WideString for all other Unicode, and specially a flexible Character
class, which can handle both 8bit and wider Unicode "code points". It
seems that a Character class is most problematic to achieve.

I once worked on a Unicode patch to Squeak (to introduce interim
TwoByteString as in VW for more efficient storing Unicode strings in
image), maybe this can help you (and Hannes) a bit:

        Unicode patch
	http://forum.world.st/Unicode-patch-tt64011.html
	as part of this very long discussion:
http://forum.world.st/New-Win32-VM-m17n-testers-needed-tp63730p63798.html

One more resource:

1) The Design and Implementation of Multilingualized Squeak
    Yoshiki Ohshima, Kazuhiro Abe
    http://www.is.titech.ac.jp/~ohshima/squeak/m17npaper/main.html

Best regards
Janko

Dne 07. 02. 2013 03:59, piše Juan Vuletich:
> Hi Folks,
> 
> I was'n able to jump before into the recent discussion, but I've been
> thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8
> made me realize that using UTF8 internally to represent Unicode strings
> is a way to minimize required changes to existing software.
> 
> Hannes, your proposal for representing WideStrings as
> variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right?
> And if I remember correctly, it is what Squeak uses.
> 
> I ended sketching this to compare alternatives:
> 
> 
> ISO 8859-15
> ========
>    pros:
>    -------
>       - We already have a lot of specific primitives in the VM
>       - Efficient use of memory
>       - Very fast #at: and #at:put:
>    cons:
>    -------
>       - Can only represent Latin alphabets and not full Unicode
> 
> Unicode UTF-8 (in a new variableByteSubclass)
>    pros:
>    -------
>       - We could reuse many existing String primitives
>       - It was created to allow existing code deal with Unicode with
> minimum changes
>       - Efficient use of memory
>    cons:
>    -------
>       - Does not allow #at:put:
>       - #at: is very slow, O(n) instead of O(1)
> 
> Unicode UTF-32 (in a new variableWordSubclass)
>    pros:
>    -------
>       - Very fast #at: and #at:put: (although I guess we don use this
> much... especially #at:put:, as Strings are usually regarded as immutable)
>    cons:
>    --------
>       - very inefficient use of memory (4 bytes for every character!)
>       - doesn't take advantage of existing String primitives (sloooow)
> 
> 
> I think that for some time, our main Character/String representation
> should be ISO 8859-15. Switching to full Unicode would require a lot of
> work in the paragraph layout and text rendering engines. But we can
> start building a Unicode representation that could be used for webapps.
> 
> So I suggest:
> 
> - Build an Utf8String variableByteSubclass and Utf8Character. Try to use
> String primitives. Do not include operations such as #at:, to discourage
> their use.
> 
> - Make conversion to/from ISO 8859-15 lossless, by a good codification
> of CodePoints in regular Strings (not unlike Hannes' recent contribution).
> 
> - Use this for the Clipboard. This is pretty close to what we have now,
> but we'd allow using an external Unicode textEditor to build any Unicode
> text, and by copying and pasting into a String literal in Cuis code,
> we'd have something that can be converted back into UTF-8 without losing
> any content.
> 
> - Web servers should convert back Strings into UTF-8. This would let us
> handle and serve content using full Unicode, without needing to wait
> until our tools can display it properly.
> 
> - Use this when editing external files, if they happen to be in UTF-8
> (there are good heuristics for determining this). On save, we offer the
> option to save as UTF-8 or ISO 8859-15.
> 
> - We can start adapting the environment to avoid using the String
> protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and
> related). This would ease an eventual migration to using only UTF-8 for
> everything.
> 
> What do you think? Do we have a plan?
> 
> Cheers,
> Juan Vuletich
> 
> _______________________________________________
> Cuis mailing list
> Cuis at jvuletich.org
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
> 

-- 
Janko Mivšek
Svetovalec za informatiko
Eranova d.o.o.
Ljubljana, Slovenija
www.eranova.si
tel:  01 514 22 55
faks: 01 514 22 56
gsm: 031 674 565