[Cuis] Unicode in Cuis

Thu Feb 7 04:07:22 CST 2013

Hi people!

I just found:
http://www.cprogramming.com/tutorial/unicode.html

where a similar pro/cons were discussed

and
http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx

Well, AFAIK, unicode UTF-16 was the way adopted by C and others to
represent Unicode strings in memory. According the first article, a direct
representation of any Unicode character should be 4 bytes. But UTF-16 is
the encoding adopted to:
- avoid waste space
- easy access to a character

I found too
http://www.evanjones.ca/unicode-in-c.html
Contrary to popular belief, it is possible for a Unicode character to
require multiple 16-bit values.

What format do you use for data that goes in and out of your software, and
what format do you use internally?

Then, the author points to another link
http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
Inside your software, store text as UTF-8 or UTF-16; that is to say, pick
one of the two and stick with it.

Another interesting post by tim bray
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

My guess:

- Cuis Smalltalk could support both internal representation, UTF-8 AND
UTF-16. A WideString class could be written. And I guess, a WideChar. Their
protocols should be the same as their "byte" counterparts. What I missing:
what is the problems in this approach? All problems should be in the
internal implementation of WideString and WideChar, and in the points where
the code should decide:

- a new string is needed, what kind of?
- I have an string object (String or WideString). I need to encode to store
out of memory. What kind of encoding?

Yes, it could be a lot of work, but it's the way adopted by many languages,
libraries, technologies: a clear separation btw internal representation vs
encoding. The only twist, Smalltalk-like, it's the possibility of having
TWO internal representations using Smalltalk capabilitities.

The other way: bite the bullit, ONE internal representation, UTF-16, all
encapsulated in String (I guess it is the strategy adopted by Java and
.NET). And have new methods when needed, to indicate the encoding (I guess
in I/O, serialization), when it is needed, having a default encoding
(UTF-8?)

I don't understand why webservers should convert back Strings into UTF-8.
No encoding for text response in http protocol? I read
http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding

AFAIK, everything related to string I/O has specified encoding in some way,
these days.

Angel "Java" Lopez
@ajlopez

On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <juan at jvuletich.org> wrote:

> Hi Folks,
>
> I was'n able to jump before into the recent discussion, but I've been
> thinking a bit about all this. This http://en.wikipedia.org/wiki/**UTF8<http://en.wikipedia.org/wiki/UTF8>made me realize that using UTF8 internally to represent Unicode strings is
> a way to minimize required changes to existing software.
>
> Hannes, your proposal for representing WideStrings as variableWordSubclass
> is like http://en.wikipedia.org/wiki/**UTF-32<http://en.wikipedia.org/wiki/UTF-32>,
> right? And if I remember correctly, it is what Squeak uses.
>
> I ended sketching this to compare alternatives:
>
>
> ISO 8859-15
> ========
>    pros:
>    -------
>       - We already have a lot of specific primitives in the VM
>       - Efficient use of memory
>       - Very fast #at: and #at:put:
>    cons:
>    -------
>       - Can only represent Latin alphabets and not full Unicode
>
> Unicode UTF-8 (in a new variableByteSubclass)
>    pros:
>    -------
>       - We could reuse many existing String primitives
>       - It was created to allow existing code deal with Unicode with
> minimum changes
>       - Efficient use of memory
>    cons:
>    -------
>       - Does not allow #at:put:
>       - #at: is very slow, O(n) instead of O(1)
>
> Unicode UTF-32 (in a new variableWordSubclass)
>    pros:
>    -------
>       - Very fast #at: and #at:put: (although I guess we don use this
> much... especially #at:put:, as Strings are usually regarded as immutable)
>    cons:
>    --------
>       - very inefficient use of memory (4 bytes for every character!)
>       - doesn't take advantage of existing String primitives (sloooow)
>
>
> I think that for some time, our main Character/String representation
> should be ISO 8859-15. Switching to full Unicode would require a lot of
> work in the paragraph layout and text rendering engines. But we can start
> building a Unicode representation that could be used for webapps.
>
> So I suggest:
>
> - Build an Utf8String variableByteSubclass and Utf8Character. Try to use
> String primitives. Do not include operations such as #at:, to discourage
> their use.
>
> - Make conversion to/from ISO 8859-15 lossless, by a good codification of
> CodePoints in regular Strings (not unlike Hannes' recent contribution).
>
> - Use this for the Clipboard. This is pretty close to what we have now,
> but we'd allow using an external Unicode textEditor to build any Unicode
> text, and by copying and pasting into a String literal in Cuis code, we'd
> have something that can be converted back into UTF-8 without losing any
> content.
>
> - Web servers should convert back Strings into UTF-8. This would let us
> handle and serve content using full Unicode, without needing to wait until
> our tools can display it properly.
>
> - Use this when editing external files, if they happen to be in UTF-8
> (there are good heuristics for determining this). On save, we offer the
> option to save as UTF-8 or ISO 8859-15.
>
> - We can start adapting the environment to avoid using the String
> protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and related).
> This would ease an eventual migration to using only UTF-8 for everything.
>
> What do you think? Do we have a plan?
>
> Cheers,
> Juan Vuletich
>
> ______________________________**_________________
> Cuis mailing list
> Cuis at jvuletich.org
> http://jvuletich.org/mailman/**listinfo/cuis_jvuletich.org<http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jvuletich.org/pipermail/cuis_jvuletich.org/attachments/20130207/436ac1c6/attachment-0002.html>