[Cuis] Unicode in Cuis

Thu Feb 7 06:09:33 CST 2013

Why UTF-16?

- Quick access, at:, at:put:, the most usual UNICODE characters are
directly mapped to 2 bytes

- If some string has a out-of-band, it can be easily marked as special

- This is (AFAIK) the preferred internal representation in .NET, Java, and
most C++ implementation of w_char (please confirm). So, in current
standards, memory is not a problem for string representations.

In practice, UTF-16 is plain UNICODE for most the cases

On Thu, Feb 7, 2013 at 9:02 AM, Juan Vuletich <juan at jvuletich.org> wrote:

> **
> Hi Angel,
>
>
> On 2/7/2013 7:07 AM, Angel Java Lopez wrote:
>
> Hi people!
>
> I just found:
> http://www.cprogramming.com/tutorial/unicode.html
>
>  where a similar pro/cons were discussed
>
>  and
>
> http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx
>
>  Well, AFAIK, unicode UTF-16 was the way adopted by C and others to
> represent Unicode strings in memory. According the first article, a direct
> representation of any Unicode character should be 4 bytes. But UTF-16 is
> the encoding adopted to:
> - avoid waste space
> - easy access to a character
>
>  I found too
> http://www.evanjones.ca/unicode-in-c.html
> Contrary to popular belief, it is possible for a Unicode character to
> require multiple 16-bit values.
>
>  What format do you use for data that goes in and out of your software,
> and what format do you use internally?
>
>  Then, the author points to another link
> http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
> Inside your software, store text as UTF-8 or UTF-16; that is to say, pick
> one of the two and stick with it.
>
>  Another interesting post by tim bray
> http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
>
>
> Thanks for the links.
>
>
>  My guess:
>
>  - Cuis Smalltalk could support both internal representation, UTF-8 AND
> UTF-16. A WideString class could be written. And I guess, a WideChar. Their
> protocols should be the same as their "byte" counterparts. What I missing:
> what is the problems in this approach? All problems should be in the
> internal implementation of WideString and WideChar, and in the points where
> the code should decide:
>
>  - a new string is needed, what kind of?
>
>
> One that can store Unicode. Candidates are UTF-8 and UTF-32.
>
>
>  - I have an string object (String or WideString). I need to encode to
> store out of memory. What kind of encoding?
>
>
> That's easy to answer: UTF-8.
>
>
>  Yes, it could be a lot of work, but it's the way adopted by many
> languages, libraries, technologies:
>
>
> I don't mean to sound harsh, but we don't decide on statistics. If we did
> that, we would not be doing Smalltalk :) .
>
>
>  a clear separation btw internal representation vs encoding. The only
> twist, Smalltalk-like, it's the possibility of having TWO internal
> representations using Smalltalk capabilitities.
>
>  The other way: bite the bullit, ONE internal representation, UTF-16, all
> encapsulated in String (I guess it is the strategy adopted by Java and
> .NET). And have new methods when needed, to indicate the encoding (I guess
> in I/O, serialization), when it is needed, having a default encoding
> (UTF-8?)
>
>
> UTF-16? Why? I'd rather chose UTF-8 or UTF-32.
>
>
>  I don't understand why webservers should convert back Strings into
> UTF-8.  No encoding for text response in http protocol? I read
> http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding
>
>
> Serving text in UTF-8 allows using full Unicode web content, and minimizes
> compatibility risks.
>
>
>  AFAIK, everything related to string I/O has specified encoding in some
> way, these days.
>
>  Angel "Java" Lopez
> @ajlopez
>
>
> Cheers,
> Juan Vuletich
>
>
>
>  On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <juan at jvuletich.org>wrote:
>
>> Hi Folks,
>>
>> I was'n able to jump before into the recent discussion, but I've been
>> thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8made me realize that using UTF8 internally to represent Unicode strings is
>> a way to minimize required changes to existing software.
>>
>> Hannes, your proposal for representing WideStrings as
>> variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32, right?
>> And if I remember correctly, it is what Squeak uses.
>>
>> I ended sketching this to compare alternatives:
>>
>>
>> ISO 8859-15
>> ========
>>    pros:
>>    -------
>>       - We already have a lot of specific primitives in the VM
>>       - Efficient use of memory
>>       - Very fast #at: and #at:put:
>>    cons:
>>    -------
>>       - Can only represent Latin alphabets and not full Unicode
>>
>> Unicode UTF-8 (in a new variableByteSubclass)
>>    pros:
>>    -------
>>       - We could reuse many existing String primitives
>>       - It was created to allow existing code deal with Unicode with
>> minimum changes
>>       - Efficient use of memory
>>    cons:
>>    -------
>>       - Does not allow #at:put:
>>       - #at: is very slow, O(n) instead of O(1)
>>
>> Unicode UTF-32 (in a new variableWordSubclass)
>>    pros:
>>    -------
>>       - Very fast #at: and #at:put: (although I guess we don use this
>> much... especially #at:put:, as Strings are usually regarded as immutable)
>>    cons:
>>    --------
>>       - very inefficient use of memory (4 bytes for every character!)
>>       - doesn't take advantage of existing String primitives (sloooow)
>>
>>
>> I think that for some time, our main Character/String representation
>> should be ISO 8859-15. Switching to full Unicode would require a lot of
>> work in the paragraph layout and text rendering engines. But we can start
>> building a Unicode representation that could be used for webapps.
>>
>> So I suggest:
>>
>> - Build an Utf8String variableByteSubclass and Utf8Character. Try to use
>> String primitives. Do not include operations such as #at:, to discourage
>> their use.
>>
>> - Make conversion to/from ISO 8859-15 lossless, by a good codification of
>> CodePoints in regular Strings (not unlike Hannes' recent contribution).
>>
>> - Use this for the Clipboard. This is pretty close to what we have now,
>> but we'd allow using an external Unicode textEditor to build any Unicode
>> text, and by copying and pasting into a String literal in Cuis code, we'd
>> have something that can be converted back into UTF-8 without losing any
>> content.
>>
>> - Web servers should convert back Strings into UTF-8. This would let us
>> handle and serve content using full Unicode, without needing to wait until
>> our tools can display it properly.
>>
>> - Use this when editing external files, if they happen to be in UTF-8
>> (there are good heuristics for determining this). On save, we offer the
>> option to save as UTF-8 or ISO 8859-15.
>>
>> - We can start adapting the environment to avoid using the String
>> protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and related).
>> This would ease an eventual migration to using only UTF-8 for everything.
>>
>> What do you think? Do we have a plan?
>>
>> Cheers,
>> Juan Vuletich
>>
>> _______________________________________________
>> Cuis mailing list
>> Cuis at jvuletich.org
>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>
>
>
> _______________________________________________
> Cuis mailing listCuis at jvuletich.orghttp://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
>
> _______________________________________________
> Cuis mailing list
> Cuis at jvuletich.org
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jvuletich.org/pipermail/cuis_jvuletich.org/attachments/20130207/98d3e15b/attachment-0002.html>