[Cuis] Unicode in Cuis

Juan Vuletich juan at jvuletich.org
Thu Feb 7 06:21:42 CST 2013


On 2/7/2013 9:09 AM, Angel Java Lopez wrote:
> Why UTF-16?
>
> - Quick access, at:, at:put:, the most usual UNICODE characters are 
> directly mapped to 2 bytes
>

Not if you support full Unicode. So, we'd need to add a flag to each 
instance to say "I only use 2 byte chars", to avoid the cost of scanning 
the whole instance every time, to know, as that would kill the "quick 
access" argument. In any case, and as just said in this thread, the main 
question to answer is "does this issue really matter at all?"

> - If some string has a out-of-band, it can be easily marked as special

That's when thing start to get messy.

> - This is (AFAIK) the preferred internal representation in .NET, Java, 
> and most C++ implementation of w_char (please confirm). So, in current 
> standards, memory is not a problem for string representations.

I _don't_ care about the internal representation others use. I just care 
about which one is best.

> In practice, UTF-16 is plain UNICODE for most the cases
>

Cheers,
Juan Vuletich

> On Thu, Feb 7, 2013 at 9:02 AM, Juan Vuletich <juan at jvuletich.org 
> <mailto:juan at jvuletich.org>> wrote:
>
>     Hi Angel,
>
>
>     On 2/7/2013 7:07 AM, Angel Java Lopez wrote:
>>     Hi people!
>>
>>     I just found:
>>     http://www.cprogramming.com/tutorial/unicode.html
>>
>>     where a similar pro/cons were discussed
>>
>>     and
>>     http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx
>>     <http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061%28v=vs.85%29.aspx>
>>
>>     Well, AFAIK, unicode UTF-16 was the way adopted by C and others
>>     to represent Unicode strings in memory. According the first
>>     article, a direct representation of any Unicode character should
>>     be 4 bytes. But UTF-16 is the encoding adopted to:
>>     - avoid waste space
>>     - easy access to a character
>>
>>     I found too
>>     http://www.evanjones.ca/unicode-in-c.html
>>     Contrary to popular belief, it is possible for a Unicode
>>     character to require multiple 16-bit values.
>>
>>     What format do you use for data that goes in and out of your
>>     software, and what format do you use internally?
>>
>>     Then, the author points to another link
>>     http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
>>     Inside your software, store text as UTF-8 or UTF-16; that is to
>>     say, pick one of the two and stick with it.
>>
>>     Another interesting post by tim bray
>>     http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
>>
>
>     Thanks for the links.
>
>
>>     My guess:
>>
>>     - Cuis Smalltalk could support both internal representation,
>>     UTF-8 AND UTF-16. A WideString class could be written. And I
>>     guess, a WideChar. Their protocols should be the same as their
>>     "byte" counterparts. What I missing: what is the problems in this
>>     approach? All problems should be in the internal implementation
>>     of WideString and WideChar, and in the points where the code
>>     should decide:
>>
>>     - a new string is needed, what kind of?
>
>     One that can store Unicode. Candidates are UTF-8 and UTF-32.
>
>
>>     - I have an string object (String or WideString). I need to
>>     encode to store out of memory. What kind of encoding?
>
>     That's easy to answer: UTF-8.
>
>
>>     Yes, it could be a lot of work, but it's the way adopted by many
>>     languages, libraries, technologies:
>
>     I don't mean to sound harsh, but we don't decide on statistics. If
>     we did that, we would not be doing Smalltalk :) .
>
>
>>     a clear separation btw internal representation vs encoding. The
>>     only twist, Smalltalk-like, it's the possibility of having TWO
>>     internal representations using Smalltalk capabilitities.
>>
>>     The other way: bite the bullit, ONE internal representation,
>>     UTF-16, all encapsulated in String (I guess it is the strategy
>>     adopted by Java and .NET). And have new methods when needed, to
>>     indicate the encoding (I guess in I/O, serialization), when it is
>>     needed, having a default encoding (UTF-8?)
>>
>
>     UTF-16? Why? I'd rather chose UTF-8 or UTF-32.
>
>
>>     I don't understand why webservers should convert back Strings
>>     into UTF-8.  No encoding for text response in http protocol? I
>>     read
>>     http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding
>>
>
>     Serving text in UTF-8 allows using full Unicode web content, and
>     minimizes compatibility risks.
>
>
>>     AFAIK, everything related to string I/O has specified encoding in
>>     some way, these days.
>>
>>     Angel "Java" Lopez
>>     @ajlopez
>>
>
>     Cheers,
>     Juan Vuletich
>
>
>>
>>     On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich
>>     <juan at jvuletich.org <mailto:juan at jvuletich.org>> wrote:
>>
>>         Hi Folks,
>>
>>         I was'n able to jump before into the recent discussion, but
>>         I've been thinking a bit about all this. This
>>         http://en.wikipedia.org/wiki/UTF8 made me realize that using
>>         UTF8 internally to represent Unicode strings is a way to
>>         minimize required changes to existing software.
>>
>>         Hannes, your proposal for representing WideStrings as
>>         variableWordSubclass is like
>>         http://en.wikipedia.org/wiki/UTF-32, right? And if I remember
>>         correctly, it is what Squeak uses.
>>
>>         I ended sketching this to compare alternatives:
>>
>>
>>         ISO 8859-15
>>         ========
>>            pros:
>>            -------
>>               - We already have a lot of specific primitives in the VM
>>               - Efficient use of memory
>>               - Very fast #at: and #at:put:
>>            cons:
>>            -------
>>               - Can only represent Latin alphabets and not full Unicode
>>
>>         Unicode UTF-8 (in a new variableByteSubclass)
>>            pros:
>>            -------
>>               - We could reuse many existing String primitives
>>               - It was created to allow existing code deal with
>>         Unicode with minimum changes
>>               - Efficient use of memory
>>            cons:
>>            -------
>>               - Does not allow #at:put:
>>               - #at: is very slow, O(n) instead of O(1)
>>
>>         Unicode UTF-32 (in a new variableWordSubclass)
>>            pros:
>>            -------
>>               - Very fast #at: and #at:put: (although I guess we don
>>         use this much... especially #at:put:, as Strings are usually
>>         regarded as immutable)
>>            cons:
>>            --------
>>               - very inefficient use of memory (4 bytes for every
>>         character!)
>>               - doesn't take advantage of existing String primitives
>>         (sloooow)
>>
>>
>>         I think that for some time, our main Character/String
>>         representation should be ISO 8859-15. Switching to full
>>         Unicode would require a lot of work in the paragraph layout
>>         and text rendering engines. But we can start building a
>>         Unicode representation that could be used for webapps.
>>
>>         So I suggest:
>>
>>         - Build an Utf8String variableByteSubclass and Utf8Character.
>>         Try to use String primitives. Do not include operations such
>>         as #at:, to discourage their use.
>>
>>         - Make conversion to/from ISO 8859-15 lossless, by a good
>>         codification of CodePoints in regular Strings (not unlike
>>         Hannes' recent contribution).
>>
>>         - Use this for the Clipboard. This is pretty close to what we
>>         have now, but we'd allow using an external Unicode textEditor
>>         to build any Unicode text, and by copying and pasting into a
>>         String literal in Cuis code, we'd have something that can be
>>         converted back into UTF-8 without losing any content.
>>
>>         - Web servers should convert back Strings into UTF-8. This
>>         would let us handle and serve content using full Unicode,
>>         without needing to wait until our tools can display it properly.
>>
>>         - Use this when editing external files, if they happen to be
>>         in UTF-8 (there are good heuristics for determining this). On
>>         save, we offer the option to save as UTF-8 or ISO 8859-15.
>>
>>         - We can start adapting the environment to avoid using the
>>         String protocols that are a problem for UTF-8 (i.e. #at:,
>>         #at:put: and related). This would ease an eventual migration
>>         to using only UTF-8 for everything.
>>
>>         What do you think? Do we have a plan?
>>
>>         Cheers,
>>         Juan Vuletich
>>
>>         _______________________________________________
>>         Cuis mailing list
>>         Cuis at jvuletich.org <mailto:Cuis at jvuletich.org>
>>         http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>
>>
>>
>>     _______________________________________________
>>     Cuis mailing list
>>     Cuis at jvuletich.org  <mailto:Cuis at jvuletich.org>
>>     http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
>     _______________________________________________
>     Cuis mailing list
>     Cuis at jvuletich.org <mailto:Cuis at jvuletich.org>
>     http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
>
> _______________________________________________
> Cuis mailing list
> Cuis at jvuletich.org
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jvuletich.org/pipermail/cuis_jvuletich.org/attachments/20130207/71867542/attachment-0002.html>


More information about the Cuis mailing list