[Cuis] Unicode in Cuis

Juan Vuletich juan at jvuletich.org
Thu Feb 7 06:02:51 CST 2013


Hi Angel,

On 2/7/2013 7:07 AM, Angel Java Lopez wrote:
> Hi people!
>
> I just found:
> http://www.cprogramming.com/tutorial/unicode.html
>
> where a similar pro/cons were discussed
>
> and
> http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx 
> <http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061%28v=vs.85%29.aspx>
>
> Well, AFAIK, unicode UTF-16 was the way adopted by C and others to 
> represent Unicode strings in memory. According the first article, a 
> direct representation of any Unicode character should be 4 bytes. But 
> UTF-16 is the encoding adopted to:
> - avoid waste space
> - easy access to a character
>
> I found too
> http://www.evanjones.ca/unicode-in-c.html
> Contrary to popular belief, it is possible for a Unicode character to 
> require multiple 16-bit values.
>
> What format do you use for data that goes in and out of your software, 
> and what format do you use internally?
>
> Then, the author points to another link
> http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
> Inside your software, store text as UTF-8 or UTF-16; that is to say, 
> pick one of the two and stick with it.
>
> Another interesting post by tim bray
> http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
>

Thanks for the links.

> My guess:
>
> - Cuis Smalltalk could support both internal representation, UTF-8 AND 
> UTF-16. A WideString class could be written. And I guess, a WideChar. 
> Their protocols should be the same as their "byte" counterparts. What 
> I missing: what is the problems in this approach? All problems should 
> be in the internal implementation of WideString and WideChar, and in 
> the points where the code should decide:
>
> - a new string is needed, what kind of?

One that can store Unicode. Candidates are UTF-8 and UTF-32.

> - I have an string object (String or WideString). I need to encode to 
> store out of memory. What kind of encoding?

That's easy to answer: UTF-8.

> Yes, it could be a lot of work, but it's the way adopted by many 
> languages, libraries, technologies:

I don't mean to sound harsh, but we don't decide on statistics. If we 
did that, we would not be doing Smalltalk :) .

> a clear separation btw internal representation vs encoding. The only 
> twist, Smalltalk-like, it's the possibility of having TWO internal 
> representations using Smalltalk capabilitities.
>
> The other way: bite the bullit, ONE internal representation, UTF-16, 
> all encapsulated in String (I guess it is the strategy adopted by Java 
> and .NET). And have new methods when needed, to indicate the encoding 
> (I guess in I/O, serialization), when it is needed, having a default 
> encoding (UTF-8?)
>

UTF-16? Why? I'd rather chose UTF-8 or UTF-32.

> I don't understand why webservers should convert back Strings into 
> UTF-8.  No encoding for text response in http protocol? I read 
> http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding 
>

Serving text in UTF-8 allows using full Unicode web content, and 
minimizes compatibility risks.

> AFAIK, everything related to string I/O has specified encoding in some 
> way, these days.
>
> Angel "Java" Lopez
> @ajlopez
>

Cheers,
Juan Vuletich

>
> On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <juan at jvuletich.org 
> <mailto:juan at jvuletich.org>> wrote:
>
>     Hi Folks,
>
>     I was'n able to jump before into the recent discussion, but I've
>     been thinking a bit about all this. This
>     http://en.wikipedia.org/wiki/UTF8 made me realize that using UTF8
>     internally to represent Unicode strings is a way to minimize
>     required changes to existing software.
>
>     Hannes, your proposal for representing WideStrings as
>     variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32,
>     right? And if I remember correctly, it is what Squeak uses.
>
>     I ended sketching this to compare alternatives:
>
>
>     ISO 8859-15
>     ========
>        pros:
>        -------
>           - We already have a lot of specific primitives in the VM
>           - Efficient use of memory
>           - Very fast #at: and #at:put:
>        cons:
>        -------
>           - Can only represent Latin alphabets and not full Unicode
>
>     Unicode UTF-8 (in a new variableByteSubclass)
>        pros:
>        -------
>           - We could reuse many existing String primitives
>           - It was created to allow existing code deal with Unicode
>     with minimum changes
>           - Efficient use of memory
>        cons:
>        -------
>           - Does not allow #at:put:
>           - #at: is very slow, O(n) instead of O(1)
>
>     Unicode UTF-32 (in a new variableWordSubclass)
>        pros:
>        -------
>           - Very fast #at: and #at:put: (although I guess we don use
>     this much... especially #at:put:, as Strings are usually regarded
>     as immutable)
>        cons:
>        --------
>           - very inefficient use of memory (4 bytes for every character!)
>           - doesn't take advantage of existing String primitives (sloooow)
>
>
>     I think that for some time, our main Character/String
>     representation should be ISO 8859-15. Switching to full Unicode
>     would require a lot of work in the paragraph layout and text
>     rendering engines. But we can start building a Unicode
>     representation that could be used for webapps.
>
>     So I suggest:
>
>     - Build an Utf8String variableByteSubclass and Utf8Character. Try
>     to use String primitives. Do not include operations such as #at:,
>     to discourage their use.
>
>     - Make conversion to/from ISO 8859-15 lossless, by a good
>     codification of CodePoints in regular Strings (not unlike Hannes'
>     recent contribution).
>
>     - Use this for the Clipboard. This is pretty close to what we have
>     now, but we'd allow using an external Unicode textEditor to build
>     any Unicode text, and by copying and pasting into a String literal
>     in Cuis code, we'd have something that can be converted back into
>     UTF-8 without losing any content.
>
>     - Web servers should convert back Strings into UTF-8. This would
>     let us handle and serve content using full Unicode, without
>     needing to wait until our tools can display it properly.
>
>     - Use this when editing external files, if they happen to be in
>     UTF-8 (there are good heuristics for determining this). On save,
>     we offer the option to save as UTF-8 or ISO 8859-15.
>
>     - We can start adapting the environment to avoid using the String
>     protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and
>     related). This would ease an eventual migration to using only
>     UTF-8 for everything.
>
>     What do you think? Do we have a plan?
>
>     Cheers,
>     Juan Vuletich
>
>     _______________________________________________
>     Cuis mailing list
>     Cuis at jvuletich.org <mailto:Cuis at jvuletich.org>
>     http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
>
> _______________________________________________
> Cuis mailing list
> Cuis at jvuletich.org
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jvuletich.org/pipermail/cuis_jvuletich.org/attachments/20130207/fa167016/attachment-0002.html>


More information about the Cuis mailing list