[Cuis] Unicode in Cuis
Juan Vuletich
juan at jvuletich.org
Thu Feb 7 06:02:51 CST 2013
Hi Angel,
On 2/7/2013 7:07 AM, Angel Java Lopez wrote:
> Hi people!
>
> I just found:
> http://www.cprogramming.com/tutorial/unicode.html
>
> where a similar pro/cons were discussed
>
> and
> http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx
> <http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061%28v=vs.85%29.aspx>
>
> Well, AFAIK, unicode UTF-16 was the way adopted by C and others to
> represent Unicode strings in memory. According the first article, a
> direct representation of any Unicode character should be 4 bytes. But
> UTF-16 is the encoding adopted to:
> - avoid waste space
> - easy access to a character
>
> I found too
> http://www.evanjones.ca/unicode-in-c.html
> Contrary to popular belief, it is possible for a Unicode character to
> require multiple 16-bit values.
>
> What format do you use for data that goes in and out of your software,
> and what format do you use internally?
>
> Then, the author points to another link
> http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
> Inside your software, store text as UTF-8 or UTF-16; that is to say,
> pick one of the two and stick with it.
>
> Another interesting post by tim bray
> http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
>
Thanks for the links.
> My guess:
>
> - Cuis Smalltalk could support both internal representation, UTF-8 AND
> UTF-16. A WideString class could be written. And I guess, a WideChar.
> Their protocols should be the same as their "byte" counterparts. What
> I missing: what is the problems in this approach? All problems should
> be in the internal implementation of WideString and WideChar, and in
> the points where the code should decide:
>
> - a new string is needed, what kind of?
One that can store Unicode. Candidates are UTF-8 and UTF-32.
> - I have an string object (String or WideString). I need to encode to
> store out of memory. What kind of encoding?
That's easy to answer: UTF-8.
> Yes, it could be a lot of work, but it's the way adopted by many
> languages, libraries, technologies:
I don't mean to sound harsh, but we don't decide on statistics. If we
did that, we would not be doing Smalltalk :) .
> a clear separation btw internal representation vs encoding. The only
> twist, Smalltalk-like, it's the possibility of having TWO internal
> representations using Smalltalk capabilitities.
>
> The other way: bite the bullit, ONE internal representation, UTF-16,
> all encapsulated in String (I guess it is the strategy adopted by Java
> and .NET). And have new methods when needed, to indicate the encoding
> (I guess in I/O, serialization), when it is needed, having a default
> encoding (UTF-8?)
>
UTF-16? Why? I'd rather chose UTF-8 or UTF-32.
> I don't understand why webservers should convert back Strings into
> UTF-8. No encoding for text response in http protocol? I read
> http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding
>
Serving text in UTF-8 allows using full Unicode web content, and
minimizes compatibility risks.
> AFAIK, everything related to string I/O has specified encoding in some
> way, these days.
>
> Angel "Java" Lopez
> @ajlopez
>
Cheers,
Juan Vuletich
>
> On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <juan at jvuletich.org
> <mailto:juan at jvuletich.org>> wrote:
>
> Hi Folks,
>
> I was'n able to jump before into the recent discussion, but I've
> been thinking a bit about all this. This
> http://en.wikipedia.org/wiki/UTF8 made me realize that using UTF8
> internally to represent Unicode strings is a way to minimize
> required changes to existing software.
>
> Hannes, your proposal for representing WideStrings as
> variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32,
> right? And if I remember correctly, it is what Squeak uses.
>
> I ended sketching this to compare alternatives:
>
>
> ISO 8859-15
> ========
> pros:
> -------
> - We already have a lot of specific primitives in the VM
> - Efficient use of memory
> - Very fast #at: and #at:put:
> cons:
> -------
> - Can only represent Latin alphabets and not full Unicode
>
> Unicode UTF-8 (in a new variableByteSubclass)
> pros:
> -------
> - We could reuse many existing String primitives
> - It was created to allow existing code deal with Unicode
> with minimum changes
> - Efficient use of memory
> cons:
> -------
> - Does not allow #at:put:
> - #at: is very slow, O(n) instead of O(1)
>
> Unicode UTF-32 (in a new variableWordSubclass)
> pros:
> -------
> - Very fast #at: and #at:put: (although I guess we don use
> this much... especially #at:put:, as Strings are usually regarded
> as immutable)
> cons:
> --------
> - very inefficient use of memory (4 bytes for every character!)
> - doesn't take advantage of existing String primitives (sloooow)
>
>
> I think that for some time, our main Character/String
> representation should be ISO 8859-15. Switching to full Unicode
> would require a lot of work in the paragraph layout and text
> rendering engines. But we can start building a Unicode
> representation that could be used for webapps.
>
> So I suggest:
>
> - Build an Utf8String variableByteSubclass and Utf8Character. Try
> to use String primitives. Do not include operations such as #at:,
> to discourage their use.
>
> - Make conversion to/from ISO 8859-15 lossless, by a good
> codification of CodePoints in regular Strings (not unlike Hannes'
> recent contribution).
>
> - Use this for the Clipboard. This is pretty close to what we have
> now, but we'd allow using an external Unicode textEditor to build
> any Unicode text, and by copying and pasting into a String literal
> in Cuis code, we'd have something that can be converted back into
> UTF-8 without losing any content.
>
> - Web servers should convert back Strings into UTF-8. This would
> let us handle and serve content using full Unicode, without
> needing to wait until our tools can display it properly.
>
> - Use this when editing external files, if they happen to be in
> UTF-8 (there are good heuristics for determining this). On save,
> we offer the option to save as UTF-8 or ISO 8859-15.
>
> - We can start adapting the environment to avoid using the String
> protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and
> related). This would ease an eventual migration to using only
> UTF-8 for everything.
>
> What do you think? Do we have a plan?
>
> Cheers,
> Juan Vuletich
>
> _______________________________________________
> Cuis mailing list
> Cuis at jvuletich.org <mailto:Cuis at jvuletich.org>
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
>
> _______________________________________________
> Cuis mailing list
> Cuis at jvuletich.org
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jvuletich.org/pipermail/cuis_jvuletich.org/attachments/20130207/fa167016/attachment-0002.html>
More information about the Cuis
mailing list