[Cuis] Unicode in Cuis

Angel Java Lopez ajlopez2000 at gmail.com
Thu Feb 7 06:33:38 CST 2013


So, forget full UNICODE, and support the first plane. Baby step. One
internal representation, with >9x% of covering cases. If it were my
personal project, I would follow this path.

But if you want to go full Unicode, I don't see the scan problem: usually,
YOU MUST scan the original string, because it is encoded in some way. Once
done, you know if its UTF16 has 4-byte chars or not. You must decide then
if the internal representation goes UTF16 with marked as special, or plane
Unicode

On Thu, Feb 7, 2013 at 9:21 AM, Juan Vuletich <juan at jvuletich.org> wrote:

> **
> On 2/7/2013 9:09 AM, Angel Java Lopez wrote:
>
> Why UTF-16?
>
>  - Quick access, at:, at:put:, the most usual UNICODE characters are
> directly mapped to 2 bytes
>
>
> Not if you support full Unicode. So, we'd need to add a flag to each
> instance to say "I only use 2 byte chars", to avoid the cost of scanning
> the whole instance every time, to know, as that would kill the "quick
> access" argument. In any case, and as just said in this thread, the main
> question to answer is "does this issue really matter at all?"
>
>
>  - If some string has a out-of-band, it can be easily marked as special
>
>
> That's when thing start to get messy.
>
>
>  - This is (AFAIK) the preferred internal representation in .NET, Java,
> and most C++ implementation of w_char (please confirm). So, in current
> standards, memory is not a problem for string representations.
>
>
> I _don't_ care about the internal representation others use. I just care
> about which one is best.
>
>
>  In practice, UTF-16 is plain UNICODE for most the cases
>
>
> Cheers,
> Juan Vuletich
>
>
>  On Thu, Feb 7, 2013 at 9:02 AM, Juan Vuletich <juan at jvuletich.org> wrote:
>
>>  Hi Angel,
>>
>>
>> On 2/7/2013 7:07 AM, Angel Java Lopez wrote:
>>
>> Hi people!
>>
>> I just found:
>> http://www.cprogramming.com/tutorial/unicode.html
>>
>>  where a similar pro/cons were discussed
>>
>>  and
>>
>> http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx
>>
>>  Well, AFAIK, unicode UTF-16 was the way adopted by C and others to
>> represent Unicode strings in memory. According the first article, a direct
>> representation of any Unicode character should be 4 bytes. But UTF-16 is
>> the encoding adopted to:
>> - avoid waste space
>> - easy access to a character
>>
>>  I found too
>> http://www.evanjones.ca/unicode-in-c.html
>> Contrary to popular belief, it is possible for a Unicode character to
>> require multiple 16-bit values.
>>
>>  What format do you use for data that goes in and out of your software,
>> and what format do you use internally?
>>
>>  Then, the author points to another link
>> http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
>> Inside your software, store text as UTF-8 or UTF-16; that is to say, pick
>> one of the two and stick with it.
>>
>>  Another interesting post by tim bray
>> http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
>>
>>
>>  Thanks for the links.
>>
>>
>>  My guess:
>>
>>  - Cuis Smalltalk could support both internal representation, UTF-8 AND
>> UTF-16. A WideString class could be written. And I guess, a WideChar. Their
>> protocols should be the same as their "byte" counterparts. What I missing:
>> what is the problems in this approach? All problems should be in the
>> internal implementation of WideString and WideChar, and in the points where
>> the code should decide:
>>
>>  - a new string is needed, what kind of?
>>
>>
>>  One that can store Unicode. Candidates are UTF-8 and UTF-32.
>>
>>
>>  - I have an string object (String or WideString). I need to encode to
>> store out of memory. What kind of encoding?
>>
>>
>>  That's easy to answer: UTF-8.
>>
>>
>>  Yes, it could be a lot of work, but it's the way adopted by many
>> languages, libraries, technologies:
>>
>>
>>  I don't mean to sound harsh, but we don't decide on statistics. If we
>> did that, we would not be doing Smalltalk :) .
>>
>>
>>  a clear separation btw internal representation vs encoding. The only
>> twist, Smalltalk-like, it's the possibility of having TWO internal
>> representations using Smalltalk capabilitities.
>>
>>  The other way: bite the bullit, ONE internal representation, UTF-16,
>> all encapsulated in String (I guess it is the strategy adopted by Java and
>> .NET). And have new methods when needed, to indicate the encoding (I guess
>> in I/O, serialization), when it is needed, having a default encoding
>> (UTF-8?)
>>
>>
>>  UTF-16? Why? I'd rather chose UTF-8 or UTF-32.
>>
>>
>>  I don't understand why webservers should convert back Strings into
>> UTF-8.  No encoding for text response in http protocol? I read
>> http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding
>>
>>
>>  Serving text in UTF-8 allows using full Unicode web content, and
>> minimizes compatibility risks.
>>
>>
>>  AFAIK, everything related to string I/O has specified encoding in some
>> way, these days.
>>
>>  Angel "Java" Lopez
>> @ajlopez
>>
>>
>>  Cheers,
>> Juan Vuletich
>>
>>
>>
>>  On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <juan at jvuletich.org>wrote:
>>
>>> Hi Folks,
>>>
>>> I was'n able to jump before into the recent discussion, but I've been
>>> thinking a bit about all this. This http://en.wikipedia.org/wiki/UTF8made me realize that using UTF8 internally to represent Unicode strings is
>>> a way to minimize required changes to existing software.
>>>
>>> Hannes, your proposal for representing WideStrings as
>>> variableWordSubclass is like http://en.wikipedia.org/wiki/UTF-32,
>>> right? And if I remember correctly, it is what Squeak uses.
>>>
>>> I ended sketching this to compare alternatives:
>>>
>>>
>>> ISO 8859-15
>>> ========
>>>    pros:
>>>    -------
>>>       - We already have a lot of specific primitives in the VM
>>>       - Efficient use of memory
>>>       - Very fast #at: and #at:put:
>>>    cons:
>>>    -------
>>>       - Can only represent Latin alphabets and not full Unicode
>>>
>>> Unicode UTF-8 (in a new variableByteSubclass)
>>>    pros:
>>>    -------
>>>       - We could reuse many existing String primitives
>>>       - It was created to allow existing code deal with Unicode with
>>> minimum changes
>>>       - Efficient use of memory
>>>    cons:
>>>    -------
>>>       - Does not allow #at:put:
>>>       - #at: is very slow, O(n) instead of O(1)
>>>
>>> Unicode UTF-32 (in a new variableWordSubclass)
>>>    pros:
>>>    -------
>>>       - Very fast #at: and #at:put: (although I guess we don use this
>>> much... especially #at:put:, as Strings are usually regarded as immutable)
>>>    cons:
>>>    --------
>>>       - very inefficient use of memory (4 bytes for every character!)
>>>       - doesn't take advantage of existing String primitives (sloooow)
>>>
>>>
>>> I think that for some time, our main Character/String representation
>>> should be ISO 8859-15. Switching to full Unicode would require a lot of
>>> work in the paragraph layout and text rendering engines. But we can start
>>> building a Unicode representation that could be used for webapps.
>>>
>>> So I suggest:
>>>
>>> - Build an Utf8String variableByteSubclass and Utf8Character. Try to use
>>> String primitives. Do not include operations such as #at:, to discourage
>>> their use.
>>>
>>> - Make conversion to/from ISO 8859-15 lossless, by a good codification
>>> of CodePoints in regular Strings (not unlike Hannes' recent contribution).
>>>
>>> - Use this for the Clipboard. This is pretty close to what we have now,
>>> but we'd allow using an external Unicode textEditor to build any Unicode
>>> text, and by copying and pasting into a String literal in Cuis code, we'd
>>> have something that can be converted back into UTF-8 without losing any
>>> content.
>>>
>>> - Web servers should convert back Strings into UTF-8. This would let us
>>> handle and serve content using full Unicode, without needing to wait until
>>> our tools can display it properly.
>>>
>>> - Use this when editing external files, if they happen to be in UTF-8
>>> (there are good heuristics for determining this). On save, we offer the
>>> option to save as UTF-8 or ISO 8859-15.
>>>
>>> - We can start adapting the environment to avoid using the String
>>> protocols that are a problem for UTF-8 (i.e. #at:, #at:put: and related).
>>> This would ease an eventual migration to using only UTF-8 for everything.
>>>
>>> What do you think? Do we have a plan?
>>>
>>> Cheers,
>>> Juan Vuletich
>>>
>>> _______________________________________________
>>> Cuis mailing list
>>> Cuis at jvuletich.org
>>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>>
>>
>>
>> _______________________________________________
>> Cuis mailing listCuis at jvuletich.orghttp://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>
>>
>>
>> _______________________________________________
>> Cuis mailing list
>> Cuis at jvuletich.org
>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>
>>
>
> _______________________________________________
> Cuis mailing listCuis at jvuletich.orghttp://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
>
> _______________________________________________
> Cuis mailing list
> Cuis at jvuletich.org
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://jvuletich.org/pipermail/cuis_jvuletich.org/attachments/20130207/c72917bc/attachment-0002.html>


More information about the Cuis mailing list