[Cuis] Unicode in Cuis

Fri Feb 8 09:45:25 CST 2013

On 2/7/2013 1:45 PM, KenD wrote:
> On Thu, 07 Feb 2013 09:21:42 -0300
> Juan Vuletich<juan at jvuletich.org>  wrote:
>
>> Not if you support full Unicode. So, we'd need to add a flag to each
>> instance to say "I only use 2 byte chars", to avoid the cost of scanning
>> the whole instance every time, to know, as that would kill the "quick
>> access" argument.
> Actually, the quick thing to do is have multiple representations and let Smalltalk do the (double) dispatching as required.
>
> I would recommend that Unicode support be almost exclusively a separate library, with minimal changes to core.  A small add-on change-set like I did for Color could be used to allow loading the Unicode library(s) with no changes to core and little impact on Juan's time until things look good.
>
> I know Juan's tendency to peek and comment, but it would not be required.  ;^)

:)

I'll try to let you guys find the best approach.

Yesterday I played a bit to see how quickly the system would break if 
#at: and #at:put: were banned for Strings. It breaks rather quickly. So, 
converting everything to use only UTF-8 as I had suggested is not a 
trivial task, and you are right that I'd better focus on Morphic 3...

> My suggestion is a new alternate reader for Unicode which scans the text and auto-converts input into one of 3 categories of internal representation: current (or utf-8), utf-16, or utf-32, each of which is "pure".  I.e. only one encoding size in each case.  If I "squash" a Chinese character into a current/utf-8 string, the result is a new, utf-32 string.  So we give up mutation in favor of functional strings, bUt access is O(1).
>
> This also means 3 classes of characters for 8, 16, or 32 bits along with the cross-product of comparison, up/down-case and (much work in Unicode) sorting.  I suggest we just sort on "code points" in the short term.
>
> This has the advantage of testing simplicity.  Doing multiple representations where characters are a mix of 8/16/32 bits gives too much work with edge cases in unit tests.  Too much time would be spent "testing tests".
>
> The string writer could output in current or utf-8 with utf-16 and utf-32 added later.
>
> Proper character display is addressed separately only after Unicode support is implemented and properly working.
>
> The implementation would not be optimal w.r.t. memory utilization, but would minimize development (including unit test writing) time.  Complex handling of mixed size characters in strings could be done later as an optimization -- but only if/when this becomes useful.
>
> There is an issue of mutable strings (e.g. in text editing).  There are several ways of approaching this, including a "caching" implementation which converts all to utf-32 for fast access and optionally back and so forth.  Again, this can be approached on an as-needed basis after full (but simple) Unicode support is in place at the Character and String level.
>
> $0.02,
> -KenD
>
> _______________________________________________
> Cuis mailing list
> Cuis at jvuletich.org
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
>
> -----
> Se certifico que el correo no contiene virus.
> Comprobada por AVG - www.avg.es
> Version: 2013.0.2897 / Base de datos de virus: 2639/6087 - Fecha de la version: 07/02/2013
>
>

Cheers,
Juan Vuletich