[Cuis] Unicode in Cuis

Thu Feb 7 15:16:21 CST 2013

On 2/7/13, KenD <Ken.Dickey at whidbey.com> wrote:
> On Thu, 07 Feb 2013 09:21:42 -0300
> Juan Vuletich <juan at jvuletich.org> wrote:
>
>> Not if you support full Unicode. So, we'd need to add a flag to each
>> instance to say "I only use 2 byte chars", to avoid the cost of scanning
>> the whole instance every time, to know, as that would kill the "quick
>> access" argument.
>
> Actually, the quick thing to do is have multiple representations and let
> Smalltalk do the (double) dispatching as required.
>
> I would recommend that Unicode support be almost exclusively a separate
> library, with minimal changes to core.

+1 for a separate library, with minimal changes to core.

 A small add-on change-set like I did
> for Color could be used to allow loading the Unicode library(s) with no
> changes to core and little impact on Juan's time until things look good.

+1 for little impact on Juan's time so that he can continue to focus on Morphic.


> I know Juan's tendency to peek and comment, but it would not be required.
> ;^)
>
> My suggestion is a new alternate reader for Unicode which scans the text and
> auto-converts input into one of 3 categories of internal representation:
> current (or utf-8), utf-16, or utf-32, each of which is "pure".  I.e. only
> one encoding size in each case.  If I "squash" a Chinese character into a
> current/utf-8 string, the result is a new, utf-32 string.  So we give up
> mutation in favor of functional strings, bUt access is O(1).
>
> This also means 3 classes of characters for 8, 16, or 32 bits along with the
> cross-product of comparison, up/down-case and (much work in Unicode)
> sorting.  I suggest we just sort on "code points" in the short term.
>
> This has the advantage of testing simplicity.  Doing multiple
> representations where characters are a mix of 8/16/32 bits gives too much
> work with edge cases in unit tests.  Too much time would be spent "testing
> tests".
>
> The string writer could output in current or utf-8 with utf-16 and utf-32
> added later.
>
> Proper character display is addressed separately only after Unicode support
> is implemented and properly working.
>
> The implementation would not be optimal w.r.t. memory utilization, but would
> minimize development (including unit test writing) time.  Complex handling
> of mixed size characters in strings could be done later as an optimization
> -- but only if/when this becomes useful.
>
> There is an issue of mutable strings (e.g. in text editing).  There are
> several ways of approaching this, including a "caching" implementation which
> converts all to utf-32 for fast access and optionally back and so forth.
> Again, this can be approached on an as-needed basis after full (but simple)
> Unicode support is in place at the Character and String level.
>
> $0.02,
> -KenD
>
> _______________________________________________
> Cuis mailing list
> Cuis at jvuletich.org
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>