[Cuis] Sorting Unicode strings (Re: [Unicode] collation sequences (Re: [squeak-dev] Unicode Support))

Dale Henrichs dale.henrichs at gemtalksystems.com
Mon Dec 7 14:43:52 CST 2015


Hannes,

For GemStone, we are using the ICU library[1]. We have Unicode7, 
Unicode16 and Unicode32 classes (subclasses of CharacterCollection) for 
internal Strings and the class Utf8 (a subclass of ByteArray) for Utf8 
encoded strings ...

The ICU library provides the primitive implementations for working with 
the Unicode* and Utf8 classes

When we started considering Unicode support, we looked at what it would 
take to support collation - our main reason for looking at Unicode in 
the first place) -- and we saw just how complicated the collation rules 
can be[2], we were glad to see that someone had already done the hard 
work[1]...

Reconciling our legacy String implementations (String, DoubleByteString, 
and QuadByteString) with the Unicode* classes was also interesting, 
because the rules for Unicode equality and our legacy equality 
implementation were not quite compatible.

If you are interested in more information, I can share additional 
details ...

Dale

[1] http://site.icu-project.org/
[2] http://unicode.org/reports/tr10/

On 12/07/2015 11:54 AM, H. Hirzel wrote:
> Hello
>
> According to http://www.unicode.org/cldr/charts/27/collation/de.html the German
> phonebook sort order is
>
> a A ä Ä ą̈ Ą̈ ǟ Ǟ ạ̈ Ạ̈ ḁ̈ Ḁ̈ b B c C d D e E f F g G h H i I j J k K
> l L m M n N o O ö Ö ǫ̈ Ǫ̈ ȫ Ȫ ơ̈ Ơ̈ ợ̈ Ợ̈ ọ̈ Ọ̈ p P q Q r R s S ss ß t
> T u U ü Ü ǘ Ǘ ǜ Ǜ ǚ Ǚ ų̈ Ų̈ ǖ Ǖ ư̈ Ư̈ ự̈ Ự̈ ụ̈ Ụ̈ ṳ̈ Ṳ̈ ṷ̈ Ṷ̈ ṵ̈ Ṵ̈ v
> V w W x X y Y z Z
>
> I wonder why it looks like this. A lot of characters which never
> appear in a German text.
>
>
> For Spanish there is 'traditional' and 'standard'
>
> http://www.unicode.org/cldr/charts/27/collation/es.html
>
> standard	a A á Á b B c C d D e E é É f F g G h H i I í Í j J k K l L m
> M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s S t T u U ú Ú
> ü Ü v V w W x X y Y z Z
>
> traditional	a A á Á b B c C ch Ch CH cĥ Cĥ CĤ cȟ Cȟ CȞ cḧ Cḧ CḦ cḣ Cḣ
> CḢ cḩ Cḩ CḨ cḥ Cḥ CḤ cḫ Cḫ CḪ cẖ Cẖ d D e E é É f F g G h H i I í Í j
> J k K l L ll Ll LL lĺ Lĺ LĹ lľ Lľ LĽ lļ Lļ LĻ lḷ Lḷ LḶ lḹ Lḹ LḸ lḽ Lḽ
> LḼ lḻ Lḻ LḺ m M n N ñ Ñ ņ̃ Ņ̃ ṇ̃ Ṇ̃ ṋ̃ Ṋ̃ ṉ̃ Ṉ̃ o O ó Ó p P q Q r R s
> S t T u U ú Ú ü Ü v V w W x X y Y z Z
>
> And French is not easily found
> http://www.unicode.org/cldr/charts/27/collation/index.html
> or seems to be defined elsewhere
>
> http://unicode.org/repos/cldr/tags/release-27/common/collation/fr.xml
>
> Suggestions and hints are welcome
>
> --Hannes
> _______________________________________________
> Cuis mailing list
> Cuis at jvuletich.org
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org





More information about the Cuis mailing list