[Cuis] [Pharo-dev] Unicode Support
EuanM
euanmee at gmail.com
Fri Dec 11 19:45:54 CST 2015
Elliot, what's your take on having heterogenous collections for the
composed Unicode?
i.e. collections with one element for each character, with some
characters being themselves a collection of characters
(Simple character like "a" is one char, and a character which is a
collection of characters is the fully composed version of Ǖ (01d5), a
U (0055) with a diaeresis - ¨ - ( 00a8 aka 0308 in combining form )
on top to form the compatibility character Ü (ooDC) which then gets a
macron- ̄ -( 0304) on top of that
so
a #(0061)
Ǖ #(01d5) = #( 00dc 0304) = #( 0055 0308 0304)
i.e a string which alternated those two characters
'aǕaǕaǕaǕ'
would be represented by something equivalent to:
#( 0061 #( 0055 0308 0304) 0061 #( 0055 0308 0304) 0061 #( 0055 0308
0304) 0061 #( 0055 0308 0304) )
as opposed to a string of compatibility characters:
#( 0061 01d5 0061 01d5 0061 01d5 0061 01d5)
Does alternating the type used for characters in a string have a
significant effect on speed?
On 11 December 2015 at 23:08, Eliot Miranda <eliot.miranda at gmail.com> wrote:
> Hi Todd,
>
> On Dec 11, 2015, at 12:57 PM, Todd Blanchard <tblanchard at mac.com> wrote:
>
>
> On Dec 11, 2015, at 12:19, EuanM <euanmee at gmail.com> wrote:
>
> "If it hasn't already been said, please do not conflate Unicode and
> UTF-8. I think that would be a recipe for
> a high P.I.T.A. factor." --Richard Sargent
>
>
> Well, yes. But I think you guys are making this way too hard.
>
> A unicode character is an abstract idea - for instance the letter 'a'.
> The letter 'a' has a code point - its the number 97. How the number 97 is
> represented in the computer is irrelevant.
>
> Now we get to transfer encodings. These are UTF8, UTF16, etc.... A
> transfer encoding specifies the binary representation of the sequence of
> code points.
>
> UTF8 is a variable length byte encoding. You read it one byte at a time,
> aggregating byte sequences to 'code points'. ByteArray would be an
> excellent choice as a superclass but it must be understood that #at: or
> #at:put refers to a byte, not a character. If you want characters, you have
> to start at the beginning and process it sequentially, like a stream (if
> working in the ASCII domain - you can generally 'cheat' this a bit). A C
> representation would be char utf8[]
>
> UTF16 is also a variable length encoding of two byte quantities - what C
> used to call a 'short int'. You process it in two byte chunks instead of
> one byte chunks. Like UTF8, you must read it sequentially to interpret the
> characters. #at and #at:put: would necessarily refer to byte pairs and not
> characters. A C representation would be short utf16[]; It would also to
> 50% space inefficient for ASCII - which is normally the bulk of your text.
>
> Realistically, you need exactly one in-memory format and stream
> readers/writers that can convert (these are typically table driven state
> machines). My choice would be UTF8 for the internal memory format and the
> ability to read and write from UTF8 to UTF16.
>
> But I stress again...strings don't really need indexability as much as you
> think and neither UTF8 nor UTF16 provide this property anyhow as they are
> variable length encodings. I don't see any sensible reason to have more
> than one in-memory binary format in the image.
>
>
> The only reasons are space and time. If a string only contains code points
> in the range 0-255 there's no point in squandering 4 bytes per code point
> (same goes for 0-65535). Further, if in some application interchange is
> more important than random access it may make sense in performance grounds
> to use utf-8 directly.
>
> Again, Smalltalk's dynamic typing makes it easy to have one's cake and eat
> it too.
>
> My $0.02c
>
>
> _,,,^..^,,,_ (phone)
>
>
> I agree. :-)
>
> Regarding UTF-16, I just want to be able to export to, and receive
> from, Windows (and any other platforms using UTF-16 as their native
> character representation).
>
> Windows will always be able to accept UTF-16. All Windows apps *might
> well* export UTF-16. There may be other platforms which use UTF-16 as
> their native format. I'd just like to be able to cope with those
> situations. Nothing more.
>
> All this is requires is a Utf16String class that has an asUtf8String
> method (and any other required conversion methods).
>
>
More information about the Cuis
mailing list