<div dir="ltr">Hi Euan,<div class="gmail_extra"><br><div class="gmail_quote">On Fri, Dec 11, 2015 at 5:45 PM, EuanM <span dir="ltr"><<a href="mailto:euanmee@gmail.com" target="_blank">euanmee@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Elliot, what's your take on having heterogenous collections for the<br>

composed Unicode?<br></blockquote><div><br></div><div>I'm not sure I'm understanding the question, but... I'm told by someone in the know that string concatenation is a big deal in certain applications, so providing tree-like representations for strings can be a win since concatenation is O(1) (allocate a new root and assign the two subtrees).  It seems reasonable to have a rich library with several representations available with different trade-offs.  But I'd let requirements drive design, not feature dreams.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">i.e. collections with one element for each character, with some<br>

characters being themselves a collection of characters<br>

<br>

(Simple character like "a" is one char, and a character which is a<br>

collection of characters is the fully composed version of Ǖ (01d5), a<br>

U (0055) with a diaeresis - ¨ - ( 00a8  aka 0308 in combining form )<br>

on top to form the compatibility character Ü (ooDC) which then gets a<br>

macron-  ̄  -( 0304) on top of that<br>

<br>

so<br>

a  #(0061)<br>

<br>

Ǖ  #(01d5) = #( 00dc 0304) = #( 0055 0308 0304)<br>

<br>

i.e a string which alternated those two characters<br>

<br>

'aǕaǕaǕaǕ'<br>

<br>

would be represented by something equivalent to:<br>

<br>

#( 0061 #( 0055 0308 0304)  0061 #( 0055 0308 0304)  0061 #( 0055 0308<br>

0304)  0061 #( 0055 0308 0304) )<br>

<br>

as opposed to a string of compatibility characters:<br>

#( 0061 01d5  0061 01d5 0061 01d5 0061 01d5)<br>

<br>

Does alternating the type used for characters in a string have a<br>

significant effect on speed?<br></blockquote><div><br></div><div>I honestly don't know.  You've just gone well beyond my familiarity with the issues :-).  I'm just a VM guy :-). But I will say that in cases like this, real applications and the profiler are your friends.  Be guided by what you need now, not by what you think you'll need further down the road.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5"><br>

On 11 December 2015 at 23:08, Eliot Miranda <<a href="mailto:eliot.miranda@gmail.com">eliot.miranda@gmail.com</a>> wrote:<br>

> Hi Todd,<br>

><br>

> On Dec 11, 2015, at 12:57 PM, Todd Blanchard <<a href="mailto:tblanchard@mac.com">tblanchard@mac.com</a>> wrote:<br>

><br>

><br>

> On Dec 11, 2015, at 12:19, EuanM <<a href="mailto:euanmee@gmail.com">euanmee@gmail.com</a>> wrote:<br>

><br>

> "If it hasn't already been said, please do not conflate Unicode and<br>

> UTF-8. I think that would be a recipe for<br>

> a high P.I.T.A. factor."  --Richard Sargent<br>

><br>

><br>

> Well, yes. But  I think you guys are making this way too hard.<br>

><br>

> A unicode character is an abstract idea - for instance the letter 'a'.<br>

> The letter 'a' has a code point - its the number 97.  How the number 97 is<br>

> represented in the computer is irrelevant.<br>

><br>

> Now we get to transfer encodings.  These are UTF8, UTF16, etc....  A<br>

> transfer encoding specifies the binary representation of the sequence of<br>

> code points.<br>

><br>

> UTF8 is a variable length byte encoding.  You read it one byte at a time,<br>

> aggregating byte sequences to 'code points'.  ByteArray would be an<br>

> excellent choice as a superclass but it must be understood that #at: or<br>

> #at:put refers to a byte, not a character.  If you want characters, you have<br>

> to start at the beginning and process it sequentially, like a stream (if<br>

> working in the ASCII domain - you can generally 'cheat' this a bit).  A C<br>

> representation would be char utf8[]<br>

><br>

> UTF16 is also a variable length encoding of two byte quantities - what C<br>

> used to call a 'short int'.  You process it in two byte chunks instead of<br>

> one byte chunks.  Like UTF8, you must read it sequentially to interpret the<br>

> characters.  #at and #at:put: would necessarily refer to byte pairs and not<br>

> characters.  A C representation would be short utf16[];  It would also to<br>

> 50% space inefficient for ASCII - which is normally the bulk of your text.<br>

><br>

> Realistically, you need exactly one in-memory format and stream<br>

> readers/writers that can convert (these are typically table driven state<br>

> machines).  My choice would be UTF8 for the internal memory format and the<br>

> ability to read and write from UTF8 to UTF16.<br>

><br>

> But I stress again...strings don't really need indexability as much as you<br>

> think and neither UTF8 nor UTF16 provide this property anyhow as they are<br>

> variable length encodings.  I don't see any sensible reason to have more<br>

> than one in-memory binary format in the image.<br>

><br>

><br>

> The only reasons are space and time.  If a string only contains code points<br>

> in the range 0-255 there's no point in squandering 4 bytes per code point<br>

> (same goes for 0-65535).  Further, if in some application interchange is<br>

> more important than random access it may make sense in performance grounds<br>

> to use utf-8 directly.<br>

><br>

> Again, Smalltalk's dynamic typing makes it easy to have one's cake and eat<br>

> it too.<br>

><br>

> My $0.02c<br>

><br>

><br>

> _,,,^..^,,,_ (phone)<br>

><br>

><br>

> I agree. :-)<br>

><br>

> Regarding UTF-16, I just want to be able to export to, and receive<br>

> from, Windows (and any other platforms using UTF-16 as their native<br>

> character representation).<br>

><br>

> Windows will always be able to accept UTF-16.  All Windows apps *might<br>

> well* export UTF-16.  There may be other platforms which use UTF-16 as<br>

> their native format.  I'd just like to be able to cope with those<br>

> situations.  Nothing more.<br>

><br>

> All this is requires is a Utf16String class that has an asUtf8String<br>

> method (and any other required conversion methods).<br>

><br>

><br>

<br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><div><span style="font-size:small;border-collapse:separate"><div>_,,,^..^,,,_<br></div><div>best, Eliot</div></span></div></div></div>

</div></div>