[Cuis] Ropes & Unicode

Fri Feb 22 12:57:04 CST 2013

Hi Folks,

On 2/18/2013 2:53 PM, Ken Dickey wrote:
> On Mon, 18 Feb 2013 10:07:22 +0000
> "H. Hirzel"<hannes.hirzel at gmail.com>  wrote:
>
>
>>>> c) finding a subrope
>> This is what I am interested in next. The general String find. And
>> maybe #splitBy:
> Right.  Unicode has character classes, so word/token breaks, whitespace, case-ing, and especially sorting are typically more complex and expensive to implement.  E.g. you really have to know both the language and locale to sort properly.
>
> My strategy at this point is to go for simple comparison and sorting using the numeric code-point values (#<) and do the more complex comparisons (#comesBefore:) after the codePoint basics are in place and tested.
>
> One challenge is to have good names which are descriptive of the operation but makes the distinction between Unicode and Numeric.
>
>> BTW could you provide the links to Java, Python and C implementations
>> of Ropes so that I can have a look at it as well?
> I did that in these emails and Wikipedia has references, but no reason not to gather them together in the documentation.
>
>
>>>> I actually would expect the class Rope to have
>>>>     Rope class>>  fromString:
>>> Good call.  I'll add it.
>> Thank you, this makes using Ropes effortlessly.  I used in in the
>> performance tests. I only need to know about class Rope for getting
>> started.
> Wow!  Thanks for doing performance tests!  Good information!
>
>
>> Maybe we should complete API the subclasses provide in the class Rope
>> as it is done in String in Squeak (String is superclass of ByteString
>> and WideString in Squeak).
> Ah.  You are catching me out.  I have been consciously NOT looking at the Squeak implementation until I did a "code sketch" of a Ropes implementation for Unicode.
>
> I tend to "think out loud" by writing code.  I confuse myself easily and the more abstract the concept, the better it is to do concrete implementations (IMHO).
>
> I just created
> 	https://github.com/KenDickey/Cuis-Unicode
>
> Note the read-me states "Not yet ready to see the light of day" !!
>
> No real Unicode scanning yet, just getting sketching out the dispatch mechanics.
>
> Note that I have changed names (Rope ->  UniString, ConcatRope ->  UniSplice, etc.)
>
> The plan is to qualify when scanning/inputing characters to see when it makes sense to use 8/16/32 bit charBlocks to hold codePoints.  I.e. if more than XXX % of characters are 32 bit, then use the 32-bit storage class (WordArray) for a chunk of codePoints.  We need to measure to figure out what XXX is.
>
>
>>> The point of this first code was as proof of concept.  I wanted to get
>>> enough experience to see if this was worth our time to carry further.  Lazy
>>> ropes and other specializations are easy to add.
>> What is "lazy" in lazy Ropes?
> If I edit a large file, I don't really want to read the entire file into memory.  I can read in a chunk and concatenate this FlatRope with a LazyRope which is essentially a promise to get things as needed.  In practice, #at: and friends are closures or methods which will read in more of the file as required.  This is called "lazy", as in I am too lazy to read the file all at once.  I'll do it automagically on demand.
>
> Think "virtual rope" like "virtual file system".  You don't have to have everything in memory.  So "distributed", "transactions", ... plenty of fun here! 8^)
>
>
>>> This will take me a while.  I still have much to learn about paragraph,
>>> parsing, and the text editors.
>> With paragraph parsing you mean finding where the paragraph boundaries
>> are? It might be worthwhile considering using paragraph boundaries as
>> end points for Ropes?
> Paragraph breaks are another complex concept, particularly in mixing left-to-right and right-to-left word orderings (say a newspaper review in Arabic, Hebrew, and French).
>
> I think paragraph breaks must be orthogonal to storage.  You don't store paragraphs as  "disk blocks".  You make index structures.

In addition, most users won't need Strings but Texts, i.e. including 
attributes. The Pharo folks did a new implementation of Text that uses 
something very similar to Ropes. They call them 'extents'. This would be 
nice to have.

> Hey, this is getting way too long and I have a bunch of my life that does not fit here.

:) I can understand you. Cuis and Morphic 3 started this way!

> More later.
>
>
> Thanks again for everything!!  I'll try and get the "pulls" in later today.
> -KenD

Cheers,
Juan Vuletich