[Cuis] About adding a Unicode handling porting layer

H. Hirzel hannes.hirzel at gmail.com
Sat Feb 2 13:29:17 CST 2013


Your feedback, Germán,

makes me consider to analyze what it would involve to write a simple
Unicode porting level.

I think of  an Add-On to Cuis which people can load if they want to
work more Unicode compliant.

In fact 50% of all HTML files on the internet are encoded in UTF8 and
the text files with which I work are mostly UFT8. So if I use
WebClient to download an HTML file and want to futher process it I
have to deal with workarounds. Even HTML files in major European
languages often have Unicode characters like special hypens, quotation
marks, graphical symbols etc.

One idea I'd like to try out is just to replace the class String which
only stores bytes (8bit) with a String class which stores words (32
bit). It is a bit a waste in terms of space but conceptually it would
be straightforward. Space measurement has shown that there are not all
that many strings in Cuis. The major part is taken by bitmaps.

I just have to figure out how to work with these
variableByteSubclasses with which I have not done much in the past.

--Hannes





On 2/1/13, Germán Arduino <garduino at gmail.com> wrote:
> Thanks Hannes, this is very useful to me.
>
> My next step in porting stuff is polish WebClient and, between other
> things, Unicode is an issue.
>
> Germán.
>
> 2013/2/1 H. Hirzel <hannes.hirzel at gmail.com>:
>> Thank you Juan,
>> for adding the Unicode fix so that pasting text through the clipboard
>> does not silently loose characters. More things like this (including
>> comments) later.
>>
>> I have realized that what I wrote earlier is wrong. Cuis reads and
>> saves files in ISO8859-15 by default and not with Unicode. However it
>> is not too difficult to read and write a Unicode file.
>>
>> I have started some notes on this here
>> https://github.com/hhzl/Cuis-Add-Ons/blob/master/UnicodeNotes.md
>>
>> Regards
>> Hannes
>>
>> On 1/23/13, Juan Vuletich <juan at jvuletich.org> wrote:
>>> Thanks Hannes, just integrated this.
>>>
>>> Cheers,
>>> Juan Vuletich
>>>
>>> H. Hirzel wrote:
>>>> The attached change set prevents Cuis from silently ignoring
>>>> characters which are not in ISO 8859-15.
>>>>
>>>> For example if you paste a text snippet which contains the letter
>>>> Omega (Ω) into a TextWindow it is displayed as Ω
>>>>
>>>> The part which does it the other way round is not included.
>>>>
>>>> --Hannes
>>>>
>>>>
>>>>
>>>> On 1/22/13, H. Hirzel <hannes.hirzel at gmail.com> wrote:
>>>>
>>>>> Hello Germán
>>>>>
>>>>> On 1/22/13, Germán Arduino <garduino at gmail.com> wrote:
>>>>>
>>>>>> Nice if you will develop the needed code!
>>>>>>
>>>>>> The first need I have is on the methods of Swazoo that I commented in
>>>>>> other mail, but I think that is more simple, only that I don't was
>>>>>> aware of the already inplace support in Cuis itself.
>>>>>>
>>>>> Yes, that took me as well some time to find out that Cuis indeed has
>>>>> some limited Unicode support.
>>>>>
>>>>> Juan originally wrote that Cuis had dropped Unicode support.
>>>>>
>>>>> When I have a look at Cuis from outside I cannot say that it is the
>>>>> case as Cuis consumes and writes UFT8 text files. Unicode text
>>>>> snippets pasted through the clipboard into a Cuis TextEditor also pass
>>>>> in well. The only limitation is that internally it only handles the
>>>>> code points which are in https://de.wikipedia.org/wiki/ISO_8859-15.
>>>>> And if I work in a Cuis workspace  with
>>>>>
>>>>>     nn asCharacter
>>>>>
>>>>> where nn is an Integer
>>>>>
>>>>>    nn must belong to ISO_8859-15
>>>>>
>>>>>
>>>>> ISO_8859-15 is good for most European languages. If we would have an
>>>>> Add-On to cater for occasional other characters of Unicode which do
>>>>> not fall into the set covered by ISO_8859-15 that would make UTF8 text
>>>>> file processing with Cuis safe.
>>>>>
>>>>>
>>>>> --Hannes
>>>>>
>>>>>
>>>>>
>>>>>> Germàn.
>>>>>>
>>>>>> 2013/1/22 H. Hirzel <hannes.hirzel at gmail.com>:
>>>>>>
>>>>>>> Hello Germán and Juan
>>>>>>>
>>>>>>> As we have seen we can say that Cuis handles Unicode to a certain
>>>>>>> limited extent.
>>>>>>>
>>>>>>> I will post summary a writeup of what I know about it later. I am
>>>>>>> interested in working/contributing to an add-on which loads Unicode
>>>>>>> support into Cuis.
>>>>>>>
>>>>>>> For general work I need
>>>>>>>
>>>>>>> a)
>>>>>>> an add-on so that Cuis can process arbitrary UFT8 text files. However
>>>>>>> the majority of the content characters will fall into the
>>>>>>>   https://de.wikipedia.org/wiki/ISO_8859-15
>>>>>>> range. So it is fine if the other characters are rendered as \unnn or
>>>>>>> &#nnn;
>>>>>>>
>>>>>>> b)
>>>>>>> Another more rewarding put maybe more difficult way  would be to
>>>>>>> replace the String class with a class which handles 16bit characters
>>>>>>> instead of 8 bit characters. In terms of structure all would remain
>>>>>>> the same. Characters would be 16bit like in Java.
>>>>>>>
>>>>>>>
>>>>>>> This will come later. At the moment I am working on ContentPack
>>>>>>> version 2 which will run on Cuis, Squeak and Pharo.
>>>>>>>
>>>>>>> Kind regards
>>>>>>>
>>>>>>> --Hannes
>>>>>>>
>>>>>>>
>>>>>>>> 2013/1/22 Germán Arduino <garduino at gmail.com>:
>>>>>>>>
>>>>>>>>> Thanks for the comments Hannes / Juan:
>>>>>>>>>
>>>>>>>>> I will look into it when have time, or if you prefer Hannes and
>>>>>>>>> want
>>>>>>>>> to help I will integrate it when finish with Aida.
>>>>>>>>>
>>>>>>>>> Germán.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2013/1/21 Juan Vuletich <juan at jvuletich.org>:
>>>>>>>>>
>>>>>>>>>> Hi Germán,
>>>>>>>>>>
>>>>>>>>>> Cool! Just a remark: Cuis does include conversion to/from utf-8
>>>>>>>>>> for
>>>>>>>>>> the
>>>>>>>>>> charset it supports (ISO-8859-15, covering nearly all the latin
>>>>>>>>>> alphabets).
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Juan Vuletich
>>>>>>>>>>
>>>>>>>>>> Germán Arduino wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi:
>>>>>>>>>>>
>>>>>>>>>>> The first versions of Sport and Swazoo working in Cuis 4.1 with
>>>>>>>>>>> all
>>>>>>>>>>> tests green are ready to install.
>>>>>>>>>>>
>>>>>>>>>>> The changes I did in Swazoo are:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> - Avoid Unicode support that don't exist in Cuis
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>> ......
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Cuis mailing list
>>>>>>> Cuis at jvuletich.org
>>>>>>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sincerely,
>>>>>> Germán Arduino
>>>>>> about.me/garduino
>>>>>>
>>>>>> _______________________________________________
>>>>>> Cuis mailing list
>>>>>> Cuis at jvuletich.org
>>>>>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>>>>>
>>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Cuis mailing list
>>>>> Cuis at jvuletich.org
>>>>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>>>>
>>>
>>>
>>> _______________________________________________
>>> Cuis mailing list
>>> Cuis at jvuletich.org
>>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>>>
>>
>> _______________________________________________
>> Cuis mailing list
>> Cuis at jvuletich.org
>> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>
> _______________________________________________
> Cuis mailing list
> Cuis at jvuletich.org
> http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org
>




More information about the Cuis mailing list