[Cuis] Unicode Support

KenD Ken.Dickey at Whidbey.com
Fri Dec 4 09:48:24 CST 2015


On Fri, 4 Dec 2015 11:42:11 +0000
EuanM <euanmee at gmail.com> wrote:

> I'm currently groping my way to seeing how feature-complete our
> Unicode support is.  I am doing this to establish what still needs to
> be done to provide full Unicode support.

Moby work!

I did some prototyping to get an idea of the scale of work using a simple Unicode strikefont

  https://github.com/KenDickey/Cuis-Smalltalk-Unicode

Useful to be able to do simple, very limited, editing of web pages.

The prototype is notable in using immutable strings (ropes) to allow mixing characters of different sizes.  E.g. adding a 32 bit code point to an ASCII string does not convert the string from narrow to wide characters.  The theory is that much editing adds a small amount of text to a relatively large document, so immutable strings can save a lot of space.

My impression from doing this basic bit is that doing a decent job with ligatures, multi-directional text, and layouts adds a significant amount of code and data and I would guess doubles the complexity of the system.  I.e. _good_ Unicode support is on the order of the Smalltalk VM and current core class set in terms of size and complexity.

Commercially, this is done over time by fairly well funded teams of people.

===

So it is probably useful to talk a bit about what you want to accomplish and the relative costs.

One decision is the level of support.  
  [A] Is Unicode for the image itself (replace ASCII)? 
  [B] Is Unicode to edit and maintain Web Pages (page editor)?
  [C] Is Unicode to support general fonts for page layout (LibreOffice/Word/..)  

I like Juan's list:

> 0) No Unicode support at all. Like ST-80 or Squeak previous to 3.7.

+ Cheap (done!)
- Non-existant (less useful)

> 1) Limited Unicode support (as in Cuis). Can handle Unicode character in 
Strings. Comfortable Display / edit of text is restricted to Latin 
alphabet. Non ISO-8859-15 characters are represented as NCRs, but are 
not instances of Character themselves. For example, an NCR such as 
'α' (made of 5 8-bit Characters) represents the greek letter Alpha, 
and is properly handled if such string is converted to an UTF-8 
ByteArray (for example for copying into the Clipboard or for serving Web 
pages). In short, you can not directly edit or display general Unicode, 
but you can embed it in code, include it in Strings, copy&paste, and 
serve web pages with it.

+ Fairly Inexpensive
- No unifonts for editors [but could add (4)!]

> 2) Ken's Cuis-Smalltalk-Unicode. Can display and edit. Includes the 
great Ropes representation for Strings. Limited font support.

+ Moderately Expensive (Complex, tables take a large amount of space)
- Probably slower in Smalltalk than external library (4)
- Strikefonts do not do ligatures


> 3) Squeak. Can display and edit Unicode strings. Includes broad font 
support with TrueType / OpenType. Does not do grapheme composition, 
right to left, or ligatures.

+ Does a good basic job for text display
- Code base could use some cleanup
- Getting bi-directional sorting, editing support

I have not tried to use this to edit large documents, perhaps the Pharo/Seaside communities have experience here.


> 4) Scratch. Most complete support, including all that's missing in the 
previous approaches. Uses PanGo. This means that text composition is not 
in Smalltalk, but in a large and complex external library. This sounds 
appropriate: Full Unicode is so complex that it took PanGo many years do 
it reasonably well, and most projects rely on it.

+ Reuses decades of design and implementation experience.
- Relies on external libraries

Personally, I think this is least cost for the benefit.

==

Note that there are more compact encodings of glyph layouts than fonts.  See "character description language".


$0.02,
-KenD




More information about the Cuis mailing list