[Cuis] Fwd: [Pharo-dev] Unicode Support

EuanM euanmee at gmail.com
Wed Dec 9 07:17:01 CST 2015


"Well. I do not agree with this. I agree with the quote.

Can you explain a bit more about what you mean by abstract character
and concept?"  --Guillermo

The problem with the quote is, that *while true*, it *does not
disambiguate* between:
either
compatibility character and abstract character;
or
character as composable component of an abstract character and
character as the entire
embodiment of an abstract character;

Abstract character is the key concept of Unicode.  Differentiation
between abstract character and codepoints is the key differentiator of
the Unicode approach and most previous approaches to character
encoding, e,g, ASCII, EBCDIC, ISO Latin 1, etc

Please see my previous posts which use the example of Angstrom,
Capital A with circle (or whatever the canonical name is) and the
composed sequence of "Capital A" and "circle above a letter" for a
fuller explanation of the concept of "abstract character".



On 9 December 2015 at 09:35, Guillermo Polito <guillermopolito at gmail.com> wrote:
>
>> On 8 dic 2015, at 10:07 p.m., EuanM <euanmee at gmail.com> wrote:
>>
>> "No. a codepoint is the numerical value assigned to a character. An
>> "encoded character" is the way a codepoint is represented in bytes
>> using a given encoding."
>>
>> No.
>>
>> A codepoint may represent a component part of an abstract character,
>> or may represent an abstract character, or it may do both (but not
>> always at the same time).
>>
>> Codepoints represent a single encoding of a single concept.
>>
>> Sometimes that concept represents a whole abstract character.
>> Sometimes it represent part of an abstract character.
>
> Well. I do not agree with this. I agree with the quote.
>
> Can you explain a bit more about what you mean by abstract character and concept?
>
>>
>> This is the key difference between Unicode and most character encodings.
>>
>> A codepoint does not always represent a whole character.
>>
>> On 7 December 2015 at 13:06, Henrik Johansen
>> <henrik.s.johansen at veloxit.no> wrote:
>>>
>>> On 07 Dec 2015, at 1:05 , EuanM <euanmee at gmail.com> wrote:
>>>
>>> Hi Henry,
>>>
>>> To be honest, at some point I'm going to long for the for the much
>>> more succinct semantics of healthcare systems and sports scoring and
>>> administration systems again.  :-)
>>>
>>> codepoints are any of *either*
>>> - the representation of a component of an abstract character, *or*
>>> eg. "A" #(0041) as a component of
>>> - the sole representation of the whole of an abstract character *or* of
>>> -  a representation of an abstract character provided for backwards
>>> compatibility which is more properly represented by a series of
>>> codepoints representing a composed character
>>>
>>> e.g.
>>>
>>> The "A" #(0041) as a codepoint can be:
>>> the sole representation of the whole of an abstract character "A" #(0041)
>>>
>>> The representation of a component of the composed (i.e. preferred)
>>> version of the abstract character Å #(0041 030a)
>>>
>>> Å (#00C5) represents one valid compatibility form of the abstract
>>> character Å which is most properly represented by #(0041 030a).
>>>
>>> Å (#212b) also represents one valid compatibility form of the abstract
>>> character Å which is most properly represented by #(0041 030a).
>>>
>>> With any luck, this satisfies both our semantic understandings of the
>>> concept of "codepoint"
>>>
>>> Would you agree with that?
>>>
>>> In Unicode, codepoints are *NOT* an abstract numerical representation
>>> of a text character.
>>>
>>> At least not as we generally understand the term "text character" from
>>> our experience of non-Unicode character mappings.
>>>
>>>
>>> I agree, they are numerical representations of what Unicode refers to as
>>> characters.
>>>
>>>
>>> codepoints represent "*encoded characters*"
>>>
>>>
>>> No. a codepoint is the numerical value assigned to a character. An "encoded
>>> character" is the way a codepoint is represented in bytes using a given
>>> encoding.
>>>
>>> and "a *text element* ...
>>> is represented by a sequence of one or more codepoints".  (And the
>>> term "text element" is deliberately left undefined in the Unicode
>>> standard)
>>>
>>> Individual codepoints are very often *not* the encoded form of an
>>> abstract character that we are interested in.  Unless we are
>>> communicating to or from another system  (Which in some cases is the
>>> Smalltalk ByteString class)
>>>
>>>
>>>
>>>
>>> i.e. in other words
>>>
>>> *Some* individual codepoints *may* be a representation of a specific
>>> *abstract character*, but only in special cases.
>>>
>>> The general case in Unicode is that Unicode defines (a)
>>> representation(s) of a Unicode *abstract character*.
>>>
>>> The Unicode standard representation of an abstract character is a
>>> composed sequence of codepoints, where in some cases that sequence is
>>> as short as 1 codepoint.
>>>
>>> In other cases, Unicode has a compatibility alias of a single
>>> codepoint which is *also* a representation of an abstract character
>>>
>>> There are some cases where an abstract character can be represented by
>>> more than one single-codepoint compatibility codepoint.
>>>
>>> Cheers,
>>> Euan
>>>
>>>
>>> I agree you have a good grasp of the distinction between an abstract
>>> character (characters and character sequences which should be treated
>>> equivalent wrt, equality / sorting / display, etc.) and a character (which
>>> each have a code point assigned).
>>> That is besides the point both Sven and I tried to get through, which is the
>>> difference between a code point and the encoded form(s) of said code point.
>>> When you write:
>>> "and therefore encodable in UTF-8 as compatibility codepoint e9 hex
>>> and as the composed character #(0065 00b4) (all in hex) and as the
>>> same composed character as both
>>> #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are
>>> included"
>>>
>>> I's quite clear you confuse the two. 0xFEFF is the codepoint of the
>>> character used as bom.
>>> When you state that it can be written ffef (I assume you meant FFFE), you
>>> are again confusing the code point and its encoded value (an encoded value
>>> which only occurs in UTF16/32, no less).
>>>
>>> When this distinction is clear, it might be easier to see that value in that
>>> Strings are kept as Unicode code points arrays, and converted to encoded
>>> forms when entering/exiting the system.
>>>
>>> Cheers,
>>> Henry
>>>
>>
>
>




More information about the Cuis mailing list