<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Hi Angel,<br>
<br>
On 2/7/2013 7:07 AM, Angel Java Lopez wrote:
<blockquote
cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"
type="cite">Hi people!<br>
<br>
I just found:
<div><a moz-do-not-send="true"
href="http://www.cprogramming.com/tutorial/unicode.html">http://www.cprogramming.com/tutorial/unicode.html</a></div>
<div><br>
</div>
<div>where a similar pro/cons were discussed</div>
<div>
<br>
</div>
<div>and</div>
<div><a moz-do-not-send="true"
href="http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061%28v=vs.85%29.aspx">http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx</a></div>
<div><br>
</div>
<div>Well, AFAIK, unicode UTF-16 was the way adopted by C and
others to represent Unicode strings in memory. According the
first article, a direct representation of any Unicode character
should be 4 bytes. But UTF-16 is the encoding adopted to:</div>
<div>- avoid waste space</div>
<div>- easy access to a character</div>
<div><br>
</div>
<div>I found too</div>
<div><a moz-do-not-send="true"
href="http://www.evanjones.ca/unicode-in-c.html">http://www.evanjones.ca/unicode-in-c.html</a></div>
<div><span style="font-family: 'Times New Roman'; font-size:
medium;">Contrary to popular belief, it is possible for a
Unicode character to require multiple 16-bit values.</span></div>
<div><br>
</div>
<div><span style="font-family: 'Times New Roman'; font-size:
medium;">What format do you use for data that goes in and out
of your software, and what format do you use internally? </span></div>
<div><br>
</div>
<div>Then, the author points to another link </div>
<div><a moz-do-not-send="true"
href="http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode">http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode</a></div>
<div><span style="font-family: Verdana,Arial,Helvetica,sans-serif;
font-size: medium; background-color: rgb(255, 255, 255);">Inside
your software, store text as UTF-8 or UTF-16; that is to say,
pick one of the two and stick with it.</span></div>
<div><br>
</div>
<div>Another interesting post by tim bray</div>
<div><a moz-do-not-send="true"
href="http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF">http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF</a></div>
<div><br>
</div>
</blockquote>
<br>
Thanks for the links.<br>
<br>
<blockquote
cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"
type="cite">
<div>My guess:</div>
<div><br>
</div>
<div>- Cuis Smalltalk could support both internal representation,
UTF-8 AND UTF-16. A WideString class could be written. And I
guess, a WideChar. Their protocols should be the same as their
"byte" counterparts. What I missing: what is the problems in
this approach? All problems should be in the internal
implementation of WideString and WideChar, and in the points
where the code should decide: </div>
<div><br>
</div>
<div>- a new string is needed, what kind of?</div>
</blockquote>
<br>
One that can store Unicode. Candidates are UTF-8 and UTF-32.<br>
<br>
<blockquote
cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"
type="cite">
<div>- I have an string object (String or WideString). I need to
encode to store out of memory. What kind of encoding?</div>
</blockquote>
<br>
That's easy to answer: UTF-8.<br>
<br>
<blockquote
cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"
type="cite">
<div>Yes, it could be a lot of work, but it's the way adopted by
many languages, libraries, technologies: </div>
</blockquote>
<br>
I don't mean to sound harsh, but we don't decide on statistics. If
we did that, we would not be doing Smalltalk :) .<br>
<br>
<blockquote
cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"
type="cite">
<div>a clear separation btw internal representation vs encoding.
The only twist, Smalltalk-like, it's the possibility of having
TWO internal representations using Smalltalk capabilitities.</div>
<div><br>
</div>
<div>The other way: bite the bullit, ONE internal representation,
UTF-16, all encapsulated in String (I guess it is the strategy
adopted by Java and .NET). And have new methods when needed, to
indicate the encoding (I guess in I/O, serialization), when it
is needed, having a default encoding (UTF-8?)</div>
<div><br>
</div>
</blockquote>
<br>
UTF-16? Why? I'd rather chose UTF-8 or UTF-32.<br>
<br>
<blockquote
cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"
type="cite">
<div>I don't understand why webservers should convert back Strings
into UTF-8. No encoding for text response in http protocol? I
read <a moz-do-not-send="true"
href="http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding">http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding</a>
<br>
</div>
</blockquote>
<br>
Serving text in UTF-8 allows using full Unicode web content, and
minimizes compatibility risks.<br>
<br>
<blockquote
cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"
type="cite">
<div>AFAIK, everything related to string I/O has specified
encoding in some way, these days.</div>
<div><br>
</div>
<div>Angel "Java" Lopez</div>
<div>@ajlopez</div>
<div><br>
</div>
</blockquote>
<br>
Cheers,<br>
Juan Vuletich<br>
<br>
<blockquote
cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"
type="cite">
<div><br>
<div class="gmail_quote">
On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:juan@jvuletich.org" target="_blank">juan@jvuletich.org</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt
0.8ex; border-left: 1px solid rgb(204, 204, 204);
padding-left: 1ex;">
Hi Folks,<br>
<br>
I was'n able to jump before into the recent discussion, but
I've been thinking a bit about all this. This <a
moz-do-not-send="true"
href="http://en.wikipedia.org/wiki/UTF8" target="_blank">http://en.wikipedia.org/wiki/UTF8</a>
made me realize that using UTF8 internally to represent
Unicode strings is a way to minimize required changes to
existing software.<br>
<br>
Hannes, your proposal for representing WideStrings as
variableWordSubclass is like <a moz-do-not-send="true"
href="http://en.wikipedia.org/wiki/UTF-32" target="_blank">http://en.wikipedia.org/wiki/UTF-32</a>,
right? And if I remember correctly, it is what Squeak uses.<br>
<br>
I ended sketching this to compare alternatives:<br>
<br>
<br>
ISO 8859-15<br>
========<br>
pros:<br>
-------<br>
- We already have a lot of specific primitives in the
VM<br>
- Efficient use of memory<br>
- Very fast #at: and #at:put:<br>
cons:<br>
-------<br>
- Can only represent Latin alphabets and not full
Unicode<br>
<br>
Unicode UTF-8 (in a new variableByteSubclass)<br>
pros:<br>
-------<br>
- We could reuse many existing String primitives<br>
- It was created to allow existing code deal with
Unicode with minimum changes<br>
- Efficient use of memory<br>
cons:<br>
-------<br>
- Does not allow #at:put:<br>
- #at: is very slow, O(n) instead of O(1)<br>
<br>
Unicode UTF-32 (in a new variableWordSubclass)<br>
pros:<br>
-------<br>
- Very fast #at: and #at:put: (although I guess we don
use this much... especially #at:put:, as Strings are usually
regarded as immutable)<br>
cons:<br>
--------<br>
- very inefficient use of memory (4 bytes for every
character!)<br>
- doesn't take advantage of existing String primitives
(sloooow)<br>
<br>
<br>
I think that for some time, our main Character/String
representation should be ISO 8859-15. Switching to full
Unicode would require a lot of work in the paragraph layout
and text rendering engines. But we can start building a
Unicode representation that could be used for webapps.<br>
<br>
So I suggest:<br>
<br>
- Build an Utf8String variableByteSubclass and
Utf8Character. Try to use String primitives. Do not include
operations such as #at:, to discourage their use.<br>
<br>
- Make conversion to/from ISO 8859-15 lossless, by a good
codification of CodePoints in regular Strings (not unlike
Hannes' recent contribution).<br>
<br>
- Use this for the Clipboard. This is pretty close to what
we have now, but we'd allow using an external Unicode
textEditor to build any Unicode text, and by copying and
pasting into a String literal in Cuis code, we'd have
something that can be converted back into UTF-8 without
losing any content.<br>
<br>
- Web servers should convert back Strings into UTF-8. This
would let us handle and serve content using full Unicode,
without needing to wait until our tools can display it
properly.<br>
<br>
- Use this when editing external files, if they happen to be
in UTF-8 (there are good heuristics for determining this).
On save, we offer the option to save as UTF-8 or ISO
8859-15.<br>
<br>
- We can start adapting the environment to avoid using the
String protocols that are a problem for UTF-8 (i.e. #at:,
#at:put: and related). This would ease an eventual migration
to using only UTF-8 for everything.<br>
<br>
What do you think? Do we have a plan?<br>
<br>
Cheers,<br>
Juan Vuletich<br>
<br>
_______________________________________________<br>
Cuis mailing list<br>
<a moz-do-not-send="true" href="mailto:Cuis@jvuletich.org"
target="_blank">Cuis@jvuletich.org</a><br>
<a moz-do-not-send="true"
href="http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org"
target="_blank">http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org</a><br>
</blockquote>
</div>
<br>
</div>
<pre wrap="">
<fieldset class="mimeAttachmentHeader"></fieldset>
_______________________________________________
Cuis mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Cuis@jvuletich.org">Cuis@jvuletich.org</a>
<a class="moz-txt-link-freetext" href="http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org">http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org</a>
</pre>
</blockquote>
<br>
</body>
</html>