<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#ffffff" text="#000000">

    Hi Angel,<br>

    <br>

    On 2/7/2013 7:07 AM, Angel Java Lopez wrote:

    <blockquote

cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"

      type="cite">Hi people!<br>

      <br>

      I just found:

      <div><a moz-do-not-send="true"

          href="http://www.cprogramming.com/tutorial/unicode.html">http://www.cprogramming.com/tutorial/unicode.html</a></div>

      <div><br>

      </div>

      <div>where a similar pro/cons were discussed</div>

      <div>

        <br>

      </div>

      <div>and</div>

      <div><a moz-do-not-send="true"

href="http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061%28v=vs.85%29.aspx">http://msdn.microsoft.com/en-us/library/windows/desktop/dd374061(v=vs.85).aspx</a></div>

      <div><br>

      </div>

      <div>Well, AFAIK, unicode UTF-16 was the way adopted by C and

        others to represent Unicode strings in memory. According the

        first article, a direct representation of any Unicode character

        should be 4 bytes. But UTF-16 is the encoding adopted to:</div>

      <div>- avoid waste space</div>

      <div>- easy access to a character</div>

      <div><br>

      </div>

      <div>I found too</div>

      <div><a moz-do-not-send="true"

          href="http://www.evanjones.ca/unicode-in-c.html">http://www.evanjones.ca/unicode-in-c.html</a></div>

      <div><span style="font-family: 'Times New Roman'; font-size:

          medium;">Contrary to popular belief, it is possible for a

          Unicode character to require multiple 16-bit values.</span></div>

      <div><br>

      </div>

      <div><span style="font-family: 'Times New Roman'; font-size:

          medium;">What format do you use for data that goes in and out

          of your software, and what format do you use internally? </span></div>

      <div><br>

      </div>

      <div>Then, the author points to another link </div>

      <div><a moz-do-not-send="true"

          href="http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode">http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode</a></div>

      <div><span style="font-family: Verdana,Arial,Helvetica,sans-serif;

          font-size: medium; background-color: rgb(255, 255, 255);">Inside

          your software, store text as UTF-8 or UTF-16; that is to say,

          pick one of the two and stick with it.</span></div>

      <div><br>

      </div>

      <div>Another interesting post by tim bray</div>

      <div><a moz-do-not-send="true"

          href="http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF">http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF</a></div>

      <div><br>

      </div>

    </blockquote>

    <br>

    Thanks for the links.<br>

    <br>

    <blockquote

cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"

      type="cite">

      <div>My guess:</div>

      <div><br>

      </div>

      <div>- Cuis Smalltalk could support both internal representation,

        UTF-8 AND UTF-16. A WideString class could be written. And I

        guess, a WideChar. Their protocols should be the same as their

        "byte" counterparts. What I missing: what is the problems in

        this approach? All problems should be in the internal

        implementation of WideString and WideChar, and in the points

        where the code should decide: </div>

      <div><br>

      </div>

      <div>- a new string is needed, what kind of?</div>

    </blockquote>

    <br>

    One that can store Unicode. Candidates are UTF-8 and UTF-32.<br>

    <br>

    <blockquote

cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"

      type="cite">

      <div>- I have an string object (String or WideString). I need to

        encode to store out of memory. What kind of encoding?</div>

    </blockquote>

    <br>

    That's easy to answer: UTF-8.<br>

    <br>

    <blockquote

cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"

      type="cite">

      <div>Yes, it could be a lot of work, but it's the way adopted by

        many languages, libraries, technologies: </div>

    </blockquote>

    <br>

    I don't mean to sound harsh, but we don't decide on statistics. If

    we did that, we would not be doing Smalltalk :) .<br>

    <br>

    <blockquote

cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"

      type="cite">

      <div>a clear separation btw internal representation vs encoding.

        The only twist, Smalltalk-like, it's the possibility of having

        TWO internal representations using Smalltalk capabilitities.</div>

      <div><br>

      </div>

      <div>The other way: bite the bullit, ONE internal representation,

        UTF-16, all encapsulated in String (I guess it is the strategy

        adopted by Java and .NET). And have new methods when needed, to

        indicate the encoding (I guess in I/O, serialization), when it

        is needed, having a default encoding (UTF-8?)</div>

      <div><br>

      </div>

    </blockquote>

    <br>

    UTF-16? Why? I'd rather chose UTF-8 or UTF-32.<br>

    <br>

    <blockquote

cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"

      type="cite">

      <div>I don't understand why webservers should convert back Strings

        into UTF-8.  No encoding for text response in http protocol? I

        read <a moz-do-not-send="true"

href="http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding">http://en.wikipedia.org/wiki/Character_encodings_in_HTML#Specifying_the_document.27s_character_encoding</a>

        <br>

      </div>

    </blockquote>

    <br>

    Serving text in UTF-8 allows using full Unicode web content, and

    minimizes compatibility risks.<br>

    <br>

    <blockquote

cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"

      type="cite">

      <div>AFAIK, everything related to string I/O has specified

        encoding in some way, these days.</div>

      <div><br>

      </div>

      <div>Angel "Java" Lopez</div>

      <div>@ajlopez</div>

      <div><br>

      </div>

    </blockquote>

    <br>

    Cheers,<br>

    Juan Vuletich<br>

    <br>

    <blockquote

cite="mid:CAMs+DqJ-kqCds3HvoMcCHQtruDi-k7s0yitU+kGKwwx9i3zyYg@mail.gmail.com"

      type="cite">

      <div><br>

        <div class="gmail_quote">

          On Wed, Feb 6, 2013 at 11:59 PM, Juan Vuletich <span

            dir="ltr"><<a moz-do-not-send="true"

              href="mailto:juan@jvuletich.org" target="_blank">juan@jvuletich.org</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt

            0.8ex; border-left: 1px solid rgb(204, 204, 204);

            padding-left: 1ex;">

            Hi Folks,<br>

            <br>

            I was'n able to jump before into the recent discussion, but

            I've been thinking a bit about all this. This <a

              moz-do-not-send="true"

              href="http://en.wikipedia.org/wiki/UTF8" target="_blank">http://en.wikipedia.org/wiki/UTF8</a>

            made me realize that using UTF8 internally to represent

            Unicode strings is a way to minimize required changes to

            existing software.<br>

            <br>

            Hannes, your proposal for representing WideStrings as

            variableWordSubclass is like <a moz-do-not-send="true"

              href="http://en.wikipedia.org/wiki/UTF-32" target="_blank">http://en.wikipedia.org/wiki/UTF-32</a>,

            right? And if I remember correctly, it is what Squeak uses.<br>

            <br>

            I ended sketching this to compare alternatives:<br>

            <br>

            <br>

            ISO 8859-15<br>

            ========<br>

               pros:<br>

               -------<br>

                  - We already have a lot of specific primitives in the

            VM<br>

                  - Efficient use of memory<br>

                  - Very fast #at: and #at:put:<br>

               cons:<br>

               -------<br>

                  - Can only represent Latin alphabets and not full

            Unicode<br>

            <br>

            Unicode UTF-8 (in a new variableByteSubclass)<br>

               pros:<br>

               -------<br>

                  - We could reuse many existing String primitives<br>

                  - It was created to allow existing code deal with

            Unicode with minimum changes<br>

                  - Efficient use of memory<br>

               cons:<br>

               -------<br>

                  - Does not allow #at:put:<br>

                  - #at: is very slow, O(n) instead of O(1)<br>

            <br>

            Unicode UTF-32 (in a new variableWordSubclass)<br>

               pros:<br>

               -------<br>

                  - Very fast #at: and #at:put: (although I guess we don

            use this much... especially #at:put:, as Strings are usually

            regarded as immutable)<br>

               cons:<br>

               --------<br>

                  - very inefficient use of memory (4 bytes for every

            character!)<br>

                  - doesn't take advantage of existing String primitives

            (sloooow)<br>

            <br>

            <br>

            I think that for some time, our main Character/String

            representation should be ISO 8859-15. Switching to full

            Unicode would require a lot of work in the paragraph layout

            and text rendering engines. But we can start building a

            Unicode representation that could be used for webapps.<br>

            <br>

            So I suggest:<br>

            <br>

            - Build an Utf8String variableByteSubclass and

            Utf8Character. Try to use String primitives. Do not include

            operations such as #at:, to discourage their use.<br>

            <br>

            - Make conversion to/from ISO 8859-15 lossless, by a good

            codification of CodePoints in regular Strings (not unlike

            Hannes' recent contribution).<br>

            <br>

            - Use this for the Clipboard. This is pretty close to what

            we have now, but we'd allow using an external Unicode

            textEditor to build any Unicode text, and by copying and

            pasting into a String literal in Cuis code, we'd have

            something that can be converted back into UTF-8 without

            losing any content.<br>

            <br>

            - Web servers should convert back Strings into UTF-8. This

            would let us handle and serve content using full Unicode,

            without needing to wait until our tools can display it

            properly.<br>

            <br>

            - Use this when editing external files, if they happen to be

            in UTF-8 (there are good heuristics for determining this).

            On save, we offer the option to save as UTF-8 or ISO

            8859-15.<br>

            <br>

            - We can start adapting the environment to avoid using the

            String protocols that are a problem for UTF-8 (i.e. #at:,

            #at:put: and related). This would ease an eventual migration

            to using only UTF-8 for everything.<br>

            <br>

            What do you think? Do we have a plan?<br>

            <br>

            Cheers,<br>

            Juan Vuletich<br>

            <br>

            _______________________________________________<br>

            Cuis mailing list<br>

            <a moz-do-not-send="true" href="mailto:Cuis@jvuletich.org"

              target="_blank">Cuis@jvuletich.org</a><br>

            <a moz-do-not-send="true"

              href="http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org"

              target="_blank">http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org</a><br>

          </blockquote>

        </div>

        <br>

      </div>

      <pre wrap="">

<fieldset class="mimeAttachmentHeader"></fieldset>

_______________________________________________

Cuis mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Cuis@jvuletich.org">Cuis@jvuletich.org</a>

<a class="moz-txt-link-freetext" href="http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org">http://jvuletich.org/mailman/listinfo/cuis_jvuletich.org</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>