I finally worked it all out.

For future reference, here's a record of my research on this:
http://stackoverflow.com/questions/715958/how-do-you-handle-different-string-encodings


Daniel Jomphe wrote:
> I made some progress.
>
> [By the way, NetBean's console displays *everything* 100% fine.
>  I decided to use one of the worst repl consoles: that of IntelliJ.
>  I want to make sure I really understand what's the point behind all
> this.]
>
>   (import '(java.io PrintWriter PrintStream FileInputStream)
>           '(java.nio CharBuffer ByteBuffer)
>           '(java.nio.charset Charset CharsetDecoder CharsetEncoder)
>           '(org.xml.sax InputSource))
>
>   (def   utf8 "UTF-8")
>   (def d-utf8 (.newDecoder (Charset/forName utf8)))
>   (def e-utf8 (.newEncoder (Charset/forName utf8)))
>
>   (def   latin1 "ISO-8859-1")
>   (def d-latin1 (.newDecoder (Charset/forName latin1)))
>   (def e-latin1 (.newEncoder (Charset/forName latin1)))
>
>   (defmacro with-out-encod
>     [encoding & body]
>     `(binding [*out* (PrintWriter. (PrintStream. System/out true
> ~encoding) true)]
>               ~...@body
>               (flush)))
>
>   (def s "québécois français")
>
>   (print s)                         ;quÔøΩbÔøΩcois franÔøΩaisnil
>   (with-out-encod latin1 (print s)) ;qu?b?cois fran?aisnil
>   (with-out-encod utf8   (print s)) ;qu?b?cois fran?aisnil
>
>   (def encoded (.encode e-utf8
>                         (CharBuffer/wrap "québécois français")))
>   (def s-d
>     (.toString (.decode d-utf8 encoded)))
>
>   (print s-d)                         ;quÔøΩbÔøΩcois franÔøΩaisnil
>   (with-out-encod latin1 (print s-d)) ;qu?b?cois fran?aisnil
>   (with-out-encod utf8   (print s-d)) ;qu?b?cois fran?aisnil
>
>   (def f-d
>     (:content (let [x (InputSource. (FileInputStream. "french.xml"))]
>          (.setEncoding x latin1)
>          (clojure.xml/parse x))))
>
>   (print f-d)                         ;quÔøΩbÔøΩcois franÔøΩaisnil
>   (with-out-encod latin1 (print f-d)) ;québécois français
>   (with-out-encod utf8   (print f-d)) ;québécois français
>
> So my theory, which is still almost certainly wrong, is:
>
> 1. When the input is a file whose encoding is, say, latin-1, it's easy
> to decode it and then encode it however one wants.
> 2. When the input is a literal string in the source file, it looks
> like it's impossible to encode it correctly, unless one first decodes
> it from the source file's encoding. But then, I don't yet know how to
> do this without actually reading the source file. :\
>
> Daniel Jomphe wrote:
> > I tried under eclipse.
>
> > Default console encoding configuration (MacRoman):
>
> >   #'user/s
> >   quÔøΩbÔøΩcois franÔøΩaisnil
> >   qu?b?cois fran?aisnil
>
> >   #'user/snc
> >   qu?b?cois fran?aisnil
> >   qu?b?cois fran?aisnil
>
> > Console configured to print using ISO-8859-1:
>
> >   #'user/s
> >   qu�b�cois fran�aisnil
> >   qu?b?cois fran?aisnil
>
> >   #'user/snc
> >   qu?b?cois fran?aisnil
> >   qu?b?cois fran?aisnil
>
> > Console configured to print using UTF-8:
>
> >   #'user/s
> >   québécois françaisnil
> >   québécois françaisnil
>
> >   #'user/snc
> >   québécois françaisnil
> >   québécois françaisnil
>
> > So as I come to understand it, it looks like UTF-8 should be the rolls-
> > royce for my needs.
>
> > May I correctly conclude the following?
>
> >   Don't bother about encodings unless you're displaying something and
> > it's unreadable; then, don't bother about it in the code; find a
> > proper console or viewer.
>
> > Doesn't that sound like offloading a problem to users? Isn't there
> > something reliable that can be done in the code?
>
> > Daniel Jomphe wrote:
> > > Sorry for all these posts.
>
> > > I pasted my last post's code into a fresh repl (not in my IDE), and
> > > here's what I got (cleaned up):
>
> > >   #'user/s
> > >   québécois françaisnil
> > >   qu?b?cois fran?aisnil
>
> > >   #'user/snc
> > >   québécois françaisnil
> > >   qu?b?cois fran?aisnil
>
> > > I'm not sure what to make out of it.
>
> > > My terminal (Apple Terminal) supports the encoding, and prints
> > > correctly s and snc out of the box.
> > > When I use with-out-encoded, I actually screw up both s and snc's
> > > printing.
>
> > > Daniel Jomphe wrote:
> > > > Now that I know for sure how to bind *out* to something else over
> > > > System/out, it's time to bring back my encoding issues into scope:
>
> > > >   (import '(java.io PrintWriter PrintStream))
>
> > > >   (defmacro with-out-encoded
> > > >     [encoding & body]
> > > >     `(binding [*out* (java.io.PrintWriter. (java.io.PrintStream.
> > > > System/out true ~encoding) true)]
> > > >               ~...@body
> > > >               (flush)))
>
> > > >   (def nc "ISO-8859-1")
>
> > > >   ;;; with a normal string
> > > >   (def s "québécois français")
>
> > > >   (print s)
> > > >   ; quÔøΩbÔøΩcois franÔøΩaisnil
>
> > > >   (with-out-encoded nc (print s))
> > > >   ; qu?b?cois fran?aisnil
>
> > > >   ;;; with a correctly-encoded string
> > > >   (def snc (String. (.getBytes s nc) nc))
>
> > > >   (print snc)
> > > >   ; qu?b?cois fran?aisnil
>
> > > >   (with-out-encoded nc (print snc))
> > > >   ; qu?b?cois fran?aisnil
>
> > > > I'm certainly missing something fundamental somewhere.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to