I finally worked it all out. For future reference, here's a record of my research on this: http://stackoverflow.com/questions/715958/how-do-you-handle-different-string-encodings
Daniel Jomphe wrote: > I made some progress. > > [By the way, NetBean's console displays *everything* 100% fine. > I decided to use one of the worst repl consoles: that of IntelliJ. > I want to make sure I really understand what's the point behind all > this.] > > (import '(java.io PrintWriter PrintStream FileInputStream) > '(java.nio CharBuffer ByteBuffer) > '(java.nio.charset Charset CharsetDecoder CharsetEncoder) > '(org.xml.sax InputSource)) > > (def utf8 "UTF-8") > (def d-utf8 (.newDecoder (Charset/forName utf8))) > (def e-utf8 (.newEncoder (Charset/forName utf8))) > > (def latin1 "ISO-8859-1") > (def d-latin1 (.newDecoder (Charset/forName latin1))) > (def e-latin1 (.newEncoder (Charset/forName latin1))) > > (defmacro with-out-encod > [encoding & body] > `(binding [*out* (PrintWriter. (PrintStream. System/out true > ~encoding) true)] > ~...@body > (flush))) > > (def s "québécois français") > > (print s) ;quÔøΩbÔøΩcois franÔøΩaisnil > (with-out-encod latin1 (print s)) ;qu?b?cois fran?aisnil > (with-out-encod utf8 (print s)) ;qu?b?cois fran?aisnil > > (def encoded (.encode e-utf8 > (CharBuffer/wrap "québécois français"))) > (def s-d > (.toString (.decode d-utf8 encoded))) > > (print s-d) ;quÔøΩbÔøΩcois franÔøΩaisnil > (with-out-encod latin1 (print s-d)) ;qu?b?cois fran?aisnil > (with-out-encod utf8 (print s-d)) ;qu?b?cois fran?aisnil > > (def f-d > (:content (let [x (InputSource. (FileInputStream. "french.xml"))] > (.setEncoding x latin1) > (clojure.xml/parse x)))) > > (print f-d) ;quÔøΩbÔøΩcois franÔøΩaisnil > (with-out-encod latin1 (print f-d)) ;québécois français > (with-out-encod utf8 (print f-d)) ;québécois français > > So my theory, which is still almost certainly wrong, is: > > 1. When the input is a file whose encoding is, say, latin-1, it's easy > to decode it and then encode it however one wants. > 2. When the input is a literal string in the source file, it looks > like it's impossible to encode it correctly, unless one first decodes > it from the source file's encoding. But then, I don't yet know how to > do this without actually reading the source file. :\ > > Daniel Jomphe wrote: > > I tried under eclipse. > > > Default console encoding configuration (MacRoman): > > > #'user/s > > quÔøΩbÔøΩcois franÔøΩaisnil > > qu?b?cois fran?aisnil > > > #'user/snc > > qu?b?cois fran?aisnil > > qu?b?cois fran?aisnil > > > Console configured to print using ISO-8859-1: > > > #'user/s > > qu�b�cois fran�aisnil > > qu?b?cois fran?aisnil > > > #'user/snc > > qu?b?cois fran?aisnil > > qu?b?cois fran?aisnil > > > Console configured to print using UTF-8: > > > #'user/s > > québécois françaisnil > > québécois françaisnil > > > #'user/snc > > québécois françaisnil > > québécois françaisnil > > > So as I come to understand it, it looks like UTF-8 should be the rolls- > > royce for my needs. > > > May I correctly conclude the following? > > > Don't bother about encodings unless you're displaying something and > > it's unreadable; then, don't bother about it in the code; find a > > proper console or viewer. > > > Doesn't that sound like offloading a problem to users? Isn't there > > something reliable that can be done in the code? > > > Daniel Jomphe wrote: > > > Sorry for all these posts. > > > > I pasted my last post's code into a fresh repl (not in my IDE), and > > > here's what I got (cleaned up): > > > > #'user/s > > > québécois françaisnil > > > qu?b?cois fran?aisnil > > > > #'user/snc > > > québécois françaisnil > > > qu?b?cois fran?aisnil > > > > I'm not sure what to make out of it. > > > > My terminal (Apple Terminal) supports the encoding, and prints > > > correctly s and snc out of the box. > > > When I use with-out-encoded, I actually screw up both s and snc's > > > printing. > > > > Daniel Jomphe wrote: > > > > Now that I know for sure how to bind *out* to something else over > > > > System/out, it's time to bring back my encoding issues into scope: > > > > > (import '(java.io PrintWriter PrintStream)) > > > > > (defmacro with-out-encoded > > > > [encoding & body] > > > > `(binding [*out* (java.io.PrintWriter. (java.io.PrintStream. > > > > System/out true ~encoding) true)] > > > > ~...@body > > > > (flush))) > > > > > (def nc "ISO-8859-1") > > > > > ;;; with a normal string > > > > (def s "québécois français") > > > > > (print s) > > > > ; quÔøΩbÔøΩcois franÔøΩaisnil > > > > > (with-out-encoded nc (print s)) > > > > ; qu?b?cois fran?aisnil > > > > > ;;; with a correctly-encoded string > > > > (def snc (String. (.getBytes s nc) nc)) > > > > > (print snc) > > > > ; qu?b?cois fran?aisnil > > > > > (with-out-encoded nc (print snc)) > > > > ; qu?b?cois fran?aisnil > > > > > I'm certainly missing something fundamental somewhere. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---