Jan Böcker <jan.boec...@jboecker.de> writes: > On 06.02.2010 14:50, Jan Böcker wrote: >> AFAIK, your current approach is correct. > > I was wrong. The attached patch fixes a bug in the encode_uri function. > That fixes the non-ASCII characters problem in xournal for me. > > The gchar type is just typedef'd to char, which means it is signed. To > get the byte value, it must be cast to unsigned int first. > > - Jan
Hi Jan and Daniel! Sorry for answering with that long delay. I read Daniel's mail last week, but I had to think about the answer. I'll just describe, what the `org-protocol-unhex-string' functions do here, and what they expect as arguments. Basically, it is OK to url-encode each character who's binary representation start with 1 (i.e., the value of the character is higher than 127). The text to be url-encoded should be UTF-8 ideally. If you use glib::ustring, it's easy to transform any iso-8859 string to utf-8. Each character, whos binary representation start with a 1, has to be url-encoded as well as the `%' character [1], but you could as url-encode the entire utf-8 string. The function that does the decoding is `org-protocol-unhex-string' which in turn uses `org-protocol-unhex-compound'. `man utf-8` shows, how org-protocol tries to decode characters. The JavaScript-Funktion `encodeURIComponent()' returns exactly what we need. It recodes a string to utf-8 and then encodes all characters, except digits, ASCII letters and these punctuation characters: -_.!~*'() See ECMA-262 Standard, Section 15.1.3 (http://bclary.com/2004/11/07/ecma-262.html#a-15.1.3 [2]): "The character is first transformed into a sequence of octets using the UTF-8 transformation..." Again, note, that the decoding mechanism relies on the fact, that the sequence to decode is url-encoded UTF-8. Example: The url-encoded unicode representation of the German umlaut `ö' is `%C3%B6'. Thus (org-protocol-unhex-string "%C3%B6") gives you "ö". In iso-8859-1, the url-encoded representation of the same character `ö' was `%F6'. But (org-protocol-unhex-string "%F6") gives you "" - the empty string. There is no utf-8 character with this binary representation, since every byte starting with a 1 (i.e. is bigger than 127) starts a multibyte sequence (2 or more bytes). But: (org-protocol-unhex-string "%2F%3C") gives you, as expected, "/<" which shows, that you could savely url-encode each and every character of a utf-8 encoded string. == Footnotes: [1] The percent character `%' has to be encoded, if followed by [0-9A-Fa-f]{2}, because org-protocol will assume, that a sequence matching "\\(%[0-9a-f][0-9a-f]\\)+" is an encoded character. That said, a `%' has to be url-encoded, since one will hardly ever know for sure, that a `%' is never followed by "[0-9a-f][0-9a-f]". [2] Get a PDF version of ECMA-262 third edition here: http://www.ecma-international.org/publications/standards/Ecma-262.htm _______________________________________________ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode