Re: not quite 1252

2006-04-29 Thread Anton Vredegoor
Martin v. Löwis wrote: > Well, if the document is UTF-8, you should decode it as UTF-8, of > course. Thanks. This and: http://en.wikipedia.org/wiki/UTF-8 solved my problem with understanding the encoding. Anton proof that I understand it now (please anyone, prove me wrong if you can): from z

Re: not quite 1252

2006-04-28 Thread Martin v. Löwis
Anton Vredegoor wrote: >> So if that is the case: What is the problem then? If you interpret >> the document as cp1252, and it contains \x93 and \x94, what is >> it that you don't like about that? In yet other words: what actions >> are you performing, what are the results you expect to get, and >>

Re: not quite 1252

2006-04-28 Thread Anton Vredegoor
Martin v. Löwis wrote: > So if that is the case: What is the problem then? If you interpret > the document as cp1252, and it contains \x93 and \x94, what is > it that you don't like about that? In yet other words: what actions > are you performing, what are the results you expect to get, and > wha

Re: not quite 1252

2006-04-28 Thread Serge Orlov
Anton Vredegoor wrote: > Serge Orlov wrote: > > > Anton Vredegoor wrote: > >> In fact there are a lot of printable things that haven't got a text > >> attribute, for example some items with tag ()s. > > > > In my sample file I see , is that you're talking > > about? Since my file is small I ca

Re: not quite 1252

2006-04-28 Thread Serge Orlov
Anton Vredegoor wrote: > Anton Vredegoor wrote: > > > So, probably yes. If it doesn't have a text attribrute if you iterate > > over it using OOopy for example: > > Sorry about that, I meant if the text attribute is None, but there *is* > some text. OK, I think I understand what you're talking ab

Re: not quite 1252

2006-04-28 Thread Serge Orlov
Anton Vredegoor wrote: > Anton Vredegoor wrote: > > > So, probably yes. If it doesn't have a text attribrute if you iterate > > over it using OOopy for example: > > Sorry about that, I meant if the text attribute is None, but there *is* > some text. OK, I think I understand what you're talking ab

Re: not quite 1252

2006-04-28 Thread Anton Vredegoor
Anton Vredegoor wrote: > So, probably yes. If it doesn't have a text attribrute if you iterate > over it using OOopy for example: Sorry about that, I meant if the text attribute is None, but there *is* some text. Anton -- http://mail.python.org/mailman/listinfo/python-list

Re: not quite 1252

2006-04-28 Thread Serge Orlov
Anton Vredegoor wrote: > Serge Orlov wrote: > > > Anton Vredegoor wrote: > >> In fact there are a lot of printable things that haven't got a text > >> attribute, for example some items with tag ()s. > > > > In my sample file I see , is that you're talking > > about? Since my file is small I ca

Re: not quite 1252

2006-04-28 Thread Anton Vredegoor
Serge Orlov wrote: > Anton Vredegoor wrote: >> In fact there are a lot of printable things that haven't got a text >> attribute, for example some items with tag ()s. > > In my sample file I see , is that you're talking > about? Since my file is small I can say for sure this tag represents > t

Re: not quite 1252

2006-04-28 Thread Serge Orlov
Anton Vredegoor wrote: > Serge Orlov wrote: > > > I extracted content.xml from a test file and the header is: > > > > > > So any xml library should handle it just fine, without you trying to > > guess the encoding. > > Yes my header also says UTF-8. However some kind person send me an > e-mail sta

Re: not quite 1252

2006-04-28 Thread Anton Vredegoor
Richard Brodie wrote: > "Anton Vredegoor" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > >> Yes my header also says UTF-8. However some kind person send me an e-mail >> stating that >> since I am getting \x94 and such output when using repr (even if str is >> giving correct

Re: not quite 1252

2006-04-28 Thread Serge Orlov
Anton Vredegoor wrote: > In fact there are a lot of printable things that haven't got a text > attribute, for example some items with tag ()s. In my sample file I see , is that you're talking about? Since my file is small I can say for sure this tag represents two space characters. -- http:

Re: not quite 1252

2006-04-28 Thread John Machin
JM>> No, not quite. If you saw \x94 in the repr() output, but it looked "OK" when displayed using str(), then the only reasonable hypotheses are (a) the data was in an 8-bit string, presumably encoded as cp1252 (definitely NOT UTF-8), rather than a Unicode string (b) yo

Re: not quite 1252

2006-04-28 Thread Richard Brodie
"Anton Vredegoor" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Yes my header also says UTF-8. However some kind person send me an e-mail > stating that > since I am getting \x94 and such output when using repr (even if str is > giving correct > output) there could be some pr

Re: not quite 1252

2006-04-28 Thread Anton Vredegoor
Serge Orlov wrote: > I extracted content.xml from a test file and the header is: > > > So any xml library should handle it just fine, without you trying to > guess the encoding. Yes my header also says UTF-8. However some kind person send me an e-mail stating that since I am getting \x94 and s

Re: not quite 1252

2006-04-27 Thread Anton Vredegoor
4 codes were left inside the document. But that was an *artifact*, because if one prints something using s.__repr__() as is used for example when printing a list of strings (duh) the output is not the same as when one prints with 'print s'. I guess what is called then is str(s).

Re: not quite 1252

2006-04-27 Thread John Machin
On 27/04/2006 12:49 AM, Anton Vredegoor wrote: > Fredrik Lundh wrote: > >> Anton Vredegoor wrote: >> >>> I'm trying to import text from an open office document (save as .sxw and >>> read the data from content.xml inside the sxw-archive using >>> elementtree and such tools). >>> >>> The encoding t

Re: not quite 1252

2006-04-26 Thread Martin v. Löwis
Anton Vredegoor wrote: >> Not sure I understand the question. If you process data in cp1252, >> then \x94 and \x94 are legal characters, and the Python codec should >> support them just fine. > > Tell that to the guys from open-office. Ok, I'll rephrase: Can you please explain your problem again,

Re: not quite 1252

2006-04-26 Thread Serge Orlov
Anton Vredegoor wrote: > I'm trying to import text from an open office document (save as .sxw and > read the data from content.xml inside the sxw-archive using > elementtree and such tools). > > The encoding that gives me the least problems seems to be cp1252, > however it's not completely perfe

Re: not quite 1252

2006-04-26 Thread Anton Vredegoor
Martin v. Löwis wrote: > Not sure I understand the question. If you process data in cp1252, > then \x94 and \x94 are legal characters, and the Python codec should > support them just fine. Tell that to the guys from open-office. Anton -- http://mail.python.org/mailman/listinfo/python-list

Re: not quite 1252

2006-04-26 Thread Martin v. Löwis
Anton Vredegoor wrote: > The encoding that gives me the least problems seems to be cp1252, > however it's not completely perfect because there are still characters > in it like \93 or \94. Has anyone handled this before? I'd rather not > reinvent the wheel and start translating strings 'by hand'.

Re: not quite 1252

2006-04-26 Thread Anton Vredegoor
Fredrik Lundh wrote: > Anton Vredegoor wrote: > >> I'm trying to import text from an open office document (save as .sxw and >> read the data from content.xml inside the sxw-archive using >> elementtree and such tools). >> >> The encoding that gives me the least problems seems to be cp1252, >> ho

Re: not quite 1252

2006-04-26 Thread Fredrik Lundh
Anton Vredegoor wrote: > I'm trying to import text from an open office document (save as .sxw and > read the data from content.xml inside the sxw-archive using > elementtree and such tools). > > The encoding that gives me the least problems seems to be cp1252, > however it's not completely perfec

not quite 1252

2006-04-26 Thread Anton Vredegoor
I'm trying to import text from an open office document (save as .sxw and read the data from content.xml inside the sxw-archive using elementtree and such tools). The encoding that gives me the least problems seems to be cp1252, however it's not completely perfect because there are still chara