On Aug 1, 7:44 am, william tanksley <[EMAIL PROTECTED]> wrote: > John Machin <[EMAIL PROTECTED]> wrote: > > william tanksley <[EMAIL PROTECTED]> wrote: > > Let's try again: > > Cool. Sorry for the misunderstanding. Thank you for helping again! > > Postscript: your request to print the actual data did the trick.
I'd back inspecting actual data against armchair philosophy any time :-) > I'm > including the rest of my reply just to provide context, but the answer > was the the Unicode was actually embedded in the URL, encoded as > distinct bytes. Thus, it *had* to be url-decoded and then UTF-8 > decoded, in that order, in order to recover the original filename. > > So the problem was indeed purely in my head -- I should have looked at > the original data (unfortunately, I was fooled by looking at the song > title, which is the same thing but with the raw UTF-8 bytes instead of > the URL escape codes). > > > > > >> track_id = url2pathname(urlparse(track_id).path) > > >> print repr(track_id) > > >> parse_result = urlparse(track_id).path > > >> print repr(parse_result) > > >> track_id_replacement = url2pathname(parse_result) > > >> print repr(track_id_replacement) > > > The "important" value here is track_id_replacement; it contains the > > > data that's throwing me. It appears that some UTF-8 characters are > > > being read as multiple bytes by ElementTree rather than being decoded > > > into Unicode. > > > Here's one example. The others are similar -- they have the same > > > things that look like problems to me. > > > "Buffett Time - Annual Shareholders\xc2\xa0L.mp3" > > ROTFL! I thought the Buffett thing was a Windows filename! What I was > > expecting was THREE lots of repr() output, and I'm quite unused to > > seeing repr() output with quotes around it instead of apostrophes; how > > did you achieve that? > > I don't know -- but I got it again when I printed out the original > version. My *guess* would be that this is what repr prints when asked > to print a byte string (but I don't know how to confirm that). > Alternately, the fact that I'm running these inside SPE might be > changing some defaults. I'm not sure. > > You're right that single quotes are expected -- and I'd expect a > preceding u, since they're supposed to be Unicode. I dunno what's > going on. Why do you suppose that the contents are Unicode? It's a URL-encoded string i.e. *deliberately* ASCII, in fact sub-ASCII (see all the %20 stuff?). What's going on is that ElementTree presents text as ASCII if it can be so represented, otherwise as Unicode. This is actually a *convenience*. Get used to it. Enjoy it. > > > So you're saying that track_id_replacement contains utf8 characters. > > It is obtained by track_id_replacement = url2pathname(parse_result). > > You don't show us what is in parse_result. url2pathname() is nothing > > to do with ElementTree. urlparse() is nothing to do with ElementTree. > > You have provided no evidence that ElementTree is doing what you > > accuse it of. > > Okay. Here's the evidence... Or something. Looking at this I begin to > see why things work the way they do. It's utterly bizzare, quite > frankly. > > > Please try again. Backtrack in your code to where you are pulling the > > url out of an element. Do print repr(some_element.some_attribute). > > Show us. > > Okay, the repr of the string that comes out of the .text attribute is: > > "file://localhost/C:/Documents%20and%20Settings/TanksleyJrW/My > %20Documents/My%20Music/iTunes/iTunes%20Music/Podcasts/Brian > %20Preston's%20_Money%20Guy_%20Blog%20and%20Pod/Buffett%20Time%20- > %20Annual%20Shareholders%C2%A0L.mp3" > > Looking at the XML, and THIS TIME actually looking at the correct > attribute (I was looking at the title before) I see... surprise! > That's the correct data. > > So all of the mysteries are solved (except for my Python's > doublequotes, but who cares), and ElementTree is entirely vindicated. Shucks. I can sense that you'd been looking forward to conducting an auto-da-fe followed by tossing the author on a bonfire ... but you can't burn a bot anyway :-) -- http://mail.python.org/mailman/listinfo/python-list