how to remove 'FFFD' character
Does anyone know a way to remove the 'FFFD' character with python? You can see the browser output I'm dealing with here: http://webcomm.webfactional.com/htdocs/fffd.JPG I deleted a big chunk out of the middle of that JPG to protect sensitive data. I don't know what the character encoding of this data is and don't know what the 'FFFD' represents. I guess it is something that can't be represented in whatever this particular encoding is, or maybe it is something corrupt that can't be represented in any encoding. I just want to scrub it out. I tried this... clean = txt.encode('ascii','ignore') ...but the 'FFFD' still comes through. Other ideas? Thanks, Ryan -- http://mail.python.org/mailman/listinfo/python-list
Re: distinction between unzipping bytes and unzipping a file
On Jan 9, 6:07 pm, John Machin wrote: > Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as > God^H^H^HGates intended: > > >>> buff = open('data', 'rb').read() > >>> buff[:100] > > '<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00> > \x00<\x00B\x0 > 0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00> > \x000\x00.\x000\x000\x000\x000\x0 > 0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00< > \x00S\x00t\x0 > 0a\x00t\x00'>>> buff[:100].decode('utf_16_le') There it is. Thanks. > u'0. > > > > But if I return it to my browser with python+django, > > there are bad characters every other character > > Please consider that we might have difficulty guessing what "return it > to my browser with python+django" means. Show actual code. I did stop and consider what code to show. I tried to show only the code that seemed relevant, as there are sometimes complaints on this and other groups when someone shows more than the relevant code. You solved my problem with decode('utf_16_le'). I can't find any description of that encoding on the WWW... and I thought *everything* was on the WWW. :) I didn't know the data was utf_16_le-encoded because I'm getting it from a service. I don't even know if *they* know what encoding they used. I'm not sure how you knew what the encoding was. > Please consider reading the Unicode HOWTO > athttp://docs.python.org/howto/unicode.html Probably wouldn't hurt, though reading that HOWTO wouldn't have given me the encoding, I don't think. -Ryan > Cheers, > John -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 9, 7:33 pm, John Machin wrote: > It is not impossible for a file with dummy data to have been > handcrafted or otherwise produced by a process different to that used > for a real-data file. I knew it was produced by the same process, or I wouldn't have shared it. : ) But you couldn't have known that. > > Not sure if you've seen this > > thread...http://groups.google.com/group/comp.lang.python/browse_thread/thread/... > > Yeah, I've seen it ... (sigh) ... pax Steve Holden, but *please* stick > with one thread ... Thanks... I thought I was posting about separate issues and would annoy people who were only interested in one of the issues if I put them both in the same thread. I guess all posts re: the same script should go in one thread, even if the questions posed may be unrelated and may be separate issues. There are grey areas. Problem solved in John Machin's post at http://groups.google.com/group/comp.lang.python/browse_thread/thread/d84f42493fe81864/03b8341539d87989?hl=en&lnk=raot#03b8341539d87989 I'll post the final code when it's prettier. -Ryan -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
If anyone's interested, here are my django views... from django.shortcuts import render_to_response from django.http import HttpResponse from xml.etree.ElementTree import ElementTree import urllib, base64, subprocess def get_data(request): service_url = 'http://www.something.com/webservices/someservice/ etc?user=etc&pass=etc' xml = urllib.urlopen(service_url) #the base64-encoded string is in a one-element xml doc... tree = ElementTree() xml_doc = tree.parse(xml) datum = "" for node in xml_doc.getiterator(): datum = "%s" % (node.text) decoded = base64.b64decode(datum) dir = '/path/to/data/' f = open(dir+'data.zip', 'wb') f.write(decoded) f.close() file = subprocess.call('unzip '+dir+'data.zip -d '+dir, shell=True) file = open(dir+'data', 'rb').read() txt = file.decode('utf_16_le') return render_to_response('output.html',{ 'output' : txt }) def read_xml(request): xml = urllib.urlopen('http://www.something.org/get_data/') #page using the get_data view xml = xml.read() xml = unicode(xml) xml = '\n'+xml+'' f = open('/path/to/temp.txt','w') f.write(xml) f.close() tree = ElementTree() xml_doc = tree.parse('/path/to/temp.txt') datum = "" for node in xml_doc.getiterator(): datum = "%s%s - %s" % (datum, node.tag, node.text) return render_to_response('output.html',{ 'output' : datum }) -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 12, 11:53 am, "Chris Mellon" wrote: > On Sat, Jan 10, 2009 at 1:32 PM,webcomm wrote: > > On Jan 9, 7:33 pm, John Machin wrote: > >> It is not impossible for a file with dummy data to have been > >> handcrafted or otherwise produced by a process different to that used > >> for a real-data file. > > > I knew it was produced by the same process, or I wouldn't have shared > > it. : ) > > But you couldn't have known that. > > >> > Not sure if you've seen this > >> > thread...http://groups.google.com/group/comp.lang.python/browse_thread/thread/... > > >> Yeah, I've seen it ... (sigh) ... pax Steve Holden, but *please* stick > >> with one thread ... > > > Thanks... I thought I was posting about separate issues and would > > annoy people who were only interested in one of the issues if I put > > them both in the same thread. I guess all posts re: the same script > > should go in one thread, even if the questions posed may be unrelated > > and may be separate issues. There are grey areas. > > > Problem solved in John Machin's post at > >http://groups.google.com/group/comp.lang.python/browse_thread/thread/... > > It's worth pointing out (although the provider probably doesn't care) > that this isn't really an XML document and this was a bad way of them > to distribute the data. If they'd used a correctly formatted XML > document (with the prelude and everything) with the correct encoding > information, existing XML parsers should have just Done The Right > Thing with the data, instead of you needing to know the encoding a > priori to extract an XML fragment. Agreed. I can't say I understand their rationale for doing it this way. -- http://mail.python.org/mailman/listinfo/python-list
practical limits of urlopen()
Hi, Am I going to have problems if I use urlopen() in a loop to get data from 3000+ URLs? There will be about 2KB of data on average at each URL. I will probably run the script about twice per day. Data from each URL will be saved to my database. I'm asking because I've never opened that many URLs before in a loop. I'm just wondering if it will be particularly taxing for my server. Is it very uncommon to get data from so many URLs in a script? I guess search spiders do it, so I should be able to as well? Thank you, Ryan -- http://mail.python.org/mailman/listinfo/python-list
BadZipfile "file is not a zip file"
The error... >>> file = zipfile.ZipFile('data.zip', "r") Traceback (most recent call last): File "", line 1, in file = zipfile.ZipFile('data.zip', "r") File "C:\Python25\lib\zipfile.py", line 346, in __init__ self._GetContents() File "C:\Python25\lib\zipfile.py", line 366, in _GetContents self._RealGetContents() File "C:\Python25\lib\zipfile.py", line 378, in _RealGetContents raise BadZipfile, "File is not a zip file" BadZipfile: File is not a zip file When I look at data.zip in Windows, it appears to be a valid zip file. I am able to uncompress it in Windows XP, and can also uncompress it with 7-Zip. It looks like zipfile is not able to read a "table of contents" in the zip file. That's not a concept I'm familiar with. data.zip is created in this script... decoded = base64.b64decode(datum) f = open('data.zip', 'wb') f.write(decoded) f.close() file = zipfile.ZipFile('data.zip', "r") datum is a base64 encoded zip file. Again, I am able to open data.zip as if it's a valid zip file. Maybe there is something wrong with the approach I've taken to writing the data to data.zip? I'm not sure if it matters, but the zipped data is Unicode. What would cause a zip file to not have a table of contents? Is there some way I can add a table of contents to a zip file using python? Maybe there is some more fundamental problem with the data that is making it seem like there is no table of contents? Thanks in advance for your help. Ryan -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 8, 8:02 pm, MRAB wrote: > You're just creating a file called "data.zip". That doesn't make it a > zip file. A zip file has a specific format. If the file doesn't have > that format then the zipfile module will complain. Hmm. When I open it in Windows or with 7-Zip, it contains a text file that has the data I would expect it to have. I guess that alone doesn't necessarily prove it's a zip file? datum is something I'm downloading via a web service. The providers of the service say it's a zip file, and have provided a code sample in C# (which I know nothing about) that shows how to deal with it. In the code sample, the file is base64 decoded and then unzipped. I'm trying to write something in Python to decode and unzip the file. I checked the file for comments and it has none. At least, when I view the properties in Windows, there are no comments. -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 8, 8:39 pm, "James Mills" wrote: > Send us a sample of this file in question... It contains data that I can't share publicly. I could ask the providers of the service if they have a dummy file I could use that doesn't contain any real data, but I don't know how responsive they'll be. It's an event registration service called RegOnline. -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 8, 8:54 pm, MRAB wrote: > Have you tried gzip instead? There's no option to download the data in a gzipped format. The files are .zip archives. -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 9, 3:16 am, Steven D'Aprano wrote: > The full signature of ZipFile is: > > ZipFile(file, mode="r", compression=ZIP_STORED, allowZip64=True) > > Try passing compression=zipfile.ZIP_DEFLATED and/or allowZip64=False and > see if that makes any difference. Those arguments didn't make a difference in my case. > The zip format does support alternative compression methods, it's > possible that this particular file uses a different sort of compression > which Python doesn't deal with. > > > What would cause a zip file to not have a table of contents? > > What makes you think it doesn't have one? Because when I search for the "file is not a zip file" error in zipfile.py, there is a function that checks for a table of contents. Tho it looks like there are other ideas in this thread about what might cause that error... I'll keep reading... -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 9, 3:46 am, Carl Banks wrote: > The zipfile format is kind of brain dead, you can't tell where the end > of the file is supposed to be by looking at the header. If the end of > file hasn't yet been reached there could be more data. To make > matters worse, somehow zip files came to have text comments simply > appended to the end of them. (Probably this was for the benefit of > people who would cat them to the terminal.) > > Anyway, if you see something that doesn't adhere to the zipfile > format, you don't have any foolproof way to know if it's because the > file is corrupted or if it's just an appended comment. > > Most zipfile readers use a heuristic to distinguish. Python's zipfile > module just assumes it's corrupted. > > The following post from a while back gives a solution that tries to > snip the comment off so that zipfile module can handle it. It might > help you out. > > http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543 > > Carl Banks Thanks Carl. I tried Scott's getzip() function yesterday... I stumbled upon it in my searches. It didn't seem to help in my case, though it did produce a different error: ValueError, substring not found. Not sure what that means. -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 9, 5:42 am, John Machin wrote: > And here's a little gadget that might help the diagnostic effort; it > shows the archive size and the position of all the "magic" PKnn > markers. In a "normal" uncommented archive, EndArchive_pos + 22 == > archive_size. I ran the diagnostic gadget... archive size is 69888 FileHeader at 0 CentralDir at 43796 EndArchive at 43846 -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 9, 10:14 am, "Chris Mellon" wrote: > This is a ticket about another issue or 2 with invalid zipfiles that > the zipfile module won't load, but that other tools will compensate > for: > > http://bugs.python.org/issue1757072 Hmm. That's interesting. Are there other tools I can use in a python script that are more forgiving? I am using the zipfile module only because it seems to be the most widely used. Are other options in python likely to be just as unforgiving? Guess I'll look and see... -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 9, 10:14 am, "Chris Mellon" wrote: > This is a ticket about another issue or 2 with invalid zipfiles that > the zipfile module won't load, but that other tools will compensate > for: > > http://bugs.python.org/issue1757072 Looks like I just need to do this to unzip with unix... from os import popen popen("unzip data.zip") That works for me. No idea why I didn't think of that earlier. I'm new to python but should have realized I could run unix commands with python. I had blinders on. Now I just need to get rid of some bad characters in the unzipped file. I'll start a new thread if I need help with that... -- http://mail.python.org/mailman/listinfo/python-list
distinction between unzipping bytes and unzipping a file
Hi, In python, is there a distinction between unzipping bytes and unzipping a binary file to which those bytes have been written? The following code is, I think, an example of writing bytes to a file and then unzipping... decoded = base64.b64decode(datum) #datum is a base64 encoded string of data downloaded from a web service f = open('data.zip', 'wb') f.write(decoded) f.close() x = zipfile.ZipFile('data.zip', 'r') After looking at the preceding code, the provider of the web service gave me this advice... "Instead of trying to create a file, take the unzipped bytes and get a Unicode string of text from it." If so, I'm not sure how to do what he's suggesting, or if it's really different from what I've done. I find that I am able to unzip the resulting data.zip using the unix unzip command, but the file inside contains some FFFD characters, as described in this thread... http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en# I don't know if the unwanted characters might be the result of my trying to write and unzip a file, rather than unzipping the bytes. The file does contain a semblance of what I ultimately want -- it's not all garbage. Apologies if it's not appropriate to start a new thread for this. It just seems like a different topic than how to deal with the resulting FFFD characters. Thanks for your help, Ryan -- http://mail.python.org/mailman/listinfo/python-list
Re: distinction between unzipping bytes and unzipping a file
On Jan 9, 2:49 pm, webcomm wrote: > decoded = base64.b64decode(datum) > #datum is a base64 encoded string of data downloaded from a web > service > f = open('data.zip', 'wb') > f.write(decoded) > f.close() > x = zipfile.ZipFile('data.zip', 'r') Sorry, that code is not what I mean to paste. This is what I intended... decoded = base64.b64decode(datum) #datum is a base64 encoded string of data downloaded from a web service f = open('data.zip', 'wb') f.write(decoded) f.close() x = popen("unzip data.zip") -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 9, 1:32 pm, Scott David Daniels wrote: > I'd certainly try to figure out if the archive was mis-handled > somewhere along the way. Quite possible that I'm mishandling something, or the service provider is mishandling something. Probably the former. Please see this more recent thread... http://groups.google.com/group/comp.lang.python/browse_thread/thread/d84f42493fe81864?hl=en# -- http://mail.python.org/mailman/listinfo/python-list
Re: distinction between unzipping bytes and unzipping a file
On Jan 9, 3:15 pm, Steve Holden wrote: > webcomm wrote: > > Hi, > > In python, is there a distinction between unzipping bytes and > > unzipping a binary file to which those bytes have been written? > > > The following code is, I think, an example of writing bytes to a file > > and then unzipping... > > > decoded = base64.b64decode(datum) > > #datum is a base64 encoded string of data downloaded from a web > > service > > f = open('data.zip', 'wb') > > f.write(decoded) > > f.close() > > x = zipfile.ZipFile('data.zip', 'r') > > > After looking at the preceding code, the provider of the web service > > gave me this advice... > > "Instead of trying to create a file, take the unzipped bytes and get a > > Unicode string of text from it." > > Not terribly useful advice, but one presumes he she or it was trying to > be helpful. > > > If so, I'm not sure how to do what he's suggesting, or if it's really > > different from what I've done. > > Well, what you have done appears pretty wrong to me, but let's take a > look. What's datum? You appear to be treating it as base64-encoded data; > is that correct? Have you examined it? It's data that has been compressed then base64 encoded by the web service. I'm supposed to download it, then decode, then unzip. They provide a C# example of how to do this on page 13 of http://forums.regonline.com/forums/docs/RegOnlineWebServices.pdf If you have a minute, see also this thread... http://groups.google.com/group/comp.lang.python/browse_thread/thread/d72d883409764559/5b9ec3e77dd4?hl=en&lnk=gst&q=webcomm#5b9ec3e77dd4 -- http://mail.python.org/mailman/listinfo/python-list
Re: distinction between unzipping bytes and unzipping a file
On Jan 9, 4:12 pm, "Chris Mellon" wrote: > It would really help if you could post a sample file somewhere. Here's a sample with some dummy data from the web service: http://webcomm.webfactional.com/htdocs/data.zip That's the zip created in this line of my code... f = open('data.zip', 'wb') If I open the file it contains as unicode in my text editor (EditPlus) on Windows XP, there is ostensibly nothing wrong with it. It looks like valid XML. But if I return it to my browser with python+django, there are bad characters every other character If I unzip it like this... popen("unzip data.zip") ...then the bad characters are 'FFFD' characters as described and pictured here... http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en# If I unzip it like this... getzip('data.zip', ignoreable=3) ...using the function at... http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543 ...then the bad characters are \x00 characters. -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 8, 8:39 pm, "James Mills" wrote: > Send us a sample of this file in question... Here's a sample with some dummy data from the web service: http://webcomm.webfactional.com/htdocs/data.zip That's the zip created in this line of my code... f = open('data.zip', 'wb') If I open the file it contains as unicode in my text editor (EditPlus) on Windows XP, there is ostensibly nothing wrong with it. It looks like valid XML. But if I return it to my browser with python+django, there are bad characters every other character If I unzip it like this... popen("unzip data.zip") ...then the bad characters are 'FFFD' characters as described and pictured here... http://groups.google.com/group/comp.lang.python/browse_thread/thread/... If I unzip it like this... getzip('data.zip', ignoreable=3) ...using Scott's function at... http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543 ...then the bad characters are \x00 characters. -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 9, 5:00 pm, webcomm wrote: > If I unzip it like this... > popen("unzip data.zip") > ...then the bad characters are 'FFFD' characters as described and > pictured > here...http://groups.google.com/group/comp.lang.python/browse_thread/thread/... > trying again to post the link re: FFFD characters... http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en# -- http://mail.python.org/mailman/listinfo/python-list
Re: BadZipfile "file is not a zip file"
On Jan 9, 5:21 pm, John Machin wrote: > Thanks. Would you mind spending a few minutes more on this so that we > can see if it's a problem that can be fixed easily, like the one that > Chris Mellon reported? > Don't mind at all. I'm now working with a zip file with some dummy data I downloaded from the web service. You'll notice it's a smaller archive than the one I was working with when I ran zip_susser.py, but it has the same problem (whatever the problem is). It's the one I uploaded to http://webcomm.webfactional.com/htdocs/data.zip Here's what I get when I run zip_susser_v2.py... archive size is 1092 FileHeader at 0 CentralDir at 844 EndArchive at 894 using posEndArchive = 894 endArchive: ('PK\x05\x06', 0, 0, 1, 1, 50, 844, 0) signature : 'PK\x05\x06' this_disk_num : 0 central_dir_disk_num : 0 central_dir_this_disk_num_entries : 1 central_dir_overall_num_entries : 1 central_dir_size : 50 central_dir_offset : 844 comment_size : 0 expected_comment_size: 0 actual_comment_size: 176 comment is all spaces: False comment is all '\0': True comment (first 100 bytes): '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00' Not sure if you've seen this thread... http://groups.google.com/group/comp.lang.python/browse_thread/thread/d84f42493fe81864?hl=en# Thanks, Ryan -- http://mail.python.org/mailman/listinfo/python-list