how to remove 'FFFD' character

2009-01-09 Thread webcomm
Does anyone know a way to remove the 'FFFD' character with python?

You can see the browser output I'm dealing with here:
http://webcomm.webfactional.com/htdocs/fffd.JPG
I deleted a big chunk out of the middle of that JPG to protect
sensitive data.

I don't know what the character encoding of this data is and don't
know what the 'FFFD' represents.  I guess it is something that can't
be represented in whatever this particular encoding is, or maybe it is
something corrupt that can't be represented in any encoding.  I just
want to scrub it out.  I tried this...

clean = txt.encode('ascii','ignore')

...but the 'FFFD' still comes through.  Other ideas?

Thanks,
Ryan
--
http://mail.python.org/mailman/listinfo/python-list


Re: distinction between unzipping bytes and unzipping a file

2009-01-10 Thread webcomm
On Jan 9, 6:07 pm, John Machin  wrote:
> Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
> God^H^H^HGates intended:
>
> >>> buff = open('data', 'rb').read()
> >>> buff[:100]
>
> '<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
> \x00<\x00B\x0
> 0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
> \x000\x00.\x000\x000\x000\x000\x0
> 0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
> \x00S\x00t\x0
> 0a\x00t\x00'>>> buff[:100].decode('utf_16_le')

There it is.  Thanks.

> u'0.
>
>
> >  But if I return it to my browser with python+django,
> > there are bad characters every other character
>
> Please consider that we might have difficulty guessing what "return it
> to my browser with python+django" means. Show actual code.

I did stop and consider what code to show.  I tried to show only the
code that seemed relevant, as there are sometimes complaints on this
and other groups when someone shows more than the relevant code.  You
solved my problem with decode('utf_16_le').  I can't find any
description of that encoding on the WWW... and I thought *everything*
was on the WWW.  :)

I didn't know the data was utf_16_le-encoded because I'm getting it
from a service.  I don't even know if *they* know what encoding they
used.  I'm not sure how you knew what the encoding was.

> Please consider reading the Unicode HOWTO 
> athttp://docs.python.org/howto/unicode.html

Probably wouldn't hurt, though reading that HOWTO wouldn't have given
me the encoding, I don't think.

-Ryan


> Cheers,
> John

--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-10 Thread webcomm
On Jan 9, 7:33 pm, John Machin  wrote:
> It is not impossible for a file with dummy data to have been
> handcrafted or otherwise produced by a process different to that used
> for a real-data file.

I knew it was produced by the same process, or I wouldn't have shared
it. : )
But you couldn't have known that.


> > Not sure if you've seen this 
> > thread...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
>
> Yeah, I've seen it ... (sigh) ... pax Steve Holden, but *please* stick
> with one thread ...

Thanks... I thought I was posting about separate issues and would
annoy people who were only interested in one of the issues if I put
them both in the same thread.  I guess all posts re: the same script
should go in one thread, even if the questions posed may be unrelated
and may be separate issues.  There are grey areas.

Problem solved in John Machin's post at
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d84f42493fe81864/03b8341539d87989?hl=en&lnk=raot#03b8341539d87989

I'll post the final code when it's prettier.

-Ryan

--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-12 Thread webcomm
If anyone's interested, here are my django views...


from django.shortcuts import render_to_response
from django.http import HttpResponse
from xml.etree.ElementTree import ElementTree
import urllib, base64, subprocess

def get_data(request):
service_url = 'http://www.something.com/webservices/someservice/
etc?user=etc&pass=etc'
xml = urllib.urlopen(service_url)
#the base64-encoded string is in a one-element xml doc...
tree = ElementTree()
xml_doc = tree.parse(xml)
datum = ""
for node in xml_doc.getiterator():
 datum = "%s" % (node.text)
decoded = base64.b64decode(datum)

dir = '/path/to/data/'
f = open(dir+'data.zip', 'wb')
f.write(decoded)
f.close()

file = subprocess.call('unzip '+dir+'data.zip -d '+dir,
shell=True)
file = open(dir+'data', 'rb').read()
txt = file.decode('utf_16_le')

return render_to_response('output.html',{
'output' : txt
})

def read_xml(request):
xml = urllib.urlopen('http://www.something.org/get_data/')  #page
using the get_data view
xml = xml.read()
xml = unicode(xml)
xml = '\n'+xml+''

f = open('/path/to/temp.txt','w')
f.write(xml)
f.close()

tree = ElementTree()
xml_doc = tree.parse('/path/to/temp.txt')
datum = ""
for node in xml_doc.getiterator():
 datum = "%s%s - %s" % (datum, node.tag, node.text)

return render_to_response('output.html',{
'output' : datum
})


--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-12 Thread webcomm
On Jan 12, 11:53 am, "Chris Mellon"  wrote:
> On Sat, Jan 10, 2009 at 1:32 PM,webcomm wrote:
> > On Jan 9, 7:33 pm, John Machin  wrote:
> >> It is not impossible for a file with dummy data to have been
> >> handcrafted or otherwise produced by a process different to that used
> >> for a real-data file.
>
> > I knew it was produced by the same process, or I wouldn't have shared
> > it. : )
> > But you couldn't have known that.
>
> >> > Not sure if you've seen this 
> >> > thread...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
>
> >> Yeah, I've seen it ... (sigh) ... pax Steve Holden, but *please* stick
> >> with one thread ...
>
> > Thanks... I thought I was posting about separate issues and would
> > annoy people who were only interested in one of the issues if I put
> > them both in the same thread.  I guess all posts re: the same script
> > should go in one thread, even if the questions posed may be unrelated
> > and may be separate issues.  There are grey areas.
>
> > Problem solved in John Machin's post at
> >http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
>
> It's worth pointing out (although the provider probably doesn't care)
> that this isn't really an XML document and this was a bad way of them
> to distribute the data. If they'd used a correctly formatted XML
> document (with the prelude and everything) with the correct encoding
> information, existing XML parsers should have just Done The Right
> Thing with the data, instead of you needing to know the encoding a
> priori to extract an XML fragment.

Agreed. I can't say I understand their rationale for doing it this way.
--
http://mail.python.org/mailman/listinfo/python-list


practical limits of urlopen()

2009-01-24 Thread webcomm
Hi,

Am I going to have problems if I use urlopen() in a loop to get data
from 3000+ URLs?  There will be about 2KB of data on average at each
URL.  I will probably run the script about twice per day.  Data from
each URL will be saved to my database.

I'm asking because I've never opened that many URLs before in a loop.
I'm just wondering if it will be particularly taxing for my server.
Is it very uncommon to get data from so many URLs in a script?  I
guess search spiders do it, so I should be able to as well?

Thank you,
Ryan
--
http://mail.python.org/mailman/listinfo/python-list


BadZipfile "file is not a zip file"

2009-01-08 Thread webcomm
The error...

>>> file = zipfile.ZipFile('data.zip', "r")
Traceback (most recent call last):
  File "", line 1, in 
file = zipfile.ZipFile('data.zip', "r")
  File "C:\Python25\lib\zipfile.py", line 346, in __init__
self._GetContents()
  File "C:\Python25\lib\zipfile.py", line 366, in _GetContents
self._RealGetContents()
  File "C:\Python25\lib\zipfile.py", line 378, in _RealGetContents
raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file

When I look at data.zip in Windows, it appears to be a valid zip
file.  I am able to uncompress it in Windows XP, and can also
uncompress it with 7-Zip.  It looks like zipfile is not able to read a
"table of contents" in the zip file.  That's not a concept I'm
familiar with.

data.zip is created in this script...

decoded = base64.b64decode(datum)
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
file = zipfile.ZipFile('data.zip', "r")

datum is a base64 encoded zip file.  Again, I am able to open data.zip
as if it's a valid zip file.  Maybe there is something wrong with the
approach I've taken to writing the data to data.zip?  I'm not sure if
it matters, but the zipped data is Unicode.

What would cause a zip file to not have a table of contents?  Is there
some way I can add a table of contents to a zip file using python?
Maybe there is some more fundamental problem with the data that is
making it seem like there is no table of contents?

Thanks in advance for your help.
Ryan
--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-08 Thread webcomm
On Jan 8, 8:02 pm, MRAB  wrote:
> You're just creating a file called "data.zip". That doesn't make it a
> zip file. A zip file has a specific format. If the file doesn't have
> that format then the zipfile module will complain.

Hmm.  When I open it in Windows or with 7-Zip, it contains a text file
that has the data I would expect it to have.  I guess that alone
doesn't necessarily prove it's a zip file?

datum is something I'm downloading via a web service.  The providers
of the service say it's a zip file, and have provided a code sample in
C# (which I know nothing about) that shows how to deal with it.  In
the code sample, the file is base64 decoded and then unzipped.  I'm
trying to write something in Python to decode and unzip the file.

I checked the file for comments and it has none.  At least, when I
view the properties in Windows, there are no comments.
--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-08 Thread webcomm
On Jan 8, 8:39 pm, "James Mills"  wrote:
> Send us a sample of this file in question...

It contains data that I can't share publicly.  I could ask the
providers of the service if they have a dummy file I could use that
doesn't contain any real data, but I don't know how responsive they'll
be.  It's an event registration service called RegOnline.
--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-08 Thread webcomm
On Jan 8, 8:54 pm, MRAB  wrote:
> Have you tried gzip instead?

There's no option to download the data in a gzipped format.  The files
are .zip archives.

--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-09 Thread webcomm
On Jan 9, 3:16 am, Steven D'Aprano  wrote:
> The full signature of ZipFile is:
>
> ZipFile(file, mode="r", compression=ZIP_STORED, allowZip64=True)
>
> Try passing compression=zipfile.ZIP_DEFLATED and/or allowZip64=False and
> see if that makes any difference.

Those arguments didn't make a difference in my case.

> The zip format does support alternative compression methods, it's
> possible that this particular file uses a different sort of compression
> which Python doesn't deal with.
>
> > What would cause a zip file to not have a table of contents?
>
> What makes you think it doesn't have one?

Because when I search for the "file is not a zip file" error in
zipfile.py, there is a function that checks for a table of contents.
Tho it looks like there are other ideas in this thread about what
might cause that error... I'll keep reading...

--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-09 Thread webcomm
On Jan 9, 3:46 am, Carl Banks  wrote:
> The zipfile format is kind of brain dead, you can't tell where the end
> of the file is supposed to be by looking at the header.  If the end of
> file hasn't yet been reached there could be more data.  To make
> matters worse, somehow zip files came to have text comments simply
> appended to the end of them.  (Probably this was for the benefit of
> people who would cat them to the terminal.)
>
> Anyway, if you see something that doesn't adhere to the zipfile
> format, you don't have any foolproof way to know if it's because the
> file is corrupted or if it's just an appended comment.
>
> Most zipfile readers use a heuristic to distinguish.  Python's zipfile
> module just assumes it's corrupted.
>
> The following post from a while back gives a solution that tries to
> snip the comment off so that zipfile module can handle it.  It might
> help you out.
>
> http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
>
> Carl Banks

Thanks Carl.  I tried Scott's getzip() function yesterday... I
stumbled upon it in my searches.  It didn't seem to help in my case,
though it did produce a different error:  ValueError, substring not
found.  Not sure what that means.
--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-09 Thread webcomm
On Jan 9, 5:42 am, John Machin  wrote:
> And here's a little gadget that might help the diagnostic effort; it
> shows the archive size and the position of all the "magic" PKnn
> markers. In a "normal" uncommented archive, EndArchive_pos + 22 ==
> archive_size.

I ran the diagnostic gadget...

archive size is 69888
FileHeader at 0
CentralDir at 43796
EndArchive at 43846


--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-09 Thread webcomm
On Jan 9, 10:14 am, "Chris Mellon"  wrote:
> This is a ticket about another issue or 2 with invalid zipfiles that
> the zipfile module won't load, but that other tools will compensate
> for:
>
> http://bugs.python.org/issue1757072

Hmm.  That's interesting.  Are there other tools I can use in a python
script that are more forgiving?  I am using the zipfile module only
because it seems to be the most widely used.  Are other options in
python likely to be just as unforgiving?  Guess I'll look and see...

--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-09 Thread webcomm
On Jan 9, 10:14 am, "Chris Mellon"  wrote:
> This is a ticket about another issue or 2 with invalid zipfiles that
> the zipfile module won't load, but that other tools will compensate
> for:
>
> http://bugs.python.org/issue1757072

Looks like I just need to do this to unzip with unix...

from os import popen
popen("unzip data.zip")

That works for me.  No idea why I didn't think of that earlier.  I'm
new to python but should have realized I could run unix commands with
python.  I had blinders on.  Now I just need to get rid of some bad
characters in the unzipped file.  I'll start a new thread if I need
help with that...

--
http://mail.python.org/mailman/listinfo/python-list


distinction between unzipping bytes and unzipping a file

2009-01-09 Thread webcomm
Hi,
In python, is there a distinction between unzipping bytes and
unzipping a binary file to which those bytes have been written?

The following code is, I think, an example of writing bytes to a file
and then unzipping...

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = zipfile.ZipFile('data.zip', 'r')

After looking at the preceding code, the provider of the web service
gave me this advice...
"Instead of trying to create a file, take the unzipped bytes and get a
Unicode string of text from it."

If so, I'm not sure how to do what he's suggesting, or if it's really
different from what I've done.

I find that I am able to unzip the resulting data.zip using the unix
unzip command, but the file inside contains some FFFD characters, as
described in this thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
I don't know if the unwanted characters might be the result of my
trying to write and unzip a file, rather than unzipping the bytes.
The file does contain a semblance of what I ultimately want -- it's
not all garbage.

Apologies if it's not appropriate to start a new thread for this.  It
just seems like a different topic than how to deal with the resulting
FFFD characters.

Thanks for your help,
Ryan

--
http://mail.python.org/mailman/listinfo/python-list


Re: distinction between unzipping bytes and unzipping a file

2009-01-09 Thread webcomm
On Jan 9, 2:49 pm, webcomm  wrote:
> decoded = base64.b64decode(datum)
> #datum is a base64 encoded string of data downloaded from a web
> service
> f = open('data.zip', 'wb')
> f.write(decoded)
> f.close()
> x = zipfile.ZipFile('data.zip', 'r')

Sorry, that code is not what I mean to paste.  This is what I
intended...

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = popen("unzip data.zip")
--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-09 Thread webcomm
On Jan 9, 1:32 pm, Scott David Daniels  wrote:
> I'd certainly try to figure out if the archive was mis-handled
> somewhere along the way.  

Quite possible that I'm mishandling something, or the service provider
is mishandling something.  Probably the former.  Please see this more
recent thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d84f42493fe81864?hl=en#
--
http://mail.python.org/mailman/listinfo/python-list


Re: distinction between unzipping bytes and unzipping a file

2009-01-09 Thread webcomm
On Jan 9, 3:15 pm, Steve Holden  wrote:
> webcomm wrote:
> > Hi,
> > In python, is there a distinction between unzipping bytes and
> > unzipping a binary file to which those bytes have been written?
>
> > The following code is, I think, an example of writing bytes to a file
> > and then unzipping...
>
> > decoded = base64.b64decode(datum)
> > #datum is a base64 encoded string of data downloaded from a web
> > service
> > f = open('data.zip', 'wb')
> > f.write(decoded)
> > f.close()
> > x = zipfile.ZipFile('data.zip', 'r')
>
> > After looking at the preceding code, the provider of the web service
> > gave me this advice...
> > "Instead of trying to create a file, take the unzipped bytes and get a
> > Unicode string of text from it."
>
> Not terribly useful advice, but one presumes he she or it was trying to
> be helpful.
>
> > If so, I'm not sure how to do what he's suggesting, or if it's really
> > different from what I've done.
>
> Well, what you have done appears pretty wrong to me, but let's take a
> look. What's datum? You appear to be treating it as base64-encoded data;
> is that correct? Have you examined it?

It's data that has been compressed then base64 encoded by the web
service.  I'm supposed to download it, then decode, then unzip.  They
provide a C# example of how to do this on page 13 of
http://forums.regonline.com/forums/docs/RegOnlineWebServices.pdf

If you have a minute, see also this thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d72d883409764559/5b9ec3e77dd4?hl=en&lnk=gst&q=webcomm#5b9ec3e77dd4
--
http://mail.python.org/mailman/listinfo/python-list


Re: distinction between unzipping bytes and unzipping a file

2009-01-09 Thread webcomm
On Jan 9, 4:12 pm, "Chris Mellon"  wrote:
> It would really help if you could post a sample file somewhere.

Here's a sample with some dummy data from the web service:
http://webcomm.webfactional.com/htdocs/data.zip

That's the zip created in this line of my code...
f = open('data.zip', 'wb')

If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it.  It looks
like valid XML.  But if I return it to my browser with python+django,
there are bad characters every other character

If I unzip it like this...
popen("unzip data.zip")
...then the bad characters are 'FFFD' characters as described and
pictured here...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#

If I unzip it like this...
getzip('data.zip', ignoreable=3)
...using the function at...
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
...then the bad characters are \x00 characters.

--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-09 Thread webcomm
On Jan 8, 8:39 pm, "James Mills"  wrote:
> Send us a sample of this file in question...

Here's a sample with some dummy data from the web service:
http://webcomm.webfactional.com/htdocs/data.zip

That's the zip created in this line of my code...
f = open('data.zip', 'wb')

If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it.  It looks
like valid XML.  But if I return it to my browser with python+django,
there are bad characters every other character

If I unzip it like this...
popen("unzip data.zip")
...then the bad characters are 'FFFD' characters as described and
pictured here...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/...

If I unzip it like this...
getzip('data.zip', ignoreable=3)
...using Scott's function at...
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
...then the bad characters are \x00 characters.

--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-09 Thread webcomm
On Jan 9, 5:00 pm, webcomm  wrote:
> If I unzip it like this...
> popen("unzip data.zip")
> ...then the bad characters are 'FFFD' characters as described and
> pictured 
> here...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...
>

trying again to post the link re: FFFD characters...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
--
http://mail.python.org/mailman/listinfo/python-list


Re: BadZipfile "file is not a zip file"

2009-01-09 Thread webcomm
On Jan 9, 5:21 pm, John Machin  wrote:
> Thanks. Would you mind spending a few minutes more on this so that we
> can see if it's a problem that can be fixed easily, like the one that
> Chris Mellon reported?
>

Don't mind at all.  I'm now working with a zip file with some dummy
data I downloaded from the web service.  You'll notice it's a smaller
archive than the one I was working with when I ran zip_susser.py, but
it has the same problem (whatever the problem is).  It's the one I
uploaded to http://webcomm.webfactional.com/htdocs/data.zip

Here's what I get when I run zip_susser_v2.py...

archive size is 1092
FileHeader at 0
CentralDir at 844
EndArchive at 894
using posEndArchive = 894
endArchive: ('PK\x05\x06', 0, 0, 1, 1, 50, 844, 0)
signature : 'PK\x05\x06'
this_disk_num : 0
 central_dir_disk_num : 0
central_dir_this_disk_num_entries : 1
  central_dir_overall_num_entries : 1
 central_dir_size : 50
   central_dir_offset : 844
 comment_size : 0

expected_comment_size: 0
actual_comment_size: 176
comment is all spaces: False
comment is all '\0': True
comment (first 100 bytes):
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00'

Not sure if you've seen this thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d84f42493fe81864?hl=en#

Thanks,
Ryan
--
http://mail.python.org/mailman/listinfo/python-list