from:"Spider"

Re: Python 2.7.1

2010-11-29 Thread Spider

> 2.7 includes many features that were first released in Python 3.1. The faster 
> io module ...

I understand that I/O in Python 3.0 was slower than 2.x (due to quite
a lot of the code being in Python rather than C, I gather), and that
this was fixed up in 3.1. So, io in 3.1 is faster than in 3.0.

Is it also true that io is faster in 2.7 than 2.6? That's what the
release notes imply, but I wonder whether that comment has been back-
ported from the 3.1 release notes, and doesn't actually apply to 2.7.

Of course, I probably should benchmark it, but if someone who knows
the history of the io module can respond, that would be great. My
specific interest is in file read/write speeds.

Thanks
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python program that validates an url against w3c markup validator

2006-11-29 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "yaru22" <[EMAIL PROTECTED]> wrote:

> I'd like to create a program that validates bunch of urls against the
> w3c markup validator (http://validator.w3.org/) and store the result in
> a file.
> 
> Since I don't know network programming, I have no idea how to start
> coding this program.
> 
> I was looking at the python library and thought urllib or urllib2 may
> be used to make this program work.
> 
> But I don't know how to send my urls to the w3c validator and get the
> result.
> 
> Can anyone help me with this? or at least give me a hint?
> 
> Thank you so much.

This article might be of help:
http://www.standards-schmandards.com/2005/massvalidate

"Periodically you may want to make sure your entire website validates. 
This can be a hassle if your site is big. In this article we introduce a 
few python scripts which will help us do mass validation from a list of 
links. We will also modify the W3C validator to work the way we want."

You might also be interested in my validating spider (see my sig) which 
will validate an entire site for you but that won't teach you a thing 
about Python programming.  ;)

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Good script editor for Python on Mac OS 10.3

2006-11-29 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Lou Pecora <[EMAIL PROTECTED]> wrote:

> In article <[EMAIL PROTECTED]>,
>  "Scott_Davies" <[EMAIL PROTECTED]> wrote:
> 
> > Hi,
> > 
> > I have an old Mac with OS X Panther installed.  I also have the Python
> > language download file, but I haven't got a text/script editor to use
> > for it.  Does anyone have a recommendation for a good Python text
> > editor in OS 10.3?
> > 
> > Thanks,
> > 
> > Scott D.
> > 
> 
> Try TextWrangler.  It's free.  I use it's big brother BBEdit and like it.

I second that recommendation.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: calling Postgresql stored procedure (written in plpython)

2007-06-01 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Alchemist <[EMAIL PROTECTED]> wrote:

> Thanks for your help.
> 
> My stored procedure is written in pythonpl.  I noticed that SELECT
> queries are executed correctly (results are returned to my script)
> whereas UPDATE queries are not being performed as the data is not
> updated.

Aha! So the problem is not really with how to call Postgres stored 
procs, but that you're not getting the results you expect from some 
calls. 

> I am using a database user with read/write access to the database.
> 
> Is there a commit statement in plpython?  (e.g. plpy.commit())

Did you try that? Did you check the documentation?

> Why are UPDATEs failing?

I'm not familiar with plpy but if it is compliant with the Python DBAPI 
(PEP 249) specification then, yes, it has a .commit() method and yes, 
you must call it after DDL statements. 

>From the PEP: "Note that if the database supports an auto-commit 
feature, this must be initially off."
http://www.python.org/dev/peps/pep-0249/

In short, either turn on autocommit or start calling .commit().

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: open function fail after running a day

2007-06-07 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 alexteo21 <[EMAIL PROTECTED]> wrote:

> The script is working fine on the day of execution.
> It is able to process the data files very hour.  However, the
> processing fail one day later i.e. the date increment by 1.
> 
> Traceback (most recent call last):
>   File "./alexCopy.py", line 459, in processRequestModule
> sanityTestSteps(reqId,model)
>   File "./alexCopy.py", line 699, in sanityTestSteps
> t = open(filename, 'rb')
> IOError: [Errno 24] Too many open files:
> 
> I have explicitly closed the file.  Is there something else I need to
> do?

Sounds like the .close() isn't getting executed as you think. Try using 
the logging module to log a line immediately before each open and close 
so that you can ensure you're really closing all the files. 
Alternatively, some other bit of code my be the guilty party. A utility 
like fstat can show you who has files open.

Good luck

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Html parser

2007-06-15 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Stephen R Laniel <[EMAIL PROTECTED]> wrote:

> On Fri, Jun 15, 2007 at 07:11:56AM -0700, HMS Surprise wrote:
> > Could you recommend  an html parser that works with python (jython
> > 2.2)?
> 
> I'm new here, but I believe BeautifulSoup is the canonical
> answer:
> http://www.crummy.com/software/BeautifulSoup/

It is, but personally I'm a fan of Connelly Barnes' htmldata module:
http://oregonstate.edu/~barnesc/htmldata/

Much easier to use than BeautifulSoup IMO.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Problem with Python's "robots.txt" file parser in module robotparser

2007-07-11 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 John Nagle <[EMAIL PROTECTED]> wrote:

>Python's "robots.txt" file parser may be misinterpreting a
> special case.  Given a robots.txt file like this:
> 
>   User-agent: *
>   Disallow: //
>   Disallow: /account/registration
>   Disallow: /account/mypro
>   Disallow: /account/myint
>   ...
> 
> the python library "robotparser.RobotFileParser()" considers all pages of the
> site to be disallowed.  Apparently  "Disallow: //" is being interpreted as
> "Disallow: /".  Even the home page of the site is locked out. This may be 
> incorrect.
> 
> This is the robots.txt file for "http://ibm.com";.

Hi John,
Are you sure you're not confusing your sites? The robots.txt file at 
www.ibm.com contains the double slashed path. The robots.txt file at 
ibm.com  is different and contains this which would explain why you 
think all URLs are denied:
User-agent: *
Disallow: /

I don't see the bug to which you're referring:
>>> import robotparser
>>> r = robotparser.RobotFileParser()
>>> r.set_url("http://www.ibm.com/robots.txt";)
>>> r.read()
>>> r.can_fetch("WhateverBot", "http://www.ibm.com/foo.html";)
1
>>> r.can_fetch("WhateverBot", "http://www.ibm.com//foo.html";)
0
>>> 

I'll use this opportunity to shamelessly plug an alternate robots.txt 
parser that I wrote to address some small bugs in the parser in the 
standard library. 
http://NikitaTheSpider.com/python/rerp/

Cheers

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Problem with Python's "robots.txt" file parser in module robotparser

2007-07-12 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 John Nagle <[EMAIL PROTECTED]> wrote:

> Nikita the Spider wrote:
> 
> > 
> > Hi John,
> > Are you sure you're not confusing your sites? The robots.txt file at 
> > www.ibm.com contains the double slashed path. The robots.txt file at 
> > ibm.com  is different and contains this which would explain why you 
> > think all URLs are denied:
> > User-agent: *
> > Disallow: /
> >
> Ah, that's it.  The problem is that "ibm.com" redirects to
> "http://www.ibm.com";, but but "ibm.com/robots.txt" does not
> redirect.  For comparison, try "microsoft.com/robots.txt",
> which does redirect.

Strange thing for them to do, isn't it? Especially with two such 
different robots.txt files.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Problem with Python's "robots.txt" file parser in module robotparser

2007-07-13 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 John Nagle <[EMAIL PROTECTED]> wrote:

>I asked over at Webmaster World, and over there, they recommend against
> using redirects on robots.txt files, because they questioned whether all of
> the major search engines understand that.  Does a redirect for 
> "foo.com/robots.txt" mean that the robots.txt file applies to the domain
> being redirected from, or the domain being redirected to?

Good question. I'd guess the latter, but it's a little ambiguous. I 
agree that redirecting a request for robots.txt is probably not a good 
idea. Given that the robots.txt standard isn't as standard as it could 
be, I think it's a good idea in general to apply the KISS principle when 
dealing with things robots.txt-y.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Domain Keys in Python

2007-04-20 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Andrew Veitch <[EMAIL PROTECTED]> wrote:

> I am trying to implement Domain Keys
> (http://domainkeys.sourceforge.net/) in Python.
> 
> In Perl I would just use Crypt:RSA which has a sign
> method with an armour option which generates exactly
> what I want but I can't find a way of doing this in
> Python.
> 
> I tried this:
> 
> from M2Crypto import RSA
> key = RSA.load_key('rsa.private')
> msg='Hello world'
> print key.sign(msg)
> 
> But the output isn't quite right because there isn't
> an armour option - I verified this by reading the
> source.
> 
> I'm not even sure if M2Crypto is the right library to
> be using or is it just that I need to use something
> else for the final step?

Hi Andrew,
There's also pycrypto for doing RSA encryption:
http://www.amk.ca/python/code/crypto

I messed around with this for a little while but decided I didn't need 
it. ISTR figuring out that it does not implement any padding; is this 
perhaps the armour option you're talking about? I'm not a cryptographer 
and I don't even play one on TV, so the accuracy of this is probably 
even less reliable than an average Usenet posting...

Good luck

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python "robots.txt" parser broken since 2003

2007-04-22 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 John Nagle <[EMAIL PROTECTED]> wrote:

> This bug, "[ 813986 ] robotparser interactively prompts for username and 
> password", has been open since 2003.  It killed a big batch job of ours
> last night.
> 
> Module "robotparser" naively uses "urlopen" to read "robots.txt" URLs.
> If the server asks for basic authentication on that file, "robotparser"
> prompts for the password on standard input.  Which is rarely what you
> want.  You can demonstrate this with:
> 
> import robotparser
> url = 'http://mueblesmoraleda.com' # this site is password-protected.
> parser = robotparser.RobotFileParser()
> parser.set_url(url)
> parser.read() # Prompts for password
> 
> That's the tandard, although silly, "urllib" behavior.

John,
robotparser is (IMO) suboptimal in a few other ways, too. 
- It doesn't handle non-ASCII characters. (They're infrequent but when 
writing a spider which sees thousands of robots.txt files in a short 
time, "infrequent" can become "daily").
- It doesn't account for BOMs in robots.txt (which are rare).
- It ignores any Expires header sent with the robots.txt
- It handles some ambiguous return codes (e.g. 503) that it ought to 
pass up to the caller.

I wrote my own parser to address these problems. It probably suffers 
from the same urllib hang that you've found (I have not encountered it 
myself) and I appreciate you posting a fix. Here's the code & 
documentation in case you're interested:
http://NikitaTheSpider.com/python/rerp/

Cheers

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: mmap thoughts

2007-05-12 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "James T. Dennis" <[EMAIL PROTECTED]> wrote:

>* There don't seem to be any currently maintained SysV IPC
>  (shm, message, and semaphore) modules for Python.  I guess some
>  people have managed to hack something together using ctypes;
>  but I haven't actually read, much less tested, any of that code.

http://NikitaTheSpider.com/python/shm/


Enjoy =)

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: calling Postgresql stored procedure

2007-05-30 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Alchemist <[EMAIL PROTECTED]> wrote:

> I am using Python 2.4 and Postgresql 8.2 database server.
> 
> On the database I have created a stored function, example,
> CREATE OR REPLACE FUNCTION calculateaverage()
> 
> I created a new python script and would like to call my database
> stored function.
> 
> How can I call a database stored function/procedure in python?

You need a layer in between Python and Postgres so that they can talk to 
one another. If you don't have one, try this one (use version 2, not 
version 1.x):
http://www.initd.org/tracker/psycopg

Good luck

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to parse usenet urls?

2007-05-30 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:

> I'm trying to parse newsgroup messages, and I need to follow URLs in
> this format: news://some.server. I can past them into a newsreader
> with no problem, but I want to do it programatically.
> 
> I can't figure out how to follow these links - anyone have any ideas?

Are you aware of nntplib?

http://docs.python.org/lib/module-nntplib.html

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to parse usenet urls?

2007-05-31 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:

> > Are you aware of nntplib?
> >
> > http://docs.python.org/lib/module-nntplib.html
> 
> I am, but I once I got into the article itself, I couldn't figure out
> how to "call" a link inside the resulting message text:
> 
> >>> ... 'Castro: Bush desea mi muerte, pero las ideas no se matan', 
> >>> 'news://newsclip.ap.org/[EMAIL PROTECTED]', ...
> 
> How can I take the message link 'news://newsclip.ap.org/
> [EMAIL PROTECTED]' and follow it?

OK, gotcha. I misunderstood your original question. Perhaps this is just 
a synonym for "nntp:";? THis sounds like a dangerous assumption and 
hopefully someone more knowledgeable will come along and shoot me down. 
=) But when I fire up Ethereal and paste that news: URL into my browser, 
Firefox launches my newsreader client and Ethereal reports that my 
client connects to the remote server at the NNTP port (119), sends an 
NNTP LIST command and Ethereal identifies the subsequent conversation as 
NNTP. 

If I were you I'd try handling news: URLs with nttplib. I bet it will 
work.

Sorry I couldn't provide more than guesses. Good luck!

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Ipc mechanisms and designs.

2007-08-10 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 king kikapu <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> inspired of the topic "The Future of Python Threading", i started to
> realize that the only way to utilize the power of multiple cores using
> Python, is spawn processes and "communicate" with them.
> 
> If we have the scenario:
> 
> 1. Windows (mainly) development
> 2. Processes are running in the same machine
> 3. We just want to "pass" info from one process to another. Info may
> be simple data types or user defined Python objects.
> 
> what is the best solution (besides sockets) that someone can implement
> so to have 2 actually processes that interchanged data between them ?
> I looked at Pyro and it looks really good but i wanted to experiment
> with a simpler solution.

Hi King Kikapu
There's a shared memory module for Python, but it is *nix only, I'm 
afraid. I realize you said "mainly Windows" but this module seems to do 
what you want so maybe you can work out a creative solution.

http://NikitaTheSpider.com/python/shm/

Good luck with whatever you choose

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: LRU cache?

2007-08-12 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Paul Rubin  wrote:

> Anyone got a favorite LRU cache implementation?  I see a few in google
> but none look all that good.  I just want a dictionary indexed by
> strings, that remembers the last few thousand entries I put in it.
> 
> It actually looks like this is almost a FAQ.  A well-written
> implementation would probably make a good standard library module.

This one works for me:
http://www.webfast.com/~skip/python/Cache.py

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Generating HTML

2007-09-13 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "Sebastian Bassi" <[EMAIL PROTECTED]> wrote:

> Hello,
> 
> What are people using these days to generate HTML? I still use
> HTMLgen, but I want to know if there are new options. I don't
> want/need a web-framework a la Zope, just want to produce valid HTML
> from Python.
> Best,
> SB.

Spyce. Works fine server side (like PHP) and also run from the command line.

http://spyce.sourceforge.net/

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How use XML parsing tools on this one specific URL?

2007-03-04 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:

> I understand that the web is full of ill-formed XHTML web pages but
> this is Microsoft:
> 
> http://moneycentral.msn.com/companyreport?Symbol=BBBY
> 
> I can't validate it and xml.minidom.dom.parseString won't work on it.
> 
> If this was just some teenager's web site I'd move on.  Is there any
> hope avoiding regular expression hacks to extract the data from this
> page?

Valid XHTML is scarcer than hen's teeth. Luckily, someone else has 
already written the ugly regex parsing hacks for you. Try Connelly 
Barnes' HTMLData: 
http://oregonstate.edu/~barnesc/htmldata/ 

Or BeautifulSoup as others have suggested.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Platform-specific compile flags in setup.py?

2007-03-04 Thread Nikita the Spider

Hi all,
I'm a newbie when it comes to distributing C-based Python modules. I'm 
just now sharing my first with the rest of the world (it's actually V. 
Marangozov's shared memory module for IPC) and I've learned that the 
module needs a different set of compile flags for Linux than for my Mac. 
My question is this: is there a convention or standard that says where 
platform-specific compile flags should reside? I could put them in 
setup.py, OTOH I know that within the .C file I can add something like 
this:
#ifdef __FreeBSD__
#include   /* for system definition of PAGE_SIZE */
#endif

That works, but for maximum Python programmer-friendliness I should 
perhaps put the complexity in setup.py rather than in a .C file.

Opinions appreciated.

PS - The module in question is here; Linux users must remove the tuple 
"('HAVE_UNION_SEMUN', None)" from setup.py:
http://NikitaTheSpider.com/python/shm/

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python Feature Request: (?) Group all file-directory-related stdlib functions in one place

2007-04-15 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:

> > Currently file-directory-related functionality in the Python standard
> > library is scattered among various modules such as shutil, os,
> > dircache etc. So I request that the functions be gathered and
> > consolidated at one place. Some may need renaming to avoid conflicts
> > or for clarification.
> > 
> 
> Please see PEP 355.

Thanks for bringing this to my attention; I was not aware of this PEP. 
The organization of the stdlib's file- and path-related functions gives 
me headaches so I'd like to see it change. But note that GvR has 
pronounced that PEP 355 is dead:

http://mail.python.org/pipermail/python-dev/2006-September/069087.html

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?

2007-10-04 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 John Nagle <[EMAIL PROTECTED]> wrote:

> Filip Salomonsson wrote:
> > On 02/10/2007, John Nagle <[EMAIL PROTECTED]> wrote:
> >> But there's something in there now that robotparser doesn't like.
> >> Any ideas?
> > 
> > Wikipedia denies _all_ access for the standard urllib user agent, and
> > when the robotparser gets a 401 or 403 response when trying to fetch
> > robots.txt, it is equivalent to "Disallow: *".
> > 
> > http://infix.se/2006/05/17/robotparser
> 
>  That explains it.  It's an undocumented feature of "robotparser",
> as is the 'errcode' variable.  The documentation of "robotparser" is
> silent on error handling (can it raise an exception?) and should be
> updated.

Hi John,
Robotparser is probably following the never-approved RFC for robots.txt 
which is the closest thing there is to a standard. It says, "On server 
response indicating access restrictions (HTTP Status Code 401 or 403) a 
robot should regard access to the site completely restricted."
http://www.robotstxt.org/wc/norobots-rfc.html

If you're interested, I have a replacement for the robotparser module 
that works a little better (IMHO) and which you might also find better 
documented. I'm using it in production code:
http://nikitathespider.com/python/rerp/

Happy spidering

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple HTML template engine?

2007-10-15 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "allen.fowler" <[EMAIL PROTECTED]> wrote:

> Hello,
> 
> Can anyone recommend a simple python template engine for generating
> HTML that relies only on the Pyhon Core modules?
> 
> No need for caching, template compilation, etc.
> 
> Speed is not a major issue.
> 
> I just need looping and conditionals. Template inheritance would be a
> bonus.
> 
> I've seen Genshi and Cheetah, but they seem way too complex.

I use Spyce, but in addition to being a template system for mixing 
Python & HTML, it is also a Web server which drags in a bunch of extra 
cruft that you don't need. It's what I use to generate static reports, 
though.

Good luck

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: documenting excepetions in Python

2007-10-19 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 [EMAIL PROTECTED] wrote:

> In python, how do I know what exceptions a method could raise?  Do I
> need to look at the source?  I don't see this info in the API docs for
> any of the APIs I'm using.

Hi Dale,
Usually the docs for a method will list the likely exceptions, but 
there's no way to know the full list of possible exceptions except by 
testing, testing, testing. Obviously, there's always a chance that 
there's a test you didn't think of and therefore an exception 
unaccounted for. I think this is a less-than-ideal aspect of Python but 
I don't have a suggestion on how to improve it.

Good luck

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: urllib2 and transfer-encoding = chunked

2007-01-20 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 [EMAIL PROTECTED] wrote:

> Haha!  My mistake.
> 
> The error is that when a web server is chunking a web page only the
> first chunk appears to be acquired by the urllib2.urlopen call.  If you
> check the headers, there is no 'Content-length' (as expected) and
> instead there is 'transfer-encoding' = 'chunked'.  I am getting about
> the first 30Kb, and then nothing else.
> 
> I don't get a ValueError like described at the following post:

Hi jdvolz,
What error *do* you get? Or is it that no error is raised; you're just 
not getting all of the data? If it is the latter, then the sending 
server might be at fault for not properly following the chunked transfer 
protocol. One way to find out would be to fire up Ethereal and see 
what's coming down the wire. 

> I am having errors which appear to be linked to a previous bug in
> urllib2 (and urllib) for v2.4 and v2.5 of Python.  Has this been fixed?
>  Has anyone established a standard workaround?  I keep finding old
> posts about it, that basically give up and say "well it's a known bug."

Can you give us some pointers to some of these old posts? And tell us 
what version of Python you're using.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: spidering script

2007-01-20 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "David Waizer" <[EMAIL PROTECTED]> wrote:

> Hello..
> 
> I'm  looking for a script (perl, python, sh...)or program (such as wget) 
> that will help me get a list of ALL the links on a website.
> 
> For example ./magicscript.pl www.yahoo.com and outputs it to a file, it 
> would be kind of like a spidering software..

David,
In addition to others' suggestions about Beautiful Soup, you might also 
want to look at the HTMLData module:

http://oregonstate.edu/~barnesc/htmldata/

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regex for URL extracting

2007-01-24 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "Johny" <[EMAIL PROTECTED]> wrote:

> Does anyone know about a good regular expression  for URL extracting?

Extracting URLs from what?

If it is HTML, then I'd look at some existing HTML parsing modules like 
Beautiful Soup and Barnes' HTMLData.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Help extracting info from HTML source ..

2007-01-26 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "Miki" <[EMAIL PROTECTED]> wrote:

> Hello Shelton,
> 
> >   I am learning Python, and have never worked with HTML.  However, I would
> > like to write a simple script to audit my 100+ Netware servers via their web
> > portal.
> Always use the right tool, BeautilfulSoup
> (http://www.crummy.com/software/BeautifulSoup/) is best for web
> scraping (IMO).
> 
> from urllib import urlopen
> from BeautifulSoup import BeautifulSoup
> 
> html = urlopen("http://www.python.org";).read()
> soup = BeautifulSoup(html)
> for link in soup("a"):
>   print link["href"], "-->", link.contents

Agreed. HTML scraping is really complicated once you get into it. It 
might be interesting to write such a library just for your own 
satisfaction, but if you want to get something done then use a module 
that already written, like BeautifulSoup. Another module that will do 
the same job but works differently (and more simply, IMO) is HTMLData by 
Connelly Barnes:
http://oregonstate.edu/~barnesc/htmldata/

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Marangozov's shmmodule (System V shared memory for Python IPC)

2007-02-01 Thread Nikita the Spider

Hi all,
In the late 90s Vladimir Marangozov wrote a module that provided an 
interface to System V shared memory on *nix platforms. I found a copy on 
the Net, dusted it off, compiled it, plugged a couple of memory leaks, 
intergrated others' changes, etc. Vlad hasn't posted on Usenet since the 
summer of 2000 and I can't find a working email address for him, so I 
assume this module is orphaned; this is a last-ditch effort to track him 
down before I post the module with my changes (retaining his name as the 
author of 99.9% of the code, of course).

Does anyone know where I can contact Vladimir? The email addresses I can 
find for him don't work. If anyone knows him personally and would be 
kind enough to forward my email address ([EMAIL PROTECTED]) to 
him, I'd appreciate it.

Thanks

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: postgres backup script and popen2

2007-02-08 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "Gabriel Genellina" <[EMAIL PROTECTED]> wrote:

> On 8 feb, 10:27, Maël Benjamin Mettler <[EMAIL PROTECTED]> wrote:
> 
> > flupke schrieb:
> > > i made a backup script to backup my postgres database.
> > > Problem is that it prompts for a password. It thought i
> > > could solve this by using popen2.
> >
> > Use pexpect:http://pexpect.sourceforge.net/
> 
> pexpect could work. But a better way would be to supply the password
> on the command line. I don't know how postgres does that things, but I
> hope there is some way to automate the backup process...

See the Postgres documentation for the .pgpass  file.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: threading and multicores, pros and cons

2007-02-14 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Maric Michaud <[EMAIL PROTECTED]> wrote:

> This is a recurrent problem I encounter when I try to sell python solutions 
> to 
> my customers. I'm aware that this problem is sometimes overlooked, but here 
> is the market's law.
> 
> I've heard of a bunch of arguments to defend python's choice of GIL, but I'm 
> not quite sure of their technical background, nor what is really important 
> and what is not. These discussions often end in a prudent "python has made a 
> choice among others"... which is not really convincing.
> 
> If some guru has made a good recipe, or want to resume the main points it 
> would be really appreciated.

When designing a new Python application I read a fair amount about the 
implications of multiple cores for using threads versus processes, and 
decided that using multiple processes was the way to go for me. On that 
note, there a (sort of) new module available that allows interprocess 
communication via shared memory and semaphores with Python. You can find 
it here: 
http://NikitaTheSpider.com/python/shm/

Hope this helps

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Approaches of interprocess communication

2007-02-16 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "exhuma.twn" <[EMAIL PROTECTED]> wrote:

> Hi all,
> 
> Supposing you have two separate processes running on the same box,
> what approach would you suggest to communicate between those two
> processes.

Hi exhuma,
That would depend on what data I was exchanging between the processes. 
For instance, if process A spawns work process B and wants to be able 
monitor B's progress, a message-based protocol might be kind of chatty. 
In this situation shared memory is probably a better fit because B can 
write it's progress to a chunk of shared memory and A can read that at 
its leisure. OTOH if the conversation is more event-driven, then a 
messaging protocol makes good sense.

FYI there's a Python module (with sample code) for using shared memory 
on most *nix systems here:
http://NikitaTheSpider.com/python/shm/

HTH

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: threading and multicores, pros and cons

2007-02-20 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Paul Rubin <http://[EMAIL PROTECTED]> wrote:

> Nikita the Spider <[EMAIL PROTECTED]> writes:
> > note, there a (sort of) new module available that allows interprocess 
> > communication via shared memory and semaphores with Python. You can find 
> > it here: 
> > http://NikitaTheSpider.com/python/shm/
> 
> This is from the old shm module that was floating around several years
> ago?  Cool, I remember trying to find it recently and it seemed to
> have disappeared--the original url was dead and it wasn't mirrored
> anywhere. 

Yes, this is almost certainly the one which you remember. I had a hard 
time finding it myself, but it's still shipped with a few Linux distros 
that have their SVN repository online and indexed by Google. 

FYI, I fixed a few bugs in the original, added some small features and a 
wrapper module. If you're compiling for Linux you might need to remove 
the HAVE_UNION_SEMUN definition from setup.py. (Just learned this 
yesterday thanks to Eric J. and I haven't updated the documentation yet.)

> How about putting it in CheeseShop or some other such repository?  

Hmmm, I hadn't thought about that since I've never used the Cheese Shop 
myself. What benefits does Cheese 
Shop confer to someone looking for a package? I ask because from my 
perspective it just adds overhead to package maintenance. 

> Having it in the stdlib would be even better, of course.

That'd be fine with me!

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: HTML to dictionary

2007-02-27 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Tina I <[EMAIL PROTECTED]> wrote:

> Hi everyone,
> 
> I have a small, probably trivial even, problem. I have the following HTML:
> > 
> >  METAR:
> > 
> > ENBR 270920Z 0KT  FEW018 02/M01 Q1004 NOSIG
> > 
> > 
> >  short-TAF:
> > 
> > ENBR 270800Z 270918 VRB05KT  FEW020 SCT040
> > 
> > 
> >  long-TAF:
> > 
> > ENBR 271212 VRB05KT  FEW020 BKN030 TEMPO 2012 2000 SNRA VV010 BECMG 
> > 2124 15012KT
> > 
> 
> I need to make this into a dictionary like this:
> 
> dictionary = {"METAR:" : "ENBR 270920Z 0KT  FEW018 02/M01 Q1004 
> NOSIG" , "short-TAF:" : "ENBR 270800Z 270918 VRB05KT  FEW020 SCT040" 
> , "long-Taf:" : "ENBR 271212 VRB05KT  FEW020 BKN030 TEMPO 2012 2000 
> SNRA VV010 BECMG 2124 15012KT"}

Tina,
In addition to Beautiful Soup which others have mentioned, Connelly 
Barnes' HTMLData module will take (X)HTML and convert it into a 
dictionary for you:
http://oregonstate.edu/~barnesc/htmldata/

THe dictionary won't have the exact format you want, but I think it 
would be fairly easy for you to convert to what you're looking for.

I use HTMLData a lot. Beautiful Soup is great for parsing iteratively, 
but if I just want to throw some HTML at a function and get data back, 
HTMLData is my tool of choice.

Good luck with whatever you choose

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Inter-process communication, how?

2007-12-22 Thread Nikita the Spider

In article 
<[EMAIL PROTECTED]>,
 [EMAIL PROTECTED] wrote:

> Hi,
> let's say I have two scripts: one does some computations and the other
> one is a graphical front end for launching the first one. And both run
> in separate processes (front end runs and that it spawns a subprocess
> with the computation). Now, if the computation has a result I would
> like to display it in the front end. In another words, I would like to
> pass some data from one process to another. How to do that? I'm
> affraid I can't use a pipe since the computation could print out some
> logging (if I understant pipes correctly).

Others have given you good suggestions; there's also this option which 
may or may not be an appropriate tool for what you want to do:
http://NikitaTheSpider.com/python/shm/

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Detecting OS platform in Python

2008-01-11 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Mike Meyer <[EMAIL PROTECTED]> wrote:

> On Thu, 10 Jan 2008 18:37:59 -0800 (PST) Devraj <[EMAIL PROTECTED]> wrote:
> 
> > Hi everyone,
> > 
> > My Python program needs reliably detect which Operating System its
> > being run on, infact it even needs to know which distribution of say
> > Linux its running on. The reason being its a GTK application that
> > needs to adapt itself to be a Hildon application if run on devices
> > like the N800.
> 
> I don't think it can be done. 

[...]

> ...trying to figure out what features you have
> available by guessing based on the platform type is generally the
> wrong way to approach this kind of problem - only in part because you
> wind up reduced to a series of heuristics to figure out the
> platform. And once you've done that, you could wind up being wrong.
> 
> Generally, you're better of probing the platform to find out if it has
> the facilities you're looking for. For python, that generally means
> trying to import the modules you need, and catching failures; or
> possibly looking for attributes on modules if they adopt to the
> environment around them.

Much agreed. I just went through this with my SHM module. Compilation 
was failing because of a variation in ipc_perm in ipc.h on various 
platforms. I didn't feel confident at all that I could compile a list of 
all of the variations let alone keep it accurate and updated. The 
clincher was when I found that OS X >= 10.4 has two flavors of ipc_perm 
and which gets used depends on a compile flag, so identifying the OS 
would not have been useful in that case.

OP, I don't know what a Hildon or N800 is, but is it possible that the 
same OS fingerprint could show up on different devices? If so then 
you're really out of luck. I think you'll be much better off if you 
focus less on the OS and more on the features it offers.

Good luck

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: mmap and shared memory

2008-02-13 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Jeff Schwab <[EMAIL PROTECTED]> wrote:

> greg wrote:
> > Carl Banks wrote:
> >> In C you can use the mmap call to request a specific physical location
> >> in memory (whence I presume two different processes can mmap anonymous
> >> memory block in the same location)
> > 
> > Um, no, it lets you specify the *virtual* address in the process's
> > address space at which the object you specify is to be mapped.
> > 
> > As far as I know, the only way two unrelated processes can share
> > memory via mmap is by mapping a file. An anonymous block is known
> > only to the process that creates it -- being anonymous, there's
> > no way for another process to refer to it.
> 
> On POSIX systems, you can create a shared memory object without a file 
> using shm_open.  The function returns a file descriptor.

Sorry I missed the OP, but you might be interested in this shared memory 
module for Python:
http://NikitaTheSpider.com/python/shm/

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: mmap and shared memory

2008-02-14 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Jeff Schwab <[EMAIL PROTECTED]> wrote:

> Nikita the Spider wrote:
> > In article <[EMAIL PROTECTED]>,
> >  Jeff Schwab <[EMAIL PROTECTED]> wrote:
> > 
> >> greg wrote:
> >>> Carl Banks wrote:
> >>>> In C you can use the mmap call to request a specific physical location
> >>>> in memory (whence I presume two different processes can mmap anonymous
> >>>> memory block in the same location)
> >>> Um, no, it lets you specify the *virtual* address in the process's
> >>> address space at which the object you specify is to be mapped.
> >>>
> >>> As far as I know, the only way two unrelated processes can share
> >>> memory via mmap is by mapping a file. An anonymous block is known
> >>> only to the process that creates it -- being anonymous, there's
> >>> no way for another process to refer to it.
> >> On POSIX systems, you can create a shared memory object without a file 
> >> using shm_open.  The function returns a file descriptor.
> > 
> > Sorry I missed the OP, but you might be interested in this shared memory 
> > module for Python:
> > http://NikitaTheSpider.com/python/shm/
> 
> 
> Thanks; I just downloaded it.  It seems to be missing the INSTALL file; 
> any idea where I could find that, or should I write to the author?

The main author has been AWOL for some years now. I'm the current 
maintainer. Sorry about the missing INSTALL file. Looks like I made 
reference to it in the README but never created it. It installs with the 
normal setup.py:

sudo python setup.py

I'll put out an updated package with an INSTALL file one of these days. 
Thanks for pointing that out.

Cheers

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: zope and python 2.5.1

2008-03-20 Thread Nikita the Spider

In article 
<[EMAIL PROTECTED]>,
 hberig <[EMAIL PROTECTED]> wrote:

> Hi,
> I'm sorry if this is an off-topic message, but I didn't found other
> group.
> I've read some articles about Zope, and since I like very much python,
> I would like to run a webserver application using Zope instead other
> web application server. I've difficulties to install zope 3 and zope 2
> on linux 2.6.20 (ubuntu distribution), python 2.5.1. What happen?? Is
> there a simple way to fix it?

I don't know much about installing Zope, is not the only choice for 
Python Web stacks. There's Django, Pylons and TurboGears to name a few 
alternatives.

Have fun

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: is there a bug in urlunparse/urlunsplit

2008-05-19 Thread Nikita the Spider

In article 
<[EMAIL PROTECTED]>,
 Alex <[EMAIL PROTECTED]> wrote:

> Hi all.
> 
> Is there a bug in the urlunparse/urlunsplit functions?
> Look at this fragment (I know is quite silly):
> 
> urlunparse(urlparse('www.example.org','http'))
> ---> 'http:///www.example.org'
>^
> 
> There are too many slashes, isn't it? Is it a known bug or maybe I
> missed something...

Hi Alex,
For a few years now I've been using Fourthought's libraries for parsing 
URLs and they've performed beautifully. In the code comments, they state 
that urlparse() and friends exhibit some non-RFCish behavior, hence the 
inspiration for writing their own libraries. 

If I remember correctly, the file you want is uri.py and it is in 4Suite 
which you can download from here:
http://www.fourthought.com/

HTH

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
--
http://mail.python.org/mailman/listinfo/python-list

Re: Multi Threading Problem with Python + Django + PostgreSQL.

2008-03-31 Thread Nikita the Spider

In article 
<[EMAIL PROTECTED]>,
 Pradip <[EMAIL PROTECTED]> wrote:

> Hello every body. I am new to this forum and also in Python.
> Read many things about multi threading in python. But still having
> problem.
> 
> I am using Django Framework with Python having PostgreSQL as backend
> database with Linux OS. My applications are long running. I am using
> threading.
> The problem I am facing is that the connections that are being created
> for database(postgres) update are not getting closed even though my
> threads had returned and updated database successfully. It is not like
> that the connections are not being reused. They r being reused but
> after sometime new one is created. Like this it creates too many
> connections and hence exceeding MAX_CONNECTION limit of postgres conf.
> 
> ** I am using psycopg2 as adaptor for python to postgres connection.
> which itself handles the connections(open/close)

Hi Pradip,
A common problem that new users of Python encounter is that they expect 
database statements to COMMIT automatically. Psycopg2 follows the Python 
DB-API specification and does not autocommit transactions unless you ask 
it to do so. Perhaps your connections are not closing because they have 
open transactions? 

To enable autocommit, call this on your connection object:
connection.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_AUTOCO
MMIT)

> Now the problem is with Django / Python / psycopg2 or any thing else??

Are you asking if there are bugs in this code that are responsible for 
your persistent connections? If so, then I'd say the answer is almost 
certainly no. Of course it's possible, but Django/Psycopg/Postgres is a 
pretty popular stack. The odds that there's a major bug in this popular 
code examined  by many eyes versus a bug in your code are pretty low, I 
think. Don't take it personally, the same applies to my me and my code. 
=)

Happy debugging

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: developing web spider

2008-04-03 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 John Nagle <[EMAIL PROTECTED]> wrote:

> abeen wrote:
> > Hello,
> > 
> > I would want to know which could be the best programming language for
> > developing web spider.
> > More information about the spider, much better,,
> 
>     As someone who actually runs a Python based web spider in production, I
> should comment.
> 
> You need a very robust parser to parse real world HTML.
> Even the stock version of BeautifulSoup isn't good enough.  We have a
> modified version of BeautifulSoup, plus other library patches, just to
> keep the parser from blowing up or swallowing the entire page into
> a malformed comment or tag.  Browsers are incredibly forgiving in this
> regard.
> 
> "urllib" needs extra robustness, too.  The stock timeout mechanism
> isn't good enough.  Some sites do weird things, like open TCP connections
> for HTTP but not send anything.
> 
> Python is on the slow side for this.  Python is about 60x
> slower than C, and for this application, you definitely see that.
> A Python based spider will go compute bound for seconds per page
> on big pages.  The C-based parsers for XML/HTML aren't robust enough for
> this application.  And then there's the Global Interpreter Lock; a multicore
> CPU won't help a multithreaded compute-bound process.
> 
> I'd recommend using Java or C# for new work in this area
> if you're doing this in volume.  Otherwise, you'll need to buy
> many, many extra racks of servers.  In practice, the big spiders
> are in C or C++.

I'll throw in an opinion from a different viewpoint. I'm really happy I 
used Python to develop my spider. I like the language, it has a good 
library and good community support and 3rd party modules. 

John, I don't know what your spider does, but you face some hurdles that 
I don't. For instance, since I'm focused on validation, if bizarre 
(invalid) HTML makes a page look like garbage, I just report the problem 
to the author. Performance isn't a big problem for me, either, since 
this is not a crawl-as-fast-as-you-can application. 

What you said sounds to me entirely correct for your application. The OP 
who asked for as much information as possible didn't give a whole lot to 
start with.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: urlparse http://site.com/../../../page.html

2008-04-08 Thread Nikita the Spider

In article 
<[EMAIL PROTECTED]>,
 "monk.e.boy" <[EMAIL PROTECTED]> wrote:

> I figured it out and blogged the answer:
> 
> http://teethgrinder.co.uk/blog/Normalize-URL-path-python/

Thanks for letting us know of a solution.

You might also be interested in Fourthought's URI library which contains 
a function called Absolutize() to do just what you're asking. It also 
claims to do a better job of parsing URIs than some of the functions in 
stock Python library. I've been using it heavily for quite a while and 
it has served me well.

http://4suite.org/

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: is Pylons alive?

2008-04-09 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 Gerhard Häring <[EMAIL PROTECTED]> wrote:

> - TurboGears 2.0 (I personally wouldn't bother with TurboGears 1.x at 
> this point)

Having investigated some of this myself recently, I agree with you that 
the TG 1.x series is a dead end, but there's no TG 2.0 yet. Last time I 
checked the developers hoped to have an alpha release at the end of 
March.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Stripping scripts from HTML with regular expressions

2008-04-10 Thread Nikita the Spider

In article <[EMAIL PROTECTED]>,
 "Reedick, Andrew" <[EMAIL PROTECTED]> wrote:

> > -Original Message-
> > From: [EMAIL PROTECTED] [mailto:python-
> > [EMAIL PROTECTED] On Behalf Of Michel Bouwmans
> > Sent: Wednesday, April 09, 2008 3:38 PM
> > To: python-list@python.org
> > Subject: Stripping scripts from HTML with regular expressions
> > 
> > Hey everyone,
> > 
> > I'm trying to strip all script-blocks from a HTML-file using regex.
> > 
> 
> [Insert obligatory comment about using a html specific parser
> (HTMLParser) instead of regexes.]

Yah, seconded. To the OP - use BeautifulSoup or HtmlData unless you like 
to reinvent wheels.

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
-- 
http://mail.python.org/mailman/listinfo/python-list

45 matches

Mail list logo