Using a function for regular expression substitution

2010-08-29 Thread naugiedoggie
Hello,

I'm having a problem with using a function as the replacement in
re.sub().

Here is the function:

def normalize(s) :
return
urllib.quote(string.capwords(urllib.unquote(s.group('provider'

The purpose of this function is to proper-case the words contained in
a URL query string parameter value.  I'm massaging data in web log
files.

In case it matters, the regex pattern looks like this:

provider_pattern = r'(?PSearch_Provider)=(?P[^&]+)'

The call looks like this:


re.sub(matcher,normalize,line)


Where line is the log line entry.

What I get back is first the entire line with the normalization of the
parameter value, but missing the parameter; then appended to that
string is the entire line again, with the query parameter back in
place pointing to the normalized string.


>>> fileReader = open(log,'r')
>>>
>>> lines = fileReader.readlines()
>>> for line in lines:
if line.find('Search_Type') != -1 and line.find('Search_Provider') !=
-1 :
re.sub(provider_matcher,normalize,line)
print line,'\n'


The output of the print is like this:


'log-entry parameter=value&normalized-string¶meter=value\n
log-entry parameter=value¶meter=normalized-string¶meter=value'


The goal is to massage the specified entries in the log files and
write the entire log back into a new file.  The new file has to be
exactly the same as the old one, with the exception of the entries
I've altered with my function.

No doubt I'm doing something trivially wrong, but I've tried to
reproduce the structure as defined in the documentation.

Thanks.

mp
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Using a function for regular expression substitution

2010-08-30 Thread naugiedoggie
On Aug 29, 1:14 pm, MRAB  wrote:
> On 29/08/2010 15:22, naugiedoggie wrote:

> > I'm having a problem with using a function as the replacement in
> > re.sub().

> > Here is the function:

> > def normalize(s) :
> >      return
> > urllib.quote(string.capwords(urllib.unquote(s.group('provider'
>
> This normalises the provider and returns only that, and none of the
> remainder of the string.
>
> I think you might want this:
>
> def normalize(s):
>      return s[ : s.start('provider')] +
> urllib.quote(string.capwords(urllib.unquote(s.group('provider' +
> s[s.start('provider') : ]
>
> It returns the part before the provider, followed by the normalised
> provider, and then the part after the provider.

Hello,

Thanks for the reply.

There must be something basic about the re.sub() function that I'm
missing.  The documentation shows this example:


>>> def dashrepl(matchobj):
... if matchobj.group(0) == '-': return ' '
... else: return '-'
>>> re.sub('-{1,2}', dashrepl, 'program-files')
'pro--gram files'
>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
'Baked Beans & Spam'


According to the doc, the modifying function takes one parameter, the
MatchObject.  The re.sub function takes only a compiled regex object
or a pattern, generates a MatchObject from that object/pattern and
passes the MatchObject to the given function. Notice that in the
examples, the re.sub() returns the entire line, with the changes made.
But the function itself returns only the change.  What is happening
for me is that, if I have a line that contains
&Search_Provider=chen&p=value, the processed line ends up with
&Chen&p=value.

Now, I did follow up with your suggestion.  `s' is actually a
MatchObject (bad param naming on my part, I started out passing a
string into the function and then changed it to a MatchObject, but
didn't change the param name), so I made the following change:


return line[s.pos : s.start('provider')] + \
 
urllib.quote(string.capwords(urllib.unquote(s.group('provider' + \
line[s.end('provider') : ]


In order to make this work (finally), I had to make the processing
function look like this:


def processLine(l) :
global line
line = l
provider = getProvider(line)
if provider == "No Provider" : return line
scenario = getScenario(line)
if filter (lambda a: a != None, [getOrg(s,scenario) for s in
orgs]) == [] :
line = re.sub(provider_pattern,normalize,line)
else :
line.replace(provider_parameter, org_parameter)
return line


And then the call:


lines = fileReader.readlines()
[ fileWriter.write(l) for l in [processLine(l) for l in lines]]


Without this complicated gobbledigook, I could not get the correct
result.  I hate global vars and I completely do not understand why I
have to go through this twisting and turning to get the desired
result.

[ ... ]

> These can be replaced by:
>
>         if 'Search_Type' in line and 'Search_Provider' in line:
>
> >            re.sub(provider_matcher,normalize,line)
>
> re.sub is returning the result, which you're throwing away!
>
>                 line = re.sub(provider_matcher,normalize,line)

I can't count the number of times I have forgotten the meaning of
'returns a string' when reading docs about doing substitutions. In
this case, I had put the `line = ' in and taken it out.  And I should
know better, from years of programming in Java, where strings are
immutable and you _always_ get a new, returned string.  Should be
second nature.

Thanks for the help, much appreciated.

mp
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Using a function for regular expression substitution

2010-08-30 Thread naugiedoggie
On Aug 30, 8:52 am, naugiedoggie  wrote:
> On Aug 29, 1:14 pm, MRAB  wrote:
>
>
>
>
>
> > On 29/08/2010 15:22, naugiedoggie wrote:
> > > I'm having a problem with using a function as the replacement in
> > > re.sub().
> > > Here is the function:
> > > def normalize(s) :
> > >      return
> > > urllib.quote(string.capwords(urllib.unquote(s.group('provider'
>
> > This normalises the provider and returns only that, and none of the
> > remainder of the string.
>
> > I think you might want this:
>
> > def normalize(s):
> >      return s[ : s.start('provider')] +
> > urllib.quote(string.capwords(urllib.unquote(s.group('provider' +
> > s[s.start('provider') : ]
>
> > It returns the part before the provider, followed by the normalised
> > provider, and then the part after the provider.
>
> Hello,
>
> Thanks for the reply.
>
> There must be something basic about the re.sub() function that I'm
> missing.  The documentation shows this example:
>
> >>> def dashrepl(matchobj):
>
> ...     if matchobj.group(0) == '-': return ' '
> ...     else: return '-'>>> re.sub('-{1,2}', dashrepl, 'program-files')
> 'pro--gram files'
> >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
>
> 'Baked Beans & Spam'
> 
>
> According to the doc, the modifying function takes one parameter, the
> MatchObject.  The re.sub function takes only a compiled regex object
> or a pattern, generates a MatchObject from that object/pattern and
> passes the MatchObject to the given function. Notice that in the
> examples, the re.sub() returns the entire line, with the changes made.
> But the function itself returns only the change.  What is happening
> for me is that, if I have a line that contains
> &Search_Provider=chen&p=value, the processed line ends up with
> &Chen&p=value.
>
> Now, I did follow up with your suggestion.  `s' is actually a
> MatchObject (bad param naming on my part, I started out passing a
> string into the function and then changed it to a MatchObject, but
> didn't change the param name), so I made the following change:
>
> 
> return line[s.pos : s.start('provider')] + \
>
> urllib.quote(string.capwords(urllib.unquote(s.group('provider' + \
>         line[s.end('provider') : ]
> 
>
> In order to make this work (finally), I had to make the processing
> function look like this:
>
> 
> def processLine(l) :
>         global line
>         line = l
>         provider = getProvider(line)
>         if provider == "No Provider" : return line
>         scenario = getScenario(line)
>         if filter (lambda a: a != None, [getOrg(s,scenario) for s in
> orgs]) == [] :
>             line = re.sub(provider_pattern,normalize,line)
>         else :
>             line.replace(provider_parameter, org_parameter)
>         return line
> 
>
> And then the call:
>
> 
> lines = fileReader.readlines()
> [ fileWriter.write(l) for l in [processLine(l) for l in lines]]
> 
>
> Without this complicated gobbledigook, I could not get the correct
> result.  I hate global vars and I completely do not understand why I
> have to go through this twisting and turning to get the desired
> result.
>
> [ ... ]
>
> > These can be replaced by:
>
> >         if 'Search_Type' in line and 'Search_Provider' in line:
>
> > >            re.sub(provider_matcher,normalize,line)
>
> > re.sub is returning the result, which you're throwing away!
>
> >                 line = re.sub(provider_matcher,normalize,line)
>
> I can't count the number of times I have forgotten the meaning of
> 'returns a string' when reading docs about doing substitutions. In
> this case, I had put the `line = ' in and taken it out.  And I should
> know better, from years of programming in Java, where strings are
> immutable and you _always_ get a new, returned string.  Should be
> second nature.
>
> Thanks for the help, much appreciated.
>
> mp

Hello,

Well, that turned out to be still wrong.  I did start getting the
proper param=value back from my `normalize' function, but I got
"extra" data as well.

This works:


def normalize(s) :
return s.group('search')
+'='+urllib.quote(string.capwords(urllib.unquote(s.group('provider'


Essentially, the pattern contained two groups, one identifying the
parameter name and one the value.  By concat'ing the two back
together, I was able to achieve the desired result.

I suppose the lesson is, the function replaces the entire match rather
than just the specified text captured.

Thanks.

mp
-- 
http://mail.python.org/mailman/listinfo/python-list


Installing Python as Scripting Language in IIS

2010-08-30 Thread naugiedoggie
Hello,

Windows 2003, 64-bit, standard edition server with IIS 6.0.  I
followed the MS instruction sheets on setting up CGI application with
Python as scripting engine.  I'm just getting 404 for the test script,
whereas an html file in the same virtual directory is properly
displayed.

Here:

Creating Applications in IIS 6.0 (IIS 6.0)
http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/bc0c4729-e892-4871-b8f3-fcbf489f2f09.mspx?mfr=true

Setting Application Mappings in IIS 6.0 (IIS 6.0)
http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/bc0c4729-e892-4871-b8f3-fcbf489f2f09.mspx?mfr=true

I mapped the exe thus:  c:\Python26\python.exe -u "%s %s"
to extension `py' for all verbs and checked the `script engine' box.

There are no errors in the script itself, i ran it from the command
line to be sure.  Further, I enabled ASP and tried using python as the
scripting language.  That generates this error:


Active Server Pages error 'ASP 0129'
Unknown scripting language
/cgi/index.asp, line 1
The scripting language 'Python' is not found on the server.


I can't find any good references for dealing with this, either.

I've dicked around with this for so long, now I don't know which way
is up, anymore.

Any thoughts on where I might be going wrong, much appreciated.

Thanks.

mp
-- 
http://mail.python.org/mailman/listinfo/python-list


Trap Authentication Errors in HTTP Request

2010-09-10 Thread naugiedoggie
Hello,

I have a script that authenticates to a web service provider to
retrieve data.  This script provides an authentication header built in
a very basic way like this:


# Creates an authentication object with the credentials for a given
URL
def createPasswordManager(headers) :
passwordManager = urllib2.HTTPPasswordMgrWithDefaultRealm()
 
passwordManager.add_password(None,overview_url,headers[0],headers[1])
return passwordManager

# Creates an authentication handler for the authentication object
created above
def createAuthenticationHandler(passwordManager) :
authenticationHandler =
urllib2.HTTPBasicAuthHandler(passwordManager)
return authenticationHandler

# Creates an opener that sets the credentials in the Request
def createOpener(authHandler) :
return urllib2.build_opener(authHandler)


This script makes multiple calls for data.  I would like to trap an
exception for authentication failure so that it doesn't go through its
entire list of calls when there's a problem with the login.  The
assumption is that if there is a login failure, the script is using
incorrect authentication information.

I have the call for data retrieval wrapped in try/except, to catch
HTTPError, but apparently no '401' is explicitly thrown when
authentication fails.  And I don't see an explicit Exception that is
thrown in urllib2 for handling these failures.

How can I achieve my goal of trapping these exceptions and exiting
cleanly?

Thanks.

mp
-- 
http://mail.python.org/mailman/listinfo/python-list


SOLVED: Re: Trap Authentication Errors in HTTP Request

2010-09-11 Thread naugiedoggie
On Sep 10, 12:09 pm, naugiedoggie  wrote:
> Hello,
>
> I have a script that authenticates to a web service provider to
> retrieve data.  This script provides an authentication header built in
> a very basic way like this:

The answer is that there is something whacked in the Windoze
implementation for urllib2.

It turns out that the script works fine when run in a linux console.
'401' error is trapped as expected by an exception handler.  In
Winblows, the builtin handler for authentication is supposed to take a
dump after 5 retries, but this seems to not happen.  The retries
continue until a recursion exception is fired.  At this point the
script dumps back to the console.  An exception handler for Exception
will catch this.

Thanks.

mp
-- 
http://mail.python.org/mailman/listinfo/python-list


Bug: urllib2 basic authentication does not trap auth failure in windows?

2010-09-11 Thread naugiedoggie
Hello,

Is this a known issue?

I have a script that authenticates to a web service.  urllib2 process
for authenticating for basic auth is supposed to dump out after 5
retries.

Here's a note from the urllib2.py code:


def http_error_auth_reqed(self, auth_header, host, req, headers):
authreq = headers.get(auth_header, None)
if self.retried > 5:
# Don't fail endlessly - if we failed once, we'll probably
# fail a second time. Hm. Unless the Password Manager is
# prompting for the information. Crap. This isn't great
# but it's better than the current 'repeat until recursion
# depth exceeded' approach 
raise HTTPError(req.get_full_url(), 401, "digest auth
failed",
headers, None)


This note is from the digest handler but the basic handler is exactly
the same in the important respects.

What happens in a windows console is that, in fact, the code dumps
with the message 'maximum recursion depth reached.'

Whereas, in the linux console, the same script exits appropriately
with the trapped 401 error that authentication failed.

Thanks.

mp
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to Convert IO Stream to XML Document

2010-09-11 Thread naugiedoggie
On Sep 10, 12:20 pm, jakecjacobson  wrote:
> I am trying to build a Python script that reads a Sitemap file and
> push the URLs to a Google Search Appliance.  I am able to fetch the
> XML document and parse it with regular expressions but I want to move
> to using native XML tools to do this.  The problem I am getting is if
> I use urllib.urlopen(url) I can convert the IO Stream to a XML
> document but if I use urllib2.urlopen and then read the response, I
> get the content but when I use minidom.parse() I get a "IOError:
> [Errno 2] No such file or directory:" error

Hello,

This may not be helpful, but I note that you are doing two different
things with your requests, and judging from the documentation,  the
objects returned by urllib and urllib2 openers do not appear to be the
same.  I don't know why you are calling urllib.urlopen(url) and
urllib2.urlopen(request), but I can tell you that I have used urllib2
opener to retrieve a web services document in XML and then parse it
with minidom.parse().


>
> THIS WORKS but will have issues if the IO Stream is a compressed file
> def GetPageGuts(net, url):
>         pageguts = urllib.urlopen(url)
>         xmldoc = minidom.parse(pageguts)
>         return xmldoc
>
> # THIS DOESN'T WORK, but I don't understand why
> def GetPageGuts(net, url):
>         request=getRequest_obj(net, url)
>         response = urllib2.urlopen(request)
>         response.headers.items()
>         pageguts = response.read()

Did you note the documentation says:

"One caveat: the read() method, if the size argument is omitted or
negative, may not read until the end of the data stream; there is no
good way to determine that the entire stream from a socket has been
read in the general case."

No EOF marker might be the cause of the parsing problem.

Thanks.

mp
-- 
http://mail.python.org/mailman/listinfo/python-list