string formatter %x and a class instance with __int__ or __long__ cannot handle long

2007-06-20 Thread Kenji Noguchi
Hi

I'm using Python 2.4.4 on 32bit x86 Linux.  I have a problem with printing
hex string for a value larger than 0x8 when the value is given to
% operator via an instance of a class with __int__().  If I pass a long value
to % operator it works just fine.

Example1 -- pass a long value directly.  this works.
>>> x=0x8000
>>> x
2147483648L
>>> type(x)

>>> "%08x" % x
'8000'

Example2 -- pass an instance of a class with __int__()
>>> class X:
... def __init__(self, v):
... self.v = v
... def __int__(self):
... return self.v
...
>>> y = X(0x8000)
>>> "%08x" % y
Traceback (most recent call last):
  File "", line 1, in ?
TypeError: int argument required
>>>

The behavior looks inconsistent.  By the way __int__ actually
returned a long type value in the Example2.  The "%08x" allows
either int or long in the Example1, however it accepts int only
in the Example2.   Is this a bug or expected?

by the way same thing happends on a 64bit system with a
value of 0x8000.

Regards,
Kenji Noguchi
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: string formatter %x and a class instance with __int__ cannot handle long

2007-06-21 Thread Kenji Noguchi
2007/6/20, [EMAIL PROTECTED] <[EMAIL PROTECTED]>:
> In your second example y is an instance of class X...not an int.  y.v
> is an int.  Are you hoping it will cast it to an int as needed using
> your method?  If so, I think you need to do so explicitly...ie "%08x"
> % int(y)
>
> ~Sean

I confirmed that "%08x" % int(y) works. And yes, I'm hoping so.
It actually works that way if the v is less than or equal to 0x7.
Please try the test script.  It's essentially the same test with
some more print statements.  All but test d-3 appears to be ok.


2007/6/20, Gabriel Genellina <[EMAIL PROTECTED]>:
> It is a bug, at least for me, and I have half of a patch addressing it. As
> a workaround, convert explicitely to long before formatting.

I'm interested in your patch.  What's the other half still missing?

Thanks,
Kenji Noguchi


--->8>8--->8---
#!/usr/bin/env python

class X:
   def __init__(self, v):
   self.v = v
   def __int__(self):
   print "Hey! I'm waken up!"
   return self.v

def test(arg):
   print 1,type(int(arg))
   print 2,"%08x" % int(arg)
   print 3,"%08x" % arg

a = 0x7fff
b = X(0x7fff)
c = 0x8000
d = X(0x8000)

print "test a" ; test(a)
print "test b" ; test(b)
print "test c" ; test(c)
print "test d" ; test(d)
--->8>8--->8---

And here is the result
test a
1 
2 7fff
3 7fff
test b
1 Hey! I'm waken up!

2 Hey! I'm waken up!
7fff
3 Hey! I'm waken up!
7fff
test c
1 
2 8000
3 8000
test d
1 Hey! I'm waken up!

2 Hey! I'm waken up!
8000
3 Hey! I'm waken up!
Traceback (most recent call last):
 File "", line 23, in ?
 File "", line 13, in test
TypeError: int argument required
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: string formatter %x and a class instance with __int__ cannot handle long

2007-06-21 Thread Kenji Noguchi
I looked at python2.5.1 source code.

I noticed that, in Objects/stringobject.c around line 4684,
long type is exceptionally handled, which is hack, and
everything else falls to formatint.  This explains why explicit
converting to long before formatting fixes the problem.
I made a patch but this is a hack on a hack.
I expect Python3000 won't have such problem as they unify
int and long.

Thanks
Kenji Noguchi

--- stringobject.c.org  2007-06-21 13:57:54.745877000 -0700
+++ stringobject.c  2007-06-21 13:59:19.576646000 -0700
@@ -4684,6 +4684,15 @@
case 'X':
if (c == 'i')
c = 'd';
+   /* try to convert objects to number*/
+   PyNumberMethods *nb;
+   if ((nb = v->ob_type->tp_as_number) &&
+   nb->nb_int) {
+   v = (*nb->nb_int) (v);
+   if(v == NULL)
+   goto error;
+   }
+
if (PyLong_Check(v)) {
int ilen;
temp = _PyString_FormatLong(v, flags,






2007/6/21, Kenji Noguchi <[EMAIL PROTECTED]>:
> 2007/6/20, Gabriel Genellina <[EMAIL PROTECTED]>:
> > It is a bug, at least for me, and I have half of a patch addressing it. As
> > a workaround, convert explicitely to long before formatting.
>
> I'm interested in your patch.  What's the other half still missing?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Tailing a log file?

2007-06-22 Thread Kenji Noguchi
something like this? unix tail command does more fancy stuff
like it waits for timeout, and check if the file is truncated
or depending on incoming data it sleeps seconds , etc etc.

#!/usr/bin/env python
import sys, select

while True:
ins, outs, errs = select.select([sys.stdin],[],[])
for i in ins:
print i.readline()


2007/6/22, Evan Klitzke <[EMAIL PROTECTED]>:
> On 6/22/07, Evan Klitzke <[EMAIL PROTECTED]> wrote:
> > Everyone,
> >
> > I'm interested in writing a python program that reads from a log file
> > and then executes actions based on the lines. I effectively want to
> > write a loop that does something like this:
> >
> > while True:
> > log_line = log_file.readline()
> > do_something(log_line)
> >
> > Where the readline() method blocks until a new line appears in the
> > file, unlike the standard readline() method which returns an empty
> > string on EOF. Does anyone have any suggestions on how to do this?
> > Thanks in advance!
>
> I checked the source code for tail and they actually poll the file by
> using fstat and sleep to check for changes in the file size. This
> didn't seem right so I thought about it more and realized I ought to
> be using inotify. So I guess I answered my own question.
>
> --
> Evan Klitzke <[EMAIL PROTECTED]>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: developing web spider

2008-04-04 Thread Kenji Noguchi
Attached is a essence of my crawler.  This collects  tag in a given URL

HTML parsing is not a big deal as "tidy" does all for you. It converts
a broken HTML
to a valid XHTML.  From that point there're wealth of XML libraries. Just write
whatever you want such as  element handler.

I've extended it for multi-thread, limit the number of thread for a
specific web host,
more flexible element handling, etc, etc. SQLite is nice for making URL db
by the way.

Kenji Noguchi
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys, urllib, urllib2, cookielib
import xml.dom.minidom, tidy
from urlparse import urlparse, urljoin

_ua = "Mozilla/5.0 (Windows; U; Windows NT 6.0; ja; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12"

# I'm not sure if CookieJar() is thread safe
cj = cookielib.CookieJar()

class SingleCrawler:
def __init__(self, seed_url=None):
self.seed_url = seed_url
self.urls = {}

# static
def _convert(self, html):
if isinstance(html, unicode):
html = html.encode('utf-8')
options = dict(
doctype='strict',
drop_proprietary_attributes=True,
enclose_text=True,
output_xhtml=True,
wrap=0,
char_encoding='utf8',
newline='LF',
tidy_mark=False,
)
return str(tidy.parseString(html, **options))

def _collect_urls(self, node, nest=0):
if node.nodeType == 1 and node.nodeName == 'a':
href = node.getAttribute('href')
if not href.startswith('#'):
p = urlparse(href)
if p.scheme in ('', 'http', 'https'):
self.urls[node.getAttribute('href')] = True
else:
# mailto, javascript
print p.scheme

for i in node.childNodes:
self._collect_urls(i, nest+1)

def canonicalize(self):
d = {}

for url in self.urls:
d[urljoin(self.seed_url, url).encode('ascii')] = True
self.urls = d

def crawl(self):
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', _ua)]
try:
html = opener.open(self.seed_url).read()
except urllib2.HTTPError, e:
return None
except urllib2.URLError, e:
print "URL Error:", self.seed_url
return None
if html.startswith('')+2:]

html = self._convert(html)
try:
dom = xml.dom.minidom.parseString(html)
except ExpatError, e:
print "ExpatError:", html
return None

self._collect_urls(dom.childNodes[1])
self.canonicalize()
return self.urls.keys()

if __name__=='__main__':
crawler = SingleCrawler()
crawler.seed_url = 'http://www.python.org'
next_urls = crawler.crawl()
print next_urls
   

-- 
http://mail.python.org/mailman/listinfo/python-list