[ python-Bugs-1711800 ] SequenceMatcher bug with insert/delete block after "replace"

2007-05-03 Thread SourceForge.net
Bugs item #1711800, was opened at 2007-05-03 03:24
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1711800&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.6
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Christian Hammond (chipx86)
Assigned to: Nobody/Anonymous (nobody)
Summary: SequenceMatcher bug with insert/delete block after "replace"

Initial Comment:
difflib.SequenceMatcher fails to distinguish between a "replace" block and an 
"insert" or "delete" block when the "insert/delete" immediately follows a 
"replace". It will lump both changes together as one big "replace" block.

This happens due to how get_opcodes() works. get_opcodes() loops through the 
matching blocks, grouping them into tags and ranges. However, if a block of 
text is changed and then new text is immediately added, it can't see this. All 
it knows is that the next matching block is after the added text.

As an example, consider these strings:

"ABC"

"ABCD
EFG."

Any diffing program will show that the first line was replaced and the second 
was inserted. SequenceMatcher, however, just shows that there was one replace, 
and includes both lines in the range.

I've attached a testcase that reproduces this for both replace>insert and 
replace>delete blocks.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1711800&group_id=5470
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[ python-Bugs-1701389 ] utf-16 codec problems with multiple file append

2007-05-03 Thread SourceForge.net
Bugs item #1701389, was opened at 2007-04-16 18:05
Message generated for change (Comment added) made by iceberg4ever
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1701389&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.5
Status: Closed
Resolution: Remind
Priority: 5
Private: No
Submitted By: Iceberg Luo (iceberg4ever)
Assigned to: M.-A. Lemburg (lemburg)
Summary: utf-16 codec problems with multiple file append

Initial Comment:
This bug is similar but not exactly the same as bug215974.  
(http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func=detail)

In my test, even multiple write() within an open()~close() lifespan will not 
cause the multi BOM phenomena mentioned in bug215974. Maybe it is because bug 
215974 was somehow fixed during the past 7 years, although Lemburg classified 
it as WontFix. 

However, if a file is appended for more than once, by an 
"codecs.open('file.txt', 'a', 'utf16')", the multi BOM appears.

At the same time, the saying of "(Extra unnecessary) BOM marks are removed from 
the input stream by the Python UTF-16 codec" in bug215974 is not true even in 
today, on Python2.4.4 and Python2.5.1c1 on Windows XP.

Iceberg
--

PS: Did not find the "File Upload" checkbox mentioned in this web page, so I 
think I'd better paste the code right here...

import codecs, os

filename = "test.utf-16"
if os.path.exists(filename): os.unlink(filename)  # reset

def myOpen():
  return codecs.open(filename, "a", 'UTF-16')
def readThemBack():
  return list( codecs.open(filename, "r", 'UTF-16') )
def clumsyPatch(raw): # you can read it after your first run of this program
  for line in raw:
if line[0] in (u'\ufffe', u'\ufeff'): # get rid of the BOMs
  yield line[1:]
else:
  yield line

fout = myOpen()
fout.write(u"ab\n") # to simplify the problem, I only use ASCII chars here
fout.write(u"cd\n")
fout.close()
print readThemBack()
assert readThemBack() == [ u'ab\n', u'cd\n' ]
assert os.stat(filename).st_size == 14  # Only one BOM in the file

fout = myOpen()
fout.write(u"ef\n")
fout.write(u"gh\n")
fout.close()
print readThemBack()
#print list( clumsyPatch( readThemBack() ) )  # later you can enable this fix
assert readThemBack() == [ u'ab\n', u'cd\n', u'ef\n', u'gh\n' ] # fails here
assert os.stat(filename).st_size == 26  # not to mention here: multi BOM appears


--

>Comment By: Iceberg Luo (iceberg4ever)
Date: 2007-05-03 22:08

Message:
Logged In: YES 
user_id=1770538
Originator: YES

The longtime arguable ZWNBSP is deprecated nowadays ( the
http://www.unicode.org/unicode/faq/utf_bom.html#24 suggests a "U+2060 WORD
JOINER" instead of ZWNBSP ). However I can understand that "backwards
compatibility" is always a good concern, and that's why SteamReader seems
reluctant to change.

In practice, a ZWNBSP inside a file is rarely intended (please also refer
to the topic "Q: What should I do with U+FEFF in the middle of a file?" in
same URL above). IMHO, it is very likely caused by the multi-append file
operation or etc. Well, at least, the unsymmetric "what you write is NOT
what you get/read" effect between "codecs.open(filename, 'a', 'UTF-16')"
and "codecs.open(filename, 'r', 'UTF-16')" is not elegant enough.

Aiming at the unsymmetry, finally I come up with a wrapper function for
the codecs.open(), which solve (or you may say "bypass") the problem well
in my case. I'll post the code as attachment.

BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in
Codecs", mentions that the:
   PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char
*errors, int *byteorder)
can "switches according to all byte order marks (BOM) it finds in the
input data. BOMs are not copied into the resulting Unicode string".  I
don't know whether it is the BOM-less decoder we talked for long time.
//shrug

Hope the information above can be some kind of recipe for those who
encounter same problem.  That's it. Thanks for your patience.

Best regards,
Iceberg
File Added: _codecs.py

--

Comment By: Walter Dörwald (doerwalter)
Date: 2007-04-23 18:56

Message:
Logged In: YES 
user_id=89016
Originator: NO

But BOMs *may* appear in normal content: Then their meaning is that of
ZERO WIDTH NO-BREAK SPACE (see
http://docs.python.org/lib/encodings-overview.html for more info).


--

Comment By: Iceberg Luo (iceberg4ever)
Date: 2007-04-20 11:39

Message:
Logged In: YES 
user_id=1770538
Originator: YES

If such a bug would be fixed, either StreamWriter or StreamReader should
do something.

I can understand Doerwalter that it is somewhat

[ python-Bugs-1701389 ] utf-16 codec problems with multiple file append

2007-05-03 Thread SourceForge.net
Bugs item #1701389, was opened at 2007-04-16 12:05
Message generated for change (Comment added) made by doerwalter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1701389&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.5
Status: Closed
Resolution: Remind
Priority: 5
Private: No
Submitted By: Iceberg Luo (iceberg4ever)
Assigned to: M.-A. Lemburg (lemburg)
Summary: utf-16 codec problems with multiple file append

Initial Comment:
This bug is similar but not exactly the same as bug215974.  
(http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func=detail)

In my test, even multiple write() within an open()~close() lifespan will not 
cause the multi BOM phenomena mentioned in bug215974. Maybe it is because bug 
215974 was somehow fixed during the past 7 years, although Lemburg classified 
it as WontFix. 

However, if a file is appended for more than once, by an 
"codecs.open('file.txt', 'a', 'utf16')", the multi BOM appears.

At the same time, the saying of "(Extra unnecessary) BOM marks are removed from 
the input stream by the Python UTF-16 codec" in bug215974 is not true even in 
today, on Python2.4.4 and Python2.5.1c1 on Windows XP.

Iceberg
--

PS: Did not find the "File Upload" checkbox mentioned in this web page, so I 
think I'd better paste the code right here...

import codecs, os

filename = "test.utf-16"
if os.path.exists(filename): os.unlink(filename)  # reset

def myOpen():
  return codecs.open(filename, "a", 'UTF-16')
def readThemBack():
  return list( codecs.open(filename, "r", 'UTF-16') )
def clumsyPatch(raw): # you can read it after your first run of this program
  for line in raw:
if line[0] in (u'\ufffe', u'\ufeff'): # get rid of the BOMs
  yield line[1:]
else:
  yield line

fout = myOpen()
fout.write(u"ab\n") # to simplify the problem, I only use ASCII chars here
fout.write(u"cd\n")
fout.close()
print readThemBack()
assert readThemBack() == [ u'ab\n', u'cd\n' ]
assert os.stat(filename).st_size == 14  # Only one BOM in the file

fout = myOpen()
fout.write(u"ef\n")
fout.write(u"gh\n")
fout.close()
print readThemBack()
#print list( clumsyPatch( readThemBack() ) )  # later you can enable this fix
assert readThemBack() == [ u'ab\n', u'cd\n', u'ef\n', u'gh\n' ] # fails here
assert os.stat(filename).st_size == 26  # not to mention here: multi BOM appears


--

>Comment By: Walter Dörwald (doerwalter)
Date: 2007-05-03 17:03

Message:
Logged In: YES 
user_id=89016
Originator: NO

>BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in
> Codecs", mentions that the:
>   PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char
> *errors, int *byteorder)
> can "switches according to all byte order marks (BOM) it finds in the
> input data. BOMs are not copied into the resulting Unicode string".  I
> don't know whether it is the BOM-less decoder we talked for long time.

This seems to be wrong. Looking at the source code
(Objects/unicodeobjects.c) reveals that only the first BOM is skipped.


--

Comment By: Iceberg Luo (iceberg4ever)
Date: 2007-05-03 16:08

Message:
Logged In: YES 
user_id=1770538
Originator: YES

The longtime arguable ZWNBSP is deprecated nowadays ( the
http://www.unicode.org/unicode/faq/utf_bom.html#24 suggests a "U+2060 WORD
JOINER" instead of ZWNBSP ). However I can understand that "backwards
compatibility" is always a good concern, and that's why SteamReader seems
reluctant to change.

In practice, a ZWNBSP inside a file is rarely intended (please also refer
to the topic "Q: What should I do with U+FEFF in the middle of a file?" in
same URL above). IMHO, it is very likely caused by the multi-append file
operation or etc. Well, at least, the unsymmetric "what you write is NOT
what you get/read" effect between "codecs.open(filename, 'a', 'UTF-16')"
and "codecs.open(filename, 'r', 'UTF-16')" is not elegant enough.

Aiming at the unsymmetry, finally I come up with a wrapper function for
the codecs.open(), which solve (or you may say "bypass") the problem well
in my case. I'll post the code as attachment.

BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in
Codecs", mentions that the:
   PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char
*errors, int *byteorder)
can "switches according to all byte order marks (BOM) it finds in the
input data. BOMs are not copied into the resulting Unicode string".  I
don't know whether it is the BOM-less decoder we talked for long time.
//shrug

Hope the information above can be some kind of recipe for those who
encounter same problem.  That's it. Thanks for your patience.

Best regards,
   

[ python-Bugs-1701389 ] utf-16 codec problems with multiple file append

2007-05-03 Thread SourceForge.net
Bugs item #1701389, was opened at 2007-04-16 12:05
Message generated for change (Comment added) made by doerwalter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1701389&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.5
Status: Closed
Resolution: Remind
Priority: 5
Private: No
Submitted By: Iceberg Luo (iceberg4ever)
Assigned to: M.-A. Lemburg (lemburg)
Summary: utf-16 codec problems with multiple file append

Initial Comment:
This bug is similar but not exactly the same as bug215974.  
(http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func=detail)

In my test, even multiple write() within an open()~close() lifespan will not 
cause the multi BOM phenomena mentioned in bug215974. Maybe it is because bug 
215974 was somehow fixed during the past 7 years, although Lemburg classified 
it as WontFix. 

However, if a file is appended for more than once, by an 
"codecs.open('file.txt', 'a', 'utf16')", the multi BOM appears.

At the same time, the saying of "(Extra unnecessary) BOM marks are removed from 
the input stream by the Python UTF-16 codec" in bug215974 is not true even in 
today, on Python2.4.4 and Python2.5.1c1 on Windows XP.

Iceberg
--

PS: Did not find the "File Upload" checkbox mentioned in this web page, so I 
think I'd better paste the code right here...

import codecs, os

filename = "test.utf-16"
if os.path.exists(filename): os.unlink(filename)  # reset

def myOpen():
  return codecs.open(filename, "a", 'UTF-16')
def readThemBack():
  return list( codecs.open(filename, "r", 'UTF-16') )
def clumsyPatch(raw): # you can read it after your first run of this program
  for line in raw:
if line[0] in (u'\ufffe', u'\ufeff'): # get rid of the BOMs
  yield line[1:]
else:
  yield line

fout = myOpen()
fout.write(u"ab\n") # to simplify the problem, I only use ASCII chars here
fout.write(u"cd\n")
fout.close()
print readThemBack()
assert readThemBack() == [ u'ab\n', u'cd\n' ]
assert os.stat(filename).st_size == 14  # Only one BOM in the file

fout = myOpen()
fout.write(u"ef\n")
fout.write(u"gh\n")
fout.close()
print readThemBack()
#print list( clumsyPatch( readThemBack() ) )  # later you can enable this fix
assert readThemBack() == [ u'ab\n', u'cd\n', u'ef\n', u'gh\n' ] # fails here
assert os.stat(filename).st_size == 26  # not to mention here: multi BOM appears


--

>Comment By: Walter Dörwald (doerwalter)
Date: 2007-05-03 19:12

Message:
Logged In: YES 
user_id=89016
Originator: NO

OK, I've updated the documentation (r55094, r55095)

--

Comment By: Walter Dörwald (doerwalter)
Date: 2007-05-03 17:03

Message:
Logged In: YES 
user_id=89016
Originator: NO

>BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in
> Codecs", mentions that the:
>   PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char
> *errors, int *byteorder)
> can "switches according to all byte order marks (BOM) it finds in the
> input data. BOMs are not copied into the resulting Unicode string".  I
> don't know whether it is the BOM-less decoder we talked for long time.

This seems to be wrong. Looking at the source code
(Objects/unicodeobjects.c) reveals that only the first BOM is skipped.


--

Comment By: Iceberg Luo (iceberg4ever)
Date: 2007-05-03 16:08

Message:
Logged In: YES 
user_id=1770538
Originator: YES

The longtime arguable ZWNBSP is deprecated nowadays ( the
http://www.unicode.org/unicode/faq/utf_bom.html#24 suggests a "U+2060 WORD
JOINER" instead of ZWNBSP ). However I can understand that "backwards
compatibility" is always a good concern, and that's why SteamReader seems
reluctant to change.

In practice, a ZWNBSP inside a file is rarely intended (please also refer
to the topic "Q: What should I do with U+FEFF in the middle of a file?" in
same URL above). IMHO, it is very likely caused by the multi-append file
operation or etc. Well, at least, the unsymmetric "what you write is NOT
what you get/read" effect between "codecs.open(filename, 'a', 'UTF-16')"
and "codecs.open(filename, 'r', 'UTF-16')" is not elegant enough.

Aiming at the unsymmetry, finally I come up with a wrapper function for
the codecs.open(), which solve (or you may say "bypass") the problem well
in my case. I'll post the code as attachment.

BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in
Codecs", mentions that the:
   PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char
*errors, int *byteorder)
can "switches according to all byte order marks (BOM) it finds in the
input data. BOMs are not copied into the resulting Unicode string".

[ python-Bugs-1712236 ] __getslice__ changes integer arguments

2007-05-03 Thread SourceForge.net
Bugs item #1712236, was opened at 2007-05-03 21:20
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712236&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Interpreter Core
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Imri Goldberg (lorgandon)
Assigned to: Nobody/Anonymous (nobody)
Summary: __getslice__ changes integer arguments

Initial Comment:
When using slicing for a sequence object, with a user defined __getslice__ 
function, the arguments to __getslice__ are changed.
This does not happen when __getslice__is called directly.
Attached is some code that demonstrates the problem.

I checked it on various versions, including my
"Python 2.5.1 (r251:54863, May  2 2007, 16:56:35)", on my Ubuntu machine.

Although __getslice__ is deprecated, there is still usage of the function, and 
a fix would be useful.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712236&group_id=5470
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[ python-Bugs-1712236 ] __getslice__ changes integer arguments

2007-05-03 Thread SourceForge.net
Bugs item #1712236, was opened at 2007-05-03 21:20
Message generated for change (Comment added) made by lorgandon
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712236&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Interpreter Core
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Imri Goldberg (lorgandon)
Assigned to: Nobody/Anonymous (nobody)
Summary: __getslice__ changes integer arguments

Initial Comment:
When using slicing for a sequence object, with a user defined __getslice__ 
function, the arguments to __getslice__ are changed.
This does not happen when __getslice__is called directly.
Attached is some code that demonstrates the problem.

I checked it on various versions, including my
"Python 2.5.1 (r251:54863, May  2 2007, 16:56:35)", on my Ubuntu machine.

Although __getslice__ is deprecated, there is still usage of the function, and 
a fix would be useful.

--

>Comment By: Imri Goldberg (lorgandon)
Date: 2007-05-03 21:23

Message:
Logged In: YES 
user_id=1715564
Originator: YES

This also seems to be the cause of bug "[ 908441 ] default index for
__getslice__ is not sys.maxint"

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712236&group_id=5470
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[ python-Bugs-1712419 ] Cannot use dict with unicode keys as keyword arguments

2007-05-03 Thread SourceForge.net
Bugs item #1712419, was opened at 2007-05-04 00:49
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712419&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.5
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Viktor Ferenczi (complex)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Cannot use dict with unicode keys as keyword arguments

Initial Comment:
Unicode strings cannot be used as keys in dictionaries passed as keyword 
argument to a function. For example:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> def fn(**kw):
... print repr(kw)
...
>>> fn(**{u'x':1})
Traceback (most recent call last):
  File "", line 1, in 
TypeError: fn() keywords must be strings
>>>

Unicode strings should be converted to str automatically using the site 
specific default encoding or something similar solution.

This bug caused problems when decoding dictionaries from data formats such as 
XML or JSON. Usually unicode strings are returned from such modules, such as 
simplejson. Manual encoding from unicode to str causes performance loss if this 
cannot be done by Python automatically.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712419&group_id=5470
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[ python-Feature Requests-1712419 ] Cannot use dict with unicode keys as keyword arguments

2007-05-03 Thread SourceForge.net
Feature Requests item #1712419, was opened at 2007-05-03 22:49
Message generated for change (Comment added) made by gbrandl
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1712419&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
>Category: Unicode
>Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Viktor Ferenczi (complex)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Cannot use dict with unicode keys as keyword arguments

Initial Comment:
Unicode strings cannot be used as keys in dictionaries passed as keyword 
argument to a function. For example:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> def fn(**kw):
... print repr(kw)
...
>>> fn(**{u'x':1})
Traceback (most recent call last):
  File "", line 1, in 
TypeError: fn() keywords must be strings
>>>

Unicode strings should be converted to str automatically using the site 
specific default encoding or something similar solution.

This bug caused problems when decoding dictionaries from data formats such as 
XML or JSON. Usually unicode strings are returned from such modules, such as 
simplejson. Manual encoding from unicode to str causes performance loss if this 
cannot be done by Python automatically.

--

>Comment By: Georg Brandl (gbrandl)
Date: 2007-05-04 04:10

Message:
Logged In: YES 
user_id=849994
Originator: NO

In any case, this is a feature request.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1712419&group_id=5470
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[ python-Bugs-1712522 ] urllib.quote throws exception on Unicode URL

2007-05-03 Thread SourceForge.net
Bugs item #1712522, was opened at 2007-05-04 06:11
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712522&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Nagle (nagle)
Assigned to: Nobody/Anonymous (nobody)
Summary: urllib.quote throws exception on Unicode URL

Initial Comment:
The code in urllib.quote fails on Unicode input, when
called by robotparser with a Unicode URL.

Traceback (most recent call last):
File "./sitetruth/InfoSitePage.py", line 415, in run
pagetree = self.httpfetch() # fetch page
File "./sitetruth/InfoSitePage.py", line 368, in httpfetch
if not self.owner().checkrobotaccess(self.requestedurl) : # if access 
disallowed by robots.txt file
File "./sitetruth/InfoSiteContent.py", line 446, in checkrobotaccess
return(self.robotcheck.can_fetch(config.kuseragent, url)) # return can fetch
File "/usr/local/lib/python2.5/robotparser.py", line 159, in can_fetch
url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/"
File "/usr/local/lib/python2.5/urllib.py", line 1197, in quote
res = map(safe_map.__getitem__, s)
KeyError: u'\xe2'

   That bit of code needs some attention.  
- It still assumes ASCII goes up to 255, which hasn't been true in Python for a 
while now.
- The initialization may not be thread-safe; a table is being initialized on 
first use.

"robotparser" was trying to check if a URL with a Unicode character in it was 
allowed.  Note the "KeyError: u'\xe2'" 

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1712522&group_id=5470
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com