[issue40416] Calling TextIOWrapper.tell() in the middle of reading a gb2312-encoded file causes UnicodeDecodeError

2020-04-27 Thread Rob Malouf


New submission from Rob Malouf :

Calling TextIOWrapper.tell() while reading the attached gb2312-encoded file 
like this:

with open('udhr-gb2312.txt', encoding='GB2312') as f: 
while True: 
   line = f.readline() 
   t = f.tell()
   if not line: 
   break 

gives this result:

Traceback (most recent call last):
  File "test.py", line 4, in 
t = f.tell()
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xb5 in position 0: 
illegal multibyte sequence

The file seems to be well-formed and can be read without any problem.  It's 
only the call to tell() that raises an issue.

--
components: IO, Unicode
files: udhr-gb2312.txt
messages: 367494
nosy: ezio.melotti, rmalouf, vstinner
priority: normal
severity: normal
status: open
title: Calling TextIOWrapper.tell() in the middle of reading a gb2312-encoded 
file causes UnicodeDecodeError
type: crash
versions: Python 3.7
Added file: https://bugs.python.org/file49096/udhr-gb2312.txt

___
Python tracker 
<https://bugs.python.org/issue40416>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40416] Calling TextIOWrapper.tell() in the middle of reading a gb2312-encoded file causes UnicodeDecodeError

2020-05-02 Thread Rob Malouf


Rob Malouf  added the comment:

Same results on MacOS 10.15.4 (both the system python and the intel/anaconda 
version) and on CentOS 7.8

Here's the output with print(...):

13
71
72
392
393
399
536
537
761
762
879
880
933
934
1146
1147
1254
1255
1359
1360
1760
1761
1772
1895
1897
1906
2105
2107
2338
2339
2348
2398
2399
2408
2509
2510
2519
2612
2614
2622
2682
2684
2693
2898
2900
2909
3050
3052
3061
3113
3115
3124
3295
3297
3309
3445
3632
3644
3814
3816
3828
3882
3967
3979
4048
4184
4196
4226
4308
4320
4492
4559
4641
4653
4728
4770
4782
4999
5001
5013
5202
5204
5216
5270
5318
5333
5411
5465
5672
5687
5953
5954
5969
6082
6137
6307
6373
6388
6494
6496
6511
6786
6913
6928
7148
7371
7447
7462
7569
7704
7719
7847
7848
7863
7972
8238
8342
Traceback (most recent call last):
  File "test.py", line 4, in 
print(f.tell())
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xb5 in position 0: 
illegal multibyte sequence

--

___
Python tracker 
<https://bugs.python.org/issue40416>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28340] [py2] TextIOWrapper.tell extremely slow

2017-05-22 Thread Rob Malouf

Changes by Rob Malouf :


--
pull_requests: +1832

___
Python tracker 
<http://bugs.python.org/issue28340>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25535] collections.Counter methods return Counter objects

2015-11-02 Thread Rob Malouf

New submission from Rob Malouf:

Several collections.Counter methods return Counter objects, which is leads to 
wrong or at least confusing behavior when Counter is subclassed.  For example, 
nltk.FreqDist is a subclass of Counter:

>>> x = nltk.FreqDist(['a','a','b','b','b'])
>>> y = nltk.FreqDist(['b','b','b','b','b'])
>>> z = x + y
>>> z.__class__


This applies to __add__(), __sub__(), __or__(), __and__(), __pos__(), and 
__neg__().  

In contrast, the copy() method does (what I think is) the right thing:

>>> x.copy().__class__


--
components: Library (Lib)
messages: 253930
nosy: rmalouf
priority: normal
severity: normal
status: open
title: collections.Counter methods return Counter objects
type: behavior
versions: Python 3.5

___
Python tracker 
<http://bugs.python.org/issue25535>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue28340] TextIOWrapper.tell extremely slow

2016-10-02 Thread Rob Malouf

New submission from Rob Malouf:

io.TextIOWrapper.tell() is unusably slow in Python 2.7.  This same problem was 
introduced in Python 3 and fixed in Python 3.3 (see Issue # 4).  Any chance 
of getting the fix backported into the Python 2.7 library? It would make it 
much easier to modernize Unicode handling in libraries that have to support 
both 2 and 3 using the same codebase.

--
components: IO
messages: 277898
nosy: rmalouf
priority: normal
severity: normal
status: open
title: TextIOWrapper.tell extremely slow
type: performance
versions: Python 2.7

___
Python tracker 
<http://bugs.python.org/issue28340>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com