On 1/22/2010 9:58 PM, Chris Jones wrote:
On Fri, Jan 22, 2010 at 08:46:35PM EST, Terry Reedy wrote:
Do you mean I should just read the file one character at a time?
Whoops, my misdirection (you can .read(1), but this is s l o w.
I meant to suggest processing it a char at a time.
1. If not too big,
for c in open(x, 'rb').read() # left .read() off
# 'b' will get bytes, though ord(c) same for ascii chars for byte or
unicode
2. If too big for that,
for line in open():
for c in line: # or left off this part
To only count ascii chars, as should be the case for C code,
achars = [0]*63
for c in open('xxx', 'c'):
try:
achars[ord(c)-32] += 1
except IndexError:
pass
for i,n in enumerate(achars)
print chr(i), n
or sum subsets as desired.
Thanks much for the snippet, let me play with it and see if I can come
up with a Unicode/utf-8 version.. since while I'm at it I might as well
write something a bit more general than C code.
Since utf-8 is backward-compatible with 7bit ASCII, this shouldn't be
a problem.
For any extended ascii, use larger array without decoding (until print,
if need be). For unicode, add encoding to open and 'c in line' will
return unicode chars. Then use *one* dict or defaultdict. I think
something like
from collections import defaultdict
d = defaultdict(int)
...
d[c] += 1 # if c is new, d[c] defaults to int() == 0
Terry Jan Reedy
--
http://mail.python.org/mailman/listinfo/python-list