On Apr 3, 12:37 pm, Neil Hodgson <nhodg...@iinet.net.au> wrote: > Reran the programs taking a bit more care with the encoding of the > file. This had no effect on the speeds. There are only a small amount of > paths that don't fit into ASCII: > > ASCII 1076101 > Latin1 218 > BMP 113 > Astral 0 > > # encoding:utf-8 > import codecs, os, time > from os.path import join, getsize > with codecs.open("filelist.txt", "r", "utf-8") as f: > paths = f.read().split("\n") > bucket = [0,0,0,0] > for p in paths: > b = 0 > maxChar = max([ord(ch) for ch in p]) > if maxChar >= 65536: > b = 3 > elif maxChar >= 256: > b = 2 > elif maxChar >= 128: > b = 1 > bucket[b] = bucket[b] + 1 > print("ASCII", bucket[0]) > print("Latin1", bucket[1]) > print("BMP", bucket[2]) > print("Astral", bucket[3]) > > Neil
Can you please try one more experiment Neil? Knock off all non-ASCII strings (paths) from your dataset and try again. [It should take little more than converting your above code to a filter: if b == 0: print if b > 0: ignore ] -- http://mail.python.org/mailman/listinfo/python-list