On Wed, Feb 5, 2014 at 10:00 PM, Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: >> where stopWords.txt is a file of size 4KB > > My guess is that if you split a 4K file into words, then put the words > into a list, you'll probably end up with 6-8K in memory.
I'd guess rather more; Python strings have a fair bit of fixed overhead, so with a whole lot of small strings, it will get more costly. >>> sys.version '3.4.0b2 (v3.4.0b2:ba32913eb13e, Jan 5 2014, 16:23:43) [MSC v.1600 32 bit (Intel)]' >>> sys.getsizeof("asdf") 29 "Stop words" tend to be short, rather than long, words, so I'd look at an average of 2-3 letters per word. Assuming they're separated by spaces or newlines, that means there'll be roughly a thousand of them in the file, for about 25K of overhead. A bit less if the words are longer, but still quite a bit. (Byte strings have slightly less overhead, 17 bytes apiece, but still quite a bit.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list