On 26/05/2021 22:15, Tim Chase wrote: > If you don't decode it upon reading it in, it should still be 100MB > because it's a stream of encoded bytes.
I usually convert them to utf8. > You don't specify what you then do with this humongous string, Mainly I search for regex patterns which can span multiple lines. I could chunk it up if memory was an issue but a single read is just more convenient. Up until now it hasn't been an issue and to be honest I don't often hit multi-byte characters so mostly it will be single byte character strings. They are mostly research papers and such from my university days written on a Commodore PET and various early DOS computers with weird long-lost word processors. Over the years they've been exported/converted/reimported and then re-xported several times. A very few have embedded text or "graphics"/equations which might have some unicode characters but its not a big issue for me in practice. I was more just thinking of the kinds of scenario where big strings might become a problem if suddenly consuming 4x the storage you expect. -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos -- https://mail.python.org/mailman/listinfo/python-list