Hello Marc, I think you have gotten quite a few answers already, but I'll add my voice.
> I'm writting an application that saves historical state in a log > file. If I were in your shoes, I'd probably use the logging module rather than saving state in my own log file. That allows the application to send all historical state to the system log. Then, it could be captured, recorded, analyzed and purged (or neglected) along with all of the other logging. But, this may not be appropriate for your setup. See also my final two questions at the bottom. > I want to be really efficient in terms of used bytes. It is good to want to be efficient. Don't cost your (future) self or some other poor schlub future working or computational efficiency, though! Somebody may one day want to extract utility out of the application's log data. So, don't make that data too hard to read. > What I'm doing now is: > > 1) First use zlib.compress ... assuming you are going to write your own files, then, certainly. If you also want better compression (quantified in a table below) at a higher CPU cost, try bz2 or lzma (Python3). Note that there is not a symmetric CPU cost for compression and decompression. Usually, decompression is much cheaper. # compress = bz2.compress # compress = lzma.compress compress = zlib.compress To read the logging data, then the programmer, application analyst or sysadmin will need to spend CPU to uncompress. If it's rare, that's probably a good tradeoff. Here's my small comparison matrix of the time it takes to transform a sample log file that was roughly 33MB (in memory, no I/O costs included in timing data). The chart also shows the size of the compressed data, in bytes and percentage (to demonstrate compression efficiency). format bytes pct walltime raw 34311602 1.00% 0.00000s base64-encode 46350762 1.35% 0.43066s zlib-compress 3585508 0.10% 0.54773s bz2-compress 2704835 0.08% 4.15996s lzma-compress 2243172 0.07% 15.89323s base64-decode 34311602 1.00% 0.18933s bz2-decompress 34311602 1.00% 0.62733s lzma-decompress 34311602 1.00% 0.22761s zlib-decompress 34311602 1.00% 0.07396s The point of a sample matrix like this is to examine the tradeoff between time (for compression and decompression) and to think about how often you, your application or your users will decompress the historical data. Also consider exactly how sensitive you are to bytes on disk. (N.B. Data from a single run of the code.) Finally, simply make a choice for one of the compression algorithms. > 2) And then remove all new lines using binascii.b2a_base64, so I > have a log entry per line. I'd also suggest that you resist the base64 temptation. As others have pointed out, there's a benefit to keeping the logs compressed using one of the standard compression tools (zgrep, zcat, bzgrep, lzmagrep, xzgrep, etc.) Also, see the statistics above for proof--base64 encoding is not compression. Rather, it usually expands input data to the tune of one third (see above, the base64 encoded string is 135% of the raw input). That's not compression. So, don't do it. In this case, it's expansion and obfuscation. If you don't need it, don't choose it. In short, base64 is actively preventing you from shrinking your storage requirement. > but b2a_base64 is far from ideal: adds lots of bytes to the > compressed log entry. So, I wonder if perhaps there is a better > way to remove new lines from the zlib output? or maybe a different > approach? Suggestion: Don't worry about the single-byte newline terminator. Look at a whole logfile and choose your best option. Lastly, I have one other pair of questions for you to consider. Question one: Will your application later read or use the logging data? If no, and it is intended only as a record for posterity, then, I'd suggest sending that data to the system logs (see the 'logging' module and talk to your operational people). If yes, then question two is: What about resilience? Suppose your application crashes in the middle of writing a (compressed) logfile. What does it do? Does it open the same file? (My personal answer is always 'no.') Does it open a new file? When reading the older logfiles, how does it know where to resume? Perhaps you can see my line of thinking. Anyway, best of luck, -Martin P.S. The exact compression ratio is dependent on the input. I have rarely seen zlib at 10% or bz2 at 8%. I conclude that my sample log data must have been more homogeneous than the data on which I derived my mental bookmarks for textual compression efficiencies of around 15% for zlib and 12% for bz2. I have no mental bookmark for lzma yet, but 7% is an outrageously good compression ratio. -- Martin A. Brown http://linux-ip.net/ -- https://mail.python.org/mailman/listinfo/python-list