Hi Xin,

Thanks for responding.

There were several ideas in my mail, some of them contradictory, and at times 
not grouped properly.  I hope it still was intelligible enough.

I was mostly concerned about the (future) change of default value, and still 
am.  But I'm also surprised by the premises some of your choices (including the 
default value) are based on.  To me, they look generally weak, and even for 
some do not seem to make sense.  This is (also) what I would like to discuss.  
I'm probably far from having the most stringent or intensive use of log files 
in this community, and I'm not an expert of SSD wear-leveling either.  So maybe 
it's just me, but then I'd ask for the minimum education to understand your 
reasoning and learn from it.

> I am open to removing '-c'.

An alternative I developed later in my initial mail (it was not apparent at the 
point you responsed to) is to have '-c' (on the command-line) override 
<compress> (in some configuration file), and I think this is what you've done 
(and responded to Mike).  I'm fine with it since I have the feeling it's the 
general rule for most utilities where it's possible to request the same 
behavior on the command-line and from configuration files (in other words, it 
respects POLA).  My main concern here is that, if you keep '-c', you document 
it, as well as its relation to <compress>.  I'm saying this because you evoked 
the possibility of not documenting it on purpose in some other message, which I 
think can't be justified here.
 
> Could you please clarify what you mean by "make it enable compression" --
> did you mean that we mark all log files to be compressible?  (It's probably
> not a good idea as some "log" files may be binary and not really
> compressible).

Yes, I meant exactly that.  In this alternative, you simply ignore compression 
letters but also their absence, and compress everything the same.  I understand 
your point about binary files, but I would be surprised if that logs even 
formatted as binary files aren't significantly compressible (albeit less than 
text) in most cases, and even if they aren't, it would only be a very minor 
annoyance (files are not going to get longer; for other (non-)annoyances, see 
below).  Moreover, all log files in base are text files, and that is also the 
case for all ports/applications I use, so I find it strange not to cater to 
what is probably the vast majority of use cases (or do you disagree with that?).

Doing so would have also the benefit that application writers just don't have 
to bother wondering whether their logs should be compressed or not.  What would 
that decision based on?  Basing it on format (text or binary) is most probably 
flawed, as I've just said above.  I don't think it can be based on content 
either, which I suspect will always be compressible for log files (there will 
be redundancy, like timestamps, identifiers, etc.).  And I see this more as an 
administrative decision (e.g., do I have plenty of disk space or not?), which 
is independent.  So shifting that decision to the administrator once and for 
all makes sense.  If you don't like this way to make it happen, I'm suggesting 
another one next.

> Changing the meaning of all four legacy compression type letters to "file
> is compressible" is part of the intention.  The goal is to discourage using
> them as a way to specify a compression type, in favor of using the
> administrator configured value.

As I've just explained, I see a lot of value in having an administrator 
deciding on a global behavior.  I will use this functionality most likely.

I had been hesitating between preserving the current meaning of the compression 
letters, for POLA in general, and having the configuration directive override 
them.  That's why I mentioned an alternative where the override would have to 
be explicit, through an additional, different directive.  This idea could be 
reused like this: Have '<compress>' affect only files without compression 
letters, and have '<compress_override>' affect only those with them, and 
perhaps also have the specified value of one of them used as the default for 
the other (e.g., if '<compress_override>' is set, it also affects by default 
files without compression letters).  I'm mentioning this for completeness in 
case it fulfills the needs of others.  I probably won't use this refinement 
personally.  And, concerning POLA, there are different levels of it.  
Forgetting a moment about the change in default value, being able to override 
compression letters with a directive in the configuration file is a bit 
surprising, but after more pondering I now do not consider it to be terribly 
annoying if sufficiently publicized.

> That's said, 'none' is a reasonable default in many ways as explained
> before (it makes grep'ing easier, compression is not really that helpful in
> the modern world because hard drives are larger than the 90's and it
> reduces the times data gets rewritten to SSDs and avoids hourly CPU load
> bursts for busy systems).

This is where my main disagreement is currently.  Most arguments have been 
addressed in my previous mails, so for each I'll do a small wrap-up and add a 
few new thoughts.

"it makes grep'ing easier": Our zgrep(1) works on any compressed file, and even 
on uncompressed ones, so is a drop-in replacement for grep(1).  I fail to see 
anything hard about using it.  Scripts already using grep(1) don't even need to 
be modified, via a combination of PATH or symlink tweaking.  We could even go 
so far as having grep(1) itself behave like zgrep(1), which could be a great 
usability win for newcomers as well.

"compression is not really that helpful in the modern world because hard drives 
are larger than the 90's": I certainly don't think so.  I manipulate GBs of 
(text) log files.  On build logs, I typically see ratios of 1/10, which is 
huge.  The space I'm saving is not only used to save more logs, but also for 
unrelated purposes, and prevents me from having to buy or dedicate more hard 
disks to this use.  And I'm not even talking about embedded systems, which are 
much more constrained, or virtual machines.

"it reduces the times data gets rewritten to SSDs": Surely, but does it matter? 
I don't think so.  A single rewrite of log data in most use cases shouldn't 
have any visible effect on wear-leveling, except for SSDs where this is the 
only and continuous job, but then you can have your equivalent to 'syslog' 
compressing on the fly, or can use ZFS with compression.  If really, you're 
reaching the disk I/O limits on your machine and can't afford the extra 
bandwidth for reading and compressing, shouldn't you be sending the logs via 
network to another machine doing exactly that processing?  And is this a use 
case common enough to warrant making non-compression the new newsyslog(8)'s 
default?  I don't think so.

"avoids hourly CPU load bursts for busy systems": That can, and should, be 
solved by configuration.  You're free to choose a higher frequency, to avoid 
busy hours if there are less loaded ones, and to rotate logs on a smaller size 
limit, all of which will mitigate the problem to the point of almost 
non-existence.  And if the "almost" is still significant to your workload, then 
see the previous point.  Again, is this common or important enough?  For now, I 
doubt it.  And there is an advantage of having application-controlled 
compression: At least you can control exactly when the bursts occur, which you 
can't with ZFS (which has to compress blocks also).

> 'bzip2' could be a good second best default (because for most
> configurations it's how the log files are compressed with today's
> defaults), but if the administrator has already configured their systems to
> use a different method, this would break their configuration anyways.

Yes for 'bzip2' as a good default, for POLA.  If the administrator configured 
its system, then the best default would be 'legacy'.  That's why I was 
hesitating with always keeping the original meaning to the compression letters.

> There are other benefits of not compressing rotated logs.  For busy
> systems, the hourly newsyslog run would process larger logs and cause CPU
> workload bursts.
> 
> And when logs are compressed, the data is read back and compressed data is
> rewritten to disk / SSDs, causing additional wear of the flash storage, and
> all that comes with no significant benefit for modern hardware.
> 
> (I don't think it's common to have log files indexed after rotation; a more
> common use case would be to use [u]grep to look up for a certain pattern).

I think I've already addressed most of these points in the previous mail and 
above.

I've read and, I think, understood your points.  So please save us time and 
refrain from repeating them.  This is not going to make me change my current 
mind that they all are weak at best.

On the other hand, please, after a careful reading of my objections, respond 
with comments, critiques or rebuttals as you see fit.  I may learn things in 
the process, and you might as well too.
 
> Yes, and that's not a big concern.  Achieving the maximum compression ratio
> is probably never the goal for most scenarios (not limited to logs, but
> also other places) where compression is used, and one always has to balance
> between the cost and benefit.

We are talking about logs, or at least use cases for newsyslog(8).  A frequent 
use case for it (it's certainly the primary for me) is long-term storage of old 
logs that are unfrequently read/processed.  Achieving a high compression ratio 
is important here, to save the space used in absolute terms *and* with respect 
to the expected (in a statistical sense) utility of these (i.e., low).

> If the person is distributing a release image to many thousands of users
> over the Internet, it would make a lot of sense to try the best compression
> for an 5% reduction of size because that adds up to the bandwidth cost and
> optimizes the experience for users, but it doesn't make as much sense to
> save, let's say a few MBs of disk space at the expense of spending a few
> more minutes every hour, the added "bursts" of slower response time for a
> server, and that's usually undesirable for production.

Really, I don't see where these figures can come from.  Here is a very quick 
example on a typical (for me) build log file of about ~70MB:

* Method            * Compression ratio * Elapsed time (s) *
************************************************************
  gzip (default)    | 95.3%, or / 21.2  | 0.426
  xz (default)      | 96.9%, or / 32.6  | 5.619
  zstd (default)    | 95.6%, or / 22.5  | 0.088

I could multiply them to convince you in a more serious manner statistically.  
But already, I think you can agree that "a few MBs of disk space at the expense 
of spending a few
more minutes" is way, way off, even if you're still using xz(1).

Thanks and regards.

-- 
Olivier Certner

Attachment: signature.asc
Description: This is a digitally signed message part.

Reply via email to