[EMAIL PROTECTED] writes:

>If your compression algorithm is tuned for normal ASCII text, then <UC letter>
><lc letter> may be considered more frequent than <UC letter><UC letter> for
>all combinations of values of <UC letter>, and thus pairs of uppercased
>letters may result in longer bit streams than pairs of lowercase letters or
>one uppercase letter followed by one lowercase letter.  In practice I have
>some trouble believing that this matters, but I don't even play a data
>compression expert on the net, so my lack of belief doesn't mean it doesn't
>make sense.

I can at least play a data compression expert, and can say that this will make
bugger-all difference in practice.  The second time an LZ compressor sees a
repeated substring (upper or lower case) it'll compress it to either an
(offset, length) pair or a dictionary index, an operation for which case has no
effect.  An order-1 or higher statistical compressor would be affected by UC+lc
vs UC+UC, but (a) with the tags used in HTML they'll be caught by the LZ
compressor without ever being passed down to the statistical compression layer
and (b) no widely-used compressor uses anything more than an order-0
statistical compressor (running as a backend to an LZ compressor) which means
they don't care about what follows what.  Judging by the original comment I
suspect that whoever wrote it wasn't very familiar with data compression
technology (I also wonder where it would be applicable, given that compression
of HTML occurs only in very specialised situations).

Peter.



Reply via email to