On 01/11/2016 17:54, Philip Homburg wrote:
If find it hard to believe that after compression, the BSON encoded
version of the DNS data would be a lot smaller than just the
raw DNS data.
There is a not a lot of redundancy in the DNS encoding.
Certainly there is not a lot of redundancy in the DNS encoding of a
single packet,
and there is a fair amount of poorly compressible data in transport
headers in
the PCAP.
What we're exploiting, though, is the redundancy in DNS encoding in a
stream of
packets. We're building tables of data that is often repeated in a
stream - names,
addresses, etc. - and storing references instead of repeating data. We
can do this
cheaply during writing of the CBOR output, because we know where the
redundancy
will be located. A general purpose compression engine will end up doing
much the
same, but will have to work harder to locate that specific redundancy.
Also, by
writing all e.g. names in a table, we're both grouping data that is
likely to have
significant internal redundancy that we're not exploiting and making the
size of the input
data to the compression much smaller, both of which again makes the general
engine's job much easier.
So I don't think it follows from badly compressing pcaps that storing
raw DNS would compress badly as well. Unless I missed some tricks
why the CBOR version compresses a lot better.
We did experiment with simple CBOR and Avro encodings of individual DNS
packets with
minimal transport information, which I think would be comparable. Our
data showed that
the size of input to the compressor was ~10x the size of our format, and
the final size
after compression was still significantly greater (~25-30%) than our
format after compression.
We did not take compression resource measurements in that case, but
given our experience
I would be surprised if the compression resources required were not also
significantly
greater.
The downside of CBOR, certainly as used here is that uses integers to
identify fields in what JSON calls objects.
So anybody who writes a local extension is likely to just continue numbering
fields, which leeds to mutually incompatible extensions.
In contrast, formats like XML, JSON, but also BSON where fields have names
make it less likely that people will pick the same identifier for
completely different purposes.
CBOR does not have to use integers as key values. It can use strings in
exactly
the same way as BSON. The reason for using integers is simply one of
space and
hence file size and minimising load on the final compressor. Key values
occur in the
data stream in both CBOR and BSON for every item with that key, so using
strings as key
values is not consistent with a goal of minimum file size.
We expect it would be possible, given the CDDL specification of the
format, to use that
specification to turn keys values back into text for, say, a conversion to
JSON, but no such tool currently exists, as far as we are aware.
--
Jim Hague - j...@sinodun.com Never trust a computer you can't lift.
_______________________________________________
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop