On 01/11/2016 17:54, Philip Homburg wrote:
If find it hard to believe that after compression, the BSON encoded
version of the DNS data would be a lot smaller than just the
raw DNS data.

There is a not a lot of redundancy in the DNS encoding.

Certainly there is not a lot of redundancy in the DNS encoding of a single packet, and there is a fair amount of poorly compressible data in transport headers in
the PCAP.

What we're exploiting, though, is the redundancy in DNS encoding in a stream of packets. We're building tables of data that is often repeated in a stream - names, addresses, etc. - and storing references instead of repeating data. We can do this cheaply during writing of the CBOR output, because we know where the redundancy will be located. A general purpose compression engine will end up doing much the same, but will have to work harder to locate that specific redundancy. Also, by writing all e.g. names in a table, we're both grouping data that is likely to have significant internal redundancy that we're not exploiting and making the size of the input
data to the compression much smaller, both of which again makes the general
engine's job much easier.

So I don't think it follows from badly compressing pcaps that storing
raw DNS would compress badly as well. Unless I missed some tricks
why the CBOR version compresses a lot better.

We did experiment with simple CBOR and Avro encodings of individual DNS packets with minimal transport information, which I think would be comparable. Our data showed that the size of input to the compressor was ~10x the size of our format, and the final size after compression was still significantly greater (~25-30%) than our format after compression. We did not take compression resource measurements in that case, but given our experience I would be surprised if the compression resources required were not also significantly
greater.

The downside of CBOR, certainly as used here is that uses integers to
identify fields in what JSON calls objects.

So anybody who writes a local extension is likely to just continue numbering
fields, which leeds to mutually incompatible extensions.

In contrast, formats like XML, JSON, but also BSON where fields have names
make it less likely that people will pick the same identifier for
completely different purposes.

CBOR does not have to use integers as key values. It can use strings in exactly the same way as BSON. The reason for using integers is simply one of space and hence file size and minimising load on the final compressor. Key values occur in the data stream in both CBOR and BSON for every item with that key, so using strings as key
values is not consistent with a goal of minimum file size.

We expect it would be possible, given the CDDL specification of the format, to use that
specification to turn keys values back into text for, say, a conversion to
JSON, but no such tool currently exists, as far as we are aware.
--
Jim Hague - j...@sinodun.com          Never trust a computer you can't lift.

_______________________________________________
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop

Reply via email to