[
https://issues.apache.org/jira/browse/KUDU-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329676#comment-16329676
]
Todd Lipcon commented on KUDU-2263:
-----------------------------------
bq. I don't want to have to go digging up a particular descriptor set and worry
about how the schema has evolved since this PB file was written, or whether the
descriptor set even still exists.
But isn't that the point of protobuf evolution rules? eg don't remove fields,
just mark them "obsolete", etc? In the rare case that we did remove a long-dead
field from the protobuf (eg to save space in the struct in RAM) it would still
show up in the dump as an unknown field and emit the field ID, which could then
easily be referenced against the file (or you could go back to an earlier
release's pbc dump tool if necessary)
bq. . How much of a cost savings would we realize for "small but numerous" PB
files (like cmeta and maybe tablet superblocks)? Would it be enough to get us
under 4k per file?
Not 100% sure without implementing the whole thing (which is non-trivial code),
but by locally pruning out most of the obviously-unrelated stuff and running
the 'protoc --descriptor_set_out' tool, it seems to reduce the descriptor from
12kb to 1kb. So, it's likely cmeta would fit under 4kb in most cases with just
that change... still seems a little wasteful to duplicate this data everywhere
though.
> Consider removing PB descriptors from PBC header
> ------------------------------------------------
>
> Key: KUDU-2263
> URL: https://issues.apache.org/jira/browse/KUDU-2263
> Project: Kudu
> Issue Type: Improvement
> Components: util
> Affects Versions: 1.7.0
> Reporter: Todd Lipcon
> Priority: Major
>
> Looking at a cmeta file on disk, it seems the vast majority of the bytes are
> in the supplemental header. We currently serialize the entire descriptor set
> of the referenced file and its dependencies. This means that in each cmeta
> file, we end up serializing even things like the definition of SchemaPB –
> unnecessary to serialize the type at hand and quite large.
>
> At a minimum we can prune the descriptors serialized to only include those
> that are transitively referenced by the PB type in the file. I think we
> should also consider doing away with this information entirely and instead
> allow 'kudu pbc dump' to take a descriptor set as external input – it's easy
> enough to generate a descriptor set from any kudu version source tree using
> the protoc command line.
> One potential major improvement if we can get these files down to <4kb is
> that we could atomically rewrite them in a single disk IO using O_DIRECT
> rather than doing a rewrite-rename-fsync dance.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)