[jira] [Commented] (KUDU-2263) Consider removing PB descriptors from PBC header

Todd Lipcon (JIRA) Wed, 17 Jan 2018 15:23:14 -0800

    [ 
https://issues.apache.org/jira/browse/KUDU-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329676#comment-16329676
 ]


Todd Lipcon commented on KUDU-2263:
-----------------------------------

bq. I don't want to have to go digging up a particular descriptor set and worry 
about how the schema has evolved since this PB file was written, or whether the 
descriptor set even still exists.

But isn't that the point of protobuf evolution rules? eg don't remove fields, 
just mark them "obsolete", etc? In the rare case that we did remove a long-dead 
field from the protobuf (eg to save space in the struct in RAM) it would still 
show up in the dump as an unknown field and emit the field ID, which could then 
easily be referenced against the file (or you could go back to an earlier 
release's pbc dump tool if necessary)

bq. . How much of a cost savings would we realize for "small but numerous" PB 
files (like cmeta and maybe tablet superblocks)? Would it be enough to get us 
under 4k per file?

Not 100% sure without implementing the whole thing (which is non-trivial code), 
but by locally pruning out most of the obviously-unrelated stuff and running 
the 'protoc --descriptor_set_out' tool, it seems to reduce the descriptor from 
12kb to 1kb. So, it's likely cmeta would fit under 4kb in most cases with just 
that change... still seems a little wasteful to duplicate this data everywhere 
though.

> Consider removing PB descriptors from PBC header
> ------------------------------------------------
>
>                 Key: KUDU-2263
>                 URL: https://issues.apache.org/jira/browse/KUDU-2263
>             Project: Kudu
>          Issue Type: Improvement
>          Components: util
>    Affects Versions: 1.7.0
>            Reporter: Todd Lipcon
>            Priority: Major
>
> Looking at a cmeta file on disk, it seems the vast majority of the bytes are 
> in the supplemental header. We currently serialize the entire descriptor set 
> of the referenced file and its dependencies. This means that in each cmeta 
> file, we end up serializing even things like the definition of SchemaPB – 
> unnecessary to serialize the type at hand and quite large.
>  
> At a minimum we can prune the descriptors serialized to only include those 
> that are transitively referenced by the PB type in the file. I think we 
> should also consider doing away with this information entirely and instead 
> allow 'kudu pbc dump' to take a descriptor set as external input – it's easy 
> enough to generate a descriptor set from any kudu version source tree using 
> the protoc command line.
> One potential major improvement if we can get these files down to <4kb is 
> that we could atomically rewrite them in a single disk IO using O_DIRECT 
> rather than doing a rewrite-rename-fsync dance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2263) Consider removing PB descriptors from PBC header

Reply via email to