alamb commented on issue #13816: URL: https://github.com/apache/datafusion/issues/13816#issuecomment-2659471604
At a high level, I think this ticket has 2 parts: 1. Figure out what is contributing to code size increase 2. Then perhaps figure out how to make it better I think the most valuable (and hardest) part is 1 (figuring out what to do) To do so I recommend doing an ["Ablative Study"](https://en.wikipedia.org/wiki/Ablation_(artificial_intelligence)) > An ablation study aims to determine the contribution of a component to an AI system by removing the component, and then analyzing the resultant performance of the system.[[2]](https://en.wikipedia.org/wiki/Ablation_(artificial_intelligence)#cite_note-2) This is a fancy way of saying "remove parts of the system and see how much impact it makes on binary size" # Suggested things to try I suggest initially simply trying with different datafusion crate features and see how much extra code each contributes to the binary size. A follow on idea would be to comment out some of the features that require lots of generic code such as * REmove (all implementations of `GroupValues`: https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/group_values/mod.rs * Remvoe all implementations of `GroupAccumulator`: https://github.com/apache/datafusion/blob/469f18be1c594b07e4b235f3404419792ed3c24f/datafusion/expr-common/src/groups_accumulator.rs#L108 ## Example of Ablation for the `parquet` feature For example, to test the impact of the `parquet` feature, I tested the size of the binary with and without parquet support ```shell cargo build --release -p datafusion-cli # get size of datafusion-cli in kb du -k target/release/datafusion-cli ``` Here is what I got | type | size in mb | size in kb | |--------|--------|--------| | default | 58 | 58440 | | without `parquet` feature | 53 | 54248 | So I conclude that the parquet feature adds approximately 5mb to the binary size To remove parquet support I hacked out the dependency on parquet. Since this was just to test the impact as long as it compiles that is good enough. No need to be pretty. Here is the - https://github.com/apache/datafusion/pull/14666 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org