alamb commented on issue #13816:
URL: https://github.com/apache/datafusion/issues/13816#issuecomment-2659471604

   At a high level, I think this ticket has 2 parts:
   1. Figure out what is contributing to code size increase
   2. Then perhaps figure out how to make it better
   
   I think the most valuable (and hardest) part is 1 (figuring out what to do)
   
   To do so I recommend doing an ["Ablative 
Study"](https://en.wikipedia.org/wiki/Ablation_(artificial_intelligence))
   
   > An ablation study aims to determine the contribution of a component to an 
AI system by removing the component, and then analyzing the resultant 
performance of the 
system.[[2]](https://en.wikipedia.org/wiki/Ablation_(artificial_intelligence)#cite_note-2)
   
   This is a fancy way of saying "remove parts of the system and see how much 
impact it makes on binary size"
   
   # Suggested things to try
   I suggest initially simply trying with different datafusion crate features 
and see how much extra code each contributes to the binary size.
   
   A follow on idea would be to comment out some of the  features that require 
lots of generic code such as
   
   * REmove (all implementations of `GroupValues`: 
https://github.com/apache/datafusion/blob/main/datafusion/physical-plan/src/aggregates/group_values/mod.rs
   * Remvoe all implementations of `GroupAccumulator`: 
https://github.com/apache/datafusion/blob/469f18be1c594b07e4b235f3404419792ed3c24f/datafusion/expr-common/src/groups_accumulator.rs#L108
   
   
   ## Example of Ablation for the `parquet` feature
   For example, to test the impact of the `parquet` feature, I tested the size 
of the binary with and without parquet support
   
   ```shell
   cargo build --release -p datafusion-cli
   # get size of datafusion-cli in kb
   du -k target/release/datafusion-cli
   ```
   
   Here is what I got
   
   | type | size in mb | size in kb |
   |--------|--------|--------|
   | default | 58 | 58440 |
   | without `parquet` feature | 53 | 54248 |
   
   So I conclude that the parquet feature adds approximately 5mb to the binary 
size
   
   To remove parquet support I hacked out the dependency on parquet. Since this 
was just to test the impact as long as it compiles that is good enough. No need 
to be pretty. Here is the 
   - https://github.com/apache/datafusion/pull/14666
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to