[GitHub] [doris] freemandealer opened a new issue, #17902: [Enhancement] inaccurate estimation of the load time for low compression rate data

via GitHub Fri, 17 Mar 2023 01:18:11 -0700


freemandealer opened a new issue, #17902:
URL: https://github.com/apache/doris/issues/17902


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Description
   
   ORC/parquet files can have very low compression rates, such as 3%. This 
means a 3GB file can expand to 100GB during loading. Users may wonder why 
loading such ‘small’ files takes hours. We need to explain the data size to 
them.
   
   ### Solution
   
   In a loading job, there are three types of data size:
   
   1) the original size which we need to read as input. It may be compressed 
and encoded compactly.
   2) the size during process. Each record is decompressed and decoded so that 
Doris can deal with it.
   3) the size we write to disk. Again, compressed and encoded.
   
   We should explain the three sizes individually to the user during & after 
the load, by:
   
   1) enhancing `show load`/`show stream load` statements
   2) improving profile facilities
   3) other places that can show the user intuitively
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [doris] freemandealer opened a new issue, #17902: [Enhancement] inaccurate estimation of the load time for low compression rate data

Reply via email to