[ 
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13177:
------------------------------------
    Epic Link: IMPALA-13915

> Compress encodedFileDescriptors inside the same partition
> ---------------------------------------------------------
>
>                 Key: IMPALA-13177
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13177
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>              Labels: catalog-2024
>         Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job 
> id, task id, etc. We can compress them to save some memory space. Especially 
> in the case of small files issue, the memory footprint of the metadata cache 
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
> on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
> encodedFileDescriptor is a byte array that takes 160B. Codes:
> [https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]
> Files of that table are created by Spark jobs. Here are some file names 
> inside the same partition:
> {noformat}
> part-00000-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00001-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00002-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00003-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00004-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00005-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00006-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00007-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00008-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00009-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00010-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00011-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00012-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00013-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00014-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
> part-00015-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
>  {noformat}
> By compressing the encodedFileDescriptors inside the same partition, we 
> should be able to save a significant memory space in this case. Compressing 
> all of them inside the same table might be even better, but it impacts the 
> performance when coordinator loading specific partitions from catalogd.
> We can consider only do this for partitions whose number of files exceeds a 
> threshold (e.g. 10).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to