[
https://issues.apache.org/jira/browse/HIVE-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
He Yongqiang updated HIVE-1797:
-------------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
Committed! Thanks Liyin!
> Compressed the hashtable dump file before put into distributed cache
> --------------------------------------------------------------------
>
> Key: HIVE-1797
> URL: https://issues.apache.org/jira/browse/HIVE-1797
> Project: Hive
> Issue Type: Improvement
> Components: Query Processor
> Affects Versions: 0.7.0
> Reporter: Liyin Tang
> Assignee: Liyin Tang
> Attachments: hive-1797.patch, hive-1797_3.patch
>
>
> Clearly, the size of small table is the performance bottleneck for map join.
> Because the size of the small table will affect the memory usage and dumped
> hashtable file.
> That means there are 2 boundaries of the map join performance.
> 1) The memory usage for local task and mapred task
> 2) The dumped hashtable file size for distributed cache
> The reason that test case in last email spends most of the execution time on
> initializing is because it hits the second boundary.
> Since we have already bound the memory usage, one thing we can do is to let
> the performance never hits the secondary bound before it hits the first
> boundary.
> Assuming the heap size is 1.6 G and the small table file size is 15M
> compressed (75M uncompressed),
> local task can roughly hold that 1.5M unique rows in memory.
> Roughly the dumped file size will be 150M, which is too large to put into the
> distributed cache.
>
> From experiments, we can basically conclude when the dumped file size is
> smaller than 30M.
> The distributed cache works well and all the mappers will be initialized in
> a short time (less than 30 secs).
> One easy implementation is to compress the hashtable file.
> I use the gzip to compress the hashtable file and the file size is compressed
> from 100M to 13M.
> After several tests, all the mappers will be initialized in less than 23 secs.
> But this solution adds some decompression overhead to each mapper.
> Mappers on the same machine will do the duplicated decompression work.
> Maybe in the future, we can let the distributed cache to support this.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.