[jira] [Commented] (HIVE-13985) ORC improvements for reducing the file system calls in task side

Hive QA (JIRA) Sun, 19 Jun 2016 10:54:21 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-13985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15338661#comment-15338661
 ]


Hive QA commented on HIVE-13985:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12811504/HIVE-13985.5.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 11 failed/errored test(s), 10246 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_globallimit
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_constantPropagateForSubQuery
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_12
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_13
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_repair
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_multiinsert
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_index_bitmap3
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_table_nonprintable
org.apache.hadoop.hive.llap.tezplugins.TestLlapTaskSchedulerService.testDelayedLocalityNodeCommErrorImmediateAllocation
org.apache.hadoop.hive.ql.metadata.TestHiveMetaStoreChecker.testPartitionsCheck
org.apache.hadoop.hive.ql.metadata.TestHiveMetaStoreChecker.testTableCheck
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/179/testReport
Console output: 
https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/179/console
Test logs: 
http://ec2-50-18-27-0.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-179/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 11 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12811504 - PreCommit-HIVE-MASTER-Build

> ORC improvements for reducing the file system calls in task side
> ----------------------------------------------------------------
>
>                 Key: HIVE-13985
>                 URL: https://issues.apache.org/jira/browse/HIVE-13985
>             Project: Hive
>          Issue Type: Bug
>          Components: ORC
>    Affects Versions: 1.3.0, 2.2.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>         Attachments: HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch, 
> HIVE-13985-branch-1.patch, HIVE-13985-branch-1.patch, 
> HIVE-13985-branch-2.1.patch, HIVE-13985.1.patch, HIVE-13985.2.patch, 
> HIVE-13985.3.patch, HIVE-13985.4.patch, HIVE-13985.5.patch
>
>
> HIVE-13840 fixed some issues with addition file system invocations during 
> split generation. Similarly, this jira will fix issues with additional file 
> system invocations on the task side. To avoid reading footers on the task 
> side, users can set hive.orc.splits.include.file.footer to true which will 
> serialize the orc footers on the splits. But this has issues with serializing 
> unwanted information like column statistics and other metadata which are not 
> really required for reading orc split on the task side. We can reduce the 
> payload on the orc splits by serializing only the minimum required 
> information (stripe information, types, compression details). This will 
> decrease the payload on the orc splits and can potentially avoid OOMs in 
> application master (AM) during split generation. This jira also address other 
> issues concerning the AM cache. The local cache used by AM is soft reference 
> cache. This can introduce unpredictability across multiple runs of the same 
> query. We can cache the serialized footer in the local cache and also use 
> strong reference cache which should avoid memory pressure and will have 
> better predictability.
> One other improvement that we can do is when 
> hive.orc.splits.include.file.footer is set to false, on the task side we make 
> one additional file system call to know the size of the file. If we can 
> serialize the file length in the orc split this can be avoided.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-13985) ORC improvements for reducing the file system calls in task side

Reply via email to