[jira] [Commented] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

Hive QA (Jira) Wed, 04 Dec 2019 18:40:18 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988399#comment-16988399
 ]


Hive QA commented on HIVE-22579:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12987446/HIVE-22579.01.branch-2.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/19746/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/19746/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-19746/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ date '+%Y-%m-%d %T.%3N'
2019-12-05 02:34:47.581
+ [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]]
+ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ export 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'MAVEN_OPTS=-Xmx1g '
+ MAVEN_OPTS='-Xmx1g '
+ cd /data/hiveptest/working/
+ tee /data/hiveptest/logs/PreCommit-HIVE-Build-19746/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ git = \s\v\n ]]
+ [[ git = \g\i\t ]]
+ [[ -z branch-2 ]]
+ [[ -d apache-github-branch-2-source ]]
+ [[ ! -d apache-github-branch-2-source/.git ]]
+ [[ ! -d apache-github-branch-2-source ]]
+ date '+%Y-%m-%d %T.%3N'
2019-12-05 02:34:47.679
+ cd apache-github-branch-2-source
+ git fetch origin
>From https://github.com/apache/hive
   7534f82..de0a7ec  branch-1   -> origin/branch-1
   9bcdb54..6002c51  branch-1.2 -> origin/branch-1.2
   292a98f..0359921  branch-2.1 -> origin/branch-2.1
   b148507..67f9139  branch-2.2 -> origin/branch-2.2
   f90975a..f55ee60  branch-3   -> origin/branch-3
   a354bed..0ecbd12  branch-3.0 -> origin/branch-3.0
   909c1dc..eb4d7c3  branch-3.1 -> origin/branch-3.1
   305e710..1ef05ef  master     -> origin/master
   e59fdf9..3638231  storage-branch-2.7 -> origin/storage-branch-2.7
 * [new tag]         rel/storage-release-2.7.1 -> rel/storage-release-2.7.1
+ git reset --hard HEAD
HEAD is now at a4a6101 HIVE-22249: Support Parquet through HCatalog (Jay 
Green-Stevens via Peter Vary)
+ git clean -f -d
+ git checkout branch-2
Already on 'branch-2'
Your branch is up-to-date with 'origin/branch-2'.
+ git reset --hard origin/branch-2
HEAD is now at a4a6101 HIVE-22249: Support Parquet through HCatalog (Jay 
Green-Stevens via Peter Vary)
+ git merge --ff-only origin/branch-2
Already up-to-date.
+ date '+%Y-%m-%d %T.%3N'
2019-12-05 02:35:11.141
+ rm -rf ../yetus_PreCommit-HIVE-Build-19746
+ mkdir ../yetus_PreCommit-HIVE-Build-19746
+ git gc
+ cp -R . ../yetus_PreCommit-HIVE-Build-19746
+ mkdir /data/hiveptest/logs/PreCommit-HIVE-Build-19746/yetus
+ patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hiveptest/working/scratch/build.patch
+ [[ -f /data/hiveptest/working/scratch/build.patch ]]
+ chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh
+ /data/hiveptest/working/scratch/smart-apply-patch.sh 
/data/hiveptest/working/scratch/build.patch
Going to apply patch with: git apply -p0
+ [[ maven == \m\a\v\e\n ]]
+ rm -rf /data/hiveptest/working/maven/org/apache/hive
+ mvn -B clean install -DskipTests -T 4 -q 
-Dmaven.repo.local=/data/hiveptest/working/maven
ANTLR Parser Generator  Version 3.5.2
Output file 
/data/hiveptest/working/apache-github-branch-2-source/metastore/target/generated-sources/antlr3/org/apache/hadoop/hive/metastore/parser/FilterParser.java
 does not exist: must build 
/data/hiveptest/working/apache-github-branch-2-source/metastore/src/java/org/apache/hadoop/hive/metastore/parser/Filter.g
org/apache/hadoop/hive/metastore/parser/Filter.g
DataNucleus Enhancer (version 4.1.17) for API "JDO"
DataNucleus Enhancer : Classpath
>>  /usr/share/maven/boot/plexus-classworlds-2.x.jar
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MDatabase
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MFieldSchema
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MType
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MTable
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MConstraint
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MSerDeInfo
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MOrder
ENHANCED (Persistable) : 
org.apache.hadoop.hive.metastore.model.MColumnDescriptor
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MStringList
ENHANCED (Persistable) : 
org.apache.hadoop.hive.metastore.model.MStorageDescriptor
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MPartition
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MIndex
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MRole
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MRoleMap
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MGlobalPrivilege
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MDBPrivilege
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MTablePrivilege
ENHANCED (Persistable) : 
org.apache.hadoop.hive.metastore.model.MPartitionPrivilege
ENHANCED (Persistable) : 
org.apache.hadoop.hive.metastore.model.MTableColumnPrivilege
ENHANCED (Persistable) : 
org.apache.hadoop.hive.metastore.model.MPartitionColumnPrivilege
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MPartitionEvent
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MMasterKey
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MDelegationToken
ENHANCED (Persistable) : 
org.apache.hadoop.hive.metastore.model.MTableColumnStatistics
ENHANCED (Persistable) : 
org.apache.hadoop.hive.metastore.model.MPartitionColumnStatistics
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MVersionTable
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MResourceUri
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MFunction
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MNotificationLog
ENHANCED (Persistable) : 
org.apache.hadoop.hive.metastore.model.MNotificationNextId
DataNucleus Enhancer completed with success for 30 classes. Timings : input=186 
ms, enhance=197 ms, total=383 ms. Consult the log for full details
ANTLR Parser Generator  Version 3.5.2
Output file 
/data/hiveptest/working/apache-github-branch-2-source/ql/target/generated-sources/antlr3/org/apache/hadoop/hive/ql/parse/HiveLexer.java
 does not exist: must build 
/data/hiveptest/working/apache-github-branch-2-source/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g
org/apache/hadoop/hive/ql/parse/HiveLexer.g
Output file 
/data/hiveptest/working/apache-github-branch-2-source/ql/target/generated-sources/antlr3/org/apache/hadoop/hive/ql/parse/HiveParser.java
 does not exist: must build 
/data/hiveptest/working/apache-github-branch-2-source/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g
org/apache/hadoop/hive/ql/parse/HiveParser.g
Output file 
/data/hiveptest/working/apache-github-branch-2-source/ql/target/generated-sources/antlr3/org/apache/hadoop/hive/ql/parse/HintParser.java
 does not exist: must build 
/data/hiveptest/working/apache-github-branch-2-source/ql/src/java/org/apache/hadoop/hive/ql/parse/HintParser.g
org/apache/hadoop/hive/ql/parse/HintParser.g
Generating vector expression code
Generating vector expression test code
[ERROR] Failed to execute goal on project hive-hbase-handler: Could not resolve 
dependencies for project org.apache.hive:hive-hbase-handler:jar:2.4.0-SNAPSHOT: 
Could not find artifact org.apache.hbase:hbase-procedure:jar:1.1.1 -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :hive-hbase-handler
+ result=1
+ '[' 1 -ne 0 ']'
+ rm -rf yetus_PreCommit-HIVE-Build-19746
+ exit 1
'
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12987446 - PreCommit-HIVE-Build

> ACID v1: covered delta-only splits (without base) should be marked as covered 
> (branch-2)
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-22579
>                 URL: https://issues.apache.org/jira/browse/HIVE-22579
>             Project: Hive
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: HIVE-22579.01.branch-2.patch
>
>
> There is a scenario when different SplitGenerator instances try to cover the 
> delta-only buckets (having no base file) more than once, so there could be 
> multiple OrcSplit instances generated for the same delta file, causing more 
> tasks to read the same delta file more than once, causing duplicate records 
> in a simple select star query.
> File structure for a 256 bucket table
> {code}
> drwxrwxrwx   - hive hadoop          0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_0000013
> -rw-r--r--   3 hive hadoop        353 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00012
> -rw-r--r--   3 hive hadoop       1642 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/base_0000013/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00012
> -rwxrwxrwx   3 hive hadoop       1635 2019-11-29 15:55 
> /apps/hive/warehouse/naresh.db/test1/delta_0000014_0000014_0000/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00012
> -rwxrwxrwx   3 hive hadoop       1808 2019-11-29 16:04 
> /apps/hive/warehouse/naresh.db/test1/delta_0000015_0000015_0000/bucket_00140
> drwxrwxrwx   - hive hadoop          0 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000
> -rwxrwxrwx   3 hive hadoop        348 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00043
> -rwxrwxrwx   3 hive hadoop       1633 2019-11-29 16:06 
> /apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
> {code}
> in this case, when bucket_00171 file has a record, and there is no base file 
> for that, a select (*) with ETL split strategy can generate 2 splits for the 
> same delta bucket...
> the scenario of the issue:
> 1. ETLSplitStrategy contains a [covered[] 
> array|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L763]
>  which is [shared between the SplitInfo 
> instances|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L824]
>  to be created
> 2. a SplitInfo instance is created for [every base file (2 in this 
> case)|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L809]
> 3. for every SplitInfo, [a SplitGenerator is 
> created|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L925-L926],
>  and in the constructor, [parent's getSplit is 
> called|https://github.com/apache/hive/blob/298f749ec7be04abb797fb119f3f0b923c8a1b27/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1251],
>  which tries to take care of the deltas
> I'm not sure at the moment what's the intention of this, but this way, 
> duplicated delta split can be generated, which can cause duplicated read 
> later (note that both tasks read the same delta file: bucket_00171)
> {code}
> 2019-12-01T16:24:53,669  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] orc.ReaderImpl: Reading ORC rows from 
> hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
>  with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true], offset: 0, length: 9223372036854775807, schema: 
> struct<idp_warehouse_id:bigint,idp_audit_id:bigint,batch_id:decimal(9,0),source_system_cd:varchar(500),insert_time:timestamp,process_status_cd:varchar(20),business_date:date,last_update_time:timestamp,report_date:date,etl_run_time:timestamp,etl_run_nbr:bigint>}
> 2019-12-01T16:24:53,672  INFO [TezTR-127843_16_30_0_171_0 
> (1575040127843_0016_30_00_000171_0)] lib.MRReaderMapred: Processing split: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit 
> [hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1,
>  start=171, length=0, isOriginal=false, fileLength=9223372036854775807, 
> hasFooter=false, hasBase=false, deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: 
> [0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0] }, { minTxnId: 16 maxTxnId: 
> 16 stmtIds: [0] }]]
> 2019-12-01T16:24:55,807  INFO [TezTR-127843_16_30_0_425_0 
> (1575040127843_0016_30_00_000425_0)] orc.ReaderImpl: Reading ORC rows from 
> hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1/delta_0000016_0000016_0000/bucket_00171
>  with {include: [true, true, true, true, true, true, true, true, true, true, 
> true, true], offset: 0, length: 9223372036854775807, schema: 
> struct<idp_warehouse_id:bigint,idp_audit_id:bigint,batch_id:decimal(9,0),source_system_cd:varchar(500),insert_time:timestamp,process_status_cd:varchar(20),business_date:date,last_update_time:timestamp,report_date:date,etl_run_time:timestamp,etl_run_nbr:bigint>}
> 2019-12-01T16:24:55,813  INFO [TezTR-127843_16_30_0_425_0 
> (1575040127843_0016_30_00_000425_0)] lib.MRReaderMapred: Processing split: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat:OrcSplit 
> [hdfs://c3351-node2.squadron.support.hortonworks.com:8020/apps/hive/warehouse/naresh.db/test1,
>  start=171, length=0, isOriginal=false, fileLength=9223372036854775807, 
> hasFooter=false, hasBase=false, deltas=[{ minTxnId: 14 maxTxnId: 14 stmtIds: 
> [0] }, { minTxnId: 15 maxTxnId: 15 stmtIds: [0] }, { minTxnId: 16 maxTxnId: 
> 16 stmtIds: [0] }]]
> {code}
> seems like this issue doesn't affect AcidV2, as getSplits() returns an empty 
> collection or throws an exception in case of unexpected deltas (which was the 
> case here, where deltas was not unexpected):
> https://github.com/apache/hive/blob/8ee3497f87f81fa84ee1023e891dc54087c2cd5e/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1178-L1197



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-22579) ACID v1: covered delta-only splits (without base) should be marked as covered (branch-2)

Reply via email to