[ https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marco Gaido updated HIVE-17280: ------------------------------- Description: Hive concatenation causes data loss if the ORC files in the table were written by Spark. Here are the steps to reproduce the problem: - create a table; {code:java} hive hive> create table aa (a string, b int) stored as orc; {code} - insert 2 rows using Spark; {code:java} spark-shell scala> case class AA(a:String, b:Int) scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF scala> df.write.insertInto("aa") {code} - change table schema; {code:java} hive hive> alter table aa add columns(aa string, bb int); {code} - insert other 2 rows with Spark {code:java} spark-shell scala> case class BB(a:String, b:Int, aa:String, bb:Int) scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF scala> df.write.insertInto("aa") {code} - at this point, running a select statement with Hive returns correctly *4 rows* in the table; then run the concatenation {code:java} hive hive> alter table aa concatenate; {code} At this point, a select returns only *3 rows, ie. a row is missing*. was: Hive concatenation causes data loss if the ORC files in the table were written by Spark. Here are the steps to reproduce the problem: - create a table; {code:java} hive hive> create table aa (a string, b int) stored as orc; {code} - insert 2 rows using Spark; {code:java} spark-shell scala> case class AA(a:String, b:Int) scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF scala> df.write.insertInto("aa") {code} - change table schema; {code:java} hive hive> alter table aa add columns(aa string, bb int); {code} - insert other 2 rows with Spark {code:java} spark-shell scala> case class BB(a:String, b:Int, aa:String, bb:Int) scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF scala> df.write.insertInto("aa") {code} - at this point, running a select statement with Hive returns correctly 4 rows in the table; then run the concatenation {code:java} hive hive> alter table aa concatenate; {code} At this point, a select returns only* 3 rows, ie. a row is missing*. > Data loss in CONCATENATE ORC created by Spark > --------------------------------------------- > > Key: HIVE-17280 > URL: https://issues.apache.org/jira/browse/HIVE-17280 > Project: Hive > Issue Type: Bug > Components: Hive, Spark > Affects Versions: 1.2.1 > Environment: Tested in HDP-2.6 > Reporter: Marco Gaido > > Hive concatenation causes data loss if the ORC files in the table were > written by Spark. > Here are the steps to reproduce the problem: > - create a table; > {code:java} > hive > hive> create table aa (a string, b int) stored as orc; > {code} > - insert 2 rows using Spark; > {code:java} > spark-shell > scala> case class AA(a:String, b:Int) > scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF > scala> df.write.insertInto("aa") > {code} > - change table schema; > {code:java} > hive > hive> alter table aa add columns(aa string, bb int); > {code} > - insert other 2 rows with Spark > {code:java} > spark-shell > scala> case class BB(a:String, b:Int, aa:String, bb:Int) > scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF > scala> df.write.insertInto("aa") > {code} > - at this point, running a select statement with Hive returns correctly *4 > rows* in the table; then run the concatenation > {code:java} > hive > hive> alter table aa concatenate; > {code} > At this point, a select returns only *3 rows, ie. a row is missing*. -- This message was sent by Atlassian JIRA (v6.4.14#64029)