Hi, I'm trying to rename an orc table (either in hive or spark has no difference). After that, all the content in the table will be invisible in spark while it is still available in hive. The problem could alway be recreated by very simple steps:
---------------------------- spark shell output------------------------ scala> sql("select uid from pass_db_uc.uc_user limit 1") res0: org.apache.spark.sql.DataFrame = [uid: bigint] scala> .show +---+ |uid| +---+ | 12| +---+ scala> sql("select uid from pass_db_uc.uc_user limit 1").write.format("orc").saveAsTable("yytest.orc1") 16/05/08 11:10:07 WARN HiveMetaStore: Location: hdfs://prod-hadoop-master01:9000/user/hive/warehouse/yytest.db/orc1 specified for non-external table:orc1 scala> sql("select * from yytest.orc1").count <<<< content in table res3: Long = 1 scala> sql("alter table yytest.orc1 rename to yytest.orc2") res4: org.apache.spark.sql.DataFrame = [result: string] scala> sql("select * from yytest.orc2").count <<<< after renaming, no content in table res5: Long = 0 scala> sql("alter table yytest.orc2 rename to yytest.orc1") res6: org.apache.spark.sql.DataFrame = [result: string] scala> sql("select * from yytest.orc1").count <<<< renaming it back recovered content, suspected some metadata error res7: Long = 1 ---------------------------- spark shell output end ------------------------ On the other side, I tried to use hive shell for some clues, and found that the content is available, while table schema contains a little wired things. Hive is configured to use mr intead of spark. ---------------------------- hive output ------------------------- hive> select count(*) from yytest.orc2; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases. Query ID = root_20160508114940_c3c5454f-73a3-43fc-bdd9-f264414de88e Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1461140858519_15796, Tracking URL = http://prod-hadoop-master01:8088/proxy/application_1461140858519_15796/ Kill Command = /home/hadoop/hadoop-2.6.3/bin/hadoop job -kill job_1461140858519_15796 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-05-08 11:52:15,326 Stage-1 map = 0%, reduce = 0% 2016-05-08 11:52:20,594 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.25 sec 2016-05-08 11:52:26,868 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.85 sec MapReduce Total cumulative CPU time: 2 seconds 850 msec Ended Job = job_1461140858519_15796 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.85 sec HDFS Read: 8310 HDFS Write: 2 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 850 msec OK 1 <-------------------------------------------------------------------- Content in table Time taken: 20.971 seconds, Fetched: 1 row(s) hive> hive> show create table yytest.orc2; OK CREATE TABLE `yytest.orc2`( `uid` bigint COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' WITH SERDEPROPERTIES ( 'path'='hdfs://prod-hadoop-master01:9000/user/hive/warehouse/yytest.db/orc1') <--- serde properties not match new table, but manually correct that does not take effect. Still no content in spark STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://prod-hadoop-master01:9000/user/hive/warehouse/yytest.db/orc2' <---- Checked in hdfs, content is ok; TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'EXTERNAL'='FALSE', 'last_modified_by'='root', 'last_modified_time'='1462679372', 'numFiles'='1', 'numRows'='-1', 'rawDataSize'='-1', 'spark.sql.sources.provider'='orc', 'spark.sql.sources.schema.numParts'='1', 'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"uid\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}', 'totalSize'='196', 'transient_lastDdlTime'='1462679372') Time taken: 1.258 seconds, Fetched: 25 row(s) ---------------------------- hive output ------------------------- Here are the versions I'm using: Hadoop 2.6.3 Hive 2.0.0 Spark 1.6 - build with hive support, depend on scala 2.10 Is there any idea on why content missing after rename, or any suggestion on that? Thanks a lot. Regars, Xudong -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Rename-hive-orc-table-caused-no-content-in-spark-tp26897.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org