(Cross posted from u...@spark.apache.org) Hello,
I am in the process of evaluating Spark (1.5.2) for a wide range of use cases. In particular I'm keen to understand the depth of the integration with HCatalog (aka the Hive Metastore). I am very encouraged when browsing the source contained within the org.apache.spark.sql.hive package. My goals are to evaluate how effectively Spark handles the following scenarios: 1. Reading from an unpartitioned HCatalog table. 2. Reading from a partitioned HCatalog table with partition pruning from filter pushdown. 3. Writing to a new unpartitioned HCatalog table. 4. Writing to a new partitioned HCatalog table. 5. Adding a partition to a partitioned HCatalog table. I found that the first three cases appear to function beautifully. However, I cannot seem to effectively create new HCatalog aware partitions either in a new table or on and existing table (cases 4 & 5). I suspect this may be due to my inexperience with Spark so wonder if you could advise me on what to try next. Here's what I have: *Case 4: Writing to a new partitioned HCatalog table* Create a source in Hive (could be plain data file also): hive (default)> create table foobar ( id int, name string ); hive (default)> insert into table foobar values (1, "xxx"), (2, "zzz"); Read the source with Spark, partition the data, and write to a new table: sqlContext.sql("select * from foobar").write.format("orc").partitionBy("id").saveAsTable("raboof") Check for the new table in Hive, it is partitioned correctly although the formats and schema are unexpected: hive (default)> show table extended like 'raboof'; OK tab_name tableName: raboof location:hdfs://host:port/user/hive/warehouse/raboof inputformat:org.apache.hadoop.mapred.SequenceFileInputFormat outputformat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat columns:struct columns { list<string> col} partitioned:true partitionColumns:struct partition_columns { i32 id} Check for correctly partitioned data on HDFS, it appears to be there: [me@host]$ hdfs dfs -ls -R /user/hive/warehouse/raboof /user/hive/warehouse/raboof/_SUCCESS /user/hive/warehouse/raboof/id=1 /user/hive/warehouse/raboof/id=1/part-r-00000-<uuid1>.orc /user/hive/warehouse/raboof/id=2 /user/hive/warehouse/raboof/id=2/part-r-00000-<uuid2>.orc Something is wrong however, no data is returned from this query and the column names appear incorrect: hive (default)> select * from default.raboof; OK col id HCatalog reports no partitions for the table: hive (default)> show partitions default.raboof; OK partition *Case 5: Adding a partition to a partitioned HCatalog table* Created partitioned source table in Hive: hive (default)> create table foobar ( name string ) > partitioned by ( id int ) > stored as orc; hive (default)> insert into table foobar PARTITION (id) > values ("xxx", 1), ("yyy", 2); Created a source for a new record to add to new_record_source: hive (default)> create table new_record_source ( id int, name string ) > stored as orc; hive (default)> insert into table new_record_source > values (3, "zzz"); Trying to add a partition with: sqlContext.sql("select * from new_record_source").write.mode("append").partitionBy("id").saveAsTable("foobar") This almost did what I wanted: hive (default)> show partitions default.foobar; partition id=1 id=2 id=__HIVE_DEFAULT_PARTITION__ hive (default)> select * from default.foobar; name id xxx 1 yyy 2 3 NULL Any assistance would be greatly appreciated. Many thanks - Elliot.