Fwd: Spark integration with HCatalog (specifically regarding partitions)

Elliot West Tue, 19 Jan 2016 13:08:53 -0800

(Cross posted from u...@spark.apache.org)

Hello,


I am in the process of evaluating Spark (1.5.2) for a wide range of use
cases. In particular I'm keen to understand the depth of the integration
with HCatalog (aka the Hive Metastore). I am very encouraged when browsing
the source contained within the org.apache.spark.sql.hive package. My goals
are to evaluate how effectively Spark handles the following scenarios:

   1. Reading from an unpartitioned HCatalog table.
   2. Reading from a partitioned HCatalog table with partition pruning from
   filter pushdown.
   3. Writing to a new unpartitioned HCatalog table.
   4. Writing to a new partitioned HCatalog table.
   5. Adding a partition to a partitioned HCatalog table.

I found that the first three cases appear to function beautifully. However,
I cannot seem to effectively create new HCatalog aware partitions either in
a new table or on and existing table (cases 4 & 5). I suspect this may be
due to my inexperience with Spark so wonder if you could advise me on what
to try next. Here's what I have:

*Case 4: Writing to a new partitioned HCatalog table*

Create a source in Hive (could be plain data file also):


hive (default)> create table foobar ( id int, name string );
hive (default)> insert into table foobar values (1, "xxx"), (2, "zzz");

Read the source with Spark, partition the data, and write to a new table:

sqlContext.sql("select *
from foobar").write.format("orc").partitionBy("id").saveAsTable("raboof")


Check for the new table in Hive, it is partitioned correctly although the
formats and schema are unexpected:

hive (default)> show table extended like 'raboof';
OK
tab_name
tableName: raboof
location:hdfs://host:port/user/hive/warehouse/raboof
inputformat:org.apache.hadoop.mapred.SequenceFileInputFormat
outputformat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
columns:struct columns { list<string> col}
partitioned:true
partitionColumns:struct partition_columns { i32 id}


Check for correctly partitioned data on HDFS, it appears to be there:

[me@host]$ hdfs dfs -ls -R /user/hive/warehouse/raboof
/user/hive/warehouse/raboof/_SUCCESS
/user/hive/warehouse/raboof/id=1
/user/hive/warehouse/raboof/id=1/part-r-00000-<uuid1>.orc
/user/hive/warehouse/raboof/id=2
/user/hive/warehouse/raboof/id=2/part-r-00000-<uuid2>.orc


Something is wrong however, no data is returned from this query and the
column names appear incorrect:

hive (default)> select * from default.raboof;
OK
col id

HCatalog reports no partitions for the table:

hive (default)> show partitions default.raboof;
OK
partition

*Case 5: Adding a partition to a partitioned HCatalog table*

Created partitioned source table in Hive:

hive (default)> create table foobar ( name string )
              > partitioned by ( id int )
              > stored as orc;
hive (default)> insert into table foobar PARTITION (id)
              > values ("xxx", 1), ("yyy", 2);


Created a source for a new record to add to new_record_source:

hive (default)> create table new_record_source ( id int, name string )
              > stored as orc;
hive (default)> insert into table new_record_source
              > values (3, "zzz");


Trying to add a partition with:

sqlContext.sql("select *
from 
new_record_source").write.mode("append").partitionBy("id").saveAsTable("foobar")


This almost did what I wanted:

hive (default)> show partitions default.foobar;
partition
id=1
id=2
id=__HIVE_DEFAULT_PARTITION__

hive (default)> select * from default.foobar;
name id
xxx 1
yyy 2
3 NULL

Any assistance would be greatly appreciated.

Many thanks - Elliot.

Fwd: Spark integration with HCatalog (specifically regarding partitions)

Reply via email to