Have you tried partitionBy? Something like
hiveWindowsEvents.foreachRDD( rdd => { val eventsDataFrame = rdd.toDF() eventsDataFrame.write.mode(SaveMode.Append).partitionBy(" windows_event_time_bin").saveAsTable("windows_event") }) On Wed, Oct 28, 2015 at 7:41 AM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Hello. > > I am working to get a simple solution working using Spark SQL. I am > writing streaming data to persistent tables using a HiveContext. Writing > to a persistent non-partitioned table works well - I update the table using > Spark streaming, and the output is available via Hive Thrift/JDBC. > > I create a table that looks like the following: > > 0: jdbc:hive2://localhost:10000> describe windows_event; > describe windows_event; > +--------------------------+---------------------+----------+ > | col_name | data_type | comment | > +--------------------------+---------------------+----------+ > | target_entity | string | NULL | > | target_entity_type | string | NULL | > | date_time_utc | timestamp | NULL | > | machine_ip | string | NULL | > | event_id | string | NULL | > | event_data | map<string,string> | NULL | > | description | string | NULL | > | event_record_id | string | NULL | > | level | string | NULL | > | machine_name | string | NULL | > | sequence_number | string | NULL | > | source | string | NULL | > | source_machine_name | string | NULL | > | task_category | string | NULL | > | user | string | NULL | > | additional_data | map<string,string> | NULL | > | windows_event_time_bin | timestamp | NULL | > | # Partition Information | | | > | # col_name | data_type | comment | > | windows_event_time_bin | timestamp | NULL | > +--------------------------+---------------------+----------+ > > > However, when I create a partitioned table and write data using the > following: > > hiveWindowsEvents.foreachRDD( rdd => { > val eventsDataFrame = rdd.toDF() > > eventsDataFrame.write.mode(SaveMode.Append).saveAsTable("windows_event") > }) > > The data is written as though the table is not partitioned (so everything > is written to /user/hive/warehouse/windows_event/file.gz.paquet. Because > the data is not following the partition schema, it is not accessible (and > not partitioned). > > Is there a straightforward way to write to partitioned tables using Spark > SQL? I understand that the read performance for partitioned data is far > better - are there other performance improvements that might be better to > use instead of partitioning? > > Regards, > > Bryan Jeffrey >