Have you tried partitionBy?

Something like

hiveWindowsEvents.foreachRDD( rdd => {
      val eventsDataFrame = rdd.toDF()
      eventsDataFrame.write.mode(SaveMode.Append).partitionBy("
windows_event_time_bin").saveAsTable("windows_event")
    })



On Wed, Oct 28, 2015 at 7:41 AM, Bryan Jeffrey <bryan.jeff...@gmail.com>
wrote:

> Hello.
>
> I am working to get a simple solution working using Spark SQL.  I am
> writing streaming data to persistent tables using a HiveContext.  Writing
> to a persistent non-partitioned table works well - I update the table using
> Spark streaming, and the output is available via Hive Thrift/JDBC.
>
> I create a table that looks like the following:
>
> 0: jdbc:hive2://localhost:10000> describe windows_event;
> describe windows_event;
> +--------------------------+---------------------+----------+
> |         col_name         |      data_type      | comment  |
> +--------------------------+---------------------+----------+
> | target_entity            | string              | NULL     |
> | target_entity_type       | string              | NULL     |
> | date_time_utc            | timestamp           | NULL     |
> | machine_ip               | string              | NULL     |
> | event_id                 | string              | NULL     |
> | event_data               | map<string,string>  | NULL     |
> | description              | string              | NULL     |
> | event_record_id          | string              | NULL     |
> | level                    | string              | NULL     |
> | machine_name             | string              | NULL     |
> | sequence_number          | string              | NULL     |
> | source                   | string              | NULL     |
> | source_machine_name      | string              | NULL     |
> | task_category            | string              | NULL     |
> | user                     | string              | NULL     |
> | additional_data          | map<string,string>  | NULL     |
> | windows_event_time_bin   | timestamp           | NULL     |
> | # Partition Information  |                     |          |
> | # col_name               | data_type           | comment  |
> | windows_event_time_bin   | timestamp           | NULL     |
> +--------------------------+---------------------+----------+
>
>
> However, when I create a partitioned table and write data using the
> following:
>
>     hiveWindowsEvents.foreachRDD( rdd => {
>       val eventsDataFrame = rdd.toDF()
>
> eventsDataFrame.write.mode(SaveMode.Append).saveAsTable("windows_event")
>     })
>
> The data is written as though the table is not partitioned (so everything
> is written to /user/hive/warehouse/windows_event/file.gz.paquet.  Because
> the data is not following the partition schema, it is not accessible (and
> not partitioned).
>
> Is there a straightforward way to write to partitioned tables using Spark
> SQL?  I understand that the read performance for partitioned data is far
> better - are there other performance improvements that might be better to
> use instead of partitioning?
>
> Regards,
>
> Bryan Jeffrey
>

Reply via email to