Hello.
I am working to get a simple solution working using Spark SQL. I am
writing streaming data to persistent tables using a HiveContext. Writing
to a persistent non-partitioned table works well - I update the table using
Spark streaming, and the output is available via Hive Thrift/JDBC.
I create a table that looks like the following:
0: jdbc:hive2://localhost:10000> describe windows_event;
describe windows_event;
+--------------------------+---------------------+----------+
| col_name | data_type | comment |
+--------------------------+---------------------+----------+
| target_entity | string | NULL |
| target_entity_type | string | NULL |
| date_time_utc | timestamp | NULL |
| machine_ip | string | NULL |
| event_id | string | NULL |
| event_data | map<string,string> | NULL |
| description | string | NULL |
| event_record_id | string | NULL |
| level | string | NULL |
| machine_name | string | NULL |
| sequence_number | string | NULL |
| source | string | NULL |
| source_machine_name | string | NULL |
| task_category | string | NULL |
| user | string | NULL |
| additional_data | map<string,string> | NULL |
| windows_event_time_bin | timestamp | NULL |
| # Partition Information | | |
| # col_name | data_type | comment |
| windows_event_time_bin | timestamp | NULL |
+--------------------------+---------------------+----------+
However, when I create a partitioned table and write data using the
following:
hiveWindowsEvents.foreachRDD( rdd => {
val eventsDataFrame = rdd.toDF()
eventsDataFrame.write.mode(SaveMode.Append).saveAsTable("windows_event")
})
The data is written as though the table is not partitioned (so everything
is written to /user/hive/warehouse/windows_event/file.gz.paquet. Because
the data is not following the partition schema, it is not accessible (and
not partitioned).
Is there a straightforward way to write to partitioned tables using Spark
SQL? I understand that the read performance for partitioned data is far
better - are there other performance improvements that might be better to
use instead of partitioning?
Regards,
Bryan Jeffrey