Hi,
I have data files (json in this example but could also be avro) written in
a directory structure like:
dataroot
+-- year=2015
+-- month=06
+-- day=01
+-- data1.json
+-- data2.json
+-- data3.json
+-- day=02
+-- data1.json
+-- data2.json
+-- data3.json
+-- month=07
+-- day=20
+-- data1.json
+-- data2.json
+-- data3.json
+-- day=21
+-- data1.json
+-- data2.json
+-- data3.json
+-- day=22
+-- data1.json
+-- data2.json
Using spark-sql I create a temporary table:
CREATE TEMPORARY TABLE dataTable
USING org.apache.spark.sql.json
OPTIONS (
path "dataroot/*"
)
Querying the table works well but I'm so far not able to use the
directories for pruning.
Is there a way to register the directory structure as partitions (without
using Hive) to avoid scanning the whole tree when I query, say I want to
compare data for the first day of the month?
Thanks,
Johan