I'm assuming, given this:
CREATE TABLE IF NOT EXISTS db.mytable (
`item_id` string,
`timestamp` string,
`item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;
we'd have to organize the input Parquet files into subdirectories where
each subdirectory contains data
>> properly split and partition your data before using LOAD if you want
hive to be able to find it again.
If the destination table is defined as
CREATE TABLE IF NOT EXISTS db.mytable (
`item_id` string,
`timestamp` string,
`item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORE
Thank you, Ryan and Furcy for your detailed responses.
Our application doesn't necessarily have to have the data in the CSV
format. We read data from "a source" and load it in memory (not all at
once), basically as a continuous stream of records. These are meant to be
processed and written to Hive
“If we represent our data as delimited files” ….the question is how you plan on
getting your data into these parquet files since it doesn’t sound like your
data is already in that format….
If your data is not already in parquet format, you are going to need to run
*some* process to get it into
Hi Dmitry,
If I understand what you said correctly:
At the beginning you have csv files on hdfs,
and at the end you want a partitioned Hive table as parquet.
And your question is: "can I do this using only one Hive table and a LOAD
statement?"
The answer to that question is "no".
The correct
Thanks, Ryan.
I was actually more curious about scenario B. If we represent our data as
delimited files, why don't we just use LOAD DATA INPATH and load it right
into the final, parquet, partitioned table in one step, bypassing dealing
with the temp table?
Are there any advantages to having a tem