RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

Markovitz, Dudu Tue, 04 Apr 2017 11:22:50 -0700

“LOAD” is very misleading here. it is all in done the metadata level.
The data is not being touched. The data in not being verified. The “system” 
does not have any clue if the flies format match the table definition and they 
can be actually used.
The data files are being “moved” (again,  a metadata operation) from their 
current HDFS location to the location defined for the table.
Later on when you  query the table the files will be scanned. If there are in 
the right format you’ll get results. If not, then no.

From: Dmitry Goldenberg [mailto:[email protected]]
Sent: Tuesday, April 04, 2017 8:54 PM
To: [email protected]
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED 
AS PARQUET table?

Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH on a 
few tables to achieve the effect of actual insertion of records. Is it not the 
case that the LOAD causes the data to get inserted into Hive?

Based on that I'd like to understand whether we can get away with using LOAD 
INPATH instead of INSERT/SELECT FROM.

On Apr 4, 2017, at 1:43 PM, Markovitz, Dudu 
<[email protected]<mailto:[email protected]>> wrote:
I just want to verify that you understand the following:

·         LOAD DATA INPATH is just a HDFS file movement operation.

You can achieve the same results by using hdfs dfs -mv …

·         LOAD DATA LOCAL  INPATH is just a file copying operation from the 
shell to the HDFS.

You can achieve the same results by using hdfs dfs -put …

From: Dmitry Goldenberg [mailto:[email protected]]
Sent: Tuesday, April 04, 2017 7:48 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED 
AS PARQUET table?

Dudu,

This is still in design stages, so we have a way to get the data from its 
source. The data is *not* in the Parquet format.  It's up to us to format it 
the best and most efficient way.  We can roll with CSV or Parquet; ultimately 
the data must make it into a pre-defined PARQUET, PARTITIONED table in Hive.

Thanks,
- Dmitry

On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu 
<[email protected]<mailto:[email protected]>> wrote:
Are your files already in Parquet format?

From: Dmitry Goldenberg 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, April 04, 2017 7:03 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED 
AS PARQUET table?

Thanks, Dudu.

Just to re-iterate; the way I'm reading your response is that yes, we can use 
LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in the 
delimited file is properly formatted.  Then we can LOAD it into the table 
(mytable in my example) directly and avoid the creation of the temp table 
(origtable in my example).  Correct so far?

I did not quite follow the latter part of your response:
>> You should only create an external table which is an interface to read the 
>> files and use it in an INSERT operation.

My assumption was that we would LOAD INPATH and not have to use INSERT 
altogether.  Am I missing something in groking this latter part of your 
response?

Thanks,
- Dmitry

On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu 
<[email protected]<mailto:[email protected]>> wrote:
Since LOAD DATA INPATH  only moves files the answer is very simple.
If you’re files are already in a format that matches the destination table 
(storage type, number and types of columns etc.) then – yes and if not, then – 
no.

But –
You don’t need to load the files into intermediary table.
You should only create an external table which is an interface to read the 
files and use it in an INSERT operation.

Dudu

From: Dmitry Goldenberg 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, April 04, 2017 4:52 PM
To: [email protected]<mailto:[email protected]>
Subject: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS 
PARQUET table?

We have a table such as the following defined:

CREATE TABLE IF NOT EXISTS db.mytable (
  `item_id` string,
  `timestamp` string,
  `item_comments` string)
PARTITIONED BY (`date`, `content_type`)
STORED AS PARQUET;

Currently we insert data into this PARQUET, PARTITIONED table as follows, using 
an intermediary table:

INSERT INTO TABLE db.mytable PARTITION(date, content_type)
SELECT itemid as item_id, itemts as timestamp, date, content_type
FROM db.origtable
WHERE date = “${SELECTED_DATE}”
GROUP BY item_id, date, content_type;
Our question is, would it be possible to use the LOAD DATA INPATH.. INTO TABLE 
syntax to load the data from delimited data files into 'mytable' rather than 
populating mytable from the intermediary table?

I see in the Hive documentation that:
* Load operations are currently pure copy/move operations that move datafiles 
into locations corresponding to Hive tables.
* If the table is partitioned, then one must specify a specific partition of 
the table by specifying values for all of the partitioning columns.

This seems to indicate that using LOAD is possible; however looking at this 
discussion: 
http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables,
 perhaps not?

We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED 
tables is possible and if so, then how does one go about using LOAD in that 
case?

Thanks,
- Dmitry

RE: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

Reply via email to