Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

Furcy Pin Thu, 06 Apr 2017 08:56:11 -0700

Hi Dmitry,

If I understand what you said correctly:


At the beginning you have csv files on hdfs,
and at the end you want a partitioned Hive table as parquet.

And your question is: "can I do this using only one Hive table and a LOAD
statement?"

The answer to that question is "no".


The correct way to do this is what you are currently doing right now:

Create a table db.origtable STORED as TEXTFILE and use LOAD data to put the
csv data in it (they will stay in csv)
Create a table db.mytable as PARQUET and then use Hive to insert data into
it from db.origtable

Hive is only able to:
- query data from a Hive table
- write into another Hive table
- load data "as is" into a table without changing its format


The LOAD DATA statement cannot transform your data, nor change its format.
If you want to transform your data and/or change the format, you need to use
a regulare INSERT INTO|OVERWRITE TABLE ... SELECT query.

By the way I would recommend using INSERT OVERWRITE to make your query
idempotent
and avoid issues if you accidentally run it twice.

Now, if you really want to only have *one* hive table, I guess you can
either:
- make table db.origtable temporary
- directly use spark to read the csv files and insert into Hive as Parquet.
- or, if you can, write your input data directly as parquet instead of csv,
but you would still need 2 tables if you also need to perform a GROUP BY.

Hope this helps.


On Thu, Apr 6, 2017 at 2:48 PM, Dmitry Goldenberg <dgoldenb...@hexastax.com>
wrote:

> Thanks, Ryan.
>
> I was actually more curious about scenario B. If we represent our data as
> delimited files, why don't we just use LOAD DATA INPATH and load it right
> into the final, parquet, partitioned table in one step, bypassing dealing
> with the temp table?
>
> Are there any advantages to having a temp table besides the validation?
> One advantage could possibly be making it a transactional table and being
> able to run direct INSERT's into the temp table, avoiding having to deal
> with delimited files and LOAD DATA INPATH.
>
> If we go with route B, LOAD DATA INPATH directly into the parquet,
> partitioned table, would we have to:
>
> 1) represent the input files as Parquet? - it looks like the data is still
> delimited, and Hive represents it as Parquet internally
> 2) do anything specific in the input files / with the input files in order
> to make partitioning work, or does Hive just take the data and take full
> care of partitioning it?
>
>
>
> On Tue, Apr 4, 2017 at 6:14 PM, Ryan Harris <ryan.har...@zionsbancorp.com>
> wrote:
>
>> For A) I’d recommend mapping an EXTERNAL table to the raw/original source
>> files…then you can just run a SELECT query from the EXTERNAL source and
>> INSERT into your destination.
>>
>>
>>
>> LOAD DATA can be very useful when you are trying to move data between two
>> tables that share the same schema but 1 table is partitioned and the other
>> table is NOT partitioned…once the files have been inserted into the
>> unpartitioned table the source files from the hive warehouse can be added
>> to the partitioned table using LOAD DATA.  Another place I’ve frequently
>> used LOAD DATA is when synchronizing hive table data between two clusters,
>> the hive warehouse data files can be copied from one cluster to the other
>> with distcp and then loading the data flies to the duplicate cluster using
>> LOAD DATA to ensure the metadata is recorded in hive metastore.
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
>> *Sent:* Tuesday, April 04, 2017 3:31 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a
>> PARTITIONED, STORED AS PARQUET table?
>>
>>
>>
>> [External Email]
>> ------------------------------
>>
>> Right, that makes sense, Dudu.
>>
>>
>>
>> So basically, if we have our data in "some form", and a goal of loading
>> it into a parquet, partitioned table in Hive, we have two choices:
>>
>>
>>
>> A. Load this data into a temporary table first. Presumably, for this we
>> should be able to do a LOAD INPATH, from delimited data files. Perhaps we
>> could designate the temp table as transactional and then simply do direct
>> INSERT's into this temp table - ? Then, as the second step, we'd do an
>> INSERT... SELECT, to move the data into the destination table, and then
>> DROP the temp table.
>>
>>
>>
>> B. Represent the data as a delimited format and do a LOAD INPATH directly
>> into the destination table. Understandably, we lose the 'data verification'
>> this way. If we go this route, must the data in the input files be in the
>> PARQUET format or in a delimited format?  I would guess, the former.  And,
>> how does partitioning play into it?  How would the input data need to be
>> organized and inserted so as to adhere to the partitions (the 'date' and
>> 'content-type' columns, in my example)?
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Apr 4, 2017 at 2:22 PM, Markovitz, Dudu <dmarkov...@paypal.com>
>> wrote:
>>
>> “LOAD” is very misleading here. it is all in done the metadata level.
>>
>> The data is not being touched. The data in not being verified. The
>> “system” does not have any clue if the flies format match the table
>> definition and they can be actually used.
>>
>> The data files are being “moved” (again,  a metadata operation) from
>> their current HDFS location to the location defined for the table.
>>
>> Later on when you  query the table the files will be scanned. If there
>> are in the right format you’ll get results. If not, then no.
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
>> *Sent:* Tuesday, April 04, 2017 8:54 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a
>> PARTITIONED, STORED AS PARQUET table?
>>
>>
>>
>> Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH
>> on a few tables to achieve the effect of actual insertion of records. Is it
>> not the case that the LOAD causes the data to get inserted into Hive?
>>
>> Based on that I'd like to understand whether we can get away with using
>> LOAD INPATH instead of INSERT/SELECT FROM.
>>
>>
>> On Apr 4, 2017, at 1:43 PM, Markovitz, Dudu <dmarkov...@paypal.com>
>> wrote:
>>
>> I just want to verify that you understand the following:
>>
>>
>>
>> ·         LOAD DATA INPATH is just a HDFS file movement operation.
>>
>> You can achieve the same results by using *hdfs dfs -mv …*
>>
>>
>>
>> ·         LOAD DATA LOCAL  INPATH is just a file copying operation from
>> the shell to the HDFS.
>>
>> You can achieve the same results by using *hdfs dfs -put …*
>>
>>
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com
>> <dgoldenb...@hexastax.com>]
>> *Sent:* Tuesday, April 04, 2017 7:48 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a
>> PARTITIONED, STORED AS PARQUET table?
>>
>>
>>
>> Dudu,
>>
>>
>>
>> This is still in design stages, so we have a way to get the data from its
>> source. The data is *not* in the Parquet format.  It's up to us to format
>> it the best and most efficient way.  We can roll with CSV or Parquet;
>> ultimately the data must make it into a pre-defined PARQUET, PARTITIONED
>> table in Hive.
>>
>>
>>
>> Thanks,
>>
>> - Dmitry
>>
>>
>>
>> On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu <dmarkov...@paypal.com>
>> wrote:
>>
>> Are your files already in Parquet format?
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
>> *Sent:* Tuesday, April 04, 2017 7:03 PM
>> *To:* user@hive.apache.org
>> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a
>> PARTITIONED, STORED AS PARQUET table?
>>
>>
>>
>> Thanks, Dudu.
>>
>>
>>
>> Just to re-iterate; the way I'm reading your response is that yes, we can
>> use LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in
>> the delimited file is properly formatted.  Then we can LOAD it into the
>> table (mytable in my example) directly and avoid the creation of the temp
>> table (origtable in my example).  Correct so far?
>>
>>
>>
>> I did not quite follow the latter part of your response:
>>
>> >> You should only create an external table which is an interface to
>> read the files and use it in an INSERT operation.
>>
>>
>>
>> My assumption was that we would LOAD INPATH and not have to use INSERT
>> altogether.  Am I missing something in groking this latter part of your
>> response?
>>
>>
>>
>> Thanks,
>>
>> - Dmitry
>>
>>
>>
>> On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu <dmarkov...@paypal.com>
>> wrote:
>>
>> Since LOAD DATA INPATH  only moves files the answer is very simple.
>>
>> If you’re files are already in a format that matches the destination
>> table (storage type, number and types of columns etc.) then – yes and if
>> not, then – no.
>>
>>
>>
>> But –
>>
>> You don’t need to load the files into intermediary table.
>>
>> You should only create an external table which is an interface to read
>> the files and use it in an INSERT operation.
>>
>>
>>
>> Dudu
>>
>>
>>
>> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com]
>> *Sent:* Tuesday, April 04, 2017 4:52 PM
>> *To:* user@hive.apache.org
>> *Subject:* Is it possible to use LOAD DATA INPATH with a PARTITIONED,
>> STORED AS PARQUET table?
>>
>>
>>
>> We have a table such as the following defined:
>>
>> CREATE TABLE IF NOT EXISTS db.mytable (
>>   `item_id` string,
>>   `timestamp` string,
>>   `item_comments` string)
>> PARTITIONED BY (`date`, `content_type`)
>> STORED AS PARQUET;
>>
>> Currently we insert data into this PARQUET, PARTITIONED table as follows,
>> using an intermediary table:
>>
>> INSERT INTO TABLE db.mytable PARTITION(date, content_type)
>> SELECT itemid as item_id, itemts as timestamp, date, content_type
>> FROM db.origtable
>> WHERE date = “${SELECTED_DATE}”
>> GROUP BY item_id, date, content_type;
>>
>> Our question is, would it be possible to use the LOAD DATA INPATH.. INTO
>> TABLE syntax to load the data from delimited data files into 'mytable'
>> rather than populating mytable from the intermediary table?
>>
>>
>>
>> I see in the Hive documentation that:
>>
>> * Load operations are currently pure copy/move operations that move
>> datafiles into locations corresponding to Hive tables.
>>
>> * If the table is partitioned, then one must specify a specific partition
>> of the table by specifying values for all of the partitioning columns.
>>
>>
>>
>> This seems to indicate that using LOAD is possible; however looking at
>> this discussion: http://grokbase.com/t/hive/user/114frbfg0y/can-
>> i-use-hive-dynamic-partition-while-loading-data-into-tables
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__grokbase.com_t_hive_user_114frbfg0y_can-2Di-2Duse-2Dhive-2Ddynamic-2Dpartition-2Dwhile-2Dloading-2Ddata-2Dinto-2Dtables&d=DwMFaQ&c=9WYoWBgz3TbmQlstBqb6LDRA8PY_DPmoAS0YWoTLU-g&r=_W3sXrqd7teXL8R6ey10dgFH1GT5KbehFX_EaUG41XM&m=w2-Xt3zXd67KWRPyy83l4Kn5EWquC767DmMpcE5RpgI&s=01kme5ZDH2EBjzLWRz6kJ5jQ9vxr-IzFeNepynsQ7-M&e=>,
>> perhaps not?
>>
>>
>>
>> We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED
>> tables is possible and if so, then how does one go about using LOAD in that
>> case?
>>
>>
>>
>> Thanks,
>>
>> - Dmitry
>>
>>
>>
>>
>>
>>
>>
>>
>> ------------------------------
>> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
>> CONFIDENTIAL and may contain information that is privileged and exempt from
>> disclosure under applicable law. If you are neither the intended recipient
>> nor responsible for delivering the message to the intended recipient,
>> please note that any dissemination, distribution, copying or the taking of
>> any action in reliance upon the message is strictly prohibited. If you have
>> received this communication in error, please notify the sender immediately.
>> Thank you.
>>
>
>

Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table?

Reply via email to