For A) I’d recommend mapping an EXTERNAL table to the raw/original source files…then you can just run a SELECT query from the EXTERNAL source and INSERT into your destination.
LOAD DATA can be very useful when you are trying to move data between two tables that share the same schema but 1 table is partitioned and the other table is NOT partitioned…once the files have been inserted into the unpartitioned table the source files from the hive warehouse can be added to the partitioned table using LOAD DATA. Another place I’ve frequently used LOAD DATA is when synchronizing hive table data between two clusters, the hive warehouse data files can be copied from one cluster to the other with distcp and then loading the data flies to the duplicate cluster using LOAD DATA to ensure the metadata is recorded in hive metastore. From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] Sent: Tuesday, April 04, 2017 3:31 PM To: user@hive.apache.org Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table? [External Email] ________________________________ Right, that makes sense, Dudu. So basically, if we have our data in "some form", and a goal of loading it into a parquet, partitioned table in Hive, we have two choices: A. Load this data into a temporary table first. Presumably, for this we should be able to do a LOAD INPATH, from delimited data files. Perhaps we could designate the temp table as transactional and then simply do direct INSERT's into this temp table - ? Then, as the second step, we'd do an INSERT... SELECT, to move the data into the destination table, and then DROP the temp table. B. Represent the data as a delimited format and do a LOAD INPATH directly into the destination table. Understandably, we lose the 'data verification' this way. If we go this route, must the data in the input files be in the PARQUET format or in a delimited format? I would guess, the former. And, how does partitioning play into it? How would the input data need to be organized and inserted so as to adhere to the partitions (the 'date' and 'content-type' columns, in my example)? On Tue, Apr 4, 2017 at 2:22 PM, Markovitz, Dudu <dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote: “LOAD” is very misleading here. it is all in done the metadata level. The data is not being touched. The data in not being verified. The “system” does not have any clue if the flies format match the table definition and they can be actually used. The data files are being “moved” (again, a metadata operation) from their current HDFS location to the location defined for the table. Later on when you query the table the files will be scanned. If there are in the right format you’ll get results. If not, then no. From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com<mailto:dgoldenb...@hexastax.com>] Sent: Tuesday, April 04, 2017 8:54 PM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table? Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH on a few tables to achieve the effect of actual insertion of records. Is it not the case that the LOAD causes the data to get inserted into Hive? Based on that I'd like to understand whether we can get away with using LOAD INPATH instead of INSERT/SELECT FROM. On Apr 4, 2017, at 1:43 PM, Markovitz, Dudu <dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote: I just want to verify that you understand the following: • LOAD DATA INPATH is just a HDFS file movement operation. You can achieve the same results by using hdfs dfs -mv … • LOAD DATA LOCAL INPATH is just a file copying operation from the shell to the HDFS. You can achieve the same results by using hdfs dfs -put … From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] Sent: Tuesday, April 04, 2017 7:48 PM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table? Dudu, This is still in design stages, so we have a way to get the data from its source. The data is *not* in the Parquet format. It's up to us to format it the best and most efficient way. We can roll with CSV or Parquet; ultimately the data must make it into a pre-defined PARQUET, PARTITIONED table in Hive. Thanks, - Dmitry On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu <dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote: Are your files already in Parquet format? From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com<mailto:dgoldenb...@hexastax.com>] Sent: Tuesday, April 04, 2017 7:03 PM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table? Thanks, Dudu. Just to re-iterate; the way I'm reading your response is that yes, we can use LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in the delimited file is properly formatted. Then we can LOAD it into the table (mytable in my example) directly and avoid the creation of the temp table (origtable in my example). Correct so far? I did not quite follow the latter part of your response: >> You should only create an external table which is an interface to read the >> files and use it in an INSERT operation. My assumption was that we would LOAD INPATH and not have to use INSERT altogether. Am I missing something in groking this latter part of your response? Thanks, - Dmitry On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu <dmarkov...@paypal.com<mailto:dmarkov...@paypal.com>> wrote: Since LOAD DATA INPATH only moves files the answer is very simple. If you’re files are already in a format that matches the destination table (storage type, number and types of columns etc.) then – yes and if not, then – no. But – You don’t need to load the files into intermediary table. You should only create an external table which is an interface to read the files and use it in an INSERT operation. Dudu From: Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com<mailto:dgoldenb...@hexastax.com>] Sent: Tuesday, April 04, 2017 4:52 PM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Is it possible to use LOAD DATA INPATH with a PARTITIONED, STORED AS PARQUET table? We have a table such as the following defined: CREATE TABLE IF NOT EXISTS db.mytable ( `item_id` string, `timestamp` string, `item_comments` string) PARTITIONED BY (`date`, `content_type`) STORED AS PARQUET; Currently we insert data into this PARQUET, PARTITIONED table as follows, using an intermediary table: INSERT INTO TABLE db.mytable PARTITION(date, content_type) SELECT itemid as item_id, itemts as timestamp, date, content_type FROM db.origtable WHERE date = “${SELECTED_DATE}” GROUP BY item_id, date, content_type; Our question is, would it be possible to use the LOAD DATA INPATH.. INTO TABLE syntax to load the data from delimited data files into 'mytable' rather than populating mytable from the intermediary table? I see in the Hive documentation that: * Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables. * If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns. This seems to indicate that using LOAD is possible; however looking at this discussion: http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables<https://urldefense.proofpoint.com/v2/url?u=http-3A__grokbase.com_t_hive_user_114frbfg0y_can-2Di-2Duse-2Dhive-2Ddynamic-2Dpartition-2Dwhile-2Dloading-2Ddata-2Dinto-2Dtables&d=DwMFaQ&c=9WYoWBgz3TbmQlstBqb6LDRA8PY_DPmoAS0YWoTLU-g&r=_W3sXrqd7teXL8R6ey10dgFH1GT5KbehFX_EaUG41XM&m=w2-Xt3zXd67KWRPyy83l4Kn5EWquC767DmMpcE5RpgI&s=01kme5ZDH2EBjzLWRz6kJ5jQ9vxr-IzFeNepynsQ7-M&e=>, perhaps not? We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED tables is possible and if so, then how does one go about using LOAD in that case? Thanks, - Dmitry ====================================================================== THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately. Thank you.