Hi Dmitry, If I understand what you said correctly:
At the beginning you have csv files on hdfs, and at the end you want a partitioned Hive table as parquet. And your question is: "can I do this using only one Hive table and a LOAD statement?" The answer to that question is "no". The correct way to do this is what you are currently doing right now: Create a table db.origtable STORED as TEXTFILE and use LOAD data to put the csv data in it (they will stay in csv) Create a table db.mytable as PARQUET and then use Hive to insert data into it from db.origtable Hive is only able to: - query data from a Hive table - write into another Hive table - load data "as is" into a table without changing its format The LOAD DATA statement cannot transform your data, nor change its format. If you want to transform your data and/or change the format, you need to use a regulare INSERT INTO|OVERWRITE TABLE ... SELECT query. By the way I would recommend using INSERT OVERWRITE to make your query idempotent and avoid issues if you accidentally run it twice. Now, if you really want to only have *one* hive table, I guess you can either: - make table db.origtable temporary - directly use spark to read the csv files and insert into Hive as Parquet. - or, if you can, write your input data directly as parquet instead of csv, but you would still need 2 tables if you also need to perform a GROUP BY. Hope this helps. On Thu, Apr 6, 2017 at 2:48 PM, Dmitry Goldenberg <dgoldenb...@hexastax.com> wrote: > Thanks, Ryan. > > I was actually more curious about scenario B. If we represent our data as > delimited files, why don't we just use LOAD DATA INPATH and load it right > into the final, parquet, partitioned table in one step, bypassing dealing > with the temp table? > > Are there any advantages to having a temp table besides the validation? > One advantage could possibly be making it a transactional table and being > able to run direct INSERT's into the temp table, avoiding having to deal > with delimited files and LOAD DATA INPATH. > > If we go with route B, LOAD DATA INPATH directly into the parquet, > partitioned table, would we have to: > > 1) represent the input files as Parquet? - it looks like the data is still > delimited, and Hive represents it as Parquet internally > 2) do anything specific in the input files / with the input files in order > to make partitioning work, or does Hive just take the data and take full > care of partitioning it? > > > > On Tue, Apr 4, 2017 at 6:14 PM, Ryan Harris <ryan.har...@zionsbancorp.com> > wrote: > >> For A) I’d recommend mapping an EXTERNAL table to the raw/original source >> files…then you can just run a SELECT query from the EXTERNAL source and >> INSERT into your destination. >> >> >> >> LOAD DATA can be very useful when you are trying to move data between two >> tables that share the same schema but 1 table is partitioned and the other >> table is NOT partitioned…once the files have been inserted into the >> unpartitioned table the source files from the hive warehouse can be added >> to the partitioned table using LOAD DATA. Another place I’ve frequently >> used LOAD DATA is when synchronizing hive table data between two clusters, >> the hive warehouse data files can be copied from one cluster to the other >> with distcp and then loading the data flies to the duplicate cluster using >> LOAD DATA to ensure the metadata is recorded in hive metastore. >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] >> *Sent:* Tuesday, April 04, 2017 3:31 PM >> *To:* user@hive.apache.org >> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a >> PARTITIONED, STORED AS PARQUET table? >> >> >> >> [External Email] >> ------------------------------ >> >> Right, that makes sense, Dudu. >> >> >> >> So basically, if we have our data in "some form", and a goal of loading >> it into a parquet, partitioned table in Hive, we have two choices: >> >> >> >> A. Load this data into a temporary table first. Presumably, for this we >> should be able to do a LOAD INPATH, from delimited data files. Perhaps we >> could designate the temp table as transactional and then simply do direct >> INSERT's into this temp table - ? Then, as the second step, we'd do an >> INSERT... SELECT, to move the data into the destination table, and then >> DROP the temp table. >> >> >> >> B. Represent the data as a delimited format and do a LOAD INPATH directly >> into the destination table. Understandably, we lose the 'data verification' >> this way. If we go this route, must the data in the input files be in the >> PARQUET format or in a delimited format? I would guess, the former. And, >> how does partitioning play into it? How would the input data need to be >> organized and inserted so as to adhere to the partitions (the 'date' and >> 'content-type' columns, in my example)? >> >> >> >> >> >> >> >> On Tue, Apr 4, 2017 at 2:22 PM, Markovitz, Dudu <dmarkov...@paypal.com> >> wrote: >> >> “LOAD” is very misleading here. it is all in done the metadata level. >> >> The data is not being touched. The data in not being verified. The >> “system” does not have any clue if the flies format match the table >> definition and they can be actually used. >> >> The data files are being “moved” (again, a metadata operation) from >> their current HDFS location to the location defined for the table. >> >> Later on when you query the table the files will be scanned. If there >> are in the right format you’ll get results. If not, then no. >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] >> *Sent:* Tuesday, April 04, 2017 8:54 PM >> *To:* user@hive.apache.org >> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a >> PARTITIONED, STORED AS PARQUET table? >> >> >> >> Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH >> on a few tables to achieve the effect of actual insertion of records. Is it >> not the case that the LOAD causes the data to get inserted into Hive? >> >> Based on that I'd like to understand whether we can get away with using >> LOAD INPATH instead of INSERT/SELECT FROM. >> >> >> On Apr 4, 2017, at 1:43 PM, Markovitz, Dudu <dmarkov...@paypal.com> >> wrote: >> >> I just want to verify that you understand the following: >> >> >> >> · LOAD DATA INPATH is just a HDFS file movement operation. >> >> You can achieve the same results by using *hdfs dfs -mv …* >> >> >> >> · LOAD DATA LOCAL INPATH is just a file copying operation from >> the shell to the HDFS. >> >> You can achieve the same results by using *hdfs dfs -put …* >> >> >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com >> <dgoldenb...@hexastax.com>] >> *Sent:* Tuesday, April 04, 2017 7:48 PM >> *To:* user@hive.apache.org >> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a >> PARTITIONED, STORED AS PARQUET table? >> >> >> >> Dudu, >> >> >> >> This is still in design stages, so we have a way to get the data from its >> source. The data is *not* in the Parquet format. It's up to us to format >> it the best and most efficient way. We can roll with CSV or Parquet; >> ultimately the data must make it into a pre-defined PARQUET, PARTITIONED >> table in Hive. >> >> >> >> Thanks, >> >> - Dmitry >> >> >> >> On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu <dmarkov...@paypal.com> >> wrote: >> >> Are your files already in Parquet format? >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] >> *Sent:* Tuesday, April 04, 2017 7:03 PM >> *To:* user@hive.apache.org >> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a >> PARTITIONED, STORED AS PARQUET table? >> >> >> >> Thanks, Dudu. >> >> >> >> Just to re-iterate; the way I'm reading your response is that yes, we can >> use LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in >> the delimited file is properly formatted. Then we can LOAD it into the >> table (mytable in my example) directly and avoid the creation of the temp >> table (origtable in my example). Correct so far? >> >> >> >> I did not quite follow the latter part of your response: >> >> >> You should only create an external table which is an interface to >> read the files and use it in an INSERT operation. >> >> >> >> My assumption was that we would LOAD INPATH and not have to use INSERT >> altogether. Am I missing something in groking this latter part of your >> response? >> >> >> >> Thanks, >> >> - Dmitry >> >> >> >> On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu <dmarkov...@paypal.com> >> wrote: >> >> Since LOAD DATA INPATH only moves files the answer is very simple. >> >> If you’re files are already in a format that matches the destination >> table (storage type, number and types of columns etc.) then – yes and if >> not, then – no. >> >> >> >> But – >> >> You don’t need to load the files into intermediary table. >> >> You should only create an external table which is an interface to read >> the files and use it in an INSERT operation. >> >> >> >> Dudu >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] >> *Sent:* Tuesday, April 04, 2017 4:52 PM >> *To:* user@hive.apache.org >> *Subject:* Is it possible to use LOAD DATA INPATH with a PARTITIONED, >> STORED AS PARQUET table? >> >> >> >> We have a table such as the following defined: >> >> CREATE TABLE IF NOT EXISTS db.mytable ( >> `item_id` string, >> `timestamp` string, >> `item_comments` string) >> PARTITIONED BY (`date`, `content_type`) >> STORED AS PARQUET; >> >> Currently we insert data into this PARQUET, PARTITIONED table as follows, >> using an intermediary table: >> >> INSERT INTO TABLE db.mytable PARTITION(date, content_type) >> SELECT itemid as item_id, itemts as timestamp, date, content_type >> FROM db.origtable >> WHERE date = “${SELECTED_DATE}” >> GROUP BY item_id, date, content_type; >> >> Our question is, would it be possible to use the LOAD DATA INPATH.. INTO >> TABLE syntax to load the data from delimited data files into 'mytable' >> rather than populating mytable from the intermediary table? >> >> >> >> I see in the Hive documentation that: >> >> * Load operations are currently pure copy/move operations that move >> datafiles into locations corresponding to Hive tables. >> >> * If the table is partitioned, then one must specify a specific partition >> of the table by specifying values for all of the partitioning columns. >> >> >> >> This seems to indicate that using LOAD is possible; however looking at >> this discussion: http://grokbase.com/t/hive/user/114frbfg0y/can- >> i-use-hive-dynamic-partition-while-loading-data-into-tables >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__grokbase.com_t_hive_user_114frbfg0y_can-2Di-2Duse-2Dhive-2Ddynamic-2Dpartition-2Dwhile-2Dloading-2Ddata-2Dinto-2Dtables&d=DwMFaQ&c=9WYoWBgz3TbmQlstBqb6LDRA8PY_DPmoAS0YWoTLU-g&r=_W3sXrqd7teXL8R6ey10dgFH1GT5KbehFX_EaUG41XM&m=w2-Xt3zXd67KWRPyy83l4Kn5EWquC767DmMpcE5RpgI&s=01kme5ZDH2EBjzLWRz6kJ5jQ9vxr-IzFeNepynsQ7-M&e=>, >> perhaps not? >> >> >> >> We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED >> tables is possible and if so, then how does one go about using LOAD in that >> case? >> >> >> >> Thanks, >> >> - Dmitry >> >> >> >> >> >> >> >> >> ------------------------------ >> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS >> CONFIDENTIAL and may contain information that is privileged and exempt from >> disclosure under applicable law. If you are neither the intended recipient >> nor responsible for delivering the message to the intended recipient, >> please note that any dissemination, distribution, copying or the taking of >> any action in reliance upon the message is strictly prohibited. If you have >> received this communication in error, please notify the sender immediately. >> Thank you. >> > >