I'm assuming, given this: CREATE TABLE IF NOT EXISTS db.mytable ( `item_id` string, `timestamp` string, `item_comments` string) PARTITIONED BY (`date`, `content_type`) STORED AS PARQUET;
we'd have to organize the input Parquet files into subdirectories where each subdirectory contains data just for the given 'date' (YYMMDD), then within that subdirectory, content would be organized by content_type, one file per content_type value. How does Hive make the association of a partition with a subdirectory naming or know to look for files for content_type, and how would it match content_type='Presentation' -- would the file just need to be named "Presentation"? On Thu, Apr 6, 2017 at 5:05 PM, Dmitry Goldenberg <dgoldenb...@hexastax.com> wrote: > >> properly split and partition your data before using LOAD if you want > hive to be able to find it again. > > If the destination table is defined as > CREATE TABLE IF NOT EXISTS db.mytable ( > `item_id` string, > `timestamp` string, > `item_comments` string) > PARTITIONED BY (`date`, `content_type`) > STORED AS PARQUET; > > and supposing that we have the data "in hand" (in memory or as CSV files) > how does one go about the 'proper split and partition' so it adheres to: > PARTITIONED BY (`date`, `content_type`) ? > > Thanks > > > On Thu, Apr 6, 2017 at 12:29 PM, Ryan Harris <ryan.har...@zionsbancorp.com > > wrote: > >> “If we represent our data as delimited files” ….the question is how you >> plan on getting your data into these parquet files since it doesn’t sound >> like your data is already in that format…. >> >> >> >> If your data is not already in parquet format, you are going to need to >> run **some** process to get it into that format…why not just use hive >> (running a query on an external table) to perform the conversion? >> >> >> >> “and Hive represents it as Parquet internally” That entirely depends on >> the declared STORED AS format when you define the table. The files backing >> the hive table **could** be TEXT, sequence, RC, ORC, Parquet… If you >> declared the table to be backed by delimited text, you could format your >> data into standard text files (not parquet) and then add the data to the >> hive table using LOAD DATA. >> >> >> >> So, why NOT use text data for the table storage? There is no way to >> optimize future queries against that data. >> >> >> >> One hypothetical workflow assuming that your data is currently delimited…. >> >> >> >> You could either have a hive managed table, with the table data stored as >> TEXTFILE using some delimiter based SerDe, and you could then use LOAD DATA >> to put your original raw files into this table. OR, you could use an >> external table (not managed by hive) to point to the data wherever it >> currently resides. (The only difference between the two choices here is >> whether the original raw files end up in ‘/user/hive/warehouse/tablename’ >> or the current HDFS path where they reside. >> >> >> >> From there, you could query FROM that temp table, INSERT into your final >> destination table, and the data will be formatted according to the data >> definition of your destination table. >> >> >> >> >> >> If you want to (for whatever reason) use LOAD DATA INPATH to shove the >> original data directly into your final destination table you must >> >> 1) Ensure that the data is formatted into parquet files that are >> compatible with the version of hive that you are running. The parquet >> format has been used by a number of different projects, unfortunately there >> are different versions of parquet and it cannot be taken for granted that >> any parquet file will be compatible with the version of hive you are >> using. Testing and validation is required…see >> https://github.com/Parquet/parquet-compatibility >> >> 2) Parquet files have internal partitioning to them, but from >> hive’s perspective, hive partitions will still be separated into individual >> directories. You’ll need to ensure that you properly split and partition >> your data before using LOAD if you want hive to be able to find it again. >> >> >> >> It doesn’t sound like your source data is currently formatted to match >> your hive table formatting. If you are already processing the data with a >> spark pipeline and you just happened to set the output of that processing >> to be delimited text and you can just as easily change it to something that >> is compatible with your hive table….then that may make sense to do. >> However, if you are going to require a separate processing step to convert >> the data from delimited text to hive-compatible parquet, I don’t see a >> reason to use any tool OTHER than hive to perform that conversion. >> >> >> >> LOAD DATA is generally used in situations where you **know** that the >> data format is already 100% exactly compatible with your destination >> table….which most often occurs when the source of the data is the raw data >> backing an existing hive managed table (possibly copied/moved from a >> different cluster). >> >> >> >> >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] >> *Sent:* Thursday, April 06, 2017 6:48 AM >> *To:* user@hive.apache.org >> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a >> PARTITIONED, STORED AS PARQUET table? >> >> >> >> [External Email] >> ------------------------------ >> >> Thanks, Ryan. >> >> >> >> I was actually more curious about scenario B. If we represent our data as >> delimited files, why don't we just use LOAD DATA INPATH and load it right >> into the final, parquet, partitioned table in one step, bypassing dealing >> with the temp table? >> >> >> >> Are there any advantages to having a temp table besides the validation? >> One advantage could possibly be making it a transactional table and being >> able to run direct INSERT's into the temp table, avoiding having to deal >> with delimited files and LOAD DATA INPATH. >> >> >> >> If we go with route B, LOAD DATA INPATH directly into the parquet, >> partitioned table, would we have to: >> >> >> >> 1) represent the input files as Parquet? - it looks like the data is >> still delimited, and Hive represents it as Parquet internally >> >> 2) do anything specific in the input files / with the input files in >> order to make partitioning work, or does Hive just take the data and take >> full care of partitioning it? >> >> >> >> >> >> >> >> On Tue, Apr 4, 2017 at 6:14 PM, Ryan Harris <ryan.har...@zionsbancorp.com> >> wrote: >> >> For A) I’d recommend mapping an EXTERNAL table to the raw/original source >> files…then you can just run a SELECT query from the EXTERNAL source and >> INSERT into your destination. >> >> >> >> LOAD DATA can be very useful when you are trying to move data between two >> tables that share the same schema but 1 table is partitioned and the other >> table is NOT partitioned…once the files have been inserted into the >> unpartitioned table the source files from the hive warehouse can be added >> to the partitioned table using LOAD DATA. Another place I’ve frequently >> used LOAD DATA is when synchronizing hive table data between two clusters, >> the hive warehouse data files can be copied from one cluster to the other >> with distcp and then loading the data flies to the duplicate cluster using >> LOAD DATA to ensure the metadata is recorded in hive metastore. >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] >> *Sent:* Tuesday, April 04, 2017 3:31 PM >> *To:* user@hive.apache.org >> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a >> PARTITIONED, STORED AS PARQUET table? >> >> >> >> [External Email] >> ------------------------------ >> >> Right, that makes sense, Dudu. >> >> >> >> So basically, if we have our data in "some form", and a goal of loading >> it into a parquet, partitioned table in Hive, we have two choices: >> >> >> >> A. Load this data into a temporary table first. Presumably, for this we >> should be able to do a LOAD INPATH, from delimited data files. Perhaps we >> could designate the temp table as transactional and then simply do direct >> INSERT's into this temp table - ? Then, as the second step, we'd do an >> INSERT... SELECT, to move the data into the destination table, and then >> DROP the temp table. >> >> >> >> B. Represent the data as a delimited format and do a LOAD INPATH directly >> into the destination table. Understandably, we lose the 'data verification' >> this way. If we go this route, must the data in the input files be in the >> PARQUET format or in a delimited format? I would guess, the former. And, >> how does partitioning play into it? How would the input data need to be >> organized and inserted so as to adhere to the partitions (the 'date' and >> 'content-type' columns, in my example)? >> >> >> >> >> >> >> >> On Tue, Apr 4, 2017 at 2:22 PM, Markovitz, Dudu <dmarkov...@paypal.com> >> wrote: >> >> “LOAD” is very misleading here. it is all in done the metadata level. >> >> The data is not being touched. The data in not being verified. The >> “system” does not have any clue if the flies format match the table >> definition and they can be actually used. >> >> The data files are being “moved” (again, a metadata operation) from >> their current HDFS location to the location defined for the table. >> >> Later on when you query the table the files will be scanned. If there >> are in the right format you’ll get results. If not, then no. >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] >> *Sent:* Tuesday, April 04, 2017 8:54 PM >> *To:* user@hive.apache.org >> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a >> PARTITIONED, STORED AS PARQUET table? >> >> >> >> Thanks, Dudu. I think there's a disconnect here. We're using LOAD INPATH >> on a few tables to achieve the effect of actual insertion of records. Is it >> not the case that the LOAD causes the data to get inserted into Hive? >> >> Based on that I'd like to understand whether we can get away with using >> LOAD INPATH instead of INSERT/SELECT FROM. >> >> >> On Apr 4, 2017, at 1:43 PM, Markovitz, Dudu <dmarkov...@paypal.com> >> wrote: >> >> I just want to verify that you understand the following: >> >> >> >> · LOAD DATA INPATH is just a HDFS file movement operation. >> >> You can achieve the same results by using *hdfs dfs -mv …* >> >> >> >> · LOAD DATA LOCAL INPATH is just a file copying operation from >> the shell to the HDFS. >> >> You can achieve the same results by using *hdfs dfs -put …* >> >> >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com >> <dgoldenb...@hexastax.com>] >> *Sent:* Tuesday, April 04, 2017 7:48 PM >> *To:* user@hive.apache.org >> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a >> PARTITIONED, STORED AS PARQUET table? >> >> >> >> Dudu, >> >> >> >> This is still in design stages, so we have a way to get the data from its >> source. The data is *not* in the Parquet format. It's up to us to format >> it the best and most efficient way. We can roll with CSV or Parquet; >> ultimately the data must make it into a pre-defined PARQUET, PARTITIONED >> table in Hive. >> >> >> >> Thanks, >> >> - Dmitry >> >> >> >> On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu <dmarkov...@paypal.com> >> wrote: >> >> Are your files already in Parquet format? >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] >> *Sent:* Tuesday, April 04, 2017 7:03 PM >> *To:* user@hive.apache.org >> *Subject:* Re: Is it possible to use LOAD DATA INPATH with a >> PARTITIONED, STORED AS PARQUET table? >> >> >> >> Thanks, Dudu. >> >> >> >> Just to re-iterate; the way I'm reading your response is that yes, we can >> use LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in >> the delimited file is properly formatted. Then we can LOAD it into the >> table (mytable in my example) directly and avoid the creation of the temp >> table (origtable in my example). Correct so far? >> >> >> >> I did not quite follow the latter part of your response: >> >> >> You should only create an external table which is an interface to >> read the files and use it in an INSERT operation. >> >> >> >> My assumption was that we would LOAD INPATH and not have to use INSERT >> altogether. Am I missing something in groking this latter part of your >> response? >> >> >> >> Thanks, >> >> - Dmitry >> >> >> >> On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu <dmarkov...@paypal.com> >> wrote: >> >> Since LOAD DATA INPATH only moves files the answer is very simple. >> >> If you’re files are already in a format that matches the destination >> table (storage type, number and types of columns etc.) then – yes and if >> not, then – no. >> >> >> >> But – >> >> You don’t need to load the files into intermediary table. >> >> You should only create an external table which is an interface to read >> the files and use it in an INSERT operation. >> >> >> >> Dudu >> >> >> >> *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] >> *Sent:* Tuesday, April 04, 2017 4:52 PM >> *To:* user@hive.apache.org >> *Subject:* Is it possible to use LOAD DATA INPATH with a PARTITIONED, >> STORED AS PARQUET table? >> >> >> >> We have a table such as the following defined: >> >> CREATE TABLE IF NOT EXISTS db.mytable ( >> `item_id` string, >> `timestamp` string, >> `item_comments` string) >> PARTITIONED BY (`date`, `content_type`) >> STORED AS PARQUET; >> >> Currently we insert data into this PARQUET, PARTITIONED table as follows, >> using an intermediary table: >> >> INSERT INTO TABLE db.mytable PARTITION(date, content_type) >> SELECT itemid as item_id, itemts as timestamp, date, content_type >> FROM db.origtable >> WHERE date = “${SELECTED_DATE}” >> GROUP BY item_id, date, content_type; >> >> Our question is, would it be possible to use the LOAD DATA INPATH.. INTO >> TABLE syntax to load the data from delimited data files into 'mytable' >> rather than populating mytable from the intermediary table? >> >> >> >> I see in the Hive documentation that: >> >> * Load operations are currently pure copy/move operations that move >> datafiles into locations corresponding to Hive tables. >> >> * If the table is partitioned, then one must specify a specific partition >> of the table by specifying values for all of the partitioning columns. >> >> >> >> This seems to indicate that using LOAD is possible; however looking at >> this discussion: http://grokbase.com/t/hive/user/114frbfg0y/can-i >> -use-hive-dynamic-partition-while-loading-data-into-tables >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__grokbase.com_t_hive_user_114frbfg0y_can-2Di-2Duse-2Dhive-2Ddynamic-2Dpartition-2Dwhile-2Dloading-2Ddata-2Dinto-2Dtables&d=DwMFaQ&c=9WYoWBgz3TbmQlstBqb6LDRA8PY_DPmoAS0YWoTLU-g&r=_W3sXrqd7teXL8R6ey10dgFH1GT5KbehFX_EaUG41XM&m=w2-Xt3zXd67KWRPyy83l4Kn5EWquC767DmMpcE5RpgI&s=01kme5ZDH2EBjzLWRz6kJ5jQ9vxr-IzFeNepynsQ7-M&e=>, >> perhaps not? >> >> >> >> We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED >> tables is possible and if so, then how does one go about using LOAD in that >> case? >> >> >> >> Thanks, >> >> - Dmitry >> >> >> >> >> >> >> >> >> ------------------------------ >> >> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS >> CONFIDENTIAL and may contain information that is privileged and exempt from >> disclosure under applicable law. If you are neither the intended recipient >> nor responsible for delivering the message to the intended recipient, >> please note that any dissemination, distribution, copying or the taking of >> any action in reliance upon the message is strictly prohibited. If you have >> received this communication in error, please notify the sender immediately. >> Thank you. >> >> >> ------------------------------ >> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS >> CONFIDENTIAL and may contain information that is privileged and exempt from >> disclosure under applicable law. If you are neither the intended recipient >> nor responsible for delivering the message to the intended recipient, >> please note that any dissemination, distribution, copying or the taking of >> any action in reliance upon the message is strictly prohibited. If you have >> received this communication in error, please notify the sender immediately. >> Thank you. >> > >