Dudu, This is still in design stages, so we have a way to get the data from its source. The data is *not* in the Parquet format. It's up to us to format it the best and most efficient way. We can roll with CSV or Parquet; ultimately the data must make it into a pre-defined PARQUET, PARTITIONED table in Hive.
Thanks, - Dmitry On Tue, Apr 4, 2017 at 12:20 PM, Markovitz, Dudu <dmarkov...@paypal.com> wrote: > Are your files already in Parquet format? > > > > *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] > *Sent:* Tuesday, April 04, 2017 7:03 PM > *To:* user@hive.apache.org > *Subject:* Re: Is it possible to use LOAD DATA INPATH with a PARTITIONED, > STORED AS PARQUET table? > > > > Thanks, Dudu. > > > > Just to re-iterate; the way I'm reading your response is that yes, we can > use LOAD INPATH for a PARQUET, PARTITIONED table, provided that the data in > the delimited file is properly formatted. Then we can LOAD it into the > table (mytable in my example) directly and avoid the creation of the temp > table (origtable in my example). Correct so far? > > > > I did not quite follow the latter part of your response: > > >> You should only create an external table which is an interface to read > the files and use it in an INSERT operation. > > > > My assumption was that we would LOAD INPATH and not have to use INSERT > altogether. Am I missing something in groking this latter part of your > response? > > > > Thanks, > > - Dmitry > > > > On Tue, Apr 4, 2017 at 11:26 AM, Markovitz, Dudu <dmarkov...@paypal.com> > wrote: > > Since LOAD DATA INPATH only moves files the answer is very simple. > > If you’re files are already in a format that matches the destination table > (storage type, number and types of columns etc.) then – yes and if not, > then – no. > > > > But – > > You don’t need to load the files into intermediary table. > > You should only create an external table which is an interface to read the > files and use it in an INSERT operation. > > > > Dudu > > > > *From:* Dmitry Goldenberg [mailto:dgoldenb...@hexastax.com] > *Sent:* Tuesday, April 04, 2017 4:52 PM > *To:* user@hive.apache.org > *Subject:* Is it possible to use LOAD DATA INPATH with a PARTITIONED, > STORED AS PARQUET table? > > > > We have a table such as the following defined: > > CREATE TABLE IF NOT EXISTS db.mytable ( > `item_id` string, > `timestamp` string, > `item_comments` string) > PARTITIONED BY (`date`, `content_type`) > STORED AS PARQUET; > > Currently we insert data into this PARQUET, PARTITIONED table as follows, > using an intermediary table: > > INSERT INTO TABLE db.mytable PARTITION(date, content_type) > SELECT itemid as item_id, itemts as timestamp, date, content_type > FROM db.origtable > WHERE date = “${SELECTED_DATE}” > GROUP BY item_id, date, content_type; > > Our question is, would it be possible to use the LOAD DATA INPATH.. INTO > TABLE syntax to load the data from delimited data files into 'mytable' > rather than populating mytable from the intermediary table? > > > > I see in the Hive documentation that: > > * Load operations are currently pure copy/move operations that move > datafiles into locations corresponding to Hive tables. > > * If the table is partitioned, then one must specify a specific partition > of the table by specifying values for all of the partitioning columns. > > > > This seems to indicate that using LOAD is possible; however looking at > this discussion: http://grokbase.com/t/hive/user/114frbfg0y/ > can-i-use-hive-dynamic-partition-while-loading-data-into-tables, perhaps > not? > > > > We'd like to understand if using LOAD in the case of PARQUET, PARTITIONED > tables is possible and if so, then how does one go about using LOAD in that > case? > > > > Thanks, > > - Dmitry > > > > >