Re: HIVE Parquet column names issue

Brandon Cooke Fri, 26 Jan 2018 11:02:18 -0800

Prasad and Jorn,

Prasad’s fix did the trick! 
An output table and OVERWRITE statement worked!


Up to this point, what I had been doing is 
- Create the initial Events table
- Create Parquet_table ( …. ) Stored as Parquet
- Insert INTO parquet_table select * from Events

I wasn’t using LOCATION for the Parquet table either.
Are you able to tell me why the above steps and my example below (where I 
Overwrite Directory Stored as Parquet) did not work?

Thanks so much by the way!

-Brandon

> On Jan 26, 2018, at 1:09 PM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> Drop the old parquet table before and then create it with explicit 
> statements. The above statement keeps using the old parquet table if it 
> existed
> 
> On 26. Jan 2018, at 17:35, Brandon Cooke <brandon.co...@engage.cx 
> <mailto:brandon.co...@engage.cx>> wrote:
> 
>> Hi Prasad,
>> 
>> I actually have tried this and I had that same result. 
>> Although I am certainly willing to try again.
>> 
>> Sincerely,
>> 
>> Brandon Cooke
>> 
>>> On Jan 26, 2018, at 11:29 AM, Prasad Nagaraj Subramanya 
>>> <prasadn...@gmail.com <mailto:prasadn...@gmail.com>> wrote:
>>> 
>>> Hi Brandon,
>>> 
>>> Have you tried creating an external table with the required names for 
>>> parquet -
>>> 
>>> CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS_PARQUET(
>>>     `release` STRING,
>>>     `customer` STRING,
>>>     `cookie` STRING,
>>>     `category` STRING,
>>>     `end_time` STRING,
>>>     `start_time` STRING,
>>>     `first_name` STRING,
>>>     `email` STRING,
>>>     `phone` STRING,
>>>     `last_name` STRING,
>>>     `site` STRING,
>>>     `source` STRING,
>>>     `subject` STRING,
>>>     `raw` STRING
>>> )
>>> STORED AS PARQUET
>>> LOCATION '${OUTPUT}';
>>> 
>>> And then inserting data into this table from your csv table -
>>> 
>>> INSERT OVERWRITE TABLE EVENTS_PARQUET SELECT * FROM EVENTS;
>>> 
>>> This will create a parquet file at the specified location (${OUTPUT})
>>> 
>>> Thanks,
>>> Prasad
>>> 
>>> On Fri, Jan 26, 2018 at 7:45 AM, Brandon Cooke <brandon.co...@engage.cx 
>>> <mailto:brandon.co...@engage.cx>> wrote:
>>> Hello,
>>> 
>>> I posted the following on a Cloudera forum but haven’t had much luck.
>>> I’m hoping someone here can tell me what step I have probably missed:
>>> 
>>> Hello,
>>>  
>>> I'm using HIVE (v1.2.1) to convert our data files from CSV into Parquet for 
>>> use in AWS Athena.
>>> However, no mater what I try the resulting Parquet always has columns 
>>> titles [_col0, _col1, ..., _colN]
>>>  
>>> After researching, I read that the line SET 
>>> parquet.column.index.access=false was supposed to allow for Parquet to use 
>>> the column titles of my HIVE table; however, it has been unsuccessful so 
>>> far.
>>>  
>>> Below is an example script I use to create the Parquet from data
>>>  
>>> SET parquet.column.index.access=false;
>>> 
>>> CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS(
>>>     `release` STRING,
>>>     `customer` STRING,
>>>     `cookie` STRING,
>>>     `category` STRING,
>>>     `end_time` STRING,
>>>     `start_time` STRING,
>>>     `first_name` STRING,
>>>     `email` STRING,
>>>     `phone` STRING,
>>>     `last_name` STRING,
>>>     `site` STRING,
>>>     `source` STRING,
>>>     `subject` STRING,
>>>     `raw` STRING
>>> )
>>> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>>> LOCATION '${INPUT}';
>>> 
>>> INSERT OVERWRITE DIRECTORY '${OUTPUT}/parquet'
>>> STORED AS PARQUET
>>> SELECT *
>>> FROM EVENTS;
>>>  
>>> Using parquet-tools, I read the resulting file and below is an example 
>>> output:
>>>  
>>> _col0 = 0.1.2
>>> _col1 = customer1
>>> _col2 = NULL
>>> _col3 = api
>>> _col4 = 2018-01-21T06:57:57Z 
>>> _col5 = 2018-01-21T06:57:56Z 
>>> _col6 = Brandon
>>> _col7 = bran...@fakesite.com <mailto:bran...@fakesite.com>
>>> _col8 = 999-999-9999
>>> _col9 = Pompei
>>> _col10 = Boston
>>> _col11 = Wifi
>>> _col12 = NULL
>>> _col13 = 
>>> eyJlbmdhZ2VtZW50TWVkaXVtIjoibm9uZSIsImVudHJ5UG9pbnRJZCI6ImQ5YjYwN2UzLTFlN2QtNGY1YS1iZWQ4LWQ4Yjk3NmRkZTQ3MiIsIkVDWF9FVkVOVF9DQVRFR09SWV9BUElfTkFNRSI6IkVDWF9FQ19TSVRFVFJBQ0tfU0lURV9WSVNJVCIsIkVDWF9TSVRFX1JFR0lPTl9BUElfTkFNRSI
>>>  
>>> This is problematic because it is impossible to transfer it to an Athena 
>>> table (or even back to HIVE) without using these index-based column titles. 
>>> I need HIVE's column titles to transfer over to the Parquet file. 
>>>  
>>> I've search for a very long time and have come up short. Am I doing 
>>> something wrong? 
>>> Please let me know if I can provide more information. Thank you!
>>> 
>>> I appreciate your time.
>>> Sincerely,
>>> 
>>> Brandon Cooke
>>> Software Engineer
>>> engage.cx <http://engage.cx/>
>>> 5500 Interstate N Parkway Suite 130 
>>> <https://maps.google.com/?q=5500+Interstate+N+Parkway+Suite+130&entry=gmail&source=g>
>>

Re: HIVE Parquet column names issue

Reply via email to