Re: HIVE Parquet column names issue

Jörn Franke Fri, 26 Jan 2018 10:10:06 -0800

Drop the old parquet table before and then create it with explicit statements. 
The above statement keeps using the old parquet table if it existed


> On 26. Jan 2018, at 17:35, Brandon Cooke <brandon.co...@engage.cx> wrote:
> 
> Hi Prasad,
> 
> I actually have tried this and I had that same result. 
> Although I am certainly willing to try again.
> 
> Sincerely,
> 
> Brandon Cooke
> 
>> On Jan 26, 2018, at 11:29 AM, Prasad Nagaraj Subramanya 
>> <prasadn...@gmail.com> wrote:
>> 
>> Hi Brandon,
>> 
>> Have you tried creating an external table with the required names for 
>> parquet -
>> 
>> CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS_PARQUET(
>>     `release` STRING,
>>     `customer` STRING,
>>     `cookie` STRING,
>>     `category` STRING,
>>     `end_time` STRING,
>>     `start_time` STRING,
>>     `first_name` STRING,
>>     `email` STRING,
>>     `phone` STRING,
>>     `last_name` STRING,
>>     `site` STRING,
>>     `source` STRING,
>>     `subject` STRING,
>>     `raw` STRING
>> )
>> STORED AS PARQUET
>> LOCATION '${OUTPUT}';
>> 
>> And then inserting data into this table from your csv table -
>> 
>> INSERT OVERWRITE TABLE EVENTS_PARQUET SELECT * FROM EVENTS;
>> 
>> This will create a parquet file at the specified location (${OUTPUT})
>> 
>> Thanks,
>> Prasad
>> 
>>> On Fri, Jan 26, 2018 at 7:45 AM, Brandon Cooke <brandon.co...@engage.cx> 
>>> wrote:
>>> Hello,
>>> 
>>> I posted the following on a Cloudera forum but haven’t had much luck.
>>> I’m hoping someone here can tell me what step I have probably missed:
>>> 
>>> Hello,
>>>  
>>> I'm using HIVE (v1.2.1) to convert our data files from CSV into Parquet for 
>>> use in AWS Athena.
>>> However, no mater what I try the resulting Parquet always has columns 
>>> titles [_col0, _col1, ..., _colN]
>>>  
>>> After researching, I read that the line SET 
>>> parquet.column.index.access=false was supposed to allow for Parquet to use 
>>> the column titles of my HIVE table; however, it has been unsuccessful so 
>>> far.
>>>  
>>> Below is an example script I use to create the Parquet from data
>>>  
>>> SET parquet.column.index.access=false;
>>> 
>>> CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS(
>>>     `release` STRING,
>>>     `customer` STRING,
>>>     `cookie` STRING,
>>>     `category` STRING,
>>>     `end_time` STRING,
>>>     `start_time` STRING,
>>>     `first_name` STRING,
>>>     `email` STRING,
>>>     `phone` STRING,
>>>     `last_name` STRING,
>>>     `site` STRING,
>>>     `source` STRING,
>>>     `subject` STRING,
>>>     `raw` STRING
>>> )
>>> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>>> LOCATION '${INPUT}';
>>> 
>>> INSERT OVERWRITE DIRECTORY '${OUTPUT}/parquet'
>>> STORED AS PARQUET
>>> SELECT *
>>> FROM EVENTS;
>>>  
>>> Using parquet-tools, I read the resulting file and below is an example 
>>> output:
>>>  
>>> _col0 = 0.1.2
>>> _col1 = customer1
>>> _col2 = NULL
>>> _col3 = api
>>> _col4 = 2018-01-21T06:57:57Z 
>>> _col5 = 2018-01-21T06:57:56Z 
>>> _col6 = Brandon
>>> _col7 = bran...@fakesite.com
>>> _col8 = 999-999-9999
>>> _col9 = Pompei
>>> _col10 = Boston
>>> _col11 = Wifi
>>> _col12 = NULL
>>> _col13 = 
>>> eyJlbmdhZ2VtZW50TWVkaXVtIjoibm9uZSIsImVudHJ5UG9pbnRJZCI6ImQ5YjYwN2UzLTFlN2QtNGY1YS1iZWQ4LWQ4Yjk3NmRkZTQ3MiIsIkVDWF9FVkVOVF9DQVRFR09SWV9BUElfTkFNRSI6IkVDWF9FQ19TSVRFVFJBQ0tfU0lURV9WSVNJVCIsIkVDWF9TSVRFX1JFR0lPTl9BUElfTkFNRSI
>>>  
>>> This is problematic because it is impossible to transfer it to an Athena 
>>> table (or even back to HIVE) without using these index-based column titles. 
>>> I need HIVE's column titles to transfer over to the Parquet file. 
>>>  
>>> I've search for a very long time and have come up short. Am I doing 
>>> something wrong? 
>>> Please let me know if I can provide more information. Thank you!
>>> 
>>> I appreciate your time.
>>> Sincerely,
>>> 
>>> Brandon Cooke
>>> Software Engineer
>>> engage.cx
>>> 5500 Interstate N Parkway Suite 130
>> 
>

Re: HIVE Parquet column names issue

Reply via email to