Re: HIVE Parquet column names issue

Brandon Cooke Fri, 26 Jan 2018 08:37:09 -0800

Hi Prasad,

I actually have tried this and I had that same result. 
Although I am certainly willing to try again.


Sincerely,

Brandon Cooke

> On Jan 26, 2018, at 11:29 AM, Prasad Nagaraj Subramanya 
> <prasadn...@gmail.com> wrote:
> 
> Hi Brandon,
> 
> Have you tried creating an external table with the required names for parquet 
> -
> 
> CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS_PARQUET(
>     `release` STRING,
>     `customer` STRING,
>     `cookie` STRING,
>     `category` STRING,
>     `end_time` STRING,
>     `start_time` STRING,
>     `first_name` STRING,
>     `email` STRING,
>     `phone` STRING,
>     `last_name` STRING,
>     `site` STRING,
>     `source` STRING,
>     `subject` STRING,
>     `raw` STRING
> )
> STORED AS PARQUET
> LOCATION '${OUTPUT}';
> 
> And then inserting data into this table from your csv table -
> 
> INSERT OVERWRITE TABLE EVENTS_PARQUET SELECT * FROM EVENTS;
> 
> This will create a parquet file at the specified location (${OUTPUT})
> 
> Thanks,
> Prasad
> 
> On Fri, Jan 26, 2018 at 7:45 AM, Brandon Cooke <brandon.co...@engage.cx 
> <mailto:brandon.co...@engage.cx>> wrote:
> Hello,
> 
> I posted the following on a Cloudera forum but haven’t had much luck.
> I’m hoping someone here can tell me what step I have probably missed:
> 
> Hello,
>  
> I'm using HIVE (v1.2.1) to convert our data files from CSV into Parquet for 
> use in AWS Athena.
> However, no mater what I try the resulting Parquet always has columns titles 
> [_col0, _col1, ..., _colN]
>  
> After researching, I read that the line SET parquet.column.index.access=false 
> was supposed to allow for Parquet to use the column titles of my HIVE table; 
> however, it has been unsuccessful so far.
>  
> Below is an example script I use to create the Parquet from data
>  
> SET parquet.column.index.access=false;
> 
> CREATE EXTERNAL TABLE IF NOT EXISTS EVENTS(
>     `release` STRING,
>     `customer` STRING,
>     `cookie` STRING,
>     `category` STRING,
>     `end_time` STRING,
>     `start_time` STRING,
>     `first_name` STRING,
>     `email` STRING,
>     `phone` STRING,
>     `last_name` STRING,
>     `site` STRING,
>     `source` STRING,
>     `subject` STRING,
>     `raw` STRING
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> LOCATION '${INPUT}';
> 
> INSERT OVERWRITE DIRECTORY '${OUTPUT}/parquet'
> STORED AS PARQUET
> SELECT *
> FROM EVENTS;
>  
> Using parquet-tools, I read the resulting file and below is an example output:
>  
> _col0 = 0.1.2
> _col1 = customer1
> _col2 = NULL
> _col3 = api
> _col4 = 2018-01-21T06:57:57Z 
> _col5 = 2018-01-21T06:57:56Z 
> _col6 = Brandon
> _col7 = bran...@fakesite.com <mailto:bran...@fakesite.com>
> _col8 = 999-999-9999
> _col9 = Pompei
> _col10 = Boston
> _col11 = Wifi
> _col12 = NULL
> _col13 = 
> eyJlbmdhZ2VtZW50TWVkaXVtIjoibm9uZSIsImVudHJ5UG9pbnRJZCI6ImQ5YjYwN2UzLTFlN2QtNGY1YS1iZWQ4LWQ4Yjk3NmRkZTQ3MiIsIkVDWF9FVkVOVF9DQVRFR09SWV9BUElfTkFNRSI6IkVDWF9FQ19TSVRFVFJBQ0tfU0lURV9WSVNJVCIsIkVDWF9TSVRFX1JFR0lPTl9BUElfTkFNRSI
>  
> This is problematic because it is impossible to transfer it to an Athena 
> table (or even back to HIVE) without using these index-based column titles. I 
> need HIVE's column titles to transfer over to the Parquet file. 
>  
> I've search for a very long time and have come up short. Am I doing something 
> wrong? 
> Please let me know if I can provide more information. Thank you!
> 
> I appreciate your time.
> Sincerely,
> 
> Brandon Cooke
> Software Engineer
> engage.cx <http://engage.cx/>
> 5500 Interstate N Parkway Suite 130 
> <https://maps.google.com/?q=5500+Interstate+N+Parkway+Suite+130&entry=gmail&source=g>

Re: HIVE Parquet column names issue

Reply via email to