Re: Reading partitioned Parquet data into Pig

2018-08-31 Thread Michael Doo
Eyal, The Parquet Pig loader is fine if all the data is present, but if I've written out from Spark using `df.write.partitionBy('colA', 'colB').parquet('s3://path/to/output')`, the data from those two columns are put into the output path and taken out from the data: s3://path/to/output/colA=va

Re: Reading partitioned Parquet data into Pig

2018-08-30 Thread Adam Szita
Hi Eyal, For just loading Parquet files the Parquet Pig loader is okay, although I don't think it lets you use partition values in the dataset later. I know the plain old PigStorage has a trick with -tagFiles option but not sure if that'd be enough in Michael's case and also if that's something Pa

Re: Reading partitioned Parquet data into Pig

2018-08-30 Thread Eyal Allweil
Hi Michael, You can also use the Parquet Pig loader (especially if you're not working with Hive). Here's a link to the Maven repository for it. https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0 Regards,Eyal On Tuesday, August 28, 2018, 2:40:36 PM GMT+3, Adam Szita

Re: Reading partitioned Parquet data into Pig

2018-08-28 Thread Adam Szita
Hi Michael, Yes you can use HCatLoader to do this. The requirement is that you have a Hive table defined on top of your data (probably pointing to s3://path/to/files) (and Hive MetaStore has all the relevant meta/schema information). If you do not have a Hive table yet, you can go ahead and define