Eyal,
The Parquet Pig loader is fine if all the data is present, but if I've written
out from Spark using `df.write.partitionBy('colA',
'colB').parquet('s3://path/to/output')`, the data from those two columns are
put into the output path and taken out from the data:
s3://path/to/output/colA=va
Hi Eyal,
For just loading Parquet files the Parquet Pig loader is okay, although I
don't think it lets you use partition values in the dataset later.
I know the plain old PigStorage has a trick with -tagFiles option but not
sure if that'd be enough in Michael's case and also if that's something
Pa
Hi Michael,
You can also use the Parquet Pig loader (especially if you're not working with
Hive). Here's a link to the Maven repository for it.
https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0
Regards,Eyal
On Tuesday, August 28, 2018, 2:40:36 PM GMT+3, Adam Szita
Hi Michael,
Yes you can use HCatLoader to do this.
The requirement is that you have a Hive table defined on top of your data
(probably pointing to s3://path/to/files) (and Hive MetaStore has all the
relevant meta/schema information).
If you do not have a Hive table yet, you can go ahead and define