Hi Rafi, I have a similar use case where I want to read parquet files in the dataset and want to perform some transformation and similarly want to write the result using year month day partitioned.
I am stuck at first step only where how to read and write Parquet files using hadoop-Compatability. Please help me with this and also if u find the solution for how to write data in partitioned. Thanks, Anuj On Thu, Oct 25, 2018 at 5:35 PM Andrey Zagrebin <and...@data-artisans.com> wrote: > Hi Rafi, > > At the moment I do not see any support of Parquet in DataSet API > except HadoopOutputFormat, mentioned in stack overflow question. I have > cc’ed Fabian and Aljoscha, maybe they could provide more information. > > Best, > Andrey > > On 25 Oct 2018, at 13:08, Rafi Aroch <rafi.ar...@gmail.com> wrote: > > Hi, > > I'm writing a Batch job which reads Parquet, does some aggregations and > writes back as Parquet files. > I would like the output to be partitioned by year, month, day by event > time. Similarly to the functionality of the BucketingSink. > > I was able to achieve the reading/writing to/from Parquet by using the > hadoop-compatibility features. > I couldn't find a way to partition the data by year, month, day to create > a folder hierarchy accordingly. Everything is written to a single directory. > > I could find an unanswered question about this issue: > https://stackoverflow.com/questions/52204034/apache-flink-does-dataset-api-support-writing-output-to-individual-file-partit > > Can anyone suggest a way to achieve this? Maybe there's a way to integrate > the BucketingSink with the DataSet API? Another solution? > > Rafi > > > -- Thanks & Regards, Anuj Jain Mob. : +91- 8588817877 Skype : anuj.jain07 <http://www.oracle.com/> <http://www.cse.iitm.ac.in/%7Eanujjain/>