subject:"processing avro data source using DataSet API and output to parquet"

Re: processing avro data source using DataSet API and output to parquet

2019-08-19 Thread Zhenghua Gao

the DataStream API should fully subsume the DataSet API (through bounded streams) in the long run [1] And you can consider use Table/SQL API in your project. [1] https://flink.apache.org/roadmap.html#analytics-applications-and-the-roles-of-datastream-dataset-and-table-api *Best Regards,* *Zhenghu

Re: processing avro data source using DataSet API and output to parquet

2019-08-16 Thread Lian Jiang

Thanks. Which api (dataset or datastream) is recommended for file handling (no window operation required)? We have similar scenario for real-time processing. May it make sense to use datastream api for both batch and real-time for uniformity? Sent from my iPhone > On Aug 16, 2019, at 00:38, Zh

Re: processing avro data source using DataSet API and output to parquet

2019-08-16 Thread Zhenghua Gao

Flink allows hadoop (mapreduce) OutputFormats in Flink jobs[1]. You can have a try with Parquet OutputFormat[2]. And if you can turn to DataStream API， StreamingFileSink + ParquetBulkWriter meets your requirement[3][4]. [1] https://github.com/apache/flink/blob/master/flink-connectors/flink-hadoop

processing avro data source using DataSet API and output to parquet

2019-08-15 Thread Lian Jiang

Hi, I am using Flink 1.8.1 DataSet for a batch processing. The data source is avro files and I want to output the result into parquet. https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/batch/ only has no related information. What's the recommended way for doing this? Do I need to wri