[ https://issues.apache.org/jira/browse/FLINK-9407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555420#comment-16555420 ]
ASF GitHub Bot commented on FLINK-9407: --------------------------------------- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/6075 @zhangminglei For your interest - there is a new Bucketing Sink in the Flink master (called `StreamingFileSink`), with a different design: Managing all state in Flink state (so it is consistent), with a new File System writer abstraction to generalize across HDFS, POSIX, and S3 (S3 still WIP) and with a more pluggable way to add encoders, like parquet and orc. As an example, we added a Parquet writer, which is quite straightforward and flexible with the new interface. Would be great to get your opinion on that and see if your ORC writer code also works with that. If it works out, the new StreamingFileSink could replace the current BucketingSink. > Support orc rolling sink writer > ------------------------------- > > Key: FLINK-9407 > URL: https://issues.apache.org/jira/browse/FLINK-9407 > Project: Flink > Issue Type: New Feature > Components: filesystem-connector > Reporter: zhangminglei > Assignee: zhangminglei > Priority: Major > Labels: pull-request-available > > Currently, we only support {{StringWriter}}, {{SequenceFileWriter}} and > {{AvroKeyValueSinkWriter}}. I would suggest add an orc writer for rolling > sink. > Below, FYI. > I tested the PR and verify the results with spark sql. Obviously, we can get > the results of what we had written down before. But I will give more tests in > the next couple of days. Including the performance under compression with > short checkpoint intervals. And more UTs. > {code:java} > scala> spark.read.orc("hdfs://10.199.196.0:9000/data/hive/man/2018-07-06--21") > res1: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more > field] > scala> > scala> res1.registerTempTable("tablerice") > warning: there was one deprecation warning; re-run with -deprecation for > details > scala> spark.sql("select * from tablerice") > res3: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more > field] > scala> res3.show(3) > +-----+---+-------+ > | name|age|married| > +-----+---+-------+ > |Sagar| 26| false| > |Sagar| 30| false| > |Sagar| 34| false| > +-----+---+-------+ > only showing top 3 rows > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)