[jira] [Commented] (FLINK-9407) Support orc rolling sink writer

ASF GitHub Bot (JIRA) Wed, 25 Jul 2018 05:07:54 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-9407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16555420#comment-16555420
 ]


ASF GitHub Bot commented on FLINK-9407:
---------------------------------------

Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/6075
  
    @zhangminglei  For your interest - there is a new Bucketing Sink in the 
Flink master (called `StreamingFileSink`), with a different design: Managing 
all state in Flink state (so it is consistent), with a new File System writer 
abstraction to generalize across HDFS, POSIX, and S3 (S3 still WIP) and with a 
more pluggable way to add encoders, like parquet and orc.
    
    As an example, we added a Parquet writer, which is quite straightforward 
and flexible with the new interface.
    
    Would be great to get your opinion on that and see if your ORC writer code 
also works with that.
    If it works out, the new StreamingFileSink could replace the current 
BucketingSink.


> Support orc rolling sink writer
> -------------------------------
>
>                 Key: FLINK-9407
>                 URL: https://issues.apache.org/jira/browse/FLINK-9407
>             Project: Flink
>          Issue Type: New Feature
>          Components: filesystem-connector
>            Reporter: zhangminglei
>            Assignee: zhangminglei
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, we only support {{StringWriter}}, {{SequenceFileWriter}} and 
> {{AvroKeyValueSinkWriter}}. I would suggest add an orc writer for rolling 
> sink.
> Below, FYI.
> I tested the PR and verify the results with spark sql. Obviously, we can get 
> the results of what we had written down before. But I will give more tests in 
> the next couple of days. Including the performance under compression with 
> short checkpoint intervals. And more UTs.
> {code:java}
> scala> spark.read.orc("hdfs://10.199.196.0:9000/data/hive/man/2018-07-06--21")
> res1: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more 
> field]
> scala>
> scala> res1.registerTempTable("tablerice")
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> scala> spark.sql("select * from tablerice")
> res3: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more 
> field]
> scala> res3.show(3)
> +-----+---+-------+
> | name|age|married|
> +-----+---+-------+
> |Sagar| 26|  false|
> |Sagar| 30|  false|
> |Sagar| 34|  false|
> +-----+---+-------+
> only showing top 3 rows
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-9407) Support orc rolling sink writer

Reply via email to