Re: [DISCUSS] FLIP-115: Filesystem connector in Table

Jingsong Li Sun, 15 Mar 2020 22:56:27 -0700

Thanks Jinhai for involving.

> we need add 'connector.sink.username' for UserGroupInformation when data
is written to HDFS


Yes, I am not an expert of HDFS, but it seems we need do this "doAs" in the
code for access external HDFS. I will update document.

Best,
Jingsong Lee

On Mon, Mar 16, 2020 at 12:01 PM Jingsong Li <jingsongl...@gmail.com> wrote:

> Thanks Piotr and Yun for involving.
>
> Hi Piotr and Yun, for implementation,
>
> FLINK-14254 [1] introduce batch sink table world, it deals with partitions
> thing, metastore thing and etc.. And it just reuse Dataset/Datastream
> FileInputFormat and FileOutputFormat. Filesystem can not do without
> FileInputFormat, because it need deal with file things, split things. Like
> orc and parquet, they need read whole file and have different split logic.
>
> So back to file system connector:
> - It needs introducing FilesystemTableFactory, FilesystemTableSource and
> FilesystemTableSink.
> - For sources, reusing Dataset/Datastream FileInputFormats, there are no
> other interface to finish file reading.
>
> For file sinks:
> - Batch sink use FLINK-14254
> - Streaming sink has two ways.
>
> First way is reusing Batch sink in FLINK-14254, It has handled the
> partition and metastore logic well.
> - unify batch and streaming
> - Using FileOutputFormat is consistent with FileInputFormat.
> - Add exactly-once related logic. Just 200+ lines code.
> - It's natural to support more table features, like partition commit, auto
> compact and etc..
>
> Second way is reusing Datastream StreamingFileSink:
> - unify streaming sink between table and Datastream.
> - It maybe hard to introduce table related features to StreamingFileSink.
>
> I prefer the first way a little. What do you think?
>
> Hi Yun,
>
> > Watermark mechanism might not be enough.
>
> Watermarks of subtasks are the same in the "snapshotState".
>
> > we might need to also do some coordination between subtasks.
>
> Yes, JobMaster is the role to control subtasks. Metastore is a very
> fragile single point, which can not be accessed by distributed, so it is
> uniformly accessed by JobMaster.
>
> [1]https://issues.apache.org/jira/browse/FLINK-14254
>
> Best,
> Jingsong Lee
>
> On Fri, Mar 13, 2020 at 6:43 PM Yun Gao <yungao...@aliyun.com> wrote:
>
>>        Hi,
>>
>>        Very thanks for Jinsong to bring up this discussion! It should
>> largely improve the usability after enhancing the FileSystem connector in
>> Table.
>>
>>        I have the same question with Piotr. From my side, I think it
>> should be better to be able to reuse existing StreamingFileSink. I think We
>> have began
>>        enhancing the supported FileFormat (e.g., ORC, Avro...), and
>> reusing StreamFileSink should be able to avoid repeat work in the Table
>> library. Besides,
>>        the bucket concept seems also matches the semantics of partition.
>>
>>        For the notification of adding partitions, I'm a little wondering
>> that the Watermark mechanism might not be enough since Bucket/Partition
>> might spans
>>        multiple subtasks. It depends on the level of notification: if we
>> want to notify for the bucket on each subtask, using watermark to notifying
>> each subtask
>>        should be ok, but if we want to notifying for the whole
>> Bucket/Partition, we might need to also do some coordination between
>> subtasks.
>>
>>
>>      Best,
>>       Yun
>>
>>
>>
>> ------------------------------------------------------------------
>> From:Piotr Nowojski <pi...@ververica.com>
>> Send Time:2020 Mar. 13 (Fri.) 18:03
>> To:dev <dev@flink.apache.org>
>> Cc:user <u...@flink.apache.org>; user-zh <user...@flink.apache.org>
>> Subject:Re: [DISCUSS] FLIP-115: Filesystem connector in Table
>>
>> Hi,
>>
>>
>> Which actual sinks/sources are you planning to use in this feature? Is it 
>> about exposing StreamingFileSink in the Table API? Or do you want to 
>> implement new Sinks/Sources?
>>
>> Piotrek
>>
>> > On 13 Mar 2020, at 10:04, jinhai wang <jinhai...@gmail.com> wrote:
>> >
>>
>> > Thanks for FLIP-115. It is really useful feature for platform developers 
>> > who manage hundreds of Flink to Hive jobs in production.
>>
>> > I think we need add 'connector.sink.username' for UserGroupInformation 
>> > when data is written to HDFS
>> >
>> >
>> >  在 2020/3/13 下午3:33，“Jingsong Li”<jingsongl...@gmail.com> 写入:
>> >
>> >    Hi everyone,
>> >
>>
>> >    I'd like to start a discussion about FLIP-115 Filesystem connector in 
>> > Table
>> >    [1].
>> >    This FLIP will bring:
>> >    - Introduce Filesystem table factory in table, support
>> >    csv/parquet/orc/json/avro formats.
>> >    - Introduce streaming filesystem/hive sink in table
>> >
>>
>> >    CC to user mail list, if you have any unmet needs, please feel free to
>> >    reply~
>> >
>> >    Look forward to hearing from you.
>> >
>> >    [1]
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-115%3A+Filesystem+connector+in+Table
>> >
>> >    Best,
>> >    Jingsong Lee
>> >
>> >
>> >
>>
>>
>>
>
> --
> Best, Jingsong Lee
>


-- 
Best, Jingsong Lee

Re: [DISCUSS] FLIP-115: Filesystem connector in Table

Reply via email to