Thanks Jinhai for involving. > we need add 'connector.sink.username' for UserGroupInformation when data is written to HDFS
Yes, I am not an expert of HDFS, but it seems we need do this "doAs" in the code for access external HDFS. I will update document. Best, Jingsong Lee On Mon, Mar 16, 2020 at 12:01 PM Jingsong Li <jingsongl...@gmail.com> wrote: > Thanks Piotr and Yun for involving. > > Hi Piotr and Yun, for implementation, > > FLINK-14254 [1] introduce batch sink table world, it deals with partitions > thing, metastore thing and etc.. And it just reuse Dataset/Datastream > FileInputFormat and FileOutputFormat. Filesystem can not do without > FileInputFormat, because it need deal with file things, split things. Like > orc and parquet, they need read whole file and have different split logic. > > So back to file system connector: > - It needs introducing FilesystemTableFactory, FilesystemTableSource and > FilesystemTableSink. > - For sources, reusing Dataset/Datastream FileInputFormats, there are no > other interface to finish file reading. > > For file sinks: > - Batch sink use FLINK-14254 > - Streaming sink has two ways. > > First way is reusing Batch sink in FLINK-14254, It has handled the > partition and metastore logic well. > - unify batch and streaming > - Using FileOutputFormat is consistent with FileInputFormat. > - Add exactly-once related logic. Just 200+ lines code. > - It's natural to support more table features, like partition commit, auto > compact and etc.. > > Second way is reusing Datastream StreamingFileSink: > - unify streaming sink between table and Datastream. > - It maybe hard to introduce table related features to StreamingFileSink. > > I prefer the first way a little. What do you think? > > Hi Yun, > > > Watermark mechanism might not be enough. > > Watermarks of subtasks are the same in the "snapshotState". > > > we might need to also do some coordination between subtasks. > > Yes, JobMaster is the role to control subtasks. Metastore is a very > fragile single point, which can not be accessed by distributed, so it is > uniformly accessed by JobMaster. > > [1]https://issues.apache.org/jira/browse/FLINK-14254 > > Best, > Jingsong Lee > > On Fri, Mar 13, 2020 at 6:43 PM Yun Gao <yungao...@aliyun.com> wrote: > >> Hi, >> >> Very thanks for Jinsong to bring up this discussion! It should >> largely improve the usability after enhancing the FileSystem connector in >> Table. >> >> I have the same question with Piotr. From my side, I think it >> should be better to be able to reuse existing StreamingFileSink. I think We >> have began >> enhancing the supported FileFormat (e.g., ORC, Avro...), and >> reusing StreamFileSink should be able to avoid repeat work in the Table >> library. Besides, >> the bucket concept seems also matches the semantics of partition. >> >> For the notification of adding partitions, I'm a little wondering >> that the Watermark mechanism might not be enough since Bucket/Partition >> might spans >> multiple subtasks. It depends on the level of notification: if we >> want to notify for the bucket on each subtask, using watermark to notifying >> each subtask >> should be ok, but if we want to notifying for the whole >> Bucket/Partition, we might need to also do some coordination between >> subtasks. >> >> >> Best, >> Yun >> >> >> >> ------------------------------------------------------------------ >> From:Piotr Nowojski <pi...@ververica.com> >> Send Time:2020 Mar. 13 (Fri.) 18:03 >> To:dev <dev@flink.apache.org> >> Cc:user <u...@flink.apache.org>; user-zh <user...@flink.apache.org> >> Subject:Re: [DISCUSS] FLIP-115: Filesystem connector in Table >> >> Hi, >> >> >> Which actual sinks/sources are you planning to use in this feature? Is it >> about exposing StreamingFileSink in the Table API? Or do you want to >> implement new Sinks/Sources? >> >> Piotrek >> >> > On 13 Mar 2020, at 10:04, jinhai wang <jinhai...@gmail.com> wrote: >> > >> >> > Thanks for FLIP-115. It is really useful feature for platform developers >> > who manage hundreds of Flink to Hive jobs in production. >> >> > I think we need add 'connector.sink.username' for UserGroupInformation >> > when data is written to HDFS >> > >> > >> > 在 2020/3/13 下午3:33,“Jingsong Li”<jingsongl...@gmail.com> 写入: >> > >> > Hi everyone, >> > >> >> > I'd like to start a discussion about FLIP-115 Filesystem connector in >> > Table >> > [1]. >> > This FLIP will bring: >> > - Introduce Filesystem table factory in table, support >> > csv/parquet/orc/json/avro formats. >> > - Introduce streaming filesystem/hive sink in table >> > >> >> > CC to user mail list, if you have any unmet needs, please feel free to >> > reply~ >> > >> > Look forward to hearing from you. >> > >> > [1] >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-115%3A+Filesystem+connector+in+Table >> > >> > Best, >> > Jingsong Lee >> > >> > >> > >> >> >> > > -- > Best, Jingsong Lee > -- Best, Jingsong Lee