Thanks for Jacky and Yangze. 1) we will reorganize your design and move it to the flip 2) we will support lookup and DynamicSource. and current DynamicSource's optimize, such as supportLimitPushDown/supportLimitPushdown doesn't have achieved. so we will do it after it have done Thanks for your response.
Jark Wu-2 wrote > Thanks Jacky for starting this discussion. > > The requirement of ES source has been proposed in the community many > times. +1 for the feature from my side. > > Here are my thoughts: > > 1. streaming source > As we only support bounded source for JDBC and HBase, so I think it's fine > to have a bounded ES source. > > 2. lookup source > Have you ever thought about having ES as a lookup source just like an > HBase > table and lookup by index? > I'm not sure whether it works. But my gut feeling tells me it is an > interesting feature and there may be a need for it. > > 3. DDL options > I agree with Yangze, it is important to list what new options you want to > add. It would be nice to organize your design > doc according to FLIP template (to have "Public Interface" and "Proposed > Changes"). > > 4. Implement in new table source interface (FLIP-95) > Since 1.11, we proposed a new set of table connector interfaces (FLIP-95) > with more powerful features. > Old table source interface will be removed in the future. > > 5. DataStream source > It would be nicer to expose a DataStream source too, and share > implementations as much as possible. > > > Best, > Jark > > > On Thu, 4 Jun 2020 at 22:07, Etienne Chauchot < > echauchot@ > > wrote: > >> Hi, >> >> I made the Elasticsearch connector of Apache Beam and I was thinking >> about doing the same for Flink when I came by this discussion. I have >> some comments regarding the design doc: >> >> 1. Streaming source: >> >> ES has data streams features but only for time series data; the aim of >> this source is to read all kind of data. Apart from data streams, ES >> behaves like a database: you read the content of an index (similar to a >> table) corresponding to the given query (similar to SQL). So, regarding >> streaming changes, if there are changes between 2 read requests made by >> the source, at the second the whole index (containing the change) will >> be read another time. So, I see no way of having a regular flow of >> documents updates (insertion, deletion, update) as we would need for a >> streaming source. Regarding failover: I guess exactly once semantics >> cannot be guaranteed, only at least once semantics can. Indeed there is >> no ack mechanism on already read data. As a conclusion, IMO you are >> right to target only batch source. Also this answers Yangze Guo's >> question about streaming source. Question is: can a batch only source be >> accepted as a built in flink source ? >> >> 2. hadoop ecosystem >> >> Why not use RichParallelSourceFunction ? >> >> 3. Splitting >> >> Splitting with one split = one ES shard could lead to sub-parallelism. >> IMHO I think that what's important is the number of executors there are >> in the Flink cluster: it is better to use >> runtimeContext.getIndexOfThisSubtask() and >> runtimeContext.getMaxNumberOfParallelSubtasks() to split the input data >> using ES slice API. >> >> 4. Targeting ES 5, 6, 7 >> >> In Beam I used low level REST client because it is compatible with all >> ES versions so it allows to have the same code base for all versions. >> But this client is very low level (String based requests). Now, high >> level rest client exists (it was not available at the time), it is the >> one I would use. It is also available for ES 5 so you should use it for >> ES 5 instead of deprecated Transport client. >> >> Best >> >> Etienne Chauchot. >> >> >> On 04/06/2020 08:47, Yangze Guo wrote: >> > Hi, Jackey. >> > >> > Thanks for driving this discussion. I think this proposal should be a >> > FLIP[1] since it impacts the public interface. However, as we have >> > only some preliminary discussions atm, a design draft would be ok. But >> > it would be better to organize your document according to [2]. >> > >> > I've two basic questions: >> > - Could your summarize all the public API and configurations (DDL) of >> > the ElasticSearchTableSource? >> > - If we want to implement ElasticSearch DataStream Source at the same >> > time, do we need to do a lot of extra work apart from this? >> > >> > It also would be good if you could do some tests (performance and >> > correctness) to address Robert's comments. >> > >> > [1] >> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals >> > [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP+Template >> > >> > Best, >> > Yangze Guo >> > >> > On Wed, Jun 3, 2020 at 9:41 AM Jacky Lau < > liuyongvs@ > > wrote: >> >> Hi Robert Metzger: >> >> Thanks for your response. could you please read this docs. >> >> >> https://www.yuque.com/jackylau-sc7w6/bve18l/14a2ad5b7f86998433de83dd0f8ec067 >> >> . Any Is it any problem here? we are worried about >> >> we do not think throughly. thanks. >> >> >> >> >> >> >> >> -- >> >> Sent from: >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ >> -- Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/