Hi,

I made the Elasticsearch connector of Apache Beam and I was thinking about doing the same for Flink when I came by this discussion. I have some comments regarding the design doc:

1. Streaming source:

ES has data streams features but only for time series data; the aim of this source is to read all kind of data. Apart from data streams,  ES behaves like a database: you read the content of an index (similar to a table) corresponding to the given query (similar to SQL). So, regarding streaming changes, if there are changes between 2 read requests made by the source, at the second the whole index (containing the change) will be read another time. So, I see no way of having a regular flow of documents updates (insertion, deletion, update) as we would need for a streaming source. Regarding failover: I guess exactly once semantics cannot be guaranteed, only at least once semantics can. Indeed there is no ack mechanism on already read data. As a conclusion, IMO you are right to target only batch source. Also this answers Yangze Guo's question about streaming source. Question is: can a batch only source be accepted as a built in flink source ?

2. hadoop ecosystem

Why not use RichParallelSourceFunction ?

3. Splitting

Splitting with one split = one ES shard could lead to sub-parallelism. IMHO I think that what's important is the number of executors there are in the Flink cluster: it is better to use runtimeContext.getIndexOfThisSubtask() and runtimeContext.getMaxNumberOfParallelSubtasks() to split the input data using ES slice API.

4. Targeting ES 5, 6, 7

In Beam I used low level REST client because it is compatible with all ES versions so it allows to have the same code base for all versions. But this client is very low level (String based requests). Now, high level rest client exists (it was not available at the time), it is the one I would use. It is also available for ES 5 so you should use it for ES 5 instead of deprecated Transport client.

Best

Etienne Chauchot.


On 04/06/2020 08:47, Yangze Guo wrote:
Hi, Jackey.

Thanks for driving this discussion. I think this proposal should be a
FLIP[1] since it impacts the public interface. However, as we have
only some preliminary discussions atm, a design draft would be ok. But
it would be better to organize your document according to [2].

I've two basic questions:
- Could your summarize all the public API and configurations (DDL) of
the ElasticSearchTableSource?
- If we want to implement ElasticSearch DataStream Source at the same
time, do we need to do a lot of extra work apart from this?

It also would be good if you could do some tests (performance and
correctness) to address Robert's comments.

[1] 
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
[2] https://cwiki.apache.org/confluence/display/FLINK/FLIP+Template

Best,
Yangze Guo

On Wed, Jun 3, 2020 at 9:41 AM Jacky Lau <liuyon...@gmail.com> wrote:
Hi Robert Metzger:
     Thanks for your response. could you please read this docs.
https://www.yuque.com/jackylau-sc7w6/bve18l/14a2ad5b7f86998433de83dd0f8ec067
. Any Is it any problem here? we are worried about
we do not think  throughly. thanks.



--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/

Reply via email to