Hi,
I made the Elasticsearch connector of Apache Beam and I was thinking
about doing the same for Flink when I came by this discussion. I have
some comments regarding the design doc:
1. Streaming source:
ES has data streams features but only for time series data; the aim of
this source is to read all kind of data. Apart from data streams, ES
behaves like a database: you read the content of an index (similar to a
table) corresponding to the given query (similar to SQL). So, regarding
streaming changes, if there are changes between 2 read requests made by
the source, at the second the whole index (containing the change) will
be read another time. So, I see no way of having a regular flow of
documents updates (insertion, deletion, update) as we would need for a
streaming source. Regarding failover: I guess exactly once semantics
cannot be guaranteed, only at least once semantics can. Indeed there is
no ack mechanism on already read data. As a conclusion, IMO you are
right to target only batch source. Also this answers Yangze Guo's
question about streaming source. Question is: can a batch only source be
accepted as a built in flink source ?
2. hadoop ecosystem
Why not use RichParallelSourceFunction ?
3. Splitting
Splitting with one split = one ES shard could lead to sub-parallelism.
IMHO I think that what's important is the number of executors there are
in the Flink cluster: it is better to use
runtimeContext.getIndexOfThisSubtask() and
runtimeContext.getMaxNumberOfParallelSubtasks() to split the input data
using ES slice API.
4. Targeting ES 5, 6, 7
In Beam I used low level REST client because it is compatible with all
ES versions so it allows to have the same code base for all versions.
But this client is very low level (String based requests). Now, high
level rest client exists (it was not available at the time), it is the
one I would use. It is also available for ES 5 so you should use it for
ES 5 instead of deprecated Transport client.
Best
Etienne Chauchot.
On 04/06/2020 08:47, Yangze Guo wrote:
Hi, Jackey.
Thanks for driving this discussion. I think this proposal should be a
FLIP[1] since it impacts the public interface. However, as we have
only some preliminary discussions atm, a design draft would be ok. But
it would be better to organize your document according to [2].
I've two basic questions:
- Could your summarize all the public API and configurations (DDL) of
the ElasticSearchTableSource?
- If we want to implement ElasticSearch DataStream Source at the same
time, do we need to do a lot of extra work apart from this?
It also would be good if you could do some tests (performance and
correctness) to address Robert's comments.
[1]
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
[2] https://cwiki.apache.org/confluence/display/FLINK/FLIP+Template
Best,
Yangze Guo
On Wed, Jun 3, 2020 at 9:41 AM Jacky Lau <liuyon...@gmail.com> wrote:
Hi Robert Metzger:
Thanks for your response. could you please read this docs.
https://www.yuque.com/jackylau-sc7w6/bve18l/14a2ad5b7f86998433de83dd0f8ec067
. Any Is it any problem here? we are worried about
we do not think throughly. thanks.
--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/