I second this. In my case i have a lot of stateful updates in my stream so it doesnt work at all. I ended up using Kafka between to mitigate the issue but it would be great with an IO connector that writes on keys directly.
I think a PR is always welcome. :) / Vilhelm On 15 Nov 2017 05:16, "Chet Aldrich" <[email protected]> wrote: > Hello all! > > So I’ve been using the ElasticSearchIO sink for a project (unfortunately > it’s Elasticsearch 5.x, and so I’ve been messing around with the latest RC) > and I’m finding that it doesn’t allow for changing the document ID, but > only lets you pass in a record, which means that the document ID is > auto-generated. See this line for what specifically is happening: > > https://github.com/apache/beam/blob/master/sdks/java/io/ > elasticsearch/src/main/java/org/apache/beam/sdk/io/ > elasticsearch/ElasticsearchIO.java#L838 > > Essentially the data part of the document is being placed but it doesn’t > allow for other properties, such as the document ID, to be set. > > This leads to two problems: > > 1. Beam doesn’t necessarily guarantee exactly-once execution for a given > item in a PCollection, as I understand it. This means that you may get more > than one record in Elastic for a given item in a PCollection that you pass > in. > > 2. You can’t do partial updates to an index. If you run a batch job once, > and then run the batch job again on the same index without clearing it, you > just double everything in there. > > Is there any good way around this? > > I’d be happy to try writing up a PR for this in theory, but not sure how > to best approach it. Also would like to figure out a way to get around this > in the meantime, if anyone has any ideas. > > Best, > > Chet > > P.S. CCed [email protected] because it seems like he’s been doing work > related to the elastic sink. > > >
