tensorflow-data-validation(with direct runner) failed to process large data because of grpc timeout on workers

2020-08-24 Thread Junjian Xu
Hi, I’m running into a problem of tensorflow-data-validation with direct runner to generate statistics from some large datasets over 400GB. It seems that all workers stopped working after an error message of “Keepalive watchdog fired. Closing transport.” It seems to be a grpc keepalive timeout.

Re: Out-of-orderness of window results when testing stateful operators with TextIO

2020-08-24 Thread Reuven Lax
Generally you should not rely on PCollection being ordered, though there have been discussions about adding some time-ordering semantics. On Sun, Aug 23, 2020 at 9:06 PM Rui Wang wrote: > Current Beam model does not guarantee an ordering after a GBK (i.e. > Combine.perKey() in your). So you ca

Re: tensorflow-data-validation(with direct runner) failed to process large data because of grpc timeout on workers

2020-08-24 Thread Luke Cwik
Another person reported something similar for Dataflow and it seemed as though in their scenario they were using locks and either got into a deadlock or starved processing for long enough that the watchdog also failed. Are you using locks and/or having really long single element processing times?

Re: Out-of-orderness of window results when testing stateful operators with TextIO

2020-08-24 Thread Robert Bradshaw
As for the question of writing tests in the face of non-determinism, you should look into TestStream. MyStatefulDoFn still needs to be updated to not assume an ordering. (This can be done by setting timers that provide guarantees that (modulo late data) one has seen all data up to a certain timesta

How to integrate Beam SQL windowing query with KafkaIO?

2020-08-24 Thread Minreng Wu
Hi contributors, Sorry to bother you! I met a problem when I was trying to apply a windowing aggregation Beam SQL query to a Kafka input source. The details of the question are in the following link: https://stackoverflow.com/questions/63566057/how-to-integrate-beam-sql-windowing-query-with-kafka

Need Support for ElasticSearch 7.x for beam

2020-08-24 Thread Mohil Khare
Hello, Firstly I am on java sdk 2.23.0 and we heavily use Elasticsearch as one of our sinks. It's been a while since beam got upgraded to support elasticsearch version greater than 6.x. Elasticsearch has now moved on to 7.x and we want to use some of their new security features. I want to know w

Re: Need Support for ElasticSearch 7.x for beam

2020-08-24 Thread Kyle Weaver
This ticket indicates Elasticsearch 7.x has been supported since Beam 2.19: https://issues.apache.org/jira/browse/BEAM-5192 Are there any specific features you need that aren't supported? On Mon, Aug 24, 2020 at 11:33 AM Mohil Khare wrote: > Hello, > > Firstly I am on java sdk 2.23.0 and we hea

Re: How to integrate Beam SQL windowing query with KafkaIO?

2020-08-24 Thread Rui Wang
Hi, I checked the query in your SO question and I think the SQL usage is correct. My current guess is that the problem is how does watermark generate and advance in KafkaIO. It could be either the watermark didn't pass the end of your SQL window for aggregation or the data was lagging behind the

Re: Need Support for ElasticSearch 7.x for beam

2020-08-24 Thread Mohil Khare
hello Kyle, Thanks a lot for your prompt reply. Hmm.. strange, I think I was getting some error that version was not supported. Not sure if I updated my beam version last time when I tried with ES 7.x. Let me try it again and let you know. Thanks and Regards Mohil On Mon, Aug 24, 2020 at 11:54 A

Re: Out-of-orderness of window results when testing stateful operators with TextIO

2020-08-24 Thread Dongwon Kim
Thanks Reuven for the input and Wang for CC'ing to Reuven. Generally you should not rely on PCollection being ordered Is it because Beam splits PCollection into multiple input splits and tries to process it as efficiently as possible without considering times? This one is very confusing as I've b