Re: Apache Beam, version 2.2.0

2017-12-04 Thread Jean-Baptiste Onofré
My apologizes, I thought we had a consensus already. Regards JB On 12/04/2017 11:22 PM, Eugene Kirpichov wrote: Thanks JB for sending the detailed notes about new stuff in 2.2.0! A lot of exciting things indeed. Regarding Java 8: I thought our consensus was to have the release notes say that

Re: Global sum of latest help

2017-12-04 Thread Lukasz Cwik
I believe you can provide ordering if you decide to put any unconsumed records into state. Every time you read state and check to see if its the next corresponding id. If so then emit the new sum otherwise push it back onto state until you get the missing ids allowing you to backfill all the prior

Re: Callbacks/other functions run after a PDone/output transform

2017-12-04 Thread Ben Chambers
This would be absolutely great! It seems somewhat similar to the changes that were made to the BigQuery sink to support WriteResult ( https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteResult.java ). I find it helpfu

Re: Global sum of latest help

2017-12-04 Thread Kenneth Knowles
On Mon, Dec 4, 2017 at 3:22 PM, Lukasz Cwik wrote: > Since processing can happen out of order, for example if the input was: > ``` > {"id": "2", parent_id: "a", "timestamp": 2, "amount": 3} > {"id": "1", parent_id: "a", "timestamp": 1. "amount": 1} > {"id": "1", parent_id: "a", "timestamp": 3, "a

Re: Callbacks/other functions run after a PDone/output transform

2017-12-04 Thread Eugene Kirpichov
It makes sense to consider how this maps onto existing kinds of sinks. E.g.: - Something that just makes an RPC per record, e.g. MqttIO.write(): that will emit 1 result per bundle (either a bogus value or number of records written) that will be Combine'd into 1 result per pane of input. A user can

Re: Callbacks/other functions run after a PDone/output transform

2017-12-04 Thread Eugene Kirpichov
I agree that the proper API for enabling the use case "do something after the data has been written" is to return a PCollection of objects where each object represents the result of writing some identifiable subset of the data. Then one can apply a ParDo to this PCollection, in order to "do somethi

Re: Global sum of latest help

2017-12-04 Thread Lukasz Cwik
Since processing can happen out of order, for example if the input was: ``` {"id": "2", parent_id: "a", "timestamp": 2, "amount": 3} {"id": "1", parent_id: "a", "timestamp": 1. "amount": 1} {"id": "1", parent_id: "a", "timestamp": 3, "amount": 2} ``` would the output be 3 and then 5 or would you st

Re: Apache Beam, version 2.2.0

2017-12-04 Thread Lukasz Cwik
I also believe we were still in the investigatory phase for dropping support for Java 7. On Mon, Dec 4, 2017 at 2:22 PM, Eugene Kirpichov wrote: > Thanks JB for sending the detailed notes about new stuff in 2.2.0! A lot > of exciting things indeed. > > Regarding Java 8: I thought our consensus w

Re: Apache Beam, version 2.2.0

2017-12-04 Thread Eugene Kirpichov
Thanks JB for sending the detailed notes about new stuff in 2.2.0! A lot of exciting things indeed. Regarding Java 8: I thought our consensus was to have the release notes say that we're *considering* going Java8-only, and use that to get more opinions from the user community - but I can't find th

Re: Apache Beam, version 2.2.0

2017-12-04 Thread Vilhelm von Ehrenheim
I'm super excited about this release! Great work everyone involved! On Mon, Dec 4, 2017 at 10:58 AM, Jean-Baptiste Onofré wrote: > Just an important note that we forgot to mention. > > !! The 2.2.0 release will be the last one supporting Spark 1.x and Java 7 > !! > > Starting from Beam 2.3.0, th

Global sum of latest help

2017-12-04 Thread Vilhelm von Ehrenheim
Hi all! First of all great work on the 2.2.0 release! really excited to start using it. I have a problem with how I should construct a pipeline that should emit a sum of latest values which I hope someone might have some ideas on how to solve. Here is what I have: I have a stateful stream of eve

Re: Callbacks/other functions run after a PDone/output transform

2017-12-04 Thread Robert Bradshaw
+1 At the very least an empty PCollection could be produced with no promises about its contents but the ability to be followed (e.g. as a side input), which is forward compatible with whatever actual metadata one may decide to produce in the future. On Mon, Dec 4, 2017 at 11:06 AM, Kenneth Knowle

Re: Spark Runner Issues with YARN

2017-12-04 Thread Jean-Baptiste Onofré
Hi, I second Luke here: you have to use Spark 1.x or use the PR supporting Spark 2.x. Regards JB On 12/04/2017 08:14 PM, Lukasz Cwik wrote: It seems like your trying to use Spark 2.1.0. Apache Beam currently relies on users using Spark 1.6.3. There is an open pull request[1] to migrate to Spa

Re: Spark Runner Issues with YARN

2017-12-04 Thread Lukasz Cwik
It seems like your trying to use Spark 2.1.0. Apache Beam currently relies on users using Spark 1.6.3. There is an open pull request[1] to migrate to Spark 2.2.0. 1: https://github.com/apache/beam/pull/4208/ On Mon, Dec 4, 2017 at 10:58 AM, Opitz, Daniel A wrote: > We are trying to submit a Spa

Re: Callbacks/other functions run after a PDone/output transform

2017-12-04 Thread Kenneth Knowles
+dev@ I am in complete agreement with Luke. Data dependencies are easy to understand and a good way for an IO to communicate and establish causal dependencies. Converting an IO from PDone to real output may spur further useful thoughts based on the design decisions about what sort of output is mos

Spark Runner Issues with YARN

2017-12-04 Thread Opitz, Daniel A
We are trying to submit a Spark job through YARN with the following command: spark-submit --conf spark.yarn.stagingDir=/path/to/stage --verbose --class com.my.class --jars /path/to/jar1,path/to/jar2 /path/to/main/jar/application.jar The application is being populated in the YARN scheduler howev

Re: Callbacks/other functions run after a PDone/output transform

2017-12-04 Thread Lukasz Cwik
I think all sinks actually do have valuable information to output which can be used after a write (file names, transaction/commit/row ids, table names, ...). In addition to this metadata, having a PCollection of all successful writes and all failed writes is useful for users so they can chain an ac

Re: Apache Beam, version 2.2.0

2017-12-04 Thread Jean-Baptiste Onofré
Just an important note that we forgot to mention. !! The 2.2.0 release will be the last one supporting Spark 1.x and Java 7 !! Starting from Beam 2.3.0, the Spark runner will work only with Spark 2.x and we will focus only Java 8. Regards JB On 12/04/2017 10:15 AM, Jean-Baptiste Onofré wrote

Re: Apache Beam, version 2.2.0

2017-12-04 Thread Jean-Baptiste Onofré
Thanks Reuven ! I would like to emphasize on some highlights in 2.2.0 release: - New IOs have been introduced: * TikaIO leveraging Apache Tika, allowing the deal with a lot of different data formats * RedisIO to read and write key/value pairs from a Redis server. This IO will be soon extende