Re: flink with kafka 0.7

2015-09-14 Thread Stephan Ewen
We tested it only with Kafka 0.8.1 upwards. I know nobody that used it with Kafka < 0.8 before. You can give it a try, let us know if it works... BTW: I think Kafka 0.7 has so many open problems, if you can, may be worth upgrading... On Mon, Sep 14, 2015 at 11:00 PM, Martin Neumann wrote: > He

Re: intermediate result reuse

2015-09-14 Thread Stephan Ewen
ToSet should be good to use. By default, the Iterators stream data (across memory, network, and disk), which allows you to use very large groups (larger than memory). With ToSet, your group naturally has to fit into memory. But in most cases it will ;-) On Mon, Sep 14, 2015 at 11:06 PM, Fabian H

Re: Resolving dependencies when using sbt

2015-09-14 Thread Stephan Ewen
Answering to our other mail: I think getting from import statements to dependencies is not straightworward easy. Usually if you add flink-core and flink-java or flink-scala, plus maybe flink-streaming-core you have what you need to program (transitive libraries will be resolved by Maven / sbt) an

Re: Resolving dependencies when using sbt

2015-09-14 Thread Daniel Blazevski
BTW, I previously had an already built version of Flink, am now building from scratch so that I can use an IDE -- Got Scala 2.10.5, and building Flink from source will naturally make the whole process of getting the dependencies right more straight-forward. Decided will need to do this eventually

Re: intermediate result reuse

2015-09-14 Thread Fabian Hueske
Ah, sorry :-) toSet, toList, etc. are regular methods of Scala's Iterator API [1] and not part of Flink's API although the concrete iterator is provided by Flink. I am not a Scala expert, but I think it will eagerly fetch the contents of the function's iterator into a set (or list). This call is pa

Flink Streaming and Google Cloud Pub/Sub?

2015-09-14 Thread Martin Neumann
Hej, Has anyone tried use connect Flink Streaming to Google Cloud Pub/Sub and has a code example for me? If I have to implement my own sources and sinks are there any good tutorials for that? cheers Martin

flink with kafka 0.7

2015-09-14 Thread Martin Neumann
Hej, I want to connect Flink streaming to a Kafka 0.7 cluster. Will this work with the latest release or does the Kafka implementation rely on kafka 0.8? cheers Martin

Re: Distribute DataSet to subset of nodes

2015-09-14 Thread Fabian Hueske
Hi Stefan, forcing the scheduling of tasks to certain nodes and reading files from the local file system in a multi-node setup is actually quite tricky and requires a bit understanding of the internals. It is possible and I can help you with that, but would recommend to use a shared filesystem suc

Re: intermediate result reuse

2015-09-14 Thread Michele Bertoni
sorry i was not talking about that collect, I know what a collector is I was talking about the outer join case where inside a cogroup you should do a ToSet on left or right side and collect it to be traversable multiple times with a toSet it is transforming (something like) a lazy iterator to a l

Re: intermediate result reuse

2015-09-14 Thread Fabian Hueske
Hi Michele, collect on DataSet and collect on a Collector within a Function are two different things and have the same name by coincidence (actually, this is the first time I noticed that). DataSet.collect() fetches a DataSet which can be distributed across several TaskManagers (in a cluster) to

Re: Resolving dependencies when using sbt

2015-09-14 Thread Daniel Blazevski
Thanks for the feedback. I have another question about building using sbt: how can one go from import statements to figuring out library dependencies? Would be nice to write small programs w/o an IDE and be able to go from the import statements to appending the library dependency list in a .sbt

Re: intermediate result reuse

2015-09-14 Thread Michele Bertoni
Hi Stephan, I have one more question: what happens when I do collect inside a cogroup (i.e. doing an outer join) or in a groupreduce? Il giorno 13/set/2015, alle ore 02:13, Stephan Ewen mailto:se...@apache.org>> ha scritto: Hi! In most places where you use collect(), you should be able to use

Re: Distribute DataSet to subset of nodes

2015-09-14 Thread Stefan Bunk
Hi, actually, I am distributing my data before the program starts, without using broadcast sets. However, the approach should still work, under one condition: > DataSet mapped1 = > data.flatMap(yourMap).withBroadcastSet(smallData1,"data").setParallelism(5); > DataSet mapped2 = > data.flatMap(you

Re: Resolving dependencies when using sbt

2015-09-14 Thread Aljoscha Krettek
Hi, Giancarlo is correct about the Scala version. The provided Flink libraries should only work with 2.10. I'm actually wondering why it seems to be working for you with 2.9.x. Cheers, Aljoscha On Mon, 14 Sep 2015 at 11:07 Giancarlo Pagano wrote: > If you want to use Flink 0.10-SNAPSHOT you can

Re: Simple streaming word count doesn't work (Scala)

2015-09-14 Thread Aljoscha Krettek
Hi Francis, I'm afraid this is a very strange bug that results from the interplay between pre-aggregating (an optimization that pre-aggregates the elements of a window as they arrive) and the window size/slide size you use. When using some other time values it works, but with other it doesn't, agai

Re: Resolving dependencies when using sbt

2015-09-14 Thread Giancarlo Pagano
If you want to use Flink 0.10-SNAPSHOT you can add the Apache Snapshot repository, in sbt: resolvers += "apache-snapshot" at "https://repository.apache.org/content/repositories/snapshots/“ It would probably be better to use

Re: Distribute DataSet to subset of nodes

2015-09-14 Thread Fabian Hueske
Hi Stefan, I agree with Sachin's approach. That should be the easiest solution and would look like: env.setParallelism(10); // default is 10 DataSet data = env.read(...) // large data set DataSet smallData1 = env.read(...) // read first part of small data DataSet smallData2 = env.read(...) // re

Simple streaming word count doesn't work (Scala)

2015-09-14 Thread Francis Aranda
Testing the apache flink stream API, I found something weird with a simple example. This code counts the words every 5 seconds under a window of 10 seconds. Until the 10 first seconds, counts sound good, after that, every print shows a wrong count - one per word. There is something wrong in my