Re: error message

2015-10-01 Thread Stephan Ewen
Hi! The most common cause for "NoSuchMethodError" in Java is a mixup between code versions. You may have compiled against a different version of the code than you are running against. The fix is mostly simply to make sure you have the same version that you program against and that you run, and re

Re: kryo exception due to race condition

2015-10-01 Thread Stephan Ewen
This looks to me like a bug where type registrations are not properly forwarded to all Serializers. Can you open a JIRA ticket for this? On Thu, Oct 1, 2015 at 6:46 PM, Stefano Bortoli wrote: > Hi guys, > > I hit a Kryo exception while running a process 'crossing' POJOs datasets. > I am using t

Re: data flow example on cluster

2015-10-02 Thread Stephan Ewen
@Lydia Did you create your POM files for your job with an 0.8.x quickstart? Can you try to simply re-create your project's POM files with a new quickstart? I think that the POMS between 0.8-incubating-SNAPSHOT and 0.10-SNAPSHOT may not be quite compatible any more... On Fri, Oct 2, 2015 at 12:0

Re: JM/TM startup time

2015-10-02 Thread Stephan Ewen
The delay you see happens when the TaskManager allocates the memory for its memory manager. Allocating that much in a JVM can take a bit, although 40 seconds looks a lot to me... How do you start the JVM? Are Xmx and Xms set to the same value? If not, the JVM incrementally grows through multiple g

Re: JM/TM startup time

2015-10-02 Thread Stephan Ewen
Is that a new observation that it takes so long, or has it always taken so long? On Fri, Oct 2, 2015 at 5:40 PM, Robert Schmidtke wrote: > I figured the JM would be waiting for the TMs. Each of my nodes has 64G of > memory available. > > On Fri, Oct 2, 2015 at 5:38 PM, Maximilian Michels wrote:

Re: JM/TM startup time

2015-10-02 Thread Stephan Ewen
just looked into an old >> log (well, from last Friday) and it took about 1 minute for 31 TMs to >> connect to 1 JM. They each had -Xms and -Xmx6079m though. >> >> On Fri, Oct 2, 2015 at 5:44 PM, Stephan Ewen wrote: >> >>> Is that a new observation that i

Re: Destroy StreamExecutionEnv

2015-10-05 Thread Stephan Ewen
Matthias' solution should work in most cases. In cases where you do not control the source (or the source can never be finite, like the Kafka source), we often use a trick in the tests, which is throwing a special type of exception (a SuccessException). You can catch this exception on env.execute

Re: Error trying to access JM through proxy

2015-10-05 Thread Stephan Ewen
I think this is yet another problem caused by Akka's overly strict message routing. An actor system bound to a certain URL can only receive messages that are sent to that exact URL. All other messages are dropped. This has many problems: - Proxy routing (as described here, send to the proxy UR

Re: Reading from multiple input files with fewer task slots

2015-10-05 Thread Stephan Ewen
I assume this concerns the streaming API? Can you share your program and/or the custom input format code? On Mon, Oct 5, 2015 at 12:33 PM, Pieter Hameete wrote: > Hello Flinkers! > > I run into some strange behavior when reading from a folder of input files. > > When the number of input files i

Re: Reading from multiple input files with fewer task slots

2015-10-05 Thread Stephan Ewen
scala > The Custom Input Format at > https://github.com/PHameete/dawn-flink/blob/development/src/main/scala/wis/dawnflink/parsing/xml/XML2DawnInputFormat.java > > Cheers! > > 2015-10-05 12:38 GMT+02:00 Stephan Ewen : > >> I assume this concerns the streaming API? >> &g

Re: Reading from multiple input files with fewer task slots

2015-10-05 Thread Stephan Ewen
file no records were parsed. > > Thanks alot for your help! > > - Pieter > > 2015-10-05 12:50 GMT+02:00 Stephan Ewen : > >> If you have more files than task slots, then some tasks will get multiple >> files. That means that open() and close() are called multiple times o

Re: ExecutionEnvironment setConfiguration API

2015-10-06 Thread Stephan Ewen
Hi! Are you on the SNAPSHOT master version? You can pass the configuration to the constructor of the execution environment, or create one via ExecutionEnvironment.createLocalEnvironment(config) or via createRemoteEnvironment(host, port, configuration, jarFiles); The change of the signature was p

Re: ExecutionEnvironment setConfiguration API

2015-10-06 Thread Stephan Ewen
ck support! >> >> Best, >> Flavio >> >> On Tue, Oct 6, 2015 at 10:48 AM, Stephan Ewen wrote: >> >>> Hi! >>> >>> Are you on the SNAPSHOT master version? >>> >>> You can pass the configuration to the constructor of the

Re: Parallel file read in LocalEnvironment

2015-10-07 Thread Stephan Ewen
The split functionality is in the FileInputFormat and the functionality that takes care of lines across splits is in the DelimitedIntputFormat. On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske wrote: > I'm sorry there is no such documentation. > You need to look at the code :-( > > 2015-10-07 15:19

Re: data sink stops method

2015-10-08 Thread Stephan Ewen
Yes, sinks in Flink are lazy and do not trigger execution automatically. We made this choice to allow multiple concurrent sinks (spitting the streams and writing to many outputs concurrently). That requires explicit execution triggers (env.execute()). The exceptions are, as mentioned, the "eager"

Re: Debug OutOfMemory

2015-10-08 Thread Stephan Ewen
Can you give us a bit more background? What exactly is your program doing? - Are you running a DataSet program, or a DataStream program? - Is it one simple source that reads from S3, or are there multiple sources? - What operations do you apply on the CSV file? - Are you using Flink's S3

Re: Debug OutOfMemory

2015-10-08 Thread Stephan Ewen
ot Emr) as far as Flink's s3 connector > doesn't work at all. > > Currently I have 3 taskmanagers each 5k MB, but I tried different > configurations and all leads to the same exception > > *Sent from my ZenFone > On Oct 8, 2015 12:05 PM, "Stephan Ewen" wrote: >

Re: Debug OutOfMemory

2015-10-08 Thread Stephan Ewen
at > org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:453) > at > org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:176) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) > at java.lang.Threa

Re: Debug OutOfMemory

2015-10-08 Thread Stephan Ewen
> Konstantin Kudryavtsev > > On Thu, Oct 8, 2015 at 12:35 PM, Stephan Ewen wrote: > >> Ah, that makes sense! >> >> The problem is not in the core runtime, it is in the delimited input >> format. It probably looks for the line split character and never finds it,

Re: ExecutionEnvironment setConfiguration API

2015-10-14 Thread Stephan Ewen
On Tue, Oct 6, 2015 at 11:31 AM, Flavio Pompermaier > wrote: > >> That makes sense: what can be configured should be differentiated between >> local and remote envs (obviously this is a minor issue/improvement) >> >> Thanks again, >> Flavio >> >&

Re: Apache Flink and serious streaming stateful processing

2015-10-16 Thread Stephan Ewen
Hi! As Gyula mentioned an upcoming Pull Request will make the state backend pluggable. We would like to add the following state holders into Flink: (1) Small state in memory (local execution / debugging) : State maintained in a heap hash map, checkpoints to JobManager. This is in there now. (2)

Re: Processing S3 data with Apache Flink

2015-10-20 Thread Stephan Ewen
@Konstantin (2) : Can you try the workaround described by Robert, with the "s3n" file system scheme? We are removing the custom S3 connector now, simply reusing Hadoop's S3 connector for all cases. @Kostia: You are right, there should be no broken stuff that is not clearly marked as "beta". For t

Re:

2015-10-20 Thread Stephan Ewen
@Jakob: If you use Flink standalone (not through YARN), one thing to be aware of is that the relevant change is in the bash scripts that start the cluster, not the code. If you upgraded Flink by copying a newer JAR file, you missed the update of the bash scripts and missed the fix for that issue.

Re: Flink+avro integration

2015-10-20 Thread Stephan Ewen
Hi Andrew! TL;DR There is no out of the box (de)serializer for Flink with Kafka, but it should be not very hard to add. Here is a gist that basically does it. Let me know if that works for you, I'll add it to the Flink source then: https://gist.github.com/StephanEwen/d515e10dd1c609f70bed Greeti

Re: Flink Data Stream Union

2015-10-20 Thread Stephan Ewen
Hi! Two comments: (1) The iterate() statement is probably wrong, as noticed by Anwar. (2) Which version of Flink are you using? In 0.9.x, the Union operator is not lock-safe, in 0.10, it should work well. The 0.10 release is coming up shortly, you can try the 0.10-SNAPSHOT version already. Gree

Re: Flink+avro integration

2015-10-21 Thread Stephan Ewen
@Andrew Flink should work with Scala classes that follow the POJO style (public fields), so you should be able to use the Java Avro Library just like that. If that does not work in your case, please file a bug report! On Wed, Oct 21, 2015 at 9:41 AM, Till Rohrmann wrote: > What was your proble

Re: Flink io files

2015-10-21 Thread Stephan Ewen
Hey! These files are the spilled data from a sort, a hash table, or a cache, when memory runs short. If you have some very big files and some 0 sized, I would guess you are running a Hash Join, and have heavy skew in the distribution of the keys. Greetings, Stephan On Wed, Oct 21, 2015 at 12:5

Re: Flink Data Stream Union

2015-10-21 Thread Stephan Ewen
I think the most crucial question is still whether you are running 0.9.1 or 0.10-SNAPSHOT, because the 0.9.1 union has known issues... If you are running 0.9.1 there is not much you can do except upgrade the version ;-) On Wed, Oct 21, 2015 at 5:19 PM, Aljoscha Krettek wrote: > Hi, > first of al

Re: Flink+avro integration

2015-10-21 Thread Stephan Ewen
This is actually not a bug, or a POJO or Avro problem. It is simply a limitation in the functionality, as the exception message says: "Specifying fields by name is only supported on Case Classes (for now)." Try this with a regular reduce function that selects the max and it should work fine... Gr

Re: Question regarding parallelism

2015-10-22 Thread Stephan Ewen
Hi! The bottom of this page also has an illustration of task to task slots. https://ci.apache.org/projects/flink/flink-docs-release-0.9/setup/config.html There are two optimizations involved: (1) Chaining: Here sources, mappers, filters are chained together. This is pretty classic, most systems

Re: Multiple keys in reduceGroup ?

2015-10-22 Thread Stephan Ewen
Hi! You are checking for equality / inequality with "!=" - can you check with "equals()" ? The key objects will most certainly be different in each record (as they are deserialized individually), but they should be equal. Stephan On Thu, Oct 22, 2015 at 12:20 PM, LINZ, Arnaud wrote: > Hello,

Re: Multiple keys in reduceGroup ?

2015-10-22 Thread Stephan Ewen
Aljoscha Krettek > wrote: > > Hi, > but he’s comparing it to a primitive long, so shouldn’t the Long key be > unboxed and the comparison still be valid? > > My question is whether you enabled object-reuse-mode on the > ExecutionEnvironment? > > Cheers, > Aljoscha >

Re: How best to deal with wide, structured tuples?

2015-10-29 Thread Stephan Ewen
Hi Johann! You can try and use the Table API, it has logical tuples that you program with, rather than tuple classes. Have a look here: https://ci.apache.org/projects/flink/flink-docs-master/libs/table.html Stephan On Thu, Oct 29, 2015 at 6:53 AM, Fabian Hueske wrote: > Hi Johann, > > I see

Re: Hi, question about orderBy two columns more

2015-11-01 Thread Stephan Ewen
Actually, sortPartition(col1).sortPartition(col2) results in a single sort that primarily sorts after col1 and secondarily sorts after col2, so it is the same as in SQL when you state "ORDER BY col1, col2". The SortPartitionOperator created with the first "sortPartition(col1)" call appends further

Re: Could not load the task's invokable class.

2015-11-02 Thread Stephan Ewen
Hi! You probably miss some jars in your classpath. Usually Maven/SBT resolve that automatically, so I assume you are manually constructing the classpath here? For this particular error, you probably miss the "flink-streaming-core" (0.9.1) / "flink-streaming-java" (0.10) in your classpath. I woul

Re: passing environment variables to flink program

2015-11-02 Thread Stephan Ewen
Hi! What kind of setup are you using, YARN or standalone? In both modes, you should be able to pass your flags via the config entry "env.java.opts" in the flink-conf.yaml file. See here https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#other We have never passed library pa

Re: passing environment variables to flink program

2015-11-02 Thread Stephan Ewen
Ah, okay, I confused the issue. The environment variables would need to be defined or exported in the environment that spawns TaskManager processes. I think there is nothing for that in Flink yet, but it should not be hard to add that. Can you open an issue for that in JIRA? Thanks, Stephan On

Re: [IE] Re: passing environment variables to flink program

2015-11-02 Thread Stephan Ewen
*From:* ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] *On Behalf > Of *Stephan Ewen > *Sent:* Monday, November 02, 2015 4:35 PM > *To:* user@flink.apache.org > *Subject:* [IE] Re: passing environment variables to flink program > > > > Ah, okay, I confused the issue. > &

Re: Create triggers

2015-11-02 Thread Stephan Ewen
You can also try and make the decision on the client. Imagine a program like this. long count = env.readFile(...).filter(...).count(); if (count > 5) { env.readFile(...).map().join(...).reduce(...); } else { env.readFile(...).filter().coGroup(...).map(...); } On Mon, Nov 2, 2015 at 1:35 AM

Re: Published test artifacts for flink streaming

2015-11-05 Thread Stephan Ewen
Hey! There is also a collect() sink in the "flink-streaming-contrib" project, see here: https://github.com/apache/flink/blob/master/flink-contrib/flink-streaming-contrib/src/main/java/org/apache/flink/contrib/streaming/DataStreamUtils.java It should work well locally for testing. In that case you

Re: Running continuously on yarn with kerberos

2015-11-07 Thread Stephan Ewen
Hi Niels! Usually, you simply build the binaries by invoking "mvn -DskipTests clean package" in the root flink directory. The resulting program should be in the "build-target" directory. If the program gets stuck, let us know where and what the last message on the command line is. Please be awar

Re: Running continuously on yarn with kerberos

2015-11-07 Thread Stephan Ewen
> -- Sachin Goel > Computer Science, IIT Delhi > m. +91-9871457685 > > On Sun, Nov 8, 2015 at 2:05 AM, Niels Basjes wrote: > >> How long should this take if you have HDD and about 8GB of RAM? >> Is that 10 minutes? 20? >> >> Niels >> >> On Sat, Nov 7,

Re: Mutable objects in Flink ?

2015-11-07 Thread Stephan Ewen
Hi! Sorry, I do not really understand you question. Can you paste the skeleton of your program code here? Thanks, Stephan On Fri, Nov 6, 2015 at 6:55 PM, Hajira Jabeen wrote: > Hi all, > > I am writing an evolutionary computing application with Flink. > Each object is a particle with multidim

Re: Creating a Source using an Akka actor

2015-11-08 Thread Stephan Ewen
Hi Hector! I know of users that use camel to produce a stream into Apache Kafka and then use Flink to consume and process the Kafka stream. That pattern work well. Greetings, Stephan On Sun, Nov 8, 2015 at 1:33 PM, rmetzger0 wrote: > Hi Hector, > > I'm sorry that nobody replied to your messag

Re: finite subset of an infinite data stream

2015-11-09 Thread Stephan Ewen
Hi! If you want to work on subsets of streams, the answer is usually to use windows, "stream.keyBy(...).timeWindow(Time.of(1, MINUTE))". The transformations that you want to make, do they fit into a window function? There are thoughts to introduce something like global time windows across the en

Re: Cluster installation gives java.lang.NoClassDefFoundError for everything

2015-11-09 Thread Stephan Ewen
The distributed "start-cluster.sh" script works only, if the code is accessible under the same path on all machines, which must be the same path as on the machine where you invoke the script. Otherwise the paths for remote shell commands will be wrong, and the classpaths will be wrong as a result.

Re: Running continuously on yarn with kerberos

2015-11-09 Thread Stephan Ewen
gt; Kerberos secured cluster. > Cleaning up the patch so I can submit it in a few days. > > On Sat, Nov 7, 2015 at 10:01 PM, Stephan Ewen wrote: > >> The single shading step on my machine (SSD, 10 GB RAM) takes about 45 >> seconds. HDD may be significantly longer, but should rea

Re: Missing 0.10 SNAPSHOT Download

2015-11-09 Thread Stephan Ewen
Hi! Rather than taking an 0.10-SNAPSHOT, you could also take a 0.10 release candidate. The latest is for example in https://repository.apache.org/content/repositories/orgapacheflink-1053/ Greetings, Stephan On Mon, Nov 9, 2015 at 5:45 PM, Maximilian Michels wrote: > Hi Brian, > > We are curr

Re: Running on a firewalled Yarn cluster?

2015-11-10 Thread Stephan Ewen
Hi Cory! There is no flag to define the BlobServer port right now, but we should definitely add this: https://issues.apache.org/jira/browse/FLINK-2996 If your setup is such that the firewall problem is only between client and master node (and the workers can reach the master on all ports), then y

Re: Cluster installation gives java.lang.NoClassDefFoundError for everything

2015-11-11 Thread Stephan Ewen
Hi! Flink requires at least Java 1.7, so one of the reasons could also be that the classes are rejected for an incompatible version (class format 1.7, JVM does not understand it since it is only version 1.6). That could explain things... Greetings, Stephan On Wed, Nov 11, 2015 at 9:01 AM, Came

Re: Implementing samza table/stream join

2015-11-11 Thread Stephan Ewen
I would encourage you to use the 0.10 version of Flink. Streaming has made some major improvements there. The release is voted on now, you can refer to these repositories for the release candidate code: http://people.apache.org/~mxm/flink-0.10.0-rc8/ https://repository.apache.org/content/reposit

Re: Flink, Kappa and Lambda

2015-11-11 Thread Stephan Ewen
Hi! Can you explain a little more what you want to achieve? Maybe then we can give a few more comments... I briefly read through some of the articles you linked, but did not quite understand their train of thoughts. For example, letting Tomcat write to Cassandra directly, and to Kafka, might just

Re: Apache Flink Operator State as Query Cache

2015-11-11 Thread Stephan Ewen
Hi! In general, if you can keep state in Flink, you get better throughput/latency/consistency and have one less system to worry about (external k/v store). State outside means that the Flink processes can be slimmer and need fewer resources and as such recover a bit faster. There are use cases for

Re: Checkpoints in batch processing & JDBC Output Format

2015-11-11 Thread Stephan Ewen
Hi! You can use both the DataSet API or the DataStream API for that. In case of failures, they would behave slightly differently. DataSet: Fault tolerance for the DataSet API works by restarting the job and redoing all of the work. In some sense, that is similar to what happens in MapReduce, onl

Re: finite subset of an infinite data stream

2015-11-11 Thread Stephan Ewen
I wrong. > > Moreover it should be possible to link the streams by next request with > other filtering criteria. That is create new data transformation chain > after running of env.execute("WordCount Example"). Is it possible now? If > not, is it possible with minimal chan

Re: Apache Flink Operator State as Query Cache

2015-11-16 Thread Stephan Ewen
iced that OperatorState > is not implemented for ConnectedStream, which is quite the opposite of what > you said below. > > Or maybe I misunderstood your sentence here ? > > Thanks, > Anwar. > > > On Wed, Nov 11, 2015 at 10:49 AM, Stephan Ewen wrote: > >> Hi!

Re: Error handling

2015-11-16 Thread Stephan Ewen
Hi Nick! The errors outside your UDF (such as network problems) will be handled by Flink and cause the job to go into recovery. They should be transparently handled. Just make sure you activate checkpointing for your job! Stephan On Mon, Nov 16, 2015 at 6:18 PM, Nick Dimiduk wrote: > I have

Re: Error handling

2015-11-16 Thread Stephan Ewen
oscha > > On 16 Nov 2015, at 19:22, Stephan Ewen wrote: > > > > Hi Nick! > > > > The errors outside your UDF (such as network problems) will be handled > by Flink and cause the job to go into recovery. They should be > transparently handled. > > > > Just mak

Re: Different CoGroup behavior inside DeltaIteration

2015-11-16 Thread Stephan Ewen
It is actually very important that the co group in delta iterations works like that. If the CoGroup touched every element in the solution set, the "decreasing work" effect would not happen. The delta iterations are designed for cases where specific updates to the solution are made, driven by the w

Re: Issue with sharing state in CoFlatMapFunction

2015-11-17 Thread Stephan Ewen
Hi! Can you give us a bit more context? For example share the structure of the program (what stream get windowed and connected in what way)? I would guess that the following is the problem: When you connect one stream to another, then partition n of the first stream connects with partition n of

Re: Issue with sharing state in CoFlatMapFunction

2015-11-17 Thread Stephan Ewen
out.collect(new > Features(tuple.getField(0), tuple.getField(2), tuple.getField(1), start_ts, > end_ts, size, dwell_time, Boolean.FALSE)); > > > > On Tuesday, November 17, 2015 12:59 PM, Stephan Ewen > wrote: > > > Hi! > > Can you give us a bit more context? For

Re: Issue with sharing state in CoFlatMapFunction

2015-11-17 Thread Stephan Ewen
; > > > On Tuesday, November 17, 2015 1:49 PM, Vladimir Stoyak > wrote: > > > Perfect! It does explain my problem. > > Thanks a lot > > > > On Tuesday, November 17, 2015 1:43 PM, Stephan Ewen > wrote: > > > Is the CoFlatMapFunction intended to be exe

Re: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

2015-11-17 Thread Stephan Ewen
Hi Arnaud! Java direct-memory is tricky to debug. You can turn on the memory logging or check the TaskManager tab in teh web dashboard - both report on direct memory consumption. One thing you can look for is forgetting to close streams. That means the streams consume native resources until the J

Re: Join Stream with big ref table

2015-11-17 Thread Stephan Ewen
I think this pattern may be common, so some tools that share such a table across multiple tasks may make sense. Would be nice to add a handler that you give an "initializer" which reads the data and build the shared lookup map. The first to acquire the handler actually initializes the data set (re

Re: Published test artifacts for flink streaming

2015-11-18 Thread Stephan Ewen
dicates to run >> over every stream event. Is the java client API rich enough to express such >> a flow, or should I examine something lower than DataStream? >> >> Thanks a lot, and sorry for all the newb questions. >> -n >> >> >> On Thursday, November 5,

Re: Checkpoints in batch processing & JDBC Output Format

2015-11-18 Thread Stephan Ewen
en to have further input on how this or a similar > approach (e.g. using a timestamp) could be automated, perhaps by > customizing the output format as well? > > Cheers, > Max > > Am 11.11.2015 um 11:35 schrieb Stephan Ewen : > > Hi! > > You can use both the Da

Re: Does 'DataStream.writeAsCsv' suppose to work like this?

2015-11-18 Thread Stephan Ewen
That makes sense... On Mon, Oct 26, 2015 at 12:31 PM, Márton Balassi wrote: > Hey Max, > > The solution I am proposing is not flushing on every record, but it makes > sure to forward the flushing from the sinkfunction to the outputformat > whenever it is triggered. Practically this means that th

Re: Specially introduced Flink to chinese users in CNCC(China National Computer Congress)

2015-11-18 Thread Stephan Ewen
Thank you indeed for presenting there. It looks like a very large audience! Greetings, Stephan On Mon, Oct 26, 2015 at 11:24 AM, Maximilian Michels wrote: > Hi Liang, > > We greatly appreciate you introduced Flink to the Chinese users at CNCC! > We would love to hear how people like Flink. >

Re: Reading null value from datasets

2015-11-18 Thread Stephan Ewen
Hi Guido! If you use Scala, I would use an Option to represent nullable fields. That is a very explicit solution that marks which fields can be null, and also forces the program to handle this carefully. We are looking to add support for Java 8's Optional type as well for exactly that purpose. G

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Stephan Ewen
s doesn't increase. > What is going on underneath? Is it normal? > > Thanks in advance, > Flavio > > > > On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen wrote: > >> The split functionality is in the FileInputFormat and the functionality >> that takes care

Re: Published test artifacts for flink streaming

2015-11-18 Thread Stephan Ewen
before the next test begins? > > What is "sink in DOP 1"? > > Thanks, > Nick > > > On Wednesday, November 18, 2015, Stephan Ewen wrote: > >> There is no global order in parallel streams, it is something that >> applications need to work with. W

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Stephan Ewen
, Flavio Pompermaier wrote: > So why it takes so much to start the job?because in any case the job > manager has to read all the lines of the input files before generating the > splits? > On 18 Nov 2015 17:52, "Stephan Ewen" wrote: > >> Late answer, sorry: >

Re: FlinkKafkaConsumer and multiple topics

2015-11-18 Thread Stephan Ewen
The new KafkaConsumer fro Kafka 0.9 should be able to handle this, as the Kafka Client Code itself has support for this then. For 0.8.x, we would need to implement support for recovery inside the consumer ourselves, which is why we decided to initially let the Job Recovery take care of that. If th

Re: Parallel file read in LocalEnvironment

2015-11-18 Thread Stephan Ewen
the local fs (ext4) > On 18 Nov 2015 19:17, "Stephan Ewen" wrote: > >> The JobManager does not read all files, but is has to query the HDFS for >> all file metadata (size, blocks, block locations), which can take a bit. >> There is a separate call to the HDFS Name

Re: Kafka + confluent schema registry Avro parsing

2015-11-19 Thread Stephan Ewen
The KafkaAvroDecoder is not serializable, and Flink uses serialization to distribute the code to the TaskManagers in the cluster. I think you need to "lazily" initialize the decoder, in the first invocation of "deserialize()". That should do it. Stephan On Thu, Nov 19, 2015 at 12:10 PM, Madhuka

Re: Fold vs Reduce in DataStream API

2015-11-19 Thread Stephan Ewen
Hi Ron! You are right, there is a copy/paste error in the docs, it should be a FoldFunction that is passed to fold(), not a ReduceFunction. In Flink-0.10, the FoldFunction is only available on - KeyedStream ( https://ci.apache.org/projects/flink/flink-docs-release-0.10/api/java/org/apache/flin

Re: Fold vs Reduce in DataStream API

2015-11-19 Thread Stephan Ewen
bit. Looks like another one of those > API changes that I'll be struggling with for a little bit. > > On Thu, Nov 19, 2015 at 10:40 AM, Stephan Ewen wrote: > >> Hi Ron! >> >> You are right, there is a copy/paste error in the docs, it should be a >>

Re: placement preferences for streaming jobs

2015-11-22 Thread Stephan Ewen
Hi Stefania! I think there is no hook for that right now. If I understand you correctly, assuming you run YARN or so, you want to give the sources a set of hostnames, and when scheduling, the sources have preferences for those nodes. Within a dataflow program (job), Flink will attempt to co-locat

Re: Flink Streaming Core 0.10 in maven repos

2015-11-23 Thread Stephan Ewen
Hi Arnaud! In 0.10 , we renamed the dependency to "flink-streaming-java" (and flink-streaming-scala"), to be more in line with the structure of the dependencies on the batch side. Just replace "flink-streaming-core" with "flink-streaming-java"... Greetings, Stephan On Mon, Nov 23, 2015 at 9:07

Re: ReduceByKeyAndWindow in Flink

2015-11-23 Thread Stephan Ewen
One addition: You can set the system to use "ingestion time", which gives you event time with auto-generated timestamps and watermarks, based on the time that the events are seen in the sources. That way you have the same simplicity as processing time, and you get the window alignment that Aljosch

Re: Using Hadoop Input/Output formats

2015-11-25 Thread Stephan Ewen
For streaming, I am a bit torn whether reading a file will should have so many such prominent functions. Most streaming programs work on message queues, or on monitored directories. Not saying no, but not sure DataSet/DataStream parity is the main goal - they are for different use cases after all.

Re: Working with State example /flink streaming

2015-11-25 Thread Stephan Ewen
Hi Javier! You can solve this both using windows, or using manual state. What is better depends a bit on when you want to have the result (the sum). Do you want a result emitted after each update (or do some other operation with that value) or do you want only the final sum after a certain time?

Re: Working with State example /flink streaming

2015-11-26 Thread Stephan Ewen
text().getKeyValueState("myCounter", > Long.class, 0L); > } > > } > > > We are using a Tuple4 because we want to calculate the sum and the average > (So our Tuple is ID, SUM, Count, AVG). Do we need to add another step to > get a single value out of i

Re: graph problem to be solved

2015-11-27 Thread Stephan Ewen
Hi! Yes, looks like quite a graph problem. The best way to get started with that is to have a look at Gelly: https://ci.apache.org/projects/flink/flink-docs-release-0.10/libs/gelly_guide.html Beware: The problem you describe (all possible paths between all pairs of points) results in an exponenti

Re: Cleanup of OperatorStates?

2015-11-27 Thread Stephan Ewen
Hi Niels! Currently, state is released by setting the value for the key to null. If you are tracking web sessions, you can try and send a "end of session" element that sets the value to null. To be on the safe side, you probably want state that is automatically purged after a while. I would look

Re: Cleanup of OperatorStates?

2015-11-27 Thread Stephan Ewen
gt; > Niels > > > > > > > On Fri, Nov 27, 2015 at 11:51 AM, Stephan Ewen wrote: > >> Hi Niels! >> >> Currently, state is released by setting the value for the key to null. If >> you are tracking web sessions, you can try and send a "end of se

Re: flink connectors

2015-11-27 Thread Stephan Ewen
The reason why the binary distribution does not contain all connectors is that this would add all libraries used by the connectors into the binary distribution jar. These libraries partly conflict with each other, and often conflict with the libraries used by the user's programs. Not including the

Re: Doubt about window and count trigger

2015-11-27 Thread Stephan Ewen
Hi! The reason why trigger state is purged right now with the window is to make sure that no memory is occupied any more after the purge. Otherwise, memory consumption would just grow indefinitely, holding state of old triggers. Greetings, Stephan On Fri, Nov 27, 2015 at 4:05 PM, Fabian Hueske

Re: Get an aggregator's value outside of an iteration

2015-11-30 Thread Stephan Ewen
We wanted to combine the accumulators and aggregators for a while, but have not gotten to it so far (there is a pending PR which needs some more work). You can currently work your way around this by using the accumulators together with the aggregators. - Aggregators: Within an iteration across s

Re: Watermarks as "process completion" flags

2015-11-30 Thread Stephan Ewen
Hi Anton! That you can do! You can look at the interfaces "Checkpointed" and "checkpointNotifier". There you will get a call at every checkpoint (and can look at what records are before that checkpoint). You also get a call once the checkpoint is complete, which corresponds to the point when ever

Re: Watermarks as "process completion" flags

2015-11-30 Thread Stephan Ewen
kpoint and catch it at the end of the DAG that would solve my problem > (well, I also need to somehow distinguish my checkpoint from Flink's > auto-generated ones). > > Sorry for being too chatty, this is the topic where I need expert opinion, > can't find out the answer by

Re: Watermarks as "process completion" flags

2015-11-30 Thread Stephan Ewen
duced? > > On Mon, Nov 30, 2015 at 2:13 PM, Stephan Ewen wrote: > >> Hi! >> >> If you implement the "Checkpointed" interface, you get the function calls >> to "snapshotState()" at the point when the checkpoint barrier arrives at an >> operat

Re: Cleanup of OperatorStates?

2015-11-30 Thread Stephan Ewen
as recorded). > > Now in my case all events are in Kafka (for say 2 weeks). > When something goes wrong I want to be able to 'reprocess' everything from > the start of the queue. > Here the matter of 'event time' becomes a big question for me; In those > 'replay&

Re: Cleanup of OperatorStates?

2015-12-01 Thread Stephan Ewen
realtime version I currently break down the Window to get the single events > and after that I have to recreate the same Window again. > > I'm looking forward to the implementation direction you are referring to. > I hope you have a better way of doing this. > > Niels Basjes &g

Re: Cleanup of OperatorStates?

2015-12-01 Thread Stephan Ewen
Just for clarification: The real-time results should also contain the visitId, correct? On Tue, Dec 1, 2015 at 12:06 PM, Stephan Ewen wrote: > Hi Niels! > > If you want to use the built-in windowing, you probably need two window: > - One for ID assignment (that immediately pi

Re: Cleanup of OperatorStates?

2015-12-01 Thread Stephan Ewen
Hope you can work with this! Greetings, Stephan On Tue, Dec 1, 2015 at 1:00 PM, Stephan Ewen wrote: > Just for clarification: The real-time results should also contain the > visitId, correct? > > On Tue, Dec 1, 2015 at 12:06 PM, Stephan Ewen wrote: > >> Hi Niels! >>

Re: Cleanup of OperatorStates?

2015-12-01 Thread Stephan Ewen
perator.snapshotOperatorState(SessionizingOperator.java:162) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:440) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$1.onEvent(StreamTask.java:574) > ... 8 more > > > Nie

Re: Cleanup of OperatorStates?

2015-12-01 Thread Stephan Ewen
> Niels > > > > On Tue, Dec 1, 2015 at 4:41 PM, Niels Basjes wrote: > >> Thanks! >> I'm going to study this code closely! >> >> Niels >> >> On Tue, Dec 1, 2015 at 2:50 PM, Stephan Ewen wrote: >>

Re: Question about flink message processing guarantee

2015-12-02 Thread Stephan Ewen
There is an overview of what guarantees what sources can give you: https://ci.apache.org/projects/flink/flink-docs-master/apis/fault_tolerance.html#fault-tolerance-guarantees-of-data-sources-and-sinks On Wed, Dec 2, 2015 at 9:19 AM, Till Rohrmann wrote: > Just a small addition. Your sources have

Re: NPE with Flink Streaming from Kafka

2015-12-02 Thread Stephan Ewen
Mihail! The Flink windows are currently in-memory only. There are plans to relax that, but for the time being, having enough memory in the cluster is important. @Gyula: I think window state is currently also limited when using the SqlStateBackend, by the size of a row in the database (because win

  1   2   3   4   5   6   7   8   9   10   >