Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske
String is perfectly fine as key. Looks like SourceA / SourceB are not correctly identified as Pojos. 2016-02-09 14:25 GMT+01:00 Dominique Rondé : > Sorry, i was out for lunch. Maybe the problem is that sessionID is a > String? > > public abstract class Parent{ > private Date eventDate; > priv

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske
Hi, glad you could resolve the POJO issue, but the new error doesn't look right. The CO_GROUP_RAW strategy should only be used for programs that are implemented against the Python DataSet API. I guess that's not the case since all code snippets were Java so far. Can you post the full stacktrace of

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske
.lang.Thread.run(Thread.java:745) > Caused by: java.lang.Exception: Unsupported driver strategy for join > driver: CO_GROUP_RAW > at > org.apache.flink.runtime.operators.JoinDriver.prepare(JoinDriver.java:193) > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:459) &g

Re: Dataset filter improvement

2016-02-10 Thread Fabian Hueske
Hi Flavio, I did not completely understand which objects should go where, but here are some general guidelines: - early filtering is mostly a good idea (unless evaluating the filter expression is very expensive) - you can use a flatMap function to combine a map and a filter - applying multiple fu

Re: Dataset filter improvement

2016-02-10 Thread Fabian Hueske
dataset based. > Which one do you think is the best solution? > > On Wed, Feb 10, 2016 at 9:40 AM, Fabian Hueske wrote: > >> Hi Flavio, >> >> I did not completely understand which objects should go where, but here >> are some general guidelines: >> >>

Re: How to convert List to flink DataSet

2016-02-10 Thread Fabian Hueske
Hi Subash, how is findOutliers implemented? It might be that you mix-up local and cluster computation. All DataSets are processed in the cluster. Please note the following: - ExecutionEnvironment.fromCollection() transforms a client local connection into a DataSet by serializing it and sending it

Re: How to convert List to flink DataSet

2016-02-10 Thread Fabian Hueske
} > } > return finalElements; > } > > I have attached herewith the screenshot of my project structure and > KMeansOutlierDetection.java file for more clarity. > > > Best Regards, > Subash Basnet > > On Wed, Feb 10, 2016 at 12:26 PM, Fab

Re: TaskManager unable to register with JobManager

2016-02-10 Thread Fabian Hueske
Hi Ravinder, please have a look at the configuration documentation: --> https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager Best, Fabian 2016-02-10 13:55 GMT+01:00 Ravinder Kaur : > Hello All, > > I need to know the range of ports that are being

Re: How to convert List to flink DataSet

2016-02-10 Thread Fabian Hueske
folder and files get created. But incase of > *outlierDetection.jar* no file or folder get's written inside kmeans. > > How is it that outlier class is able to write file via eclipse but outlier > jar not able to write via flink web submission client. > > > Best Regards, &

Re: Merge or minus Dataset API missing

2016-02-12 Thread Fabian Hueske
Hi Flavio, If I got it right, you can use a FullOuterJoin. It will give you both elements on a match and otherwise a left or a right element and null. Best, Fabian 2016-02-12 16:48 GMT+01:00 Flavio Pompermaier : > Hi to all, > > I have a use case where I have to merge 2 datasets but I can't fin

Re: writeAsCSV with partitionBy

2016-02-15 Thread Fabian Hueske
Hi Srikanth, DataSet.partitionBy() will partition the data on the declared partition fields. If you append a DataSink with the same parallelism as the partition operator, the data will be written out with the defined partitioning. It should be possible to achieve the behavior you described using D

Re: Regarding Concurrent Modification Exception

2016-02-15 Thread Fabian Hueske
Hi, This stacktrace looks really suspicious. It includes classes from the submission client (CLIClient), optimizer (JobGraphGenerator), and runtime (KryoSerializer). Is it possible that you try to start a new Flink job inside another job? This would not work. Best, Fabian

Re: schedule tasks `inside` Flink

2016-02-15 Thread Fabian Hueske
Hi Michal, If I got your requirements right, you could try to solve this issue by serving the updates through a regular DataStream. You could add a SourceFunction which periodically emits a new version of the cache and a CoFlatMap operator which receives on the first input the regular streamed inp

Re: Problem with KeyedStream 1.0-SNAPSHOT

2016-02-15 Thread Fabian Hueske
Hi Javier, Keys is an internal class and was recently moved to a different package. So it appears like your Flink dependencies are not aligned to the same version. We also added Scala version identifiers to all our dependencies which depend on Scala 2.10. For instance, flink-scala became flink-sc

Re: Read once input data?

2016-02-15 Thread Fabian Hueske
Hi, it looks like you are executing two distinct Flink jobs. DataSet.count() triggers the execution of a new job. If you have an execute() call in your program, this will lead to two Flink jobs being executed. It is not possible to share state among these jobs. Maybe you should add a custom count

Re: Read once input data?

2016-02-15 Thread Fabian Hueske
set (ds). Each map will have it's own reduction, so > is there a way to avoid creating two jobs for such scenario? > > The reason is, reading these binary matrices are expensive. In our current > MPI implementation, I am using memory maps for faster loading and reuse. > > Tha

Re: Read once input data?

2016-02-15 Thread Fabian Hueske
you might have an example on how to define a data flow with > Flink? > > > > On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske wrote: > >> It is not possible to "pin" data sets in memory, yet. >> However, you can stream the same data set through two different ma

Re: Read once input data?

2016-02-16 Thread Fabian Hueske
s > single job in Flink, so data read happens only once? > > Thanks, > Saliya > > On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske wrote: > >> It is not possible to "pin" data sets in memory, yet. >> However, you can stream the same data set through two diffe

Re: writeAsCSV with partitionBy

2016-02-16 Thread Fabian Hueske
older ./output/test/<1,2,3,4...> > > But what I was looking for is Hive style partitionBy that will output with > folder structure > >./output/field0=1/file >./output/field0=2/file >./output/field0=3/file >./output/field0=4/file > > Assuming fiel

Re: Read once input data?

2016-02-16 Thread Fabian Hueske
ultiple reads of the input data to create > the same dataset? > > Thank you, > saliya > > On Tue, Feb 16, 2016 at 12:03 PM, Fabian Hueske wrote: > >> Yes, if you implement both maps in a single job, data is read once. >> >> 2016-02-16 15:53 GMT+01:00 Saliya Ek

Re: Read once input data?

2016-02-16 Thread Fabian Hueske
t; the average in the previous example. > > > On Tue, Feb 16, 2016 at 3:47 PM, Fabian Hueske wrote: > >> You can use so-called BroadcastSets to send any sufficiently small >> DataSet (such as a computed average) to any other function and use it there. >> However, in you

Re: Optimal Configuration for Cluster

2016-02-22 Thread Fabian Hueske
Hi Welly, sorry for the late response. The number of network buffers primarily depends on the maximum parallelism of your job. The given formula assumes a specific cluster configuration (1 task manager per machine, one parallel task per CPU). The formula can be translated to: taskmanager.network

Re: Optimal Configuration for Cluster

2016-02-22 Thread Fabian Hueske
possible for DataSet (batch) programs. Hope this helps, Fabian 2016-02-22 11:26 GMT+01:00 Fabian Hueske : > Hi Welly, > > sorry for the late response. > > The number of network buffers primarily depends on the maximum parallelism > of your job. > The given formula assum

Re: Iterations problem in command line

2016-02-29 Thread Fabian Hueske
Hi Marcela, do you run the algorithm in both setups with the same parallelism? Best, Fabian 2016-02-26 16:52 GMT+01:00 Marcela Charfuelan : > Hello, > > I implemented an algorithm that includes iterations (EM algorithm) and I > am getting different results when running in eclipse (Luna Release

Re: Iterations problem in command line

2016-03-01 Thread Fabian Hueske
t; Regards, > MArcela. > > > On 29.02.2016 16:44, Fabian Hueske wrote: > >> Hi Marcela, >> >> do you run the algorithm in both setups with the same parallelism? >> >> Best, Fabian >> >> 2016-02-26 16:52 GMT+01:00 Marcela Charfuelan >>

Re: Multi-dimensional[more than 2] input for KMeans Clustering in Apache flink

2016-03-01 Thread Fabian Hueske
Hi Subash, the KMeans implementation in Flink is meant to be a simple toy example and should not used for serious analysis tasks. It shows how the DataSet API works by implementing a well-known algorithm. Nonetheless, the example can be easily extended to work for three or more dimensions. You wo

Re: Flink CEP Pattern Matching

2016-03-02 Thread Fabian Hueske
Hi Jerry, I haven't used the CEP features yet, so I cannot comment on your requirements. In case you are looking for the CEP documentation, here it is: --> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/libs/cep.html The CEP features will be included in the upcoming 1.0.0

Re: Setting taskmanager.network.numberOfBuffers and getting errors...

2016-03-03 Thread Fabian Hueske
Hi Sourigna, you are using the formula correctly: #cores should to be translated into slots per taskmanager (TM), and #machines into number of TMs. So 36 ^ 2 * 10 * 4 = 51840 appears to be right. The constant 4 refers to the total number of concurrently active full network shuffles (partitioning o

Re: Submit Flink Jobs to YARN running on AWS

2016-03-09 Thread Fabian Hueske
Hi Abhi, I have used Flink on EMR via YARN a couple of times without problems. I started a Flink YARN session like this: ./bin/yarn-session.sh -n 4 -jm 1024 -tm 4096 This will start five YARN containers (1 JobManager with 1024MB, 4 Taskmanagers with 4096MB). See more config options in the docume

Re: Flink and Directory Monitors

2016-03-09 Thread Fabian Hueske
Hi Philippe, I am not aware of anybody using Directory Monitor with Flink. However, the application you described sounds reasonable and I think it should be possible to implement that with Flink. You would need to implement a SourceFunction that forwards events from DM to Flink or you push the DM

Re: protobuf messages from Kafka to elasticsearch using flink

2016-03-09 Thread Fabian Hueske
Hi, I haven't used protobuf to serialize Kafka events but this blog post (+ the linked repository) shows how to write data from Flink into Elasticsearch: --> https://www.elastic.co/blog/building-real-time-dashboard-applications-with-apache-flink-elasticsearch-and-kibana Hope this helps, Fabian

Re: Flink loading an S3 File out of order

2016-03-10 Thread Fabian Hueske
Hi Benjamin, Flink reads data usually in parallel. This is done by splitting the input (e.g., a file) into several input splits. Each input split is independently processed. Since splits are usually concurrently processed by more than one task, Flink does not care about the order by default. You

Re: External DB as sink - with processing guarantees

2016-03-12 Thread Fabian Hueske
Hi Josh, Flink can guarantee exactly-once processing within its data flow given that the data sources allow to replay data from a specific position in the stream. For example, Flink's Kafka Consumer supports exactly-once. Flink achieves exactly-once processing by resetting operator state to a con

Re: Accumulators checkpointed?

2016-03-15 Thread Fabian Hueske
Hi Zach, at the moment, accumulators are not checkpointed and reset if if a failed task is restarted. Best, Fabian 2016-03-15 17:27 GMT+01:00 Zach Cox : > Are accumulators stored in checkpoint state? If a job fails and restarts, > are all accumulator values lost, or are they restored from check

Re: Memory ran out PageRank

2016-03-16 Thread Fabian Hueske
Hi Ovidiu, putting the CompactingHashTable aside, all data structures and algorithms that use managed memory can spill to disk if data exceeds memory capacity. It was a conscious choice to not let the CompactingHashTable spill. Once the solution set hash table is spilled, (parts of) the hash tabl

Re: Using a POJO class wrapping an ArrayList

2016-03-16 Thread Fabian Hueske
Hi Mengqi, I did not completely understand your use case. If you would like to use a composite key (a key with multiple fields) there are two alternatives: - use a tuple as key type. This only works if all records have the same number of key fields. Tuple serialization and comparisons are very ef

Re: off-heap size feature request

2016-03-16 Thread Fabian Hueske
Hi Ovidiu, the parameters to configure the amount of managed memory (taskmanager.memory.size, taskmanager.memory.fraction) are valid for on and off-heap memory. Have you tried these parameters and didn't they work as expected? Best, Fabian 2016-03-16 11:43 GMT+01:00 Ovidiu-Cristian MARCU < ovi

Re: Input on training exercises

2016-03-18 Thread Fabian Hueske
Hi Ken, you can open an issue on the Github repository or send a mail to me. Thanks, Fabian 2016-03-17 23:07 GMT+01:00 Ken Krugler : > Hi list, > > What's the right way to provide input on the training exercises > ? > > Thanks, > > -- Ken

Re: what happens to the aggregate in a fold on a KeyedStream if the events for a key stop coming ?

2016-03-18 Thread Fabian Hueske
n Deenen > bartvandee...@fastmail.fm > > > > On Fri, Mar 18, 2016, at 11:54, Fabian Hueske wrote: > > Hi Bart, > if you run a fold function on a keyed stream without a window, there is no > way to remove the key and the folded value. > You will eventually run out of memo

Re: off-heap size feature request

2016-03-19 Thread Fabian Hueske
se-1.0/setup/config.html#managed-memory > > Best, > Ovidiu > > On 16 Mar 2016, at 12:13, Fabian Hueske wrote: > > Hi Ovidiu, > > the parameters to configure the amount of managed memory > (taskmanager.memory.size, > taskmanager.memory.fraction) are valid for

Re: what happens to the aggregate in a fold on a KeyedStream if the events for a key stop coming ?

2016-03-19 Thread Fabian Hueske
Hi Bart, if you run a fold function on a keyed stream without a window, there is no way to remove the key and the folded value. You will eventually run out of memory if your key space is continuously growing. If you apply a fold function in a window on a keyed stream you can bound the "lifetime"

Re: what happens to the aggregate in a fold on a KeyedStream if the events for a key stop coming ?

2016-03-19 Thread Fabian Hueske
> Thanks > > Bart > > -- > Bart van Deenen > bartvandee...@fastmail.fm > > > > On Fri, Mar 18, 2016, at 12:16, Fabian Hueske wrote: > > Yes, that's possible. > > You have to implement a custom trigger for that. The Trigger.onElement() > me

Re: off-heap size feature request

2016-03-19 Thread Fabian Hueske
eInBytes), along with OS cache. > > I would like parameters like: > taskmanager.off-heap.size or taskmanager.off-heap.fraction > taskmanager.off-heap.enabled true or false > and same for heap. > > Thanks for clarification. > > Best, > Ovidiu > > > On 16 Mar 2016, at 13:43,

Re: degree of Parallelism

2016-03-19 Thread Fabian Hueske
Hi, did find the documentation for configuring the parallelism [1]? It explains how to set the parallelism on different levels: Cluster, Job, Task. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#parallel-execution 2016-03-18 13:34 GMT+01:00 T

Re: Setting taskmanager.network.numberOfBuffers does not seem to have an affect - Flink 0.10.2

2016-03-21 Thread Fabian Hueske
Hi, right now there is no way to sequentially execute the input tasks. Flink's FileInputFormat does also not support multiple paths out of the box. However, it is certainly possible to extend the FileInputFormat such that this is possible. You would need to override / extend the createInputSplits(

Re: Flink 1.0.0 reading files from multiple directory with wildcards

2016-03-21 Thread Fabian Hueske
Hi, no, this is currently not supported. However, I agree this would be a very valuable addition to the FileInputFormat. Would you mind opening a JIRA issue with your suggestions? Until this is added to Flink, it can be implemented as a custom InputFormat based on FileInputFormat by overriding th

Re: [DISCUSS] Improving Trigger/Window API and Semantics

2016-03-22 Thread Fabian Hueske
Thanks for the write-up Aljoscha. I think it is a really good idea to separate the different aspects (fire, purging, lateness) a bit. At the moment, all of these need to be handled in the Trigger and a custom trigger is necessary whenever, you want some of these aspects slightly differently handled

Re: Flink 1.0.0 reading files from multiple directory with wildcards

2016-03-22 Thread Fabian Hueske
t;> Fabian, >> >> I'll try extending InputFormat as you suggested and will create a JIRA >> issue as well. >> >> I also have an AvroGenericRecordInput format class that I would like to >> contribute once I have time to clean it up and get it into your code

AW: Window Support in Flink

2016-03-28 Thread Fabian Hueske
Hopping windows is a term used on the Apache Calcite website [1]. In Flink terms, hopping windows are sliding windows. Cheers, Fabian [1] http://calcite.apache.org/docs/stream.html Von: Ufuk Celebi Gesendet: Montag, 28. März 2016 12:40 An: user@flink.apache.org Betreff: Re: Window Support in

Re: Find differences

2016-04-07 Thread Fabian Hueske
I would go with an outer join as Stefano suggested. Outer joins can be executed as hash joins which will probably be more efficient than using a sort based groupBy/reduceGroup. Also outer joins are a more intuitive and simpler, IMO. 2016-04-07 12:35 GMT+02:00 Stefano Baghino : > Perhaps an outer

Re: Powered by Flink

2016-04-12 Thread Fabian Hueske
k but the name is not yet listed there, let me know and I'll >>>>> add the name. >>>>> >>>>> Regards, >>>>> Robert >>>>> >>>>> [1] >>>>> https://cwiki.apache.org/confluence/display/

Re: Hash tables - joins, cogroup, deltaIteration

2016-04-19 Thread Fabian Hueske
Hi Ovidiu, Hash tables are currently used for joins (inner & outer) and the solution set of delta iterations. There is a pending PR that implements a hash table for partial aggregations (combiner) [1] which should be added soon. Joins (inner & outer) are already implemented as Hybrid Hash joins t

Re: Sink Parallelism

2016-04-20 Thread Fabian Hueske
Hi Ravinder, your drawing is pretty much correct (Flink will inject a combiner between flat map and reduce which locally combines records with the same key). The partitioning between flat map and reduce is done with hash partitioning by default. However, you can also define a custom partitioner to

Re: Operation of Windows and Triggers

2016-04-20 Thread Fabian Hueske
Hi Piyush, if you explicitly set a trigger, the default trigger of the window is replaced. In your example, the time trigger is replaced by the count trigger, i.e., the window is only evaluated after the 100th element was received. This blog post discusses windows and triggers [1]. Best, Fabian

Re: Sink Parallelism

2016-04-20 Thread Fabian Hueske
ve in the first window > are grouped/partitioned by keys and aggregated and so on until no more > streams left. The global result then has the aggregated key/value pairs. > > Kind Regards, > Ravinder Kaur > > > > On Wed, Apr 20, 2016 at 12:12 PM, Fabian Hueske wrote: >

Re: Operation of Windows and Triggers

2016-04-20 Thread Fabian Hueske
o this? > > Thanks and Regards, > Piyush Shrivastava > [image: WeboGraffiti] > http://webograffiti.com > > > On Wednesday, 20 April 2016 4:59 PM, Fabian Hueske > wrote: > > > Hi Piyush, > > if you explicitly set a trigger, the default trigger of the window i

Re: Set up Flink Cluster on Windows machines

2016-04-20 Thread Fabian Hueske
Hi Yifei, I think this has not been done before. At least I am not aware of anybody running Flink in cluster mode on Windows. In principle this should work. It is possible to start a local instance on Windows (start-local.bat) and to locally execute Flink programs on this instance using the flink.

Re: Explanation on limitations of the Flink Table API

2016-04-21 Thread Fabian Hueske
Hi Simone, in Flink 1.0.x, the Table API does not support reading external data, i.e., it is not possible to read a CSV file directly from the Table API. Tables can only be created from DataSet or DataStream which means that the data is already converted into "Flink types". However, the Table API

Re: implementing a continuous time window

2016-04-21 Thread Fabian Hueske
Yes, sliding windows are different. You want to evaluate the window whenever a new element arrives or an element leaves because 5 secs passed since it entered the window, right? I think that should be possible with a GlobalWindow, a custom Trigger which holds state about the time when each element

Re: implementing a continuous time window

2016-04-22 Thread Fabian Hueske
aged state to make sure that your operator can recover from failures. Cheers, Fabian 2016-04-21 23:16 GMT+02:00 Jonathan Yom-Tov : > Thanks. Any pointers on how to do that? Or code examples which do similar > things? > > On Thu, Apr 21, 2016 at 10:30 PM, Fabian Hueske wrote: >

Re: Flink program without a line of code

2016-04-22 Thread Fabian Hueske
Hi Alex, welcome to the Flink community! Right now, there is no way to specify a Flink program without writing code (Java, Scala, Python(beta)). In principle it is possible to put such functionality on top of the DataStream or DataSet APIs. This has been done before for other programming APIs (Fl

Re: java.lang.ClassCastException: org.apache.flink.streaming.api.watermark.Watermark cannot be cast to org.apache.flink.streaming.runtime.streamrecord.StreamRecord

2016-04-22 Thread Fabian Hueske
Hi Konstantin, this exception is thrown if you do not set the time characteristic to event time and assign timestamps. Please try to add > env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) after you obtained the StreamExecutionEnvironment. Best, Fabian 2016-04-22 15:47 GMT+02:00 Ko

Re: Flink first() operator

2016-04-25 Thread Fabian Hueske
Hi Biplop, you can also implement a generic IF that wraps another IF (such as a CsvInputFormat). The wrapping IF forwards all calls to the wrapped IF and in addition counts how many records were emitted (how often InputFormat.nextRecord() was called). Once the count arrives at the threshold, it re

Re: Access to a shared resource within a mapper

2016-04-25 Thread Fabian Hueske
Hi Timur, a TaskManager may run as many subtasks of a Map operator as it has slots. Each subtask of an operator runs in a different thread. Each parallel subtask of a Map operator has its own MapFunction object, so it should be possible to use a lazy val. However, you should not use static variab

Re: Flink first() operator

2016-04-26 Thread Fabian Hueske
Actually, memory should not be a problem since the full data set would not be materialized in memory. Flink has a streaming runtime so most of the data would be immediately filtered out. However, reading the whole file causes of course a lot of unnecessary IO. 2016-04-26 17:09 GMT+02:00 Biplob Bis

Re: Return unique counter using groupReduceFunction

2016-04-26 Thread Fabian Hueske
Hi Biplob, Flink is a distributed, data parallel system which means that there are several instances of you ReduceFunction running in parallel, each with its own timestamp counter. If you want to have a unique timestamp, you have to set the parallelism of the reduce operator to 1, but then the pro

Re: Tuning parallelism in cascading-flink planner

2016-04-27 Thread Fabian Hueske
Hi Ken, at the moment, there are just two parameters to control the parallelism of Flink operators generated by the Cascading-Flink connector. The parameters are: - flink.num.sourceTasks to specify the parallelism of source tasks. - flink.num.shuffleTasks to specify the parallelism of all shuffli

Re: Job hangs

2016-04-27 Thread Fabian Hueske
Hi Timur, I had a look at the plan you shared. I could not find any flow that branches and merges again, a pattern which is prone to cause a deadlocks. However, I noticed that the plan performs a lot of partitioning steps. You might want to have a look at forwarded field annotations which can hel

Re: About flink stream table API

2016-04-28 Thread Fabian Hueske
Hi, Table API and SQL for streaming are work in progress. A first version which supports projection, filter, and union is merged to the master branch. Under the hood, Flink uses Calcite to optimize and translate Table API and SQL queries. Best, Fabian 2016-04-27 14:27 GMT+02:00 Zhangrucong : >

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
I checked the input format from your PR, but didn't see anything suspicious. It is definitely OK if the processing of an input split tasks more than 10 seconds. That should not be the cause. It rather looks like the DataSourceTask fails to request a new split from the JobManager. 2016-04-28 9:37

Re: Configuring a RichFunction on a DataStream

2016-04-28 Thread Fabian Hueske
Hi Robert, Function configuration via a Configuration object and the open method is an artifact from the past. The recommended way is to configure the function object via the constructor. Flink serializes the function object and ships them to the workers for execution. So the state of a function i

Re: General Data questions - streams vs batch

2016-04-28 Thread Fabian Hueske
Hi Konstantin, if you do not need a deterministic grouping of elements you should not use a keyed stream or window. Instead you can do the lookups in a parallel flatMap function. The function would collect arriving elements and perform a lookup query after a certain number of elements arrived (can

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
ing sospicious? > > Thanks for the support, > Flavio > > On Thu, Apr 28, 2016 at 11:54 AM, Fabian Hueske wrote: > >> I checked the input format from your PR, but didn't see anything >> suspicious. >> >> It is definitely OK if the processing of an input

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
size...that is indeed 3Gb ( > taskmanager.heap.mb:512) > > Best, > Flavio > > On Thu, Apr 28, 2016 at 1:29 PM, Fabian Hueske wrote: > >> Is the problem reproducible? >> Maybe the SplitAssigner gets stuck somehow, but I've never observed >> something like that.

Re: General Data questions - streams vs batch

2016-04-28 Thread Fabian Hueske
> > > And if you can give a bit more info on why will I have latency issues in a > case of varying rate of arrival elements that would be perfect. Or point me > to a direction where I can read it. > > Thanks! > Konstantin. > > On Thu, Apr 28, 2016 at 7:26 AM, Fabian

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
anager heap to 3 GB and maybe change some > gc setting. > Do you think I should increase also the akka timeout or other things? > > On Thu, Apr 28, 2016 at 2:06 PM, Fabian Hueske wrote: > >> Hmm, 113k splits is quite a lot. >> However, the IF uses the DefaultInputSplitAssi

Re: Creating a custom operator

2016-04-29 Thread Fabian Hueske
Hi Simone, the GraphCreatingVisitor transforms the common operator plan into a representation that is translated by the optimizer. You have to implement an OptimizerNode and OperatorDescriptor to describe the operator. Depending on the semantics of the operator, there are a few more places to make

Re: EMR vCores and slot allocation

2016-05-02 Thread Fabian Hueske
The slot configuration should depend on the complexity of jobs. Since each slot runs a "slice" of a program, one slot might potentially execute many concurrent tasks. For complex jobs you should allocate more than one core for each slot. 2016-05-02 10:12 GMT+02:00 Robert Metzger : > Hi Ken, > s

Re: Count of Grouped DataSet

2016-05-02 Thread Fabian Hueske
Hi Nirmalya, the solution with List.size() won't use a combiner and won't be efficient for large data sets with large groups. I would recommend to add a 1 and use GroupedDataSet.sum(). 2016-05-01 12:48 GMT+02:00 nsengupta : > Hello all, > > This is how I have moved ahead with the implementation

Re: Unable to write stream as csv

2016-05-02 Thread Fabian Hueske
Have you checked the log files as well? 2016-05-01 14:07 GMT+02:00 subash basnet : > Hello there, > > If anyone could help me know why the below *result* DataStream get's > written as text, but not as csv?. As it's in a tuple format I guess it > should be the same for both text and csv. It shows

Re: Any way for Flink Elasticsearch connector reflecting IP change of Elasticsearch cluster?

2016-05-02 Thread Fabian Hueske
Yes, it looks like the connector only creates the connection once when it starts and fails if the host is no longer reachable. It should be possible to catch that failure and try to re-open the connection. I opened a JIRA for this issue (FLINK-3857). Would you like to implement the improvement? 2

Re: Perform a groupBy on an already groupedDataset

2016-05-02 Thread Fabian Hueske
Grouping a grouped dataset is not supported. You can group on multiple keys: dataSet.groupBy(1,2). Can you describe your use case if that does not solve the problem? 2016-05-02 10:34 GMT+02:00 Punit Naik : > Hello > > I wanted to perform a groupBy on an already grouped dataset. How do I do > t

Re: S3 Checkpoint Storage

2016-05-02 Thread Fabian Hueske
Hi John, S3 keys are configured via Hadoop's configuration files. Check out the documentation for AWS setups [1]. Cheers, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/aws.html 2016-05-02 20:22 GMT+02:00 John Sherwood : > Hello all, > > I'm attempting to set up a

Re: Sink - writeAsText problem

2016-05-03 Thread Fabian Hueske
Did you specify a parallelism? The default parallelism of a Flink instance is 1 [1]. You can set a different default parallelism in ./conf/flink-conf.yaml or pass a job specific parallelism with ./bin/flink using the -p flag [2]. More options to define parallelism are in the docs [3]. [1] https:/

Re: Sink - writeAsText problem

2016-05-03 Thread Fabian Hueske
Yes, but be aware that your program runs with parallelism 1 if you do not configure the parallelism. 2016-05-03 11:07 GMT+02:00 Punit Naik : > Hi Stephen, Fabian > > setting "fs.output.always-create-directory" to true in flink-config.yml > worked! > > On Tue, May 3, 2016 at 2:27 PM, Stephan Ewen

Re: Creating a custom operator

2016-05-03 Thread Fabian Hueske
tors are not supposed to be implemented outside of > Flink. > > Thanks, > > Simone > > 2016-04-29 21:32 GMT+02:00 Fabian Hueske : > >> Hi Simone, >> >> the GraphCreatingVisitor transforms the common operator plan into a >> representation that is tr

Re: Discussion about a Flink DataSource repository

2016-05-04 Thread Fabian Hueske
Hi Flavio, I thought a bit about your proposal. I am not sure if it is actually necessary to integrate a central source repository into Flink. It should be possible to offer this as an external service which is based on the recently added TableSource interface. TableSources could be extended to be

Re: Any way for Flink Elasticsearch connector reflecting IP change of Elasticsearch cluster?

2016-05-04 Thread Fabian Hueske
I'm not so much familiar with the Kafka connector. Can you post your suggestion to the user or dev mailing list? Thanks, Fabian 2016-05-04 16:53 GMT+02:00 Sendoh : > Glad to see it's developing. > Can I ask would the same feature (reconnect) be useful for Kafka connector > ? > For example, if th

Re: Any way for Flink Elasticsearch connector reflecting IP change of Elasticsearch cluster?

2016-05-04 Thread Fabian Hueske
Sorry, I confused the mail threads. We're already on the user list :-) Thanks for the suggestion. 2016-05-04 17:35 GMT+02:00 Fabian Hueske : > I'm not so much familiar with the Kafka connector. > Can you post your suggestion to the user or dev mailing list? > > Thanks, Fa

Re: Prevent job/operator from spilling to disk

2016-05-04 Thread Fabian Hueske
Hi Max, it is not possible to deactivate spilling to disk at the moment. It might be possible to implement, but this would require a few more changes to make it feasible. For instance, we would need to add more fine-grained control about how memory is distributed among operators. This is currently

Re: Discussion about a Flink DataSource repository

2016-05-06 Thread Fabian Hueske
define a link from operators to TableEnvironment and then to TableSource > (using the lineage tag/source-id you said) and, finally to its metadata. I > don't know whether this is specific only to us, I just wanted to share our > needs and see if the table API development could b

Re: OutputFormat in streaming job

2016-05-06 Thread Fabian Hueske
Hi Andrea, you can use any OutputFormat to emit data from a DataStream using the writeUsingOutputFormat() method. However, this method does not guarantee exactly-once processing. In case of a failure, it might emit some records a second time. Hence the results will be written at least once. Hope

Re: Where to put live model and business logic in Hadoop/Flink BigData system

2016-05-06 Thread Fabian Hueske
Hi Palle, this sounds indeed like a good use case for Flink. Depending on the complexity of the aggregated historical views, you can implement a Flink DataStream program which builds the views on the fly, i.e., you do not need to periodically trigger MR/Flink/Spark batch jobs to compute the views

Re: OutputFormat in streaming job

2016-05-06 Thread Fabian Hueske
iate into > RichSinkFunction's open method) > > Am I wrong? > > Thanks again, > Andrea > > 2016-05-06 13:47 GMT+02:00 Fabian Hueske : > >> Hi Andrea, >> >> you can use any OutputFormat to emit data from a DataStream using the >> writeUsingOut

Re: Discussion about a Flink DataSource repository

2016-05-06 Thread Fabian Hueske
gt; What do you think? Is there the possibility to open a broadcasted Dataset > as a Map instead of a List? > > Best, > Flavio > > > On Fri, May 6, 2016 at 12:06 PM, Fabian Hueske wrote: > >> Hi Flavio, >> >> I'll open a JIRA for de/

Re: How to make Flink read all files in HDFS folder and do transformations on th e data

2016-05-07 Thread Fabian Hueske
Hi Palle, you can recursively read all files in a folder as explained in the "Recursive Traversal of the Input Path Directory" section of the Data Source documentation [1]. The easiest way to read line-wise JSON objects is to use ExecutionEnvironment.readTextFile() which reads text files linewise

Re: Creating a custom operator

2016-05-09 Thread Fabian Hueske
id, I'm still not sure if it is really required to implement a > custom runtime operator but given the complexity of the integration of two > distribute systems, we assumed that more control would allow more > flexibility and possibilities to achieve an ideal solution. > > > &g

Re: Force triggering events on watermark

2016-05-10 Thread Fabian Hueske
Maybe the last example of this blog post is helpful [1]. Best, Fabian [1] https://www.mapr.com/blog/essential-guide-streaming-first-processing-apache-flink 2016-05-10 17:24 GMT+02:00 Srikanth : > Hi, > > I read the following in Flink doc "We can explicitly specify a Trigger to > overwrite the d

Re: Creating a custom operator

2016-05-11 Thread Fabian Hueske
ion or at > least support us in the development. We are waiting for them to start a > discussion and as soon as we will have a more clear idea on how to proceed, > we will validate it with the stuff you just said. Your confidence in > Flink's operators gives up hope to achieve a clean

Re: get start and end time stamp from time window

2016-05-12 Thread Fabian Hueske
Hi Martin, You can use a FoldFunction and a WindowFunction to process the same! window. The FoldFunction is eagerly applied, so the window state is only one element. When the window is closed, the aggregated element is given to the WindowFunction where you can add start and end time. The iterator

<    1   2   3   4   5   6   7   8   9   10   >