Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Fabian Hueske
Hi Pieter, cross is indeed too expensive for this task. If dataset A fits into memory, you can do the following: Use a RichMapPartitionFunction to process dataset B and add dataset A as a broadcastSet. In the open method of mapPartition, you can load the broadcasted set and sort it by a.propertyX

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Fabian Hueske
ers for getting started with the 'tricky range > partitioning'? I am quite keen to get this working with large datasets ;-) > > Cheers, > > Pieter > > 2015-09-30 10:24 GMT+02:00 Fabian Hueske : > >> Hi Pieter, >> >> cross is indeed

Re: Config files content read

2015-10-05 Thread Fabian Hueske
Hi Flavio, I don't think this is a feature that needs to go into Flink core. To me it looks like this be implemented as a utility method by anybody who needs it without major effort. Best, Fabian 2015-10-02 15:27 GMT+02:00 Flavio Pompermaier : > Hi to all, > > in many of my jobs I have to read

Re: For each element in a dataset, do something with another dataset

2015-10-05 Thread Fabian Hueske
artitionFunction >>> on A: >>> In the open method of mapPartition, you sort B. Then, for each element >>> of A, you do a binary search in B, and look at the index found by the >>> binary search, which will be the count that you are looking for. >>> >>

Re: source binary file

2015-10-06 Thread Fabian Hueske
Hi Lydia, you need to implement a custom InputFormat to read binary files. Usually you can extend the FileInputFormat. The implementation depends a lot on your use case, for example whether each binary file is read into a single or multiple records and how records are delimited if there are more t

Re: Parallel file read in LocalEnvironment

2015-10-07 Thread Fabian Hueske
Hi Flavio, it is not possible to split by line count because that would mean to read and parse the file just for splitting. Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds

Re: Parallel file read in LocalEnvironment

2015-10-07 Thread Fabian Hueske
description of > their internals? > > On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske wrote: > >> Hi Flavio, >> >> it is not possible to split by line count because that would mean to read >> and parse the file just for splitting. >> >> Parallel processing

Re: Debug OutOfMemory

2015-10-08 Thread Fabian Hueske
Hi Konstantin, Flink uses managed memory only for its internal processing (sorting, hash tables, etc.). If you allocate too much memory in your user code, it can still fail with an OOME. This can also happen for large broadcast sets. Can you check how much much memory the JVM allocated and how muc

Re: Debug OutOfMemory

2015-10-09 Thread Fabian Hueske
I think that's actually very use case specific. You're code will never see the malformed record because it is dropped by the input format. Other applications might rely on complete input and would prefer an exception to be notified about invalid input. Flink's CsvInputFormat has a parameter "lenie

Re: Hi, Flink people, a question about translation from HIVE Query to Flink fucntioin by using Table API

2015-10-19 Thread Fabian Hueske
Hi Philip, here a few additions to what Max said: - ORDER BY: As Max said, Flink's sortPartition() does only sort with a partition and does not produce a total order. You can either set the parallelism to 1 as Max suggested or use a custom partitioner to range partition the data. - SORT BY: From y

Re: Hi, Flink people, a question about translation from HIVE Query to Flink fucntioin by using Table API

2015-10-19 Thread Fabian Hueske
tribute > By] + [Sort By]. Therefore, according to your suggestion, should it be > partitionByHash() + sortGroup() instead of sortPartition() ? > > Or probably I did not still get much difference between Partition and > scope within a reduce. > > Regards, > Philip > > On Mon,

Re: Powered by Flink

2015-10-19 Thread Fabian Hueske
Thanks for starting this Kostas. I think the list is quite hidden in the wiki. Should we link from flink.apache.org to that page? Cheers, Fabian 2015-10-19 14:50 GMT+02:00 Kostas Tzoumas : > Hi everyone, > > I started a "Powered by Flink" wiki page, listing some of the > organizations that are

Re: Powered by Flink

2015-10-19 Thread Fabian Hueske
Sounds good +1 2015-10-19 14:57 GMT+02:00 Márton Balassi : > Thanks for starting and big +1 for making it more prominent. > > On Mon, Oct 19, 2015 at 2:53 PM, Fabian Hueske wrote: > >> Thanks for starting this Kostas. >> >> I think the list is quite hidden in

Re: Powered by Flink

2015-10-19 Thread Fabian Hueske
ult to answer to > interested users. > > > On 19.10.2015 15:08, Suneel Marthi wrote: > > +1 to this. > > On Mon, Oct 19, 2015 at 3:00 PM, Fabian Hueske wrote: > >> Sounds good +1 >> >> 2015-10-19 14:57 GMT+02:00 Márton Balassi < >> balassi.mar...

Re: ExecutionEnvironment setConfiguration API

2015-10-19 Thread Fabian Hueske
I think it's not a nice solution to check for the type of the returned execution environment to determine whether it is a local or a remote execution environment. Wouldn't it be better to add a method isLocal() to ExecutionEnvironment? Cheers, Fabian 2015-10-14 19:14 GMT+02:00 Flavio Pompermaier

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Fabian Hueske
It might even be materialized (to disk) if both derived data sets are joined. 2015-10-22 12:01 GMT+02:00 Till Rohrmann : > I fear that the filter operations are not chained because there are at > least two of them which have the same DataSet as input. However, it's true > that the intermediate re

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Fabian Hueske
> reading the entire input multiple times (we are talking 100GB+ on max 32 > workers) but I would have to run some experiments to confirm that. > > > > 2015-10-22 12:06 GMT+02:00 Fabian Hueske : > >> It might even be materialized (to disk) if both derived data sets are >

Re: reading csv file from null value

2015-10-26 Thread Fabian Hueske
Hi Philip, the CsvInputFormat does not support to read empty fields. I see two ways to achieve this functionality: - Use a TextInputFormat that returns each line as a String and do the parsing in a subsequent MapFunction - Extend the CsvInputFormat to support empty fields Cheers, Fabian 2015-10

Re: Flink on EC"

2015-10-29 Thread Fabian Hueske
Hi Thomas, until recently, Flink provided an own implementation of a S3FileSystem which wasn't fully tested and buggy. We removed that implementation and are using now (in 0.10-SNAPSHOT) Hadoop's S3 implementation by default. If you want to continue using 0.9.1 you can configure Flink to use Hado

Re: How best to deal with wide, structured tuples?

2015-10-29 Thread Fabian Hueske
Hi Johann, I see three options for your use case. 1) Generate Pojo code at planning time, i.e., when the program is composed. This does not work when the program is already running. The benefit is that you can use key expressions, have typed fields, and type specific serializers and comparators.

Re: How groupBy work

2015-10-30 Thread Fabian Hueske
Hi Jeffery, Flink uses a (potentially external) merge-sort to group data. Combining is done using an in-memory sort. Because Flink uses pipelined data transfer, the execution of operators in a program can overlap. For example in WordCount, the sort of a groupBy will immediately start as soon as th

Re: Create triggers

2015-10-30 Thread Fabian Hueske
You refer to the DataSet (batch) API, right? In that case you can evaluate your condition in the program and fetch a DataSet back to the client using List myData = DataSet.collect(); Based on the result of the collect() call you can define and execute a new program. Note: collect() will immediate

Re: Create triggers

2015-11-02 Thread Fabian Hueske
aset. > > Some pseudocode from your solution: > DataSet A = env.readFile(...); > DataSet C = env.readFile(...); > > A.groupBy().reduce().filter(*Check conditions here and in case start > processing C*); > > > Thanks, > Giacomo > > > > > On Fri, Oct 30

Re: Hi, question about orderBy two columns more

2015-11-02 Thread Fabian Hueske
Hi Philip, thanks for reporting the issue. I just verified the problem. It is working correctly for the Java API, but is broken in Scala. I will work on a fix and include it in the next RC for 0.10.0. Thanks, Fabian 2015-11-02 12:58 GMT+01:00 Philip Lee : > Thanks for your reply, Stephan. > >

Re: Long ids from String

2015-11-03 Thread Fabian Hueske
Converting String ids into Long ids can be quite expensive, so you should make sure it pays off. The save way to do it is to get all unique String ids (project, distinct), do zipWithUniqueId, and join all DataSets that have the String id with the new long id. So it is a full sort for the unique an

Re: Rich vs normal functions

2015-11-09 Thread Fabian Hueske
These reason is that the non-rich function interfaces are SAM (single abstract method) interfaces. In Java 8, SAM interfaces can be specified as concise lambda functions. Cheers, Fabian 2015-11-09 10:45 GMT+01:00 Flavio Pompermaier : > Hi flinkers, > I have a simple question for you that I want

Re: Hi, join with two columns of both tables

2015-11-09 Thread Fabian Hueske
Why don't you use a composite key for the Flink join (first.join(second).where(0,1).equalTo(2,3).with(...)? This would be more efficient and you can omit the check in the join function. Best, Fabian 2015-11-08 19:13 GMT+01:00 Philip Lee : > I want to join two tables with two columns like > > //

Re: Mixing POJO and Tuples

2015-11-10 Thread Fabian Hueske
Hi Flavio, this will not work out of the box. If you extend a Flink tuple and add additional fields, the type will be recognized as tuple and the TupleSerializer will be used to serialize and deserialize the record. Since the TupleSerializer is not aware of your additional fields it will not seria

Re: Implementing samza table/stream join

2015-11-10 Thread Fabian Hueske
Hi Nick, I think you can do this with Flink quite similar to how it is explained in the Samza documentation by using a stateful CoFlatMapFunction [1], [2]. Please have a look at this snippet [3]. This code implements an updateable stream filter. The first stream is filtered by words from the seco

Apache Flink 0.10.0 released

2015-11-16 Thread Fabian Hueske
Hi everybody, The Flink community is excited to announce that Apache Flink 0.10.0 has been released. Please find the release announcement here: --> http://flink.apache.org/news/2015/11/16/release-0.10.0.html Best, Fabian

Re: Different CoGroup behavior inside DeltaIteration

2015-11-16 Thread Fabian Hueske
Hi, this is an artifact of how the solution set is internally implemented. Usually, a CoGroup is executed using a sort-merge strategy, i.e., both input are sorted, merged, and handed to the CoGroup function in a streaming fashion. Both inputs are treated equally, and if one of both inputs does not

Re: Fold vs Reduce in DataStream API

2015-11-18 Thread Fabian Hueske
Hi Ron, Have you checked: https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#transformations ? Fold is like reduce, except that you define a start element (of a different type than the input type) and the result type is the type of the initial value. In reduce,

Re: ReduceByKeyAndWindow in Flink

2015-11-23 Thread Fabian Hueske
Hi Konstantin, let me first summarize to make sure I understood what you are looking for. You computed an aggregate over a keyed event-time window and you are looking for the maximum aggregate for each group of windows over the same period of time. So if you have (key: 1, w-time: 10, agg: 17) (key

Re: ReduceByKeyAndWindow in Flink

2015-11-23 Thread Fabian Hueske
1, w-time: 20, agg: 30) > (key: 1, w-time: 20, agg: 30) > > Because the reduce function is evaluated for every incoming event (i.e. > each key), right? > > Cheers, > > Konstantin > > On 23.11.2015 10:47, Fabian Hueske wrote: > > Hi Konstantin, > > > > let me firs

Re: Using Hadoop Input/Output formats

2015-11-24 Thread Fabian Hueske
Hi Nick, you can use Flink's HadoopInputFormat wrappers also for the DataStream API. However, DataStream does not offer as much "sugar" as DataSet because StreamEnvironment does not offer dedicated createHadoopInput or readHadoopFile methods. In DataStream Scala you can read from a Hadoop InputFo

Re: Watermarks as "process completion" flags

2015-11-24 Thread Fabian Hueske
Hi Anton, If I got your requirements right, you are looking for a solution that continuously produces updated partial aggregates in a streaming fashion. When a special event (no more trades) is received, you would like to store the last update as a final result. Is that correct? You can compute

Re: Standalone Cluster vs YARN

2015-11-25 Thread Fabian Hueske
A strong argument for YARN mode can be the isolation of multiple users and jobs. You can easily start a new Flink cluster for each job or user. However, this comes at the price of resource (memory) fragmentation. YARN mode does not use memory as effective as cluster mode. 2015-11-25 9:46 GMT+01:00

Re: Standalone Cluster vs YARN

2015-11-25 Thread Fabian Hueske
use only YARN without Hadoop ? > > Currently we are using Cassandra and CFS ( cass file system ) > > > Cheers > > On Wed, Nov 25, 2015 at 3:51 PM, Fabian Hueske wrote: > >> A strong argument for YARN mode can be the isolation of multiple users >> and jobs. You can e

Re: flink connectors

2015-11-27 Thread Fabian Hueske
Hi Radu, the connectors are available in Maven Central. Just add them as a dependency in your project and they will be fetched and included. Best, Fabian 2015-11-27 14:38 GMT+01:00 Radu Tudoran : > Hi, > > > > I was trying to use flink connectors. However, when I tried to import this > > > > im

Re: flink connectors

2015-11-27 Thread Fabian Hueske
ecipient(s) is > prohibited. If you receive this e-mail in error, please notify the sender > by phone or email immediately and delete it! > > > > *From:* Fabian Hueske [mailto:fhue...@gmail.com] > *Sent:* Friday, November 27, 2015 2:41 PM > *To:* user@flink.apache.org >

Re: Continuing from the stackoverflow post

2015-11-27 Thread Fabian Hueske
Hi Nirmalya, can you describe the semantics that you want to implement? Do you want to find the max temperature every 5 milliseconds or the max of every 5 records? Right now, you are using a non-keyed timeWindow of 5 milliseconds. This will create a window for the complete stream every 5 msecs. H

Re: Interpretation of Trigger and Eviction on a window

2015-11-27 Thread Fabian Hueske
Hi Nirmalya, it is correct that the evictor is called BEFORE the window function is applied because this is required to support certain types of sliding windows. If you want to remove all elements from the window after the window function was applied, you need a trigger that purges the window. The

Re: Doubt about window and count trigger

2015-11-27 Thread Fabian Hueske
Hi Anwar, You trigger looks good! I just want to make sure you know what it is exactly happening after a window was evaluated and purged. Once a window was purged, the whole window is cleared and removed. If a new element arrives, that would have fit into the purged window, a new window with exac

Re: POJO Dataset read and write

2015-11-27 Thread Fabian Hueske
If you are just looking for an exchange format between two Flink jobs, I would go for the TypeSerializerInput/OutputFormat. Note that these are binary formats. Best, Fabian 2015-11-27 15:28 GMT+01:00 Flavio Pompermaier : > Hi to all, > > I have a complex POJO (with nexted objects) that I'd like

Re: Doubt about window and count trigger

2015-11-27 Thread Fabian Hueske
ompletion. > In that 1 min window, is my modified count trigger still valid ? Say, if > in that one minute window, I have 100 events after 30 s, it will still fire > at 30th second ? > > Cheers, > anwar. > > > > On Fri, Nov 27, 2015 at 3:31 PM, Fabian Hueske wrote: >

Re: POJO Dataset read and write

2015-11-27 Thread Fabian Hueske
rmal or TypeSerializer is > supposed to perform better then this? > > > On Fri, Nov 27, 2015 at 3:39 PM, Fabian Hueske wrote: > >> If you are just looking for an exchange format between two Flink jobs, I >> would go for the TypeSerializerInput/OutputFormat. >>

Re: Insufficient number of network buffers running on cluster

2015-11-27 Thread Fabian Hueske
Hi Guido, please check this section of the configuration documentation: https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html#configuring-the-network-buffers It should answer your questions. Please let us know, if not. Cheers, Fabian 2015-11-27 16:41 GMT+01:00 Guido :

Re: Continuing from the stackoverflow post

2015-11-30 Thread Fabian Hueske
Hi Nirmalya, thanks for the detailed description of your understanding of Flink's window semantics. Most of it is correct, but a few things need a bit of correction ;-) Please see my comments inline. 2015-11-28 4:36 GMT+01:00 Nirmalya Sengupta : > Hello Fabian, > > > A little long mail; please

Re: Continuing from the stackoverflow post

2015-11-30 Thread Fabian Hueske
Sorry, I have to correct myself. The windowing semantics are not easy ;-) 2015-11-30 15:34 GMT+01:00 Fabian Hueske : > Hi Nirmalya, > > thanks for the detailed description of your understanding of Flink's > window semantics. > Most of it is correct, but a few things need

Re: Iterative queries on Flink

2015-11-30 Thread Fabian Hueske
Hi Flavio, Flink does not support caching of data sets in memory yet. Best, Fabian 2015-11-30 16:45 GMT+01:00 Flavio Pompermaier : > Hi to all, > I was wondering if Flink could fit a use case where a user load a dataset > in memory and then he/she wants to explore it interactively. Let's say I

Re: Iterative queries on Flink

2015-11-30 Thread Fabian Hueske
Mon, Nov 30, 2015 at 4:52 PM, Fabian Hueske wrote: > >> Hi Flavio, >> >> Flink does not support caching of data sets in memory yet. >> >> Best, Fabian >> >> 2015-11-30 16:45 GMT+01:00 Flavio Pompermaier : >> >>> Hi to all, >>>

Re: key

2015-11-30 Thread Fabian Hueske
Hi Radu, with the corrected setters/getters, Flink accepts your data type as a POJO data type which automatically is a key (in contrast to the GenericType it was before). There is no need to implement the Key interface. Best, Fabian 2015-11-30 17:46 GMT+01:00 Radu Tudoran : > Thank you both for

Re: Interpretation of Trigger and Eviction on a window

2015-11-30 Thread Fabian Hueske
Yes, that is correct. The first element will be lost. In fact, you do neither need a trigger nor an evictor if you want to get the max element for each group of 5 elements. See my reply on your other mail. Cheers, Fabian 2015-11-30 18:47 GMT+01:00 Nirmalya Sengupta : > Hello Aljoscha , > > Many

Re: Continuing from the stackoverflow post

2015-11-30 Thread Fabian Hueske
Hi Nirmalya, please don't feel discouraged to write blog posts! There are many things you could write about Flink's support for windows, e.g., you could discuss use cases / applications that require advanced window semantics. Regarding my comment on how/when elements are removed from a window: El

Re: Material on Apache flink internals

2015-12-01 Thread Fabian Hueske
Hi Madhu, checkout the following resources: - Apache Flink Blog: http://flink.apache.org/blog/index.html - Data Artisans Blog: http://data-artisans.com/blog/ - Flink Forward Conference website (Talk slides & recordings): http://flink-forward.org/?post_type=session - Flink Meetup talk recordings:

Re: Hello, the performance of apply function after join

2015-12-01 Thread Fabian Hueske
Hi Phil, an apply method after a join runs pipelined with the join, i.e., it starts processing when the first join result is emitted and finishes after it handled the last join result. Unless the logic in your apply function is not terribly complex, this should be OK. If you do not specify an appl

Re: Running WebClient from Windows

2015-12-02 Thread Fabian Hueske
Hi Welly, at the moment we only provide native Windows .bat scripts for start-local and the CLI client. However, we check that the Unix scripts (including start-webclient.sh) work in a Windows Cygwin environment. I have to admit, I am not familiar with MinGW, so not sure what is happening there.

Re: Continuing from the stackoverflow post

2015-12-02 Thread Fabian Hueske
Hi Nirmalya, please find my answers in line. 2015-12-02 3:26 GMT+01:00 Nirmalya Sengupta : > Hello Fabian (), > > Many thanks for your encouraging words about the blogs. I want to make a > sincere attempt. > > To summarise my understanding of the rule of removal of the elements from > the window

Re: NPE with Flink Streaming from Kafka

2015-12-02 Thread Fabian Hueske
Hi Mihail, not sure if I correctly got your requirements, but you can define windows on a keyed stream. This basically means that you partition the stream, for example by order-id, and compute windows over the keyed stream. This will create one (or more, depending on the window type) window for ea

Re: Using Kryo for serializing lambda closures

2015-12-08 Thread Fabian Hueske
Hi Nick, thanks for pushing this and opening the JIRA issue. The issue came up a couple of times and a known limitation (see FLINK-1256). So far the workaround of marking member variables as transient and initializing them in the open() method of a RichFunction has been good enough for all cases

Re: Taskmanager memory

2015-12-09 Thread Fabian Hueske
Hi Sebastian, There is no way to return memory from a Flink process except shutting the process down. I think YARN could help in your setup. In a YARN setup, you can flexibly start and stop Flink sessions with different configurations (memory, TMs, slots) or run a single job. When running a single

Re: Taskmanager memory

2015-12-09 Thread Fabian Hueske
memory lazily > - Set the memory to offheap memory. That way the JVM heap is small. The > off-heap memory is returned when no longer used deallocated - this releases > memory much better than JVM shrinking the heap. > > > > On Wed, Dec 9, 2015 at 10:06 AM, Fabian Hueske w

Re: Taskmanager memory

2015-12-09 Thread Fabian Hueske
memory consuming operators release the > memory. > > The Java process releases that memory then on the next GC, as far as I > know. > > On Wed, Dec 9, 2015 at 11:01 AM, Fabian Hueske wrote: > >> Streaming mode with on-heap memory won't help because the JVM allocates >

Re: Streaming time window

2015-12-10 Thread Fabian Hueske
Hi Martin, you can get the start and end time of a window from the TimeWindow object. The following Scala code snippet shows how to access the window end time (start time is equivalent): .timeWindow(Time.minutes(5)) .trigger(new EarlyCountTrigger(earlyCountThreshold)) .apply { ( key: Int, win

Re: Streaming time window

2015-12-10 Thread Fabian Hueske
ltiValuePoissonPreProcess()) > > How can I get access to the time window object in the fold function? > > > cheers Martin > > > On Thu, Dec 10, 2015 at 12:20 PM, Fabian Hueske wrote: > >> Hi Martin, >> >> you can get the start and end time of a windo

Re: Streaming time window

2015-12-10 Thread Fabian Hueske
ocessed (at least to my > understanding) this will lead to problems. Any Idea how to get around this? > > cheers Martin > > > > On Thu, Dec 10, 2015 at 2:59 PM, Fabian Hueske wrote: > >> Sure. You don't need a trigger, but a WindowFunction instead of the >> F

Re: Apache Flink Web Dashboard - Completed Job history

2015-12-16 Thread Fabian Hueske
Hi Ovidiu, the job execution information is only held in memory and completely discarded when the JobManager is shut down. However, you can query all stats that are displayed by the dashboard via a REST API [1] while the JM is running and save them yourself. This way you can analyze the data also

Re: Size of a window without explicit trigger/evictor

2015-12-18 Thread Fabian Hueske
Hi Nirmalya, sorry for the delayed answer. First of all, Flink does not take care that our windows fit into memory. The default trigger depends on the way in which you define a window. Given a KeyedStream you can define a window in the following ways: KeyedStream s = ... s.timeWindow() // this w

Re: Serialisation problem

2015-12-21 Thread Fabian Hueske
Hi, In your program, you apply a distinct transformation on a data set that has a (nested) GValue[] type. Distinct requires that all fields are comparable with each other. Therefore all fields of the data sets' type must be valid key types. However, Flink does not support object arrays as keys typ

Re: Explanation of the output of timeWindowAll(Time.milliseconds(3))

2015-12-28 Thread Fabian Hueske
Hi Nirmalya, event time events (such as an event time trigger to compute a window) are triggered when a watermark is received that is larger than the triggers timestamp. By default, watermarks are emitted with a fixed time interval, i.e., every x milliseconds. When a new watermark is emitted, Flin

Re: Getting executionplan in the local mode inside IDE

2016-01-01 Thread Fabian Hueske
Hi, you can only get the execution plan for programs that have a data sink and haven't been executed. In your code print() defines the data sink, however it also eagerly executes a program. After execution the program is "removed" from the execution environment. Therefore, Flink complains that no

Re: [ANNOUNCE] Introducing Apache Flink Taiwan User Group - Flink.tw

2016-01-02 Thread Fabian Hueske
Hi Gordon, this is great news! I am very happy to hear that there is a new local Flink community in Taiwan! Translating blog posts, slide sets, and other documentation is very valuable and makes the project known to a broader audience and accessible to more people. Please let us know, if you need

Re: RideCleansing example

2016-01-02 Thread Fabian Hueske
Hi Serkan, yes this is expected. The reference implementations use DataStream.print() to emit the resulting data stream to the standard output. If the program is executed in an IDE, the standard output is goes to the IDE's console. If you start Flink in a local (or cluster) setup, the standard out

Re: SQL query support in Flink

2016-01-12 Thread Fabian Hueske
Hi Sourav, Flink does not support to execute SQL queries on DataSets, yet. We just started an effort to change that. See the discussion on the dev mailing list [1] and the corresponding design document [2]. For simple relational queries, you can use the Table API [3]. Best, Fabian [1] https://m

Re: What port should I choose to run the SocketTextStreamWordCount sample

2016-01-12 Thread Fabian Hueske
Hi, I answered your question on Stack Overflow. To summarize, the program reads data from a text socket. You need to open the socket before starting the program. You can open the socket on any (available) port. If you run nc -lk from a command line, you should used localhost and port .

Re: Accessing configuration in RichFunction

2016-01-13 Thread Fabian Hueske
Hi Christian, the open method is called by the Flink workers when the parallel tasks are initialized. The configuration parameter is the configuration object of the operator. You can set parameters in the operator config as follows: DataSet text = ... DataSet wc = text.flatMap(new Tokenizer()).ge

Re: Flink Execution Plan

2016-01-14 Thread Fabian Hueske
@Christian: I don't think that is possible. There are quite a few things missing in the JSON including: - User function objects (Flink ships objects not class names) - Function configuration objects - Data types Best, Fabian 2016-01-14 16:02 GMT+01:00 lofifnc : > Hi Márton, > > Thanks for your

Re: Actual byte-streams in multiple-node pipelines

2016-01-21 Thread Fabian Hueske
Hi Tal, you said that most processing will be done in external processes. If these processes are stateful, this might be hard to integrate with Flink's fault-tolerance mechanism. In principle, Flink requires two things to achieve exactly-once processing: 1) A data source that can be replayed from

Re: Reading Binary Data (Matrix) with Flink

2016-01-25 Thread Fabian Hueske
Hi Saliya, yes that is possible, however the requirements for reading a binary file from local fs are basically the same as for reading it from HDSF. In order to be able to start reading different sections of a file in parallel, you need to know the different starting positions. This can be done b

Re: Reading Binary Data (Matrix) with Flink

2016-01-25 Thread Fabian Hueske
o, > the file is replicated across nodes and the reading (mapping) happens only > once. > > Thank you, > Saliya > > On Mon, Jan 25, 2016 at 4:38 AM, Fabian Hueske wrote: > >> Hi Saliya, >> >> yes that is possible, however the requirements for reading a bi

Re: Hello, a question about Dashborad in Flink

2016-01-25 Thread Fabian Hueske
You can start a job and then periodically request and store information about the running job and vertices from using corresponding REST calls [1]. The data will be in JSON format. After the job finished, you can stop requesting data. Next you parse the JSON, extract the information you need and g

Re: Task Manager metrics per job on Flink 0.9.1

2016-01-27 Thread Fabian Hueske
Hi, it is correct that the metrics are collected from the task managers. In Flink 0.9.1 the metrics are visualized as charts in the web dashboard. This visualization was removed when the dashboard was redesigned and updated for 0.10. but will be hopefully be added again. For Flink 0.9.1, the metr

Re: Mixing Batch & Streaming

2016-01-28 Thread Fabian Hueske
Hi, this is currently not support yet. However, this feature is on our roadmap and has been requested for a few times. So I hope somebody will pick it up soon. If the static data set is small enough, you can read the full data set (e.g., as a file) in the open method of FlatMapFunction, build a h

Re: Window stream using timestamp key for time

2016-01-28 Thread Fabian Hueske
Hi Emmanuel, the feature you are looking for is called event time processing in Flink. These blog posts should help you to become familiar with the concepts: 1) Event-Time concepts: http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/ 2) Windows in Flink: http://fl

Re: Hello, a question about Dashborad in Flink

2016-01-29 Thread Fabian Hueske
gt; I did not find the network usage metric in it. > > Best, > Phil > > On Mon, Jan 25, 2016 at 5:06 PM, Fabian Hueske wrote: > >> You can start a job and then periodically request and store information >> about the running job and vertices from using corresponding RE

Re: Flink stream data ordering/sequence

2016-01-29 Thread Fabian Hueske
Hi Sana, The feature you are looking for is called event time processing in Flink. These blog posts should help you to become familiar with the concepts: 1) Event-Time concepts: http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/ 2) Windows in Flink: http://flink.

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske
Hi Flavio, using a default FileOutputFormat, Flink writes one output file for each data sink task, i.e., as many files as the defined parallelism. The size of these files depends on the total output size and the distribution. If you write to HDFS, a file consists of one or more HDFS blocks. Parque

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske
e InputSplits. Am I right? Or am I misunderstanding something? > > Best, > Flavio > > On Fri, Jan 29, 2016 at 11:14 AM, Fabian Hueske wrote: > >> Hi Flavio, >> >> using a default FileOutputFormat, Flink writes one output file for each >> data sink task, i.e.

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske
ile), right? > > > On Fri, Jan 29, 2016 at 11:56 AM, Fabian Hueske wrote: > >> The number of input splits does not depend on the number of files but on >> the number of HDFS blocks of all files. >> Reading a single file with 100 HDFS blocks and reading of 100 files wit

Re: Where does Flink store intermediate state and triggering/scheduling a delayed event?

2016-02-03 Thread Fabian Hueske
Hi, 1) At the moment, state is kept on the JVM heap in a regular HashMap. However, we added an interface for pluggable state backends. State backends store the operator state (Flink's built-in window operators are based on operator state as well). A pull request to add a RocksDB backend (going to

Re: Build Flink for a specific tag

2016-02-03 Thread Fabian Hueske
Hi Flavio, we use tags to identify releases. The "release-0.10.1" tag, refers to the code that has been released as Flink 0.10.1. The "release-0.10" branch is used to develop 0.10 releases. Currently, it contains Flink 0.10.1 and additionally a few more bug fix commits. We will fork off this branc

Re: Left join with unbalanced dataset

2016-02-03 Thread Fabian Hueske
Hi Arnauld, in a previous mail you said: "Note that I did not rebuild & reinstall flink, I just used a 0.10-SNAPSHOT compiled jar submitted as a batch job using the "0.10.0" flink installation" This will not fix the Netty version error. You need to install a new Flink version or submit the Flink

Re: Left join with unbalanced dataset

2016-02-03 Thread Fabian Hueske
> However, I’ve just recompiled everything and ran with a real 0.10.1 > snapshot and everything worked at an astounding speed with a reasonable > memory amount. > > Thanks for the great work and the help, as always, > > Arnaud > > > > *De :* Fabian Hueske [mailto:fh

Re: Possibility to get the line numbers?

2016-02-03 Thread Fabian Hueske
Hi Anastasiia, this is difficult because the input is usually read in parallel, i.e., an input file is split into several blogs which are independently read and processed by different threads (possibly on different machines). So it is difficult to have a sequential row number. If all rows have th

Re: Where does Flink store intermediate state and triggering/scheduling a delayed event?

2016-02-04 Thread Fabian Hueske
Soumya Simanta : > Fabian, > > Thank a lot for your response. Really appreciated. I've some additional > questions (please see inline) > > On Wed, Feb 3, 2016 at 2:42 PM, Fabian Hueske wrote: > >> Hi, >> >> 1) At the moment, state is kept on the JVM h

Re: Flink writeAsCsv

2016-02-04 Thread Fabian Hueske
You can get the end time of a window from the TimeWindow object which is passed to the AllWindowFunction. This is basically a window ID / index. I would go for a custom output sink which writes records to files based on their timestamp. IMO, this would be cleaner & easier than implementing the file

Re: Error while reading binary file

2016-02-08 Thread Fabian Hueske
Hi, please try to replace DataSet ds = env.createInput(sif); by DataSet ds = env.createInput(sif, ValueTypeInfo.SHORT_VALUE_TYPE_INFO); Best, Fabian 2016-02-08 19:33 GMT+01:00 Saliya Ekanayake : > Till, > > I am still having trouble getting this to work. Here's my code ( > https://github.com/es

Re: Error while reading binary file

2016-02-08 Thread Fabian Hueske
java:169) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) > at java.lang.Thread.run(Thread.java:745) > > On Mon, Feb 8, 2016 at 3:50 PM, Fabian Hueske wrote: > >> Hi, >> >> please try to replace >> DataSet ds = env.createInput(sif); >>

Re: Kafka partition alignment for event time

2016-02-09 Thread Fabian Hueske
Hi, where did you observe the duplicates, within Flink or in Kafka? Please be aware that the Flink Kafka Producer does not provide exactly-once consistency. This is not easily possible because Kafka does not support transactional writes yet. Flink's exactly-once guarantees are only valid within t

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske
What is the type of sessionId? It must be a key type in order to be used as key. If it is a generic class, it must implement Comparable to be used as key. 2016-02-09 11:53 GMT+01:00 Dominique Rondé : > The fields in SourceA and SourceB are private but have public getters and > setters. The classe

  1   2   3   4   5   6   7   8   9   10   >