from:"Fabian Hueske"

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Fabian Hueske

Hi Pieter, cross is indeed too expensive for this task. If dataset A fits into memory, you can do the following: Use a RichMapPartitionFunction to process dataset B and add dataset A as a broadcastSet. In the open method of mapPartition, you can load the broadcasted set and sort it by a.propertyX

Re: For each element in a dataset, do something with another dataset

2015-09-30 Thread Fabian Hueske

ers for getting started with the 'tricky range > partitioning'? I am quite keen to get this working with large datasets ;-) > > Cheers, > > Pieter > > 2015-09-30 10:24 GMT+02:00 Fabian Hueske : > >> Hi Pieter, >> >> cross is indeed

Re: Config files content read

2015-10-05 Thread Fabian Hueske

Hi Flavio, I don't think this is a feature that needs to go into Flink core. To me it looks like this be implemented as a utility method by anybody who needs it without major effort. Best, Fabian 2015-10-02 15:27 GMT+02:00 Flavio Pompermaier : > Hi to all, > > in many of my jobs I have to read

Re: For each element in a dataset, do something with another dataset

2015-10-05 Thread Fabian Hueske

artitionFunction >>> on A: >>> In the open method of mapPartition, you sort B. Then, for each element >>> of A, you do a binary search in B, and look at the index found by the >>> binary search, which will be the count that you are looking for. >>> >>

Re: source binary file

2015-10-06 Thread Fabian Hueske

Hi Lydia, you need to implement a custom InputFormat to read binary files. Usually you can extend the FileInputFormat. The implementation depends a lot on your use case, for example whether each binary file is read into a single or multiple records and how records are delimited if there are more t

Re: Parallel file read in LocalEnvironment

2015-10-07 Thread Fabian Hueske

Hi Flavio, it is not possible to split by line count because that would mean to read and parse the file just for splitting. Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds

Re: Parallel file read in LocalEnvironment

2015-10-07 Thread Fabian Hueske

description of > their internals? > > On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske wrote: > >> Hi Flavio, >> >> it is not possible to split by line count because that would mean to read >> and parse the file just for splitting. >> >> Parallel processing

Re: Debug OutOfMemory

2015-10-08 Thread Fabian Hueske

Hi Konstantin, Flink uses managed memory only for its internal processing (sorting, hash tables, etc.). If you allocate too much memory in your user code, it can still fail with an OOME. This can also happen for large broadcast sets. Can you check how much much memory the JVM allocated and how muc

Re: Debug OutOfMemory

2015-10-09 Thread Fabian Hueske

I think that's actually very use case specific. You're code will never see the malformed record because it is dropped by the input format. Other applications might rely on complete input and would prefer an exception to be notified about invalid input. Flink's CsvInputFormat has a parameter "lenie

Re: Hi, Flink people, a question about translation from HIVE Query to Flink fucntioin by using Table API

2015-10-19 Thread Fabian Hueske

Hi Philip, here a few additions to what Max said: - ORDER BY: As Max said, Flink's sortPartition() does only sort with a partition and does not produce a total order. You can either set the parallelism to 1 as Max suggested or use a custom partitioner to range partition the data. - SORT BY: From y

Re: Hi, Flink people, a question about translation from HIVE Query to Flink fucntioin by using Table API

2015-10-19 Thread Fabian Hueske

tribute > By] + [Sort By]. Therefore, according to your suggestion, should it be > partitionByHash() + sortGroup() instead of sortPartition() ? > > Or probably I did not still get much difference between Partition and > scope within a reduce. > > Regards, > Philip > > On Mon,

Re: Powered by Flink

2015-10-19 Thread Fabian Hueske

Thanks for starting this Kostas. I think the list is quite hidden in the wiki. Should we link from flink.apache.org to that page? Cheers, Fabian 2015-10-19 14:50 GMT+02:00 Kostas Tzoumas : > Hi everyone, > > I started a "Powered by Flink" wiki page, listing some of the > organizations that are

Re: Powered by Flink

2015-10-19 Thread Fabian Hueske

Sounds good +1 2015-10-19 14:57 GMT+02:00 Márton Balassi : > Thanks for starting and big +1 for making it more prominent. > > On Mon, Oct 19, 2015 at 2:53 PM, Fabian Hueske wrote: > >> Thanks for starting this Kostas. >> >> I think the list is quite hidden in

Re: Powered by Flink

2015-10-19 Thread Fabian Hueske

ult to answer to > interested users. > > > On 19.10.2015 15:08, Suneel Marthi wrote: > > +1 to this. > > On Mon, Oct 19, 2015 at 3:00 PM, Fabian Hueske wrote: > >> Sounds good +1 >> >> 2015-10-19 14:57 GMT+02:00 Márton Balassi < >> balassi.mar...

Re: ExecutionEnvironment setConfiguration API

2015-10-19 Thread Fabian Hueske

I think it's not a nice solution to check for the type of the returned execution environment to determine whether it is a local or a remote execution environment. Wouldn't it be better to add a method isLocal() to ExecutionEnvironment? Cheers, Fabian 2015-10-14 19:14 GMT+02:00 Flavio Pompermaier

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Fabian Hueske

It might even be materialized (to disk) if both derived data sets are joined. 2015-10-22 12:01 GMT+02:00 Till Rohrmann : > I fear that the filter operations are not chained because there are at > least two of them which have the same DataSet as input. However, it's true > that the intermediate re

Re: Reading multiple datasets with one read operation

2015-10-22 Thread Fabian Hueske

> reading the entire input multiple times (we are talking 100GB+ on max 32 > workers) but I would have to run some experiments to confirm that. > > > > 2015-10-22 12:06 GMT+02:00 Fabian Hueske : > >> It might even be materialized (to disk) if both derived data sets are >

Re: reading csv file from null value

2015-10-26 Thread Fabian Hueske

Hi Philip, the CsvInputFormat does not support to read empty fields. I see two ways to achieve this functionality: - Use a TextInputFormat that returns each line as a String and do the parsing in a subsequent MapFunction - Extend the CsvInputFormat to support empty fields Cheers, Fabian 2015-10

Re: Flink on EC"

2015-10-29 Thread Fabian Hueske

Hi Thomas, until recently, Flink provided an own implementation of a S3FileSystem which wasn't fully tested and buggy. We removed that implementation and are using now (in 0.10-SNAPSHOT) Hadoop's S3 implementation by default. If you want to continue using 0.9.1 you can configure Flink to use Hado

Re: How best to deal with wide, structured tuples?

2015-10-29 Thread Fabian Hueske

Hi Johann, I see three options for your use case. 1) Generate Pojo code at planning time, i.e., when the program is composed. This does not work when the program is already running. The benefit is that you can use key expressions, have typed fields, and type specific serializers and comparators.

Re: How groupBy work

2015-10-30 Thread Fabian Hueske

Hi Jeffery, Flink uses a (potentially external) merge-sort to group data. Combining is done using an in-memory sort. Because Flink uses pipelined data transfer, the execution of operators in a program can overlap. For example in WordCount, the sort of a groupBy will immediately start as soon as th

Re: Create triggers

2015-10-30 Thread Fabian Hueske

You refer to the DataSet (batch) API, right? In that case you can evaluate your condition in the program and fetch a DataSet back to the client using List myData = DataSet.collect(); Based on the result of the collect() call you can define and execute a new program. Note: collect() will immediate

Re: Create triggers

2015-11-02 Thread Fabian Hueske

aset. > > Some pseudocode from your solution: > DataSet A = env.readFile(...); > DataSet C = env.readFile(...); > > A.groupBy().reduce().filter(*Check conditions here and in case start > processing C*); > > > Thanks, > Giacomo > > > > > On Fri, Oct 30

Re: Hi, question about orderBy two columns more

2015-11-02 Thread Fabian Hueske

Hi Philip, thanks for reporting the issue. I just verified the problem. It is working correctly for the Java API, but is broken in Scala. I will work on a fix and include it in the next RC for 0.10.0. Thanks, Fabian 2015-11-02 12:58 GMT+01:00 Philip Lee : > Thanks for your reply, Stephan. > >

Re: Long ids from String

2015-11-03 Thread Fabian Hueske

Converting String ids into Long ids can be quite expensive, so you should make sure it pays off. The save way to do it is to get all unique String ids (project, distinct), do zipWithUniqueId, and join all DataSets that have the String id with the new long id. So it is a full sort for the unique an

Re: Rich vs normal functions

2015-11-09 Thread Fabian Hueske

These reason is that the non-rich function interfaces are SAM (single abstract method) interfaces. In Java 8, SAM interfaces can be specified as concise lambda functions. Cheers, Fabian 2015-11-09 10:45 GMT+01:00 Flavio Pompermaier : > Hi flinkers, > I have a simple question for you that I want

Re: Hi, join with two columns of both tables

2015-11-09 Thread Fabian Hueske

Why don't you use a composite key for the Flink join (first.join(second).where(0,1).equalTo(2,3).with(...)? This would be more efficient and you can omit the check in the join function. Best, Fabian 2015-11-08 19:13 GMT+01:00 Philip Lee : > I want to join two tables with two columns like > > //

Re: Mixing POJO and Tuples

2015-11-10 Thread Fabian Hueske

Hi Flavio, this will not work out of the box. If you extend a Flink tuple and add additional fields, the type will be recognized as tuple and the TupleSerializer will be used to serialize and deserialize the record. Since the TupleSerializer is not aware of your additional fields it will not seria

Re: Implementing samza table/stream join

2015-11-10 Thread Fabian Hueske

Hi Nick, I think you can do this with Flink quite similar to how it is explained in the Samza documentation by using a stateful CoFlatMapFunction [1], [2]. Please have a look at this snippet [3]. This code implements an updateable stream filter. The first stream is filtered by words from the seco

Apache Flink 0.10.0 released

2015-11-16 Thread Fabian Hueske

Hi everybody, The Flink community is excited to announce that Apache Flink 0.10.0 has been released. Please find the release announcement here: --> http://flink.apache.org/news/2015/11/16/release-0.10.0.html Best, Fabian

Re: Different CoGroup behavior inside DeltaIteration

2015-11-16 Thread Fabian Hueske

Hi, this is an artifact of how the solution set is internally implemented. Usually, a CoGroup is executed using a sort-merge strategy, i.e., both input are sorted, merged, and handed to the CoGroup function in a streaming fashion. Both inputs are treated equally, and if one of both inputs does not

Re: Fold vs Reduce in DataStream API

2015-11-18 Thread Fabian Hueske

Hi Ron, Have you checked: https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#transformations ? Fold is like reduce, except that you define a start element (of a different type than the input type) and the result type is the type of the initial value. In reduce,

Re: ReduceByKeyAndWindow in Flink

2015-11-23 Thread Fabian Hueske

Hi Konstantin, let me first summarize to make sure I understood what you are looking for. You computed an aggregate over a keyed event-time window and you are looking for the maximum aggregate for each group of windows over the same period of time. So if you have (key: 1, w-time: 10, agg: 17) (key

Re: ReduceByKeyAndWindow in Flink

2015-11-23 Thread Fabian Hueske

1, w-time: 20, agg: 30) > (key: 1, w-time: 20, agg: 30) > > Because the reduce function is evaluated for every incoming event (i.e. > each key), right? > > Cheers, > > Konstantin > > On 23.11.2015 10:47, Fabian Hueske wrote: > > Hi Konstantin, > > > > let me firs

Re: Using Hadoop Input/Output formats

2015-11-24 Thread Fabian Hueske

Hi Nick, you can use Flink's HadoopInputFormat wrappers also for the DataStream API. However, DataStream does not offer as much "sugar" as DataSet because StreamEnvironment does not offer dedicated createHadoopInput or readHadoopFile methods. In DataStream Scala you can read from a Hadoop InputFo

Re: Watermarks as "process completion" flags

2015-11-24 Thread Fabian Hueske

Hi Anton, If I got your requirements right, you are looking for a solution that continuously produces updated partial aggregates in a streaming fashion. When a special event (no more trades) is received, you would like to store the last update as a final result. Is that correct? You can compute

Re: Standalone Cluster vs YARN

2015-11-25 Thread Fabian Hueske

A strong argument for YARN mode can be the isolation of multiple users and jobs. You can easily start a new Flink cluster for each job or user. However, this comes at the price of resource (memory) fragmentation. YARN mode does not use memory as effective as cluster mode. 2015-11-25 9:46 GMT+01:00

Re: Standalone Cluster vs YARN

2015-11-25 Thread Fabian Hueske

use only YARN without Hadoop ? > > Currently we are using Cassandra and CFS ( cass file system ) > > > Cheers > > On Wed, Nov 25, 2015 at 3:51 PM, Fabian Hueske wrote: > >> A strong argument for YARN mode can be the isolation of multiple users >> and jobs. You can e

Re: flink connectors

2015-11-27 Thread Fabian Hueske

Hi Radu, the connectors are available in Maven Central. Just add them as a dependency in your project and they will be fetched and included. Best, Fabian 2015-11-27 14:38 GMT+01:00 Radu Tudoran : > Hi, > > > > I was trying to use flink connectors. However, when I tried to import this > > > > im

Re: flink connectors

2015-11-27 Thread Fabian Hueske

ecipient(s) is > prohibited. If you receive this e-mail in error, please notify the sender > by phone or email immediately and delete it! > > > > *From:* Fabian Hueske [mailto:fhue...@gmail.com] > *Sent:* Friday, November 27, 2015 2:41 PM > *To:* user@flink.apache.org >

Re: Continuing from the stackoverflow post

2015-11-27 Thread Fabian Hueske

Hi Nirmalya, can you describe the semantics that you want to implement? Do you want to find the max temperature every 5 milliseconds or the max of every 5 records? Right now, you are using a non-keyed timeWindow of 5 milliseconds. This will create a window for the complete stream every 5 msecs. H

Re: Interpretation of Trigger and Eviction on a window

2015-11-27 Thread Fabian Hueske

Hi Nirmalya, it is correct that the evictor is called BEFORE the window function is applied because this is required to support certain types of sliding windows. If you want to remove all elements from the window after the window function was applied, you need a trigger that purges the window. The

Re: Doubt about window and count trigger

2015-11-27 Thread Fabian Hueske

Hi Anwar, You trigger looks good! I just want to make sure you know what it is exactly happening after a window was evaluated and purged. Once a window was purged, the whole window is cleared and removed. If a new element arrives, that would have fit into the purged window, a new window with exac

Re: POJO Dataset read and write

2015-11-27 Thread Fabian Hueske

If you are just looking for an exchange format between two Flink jobs, I would go for the TypeSerializerInput/OutputFormat. Note that these are binary formats. Best, Fabian 2015-11-27 15:28 GMT+01:00 Flavio Pompermaier : > Hi to all, > > I have a complex POJO (with nexted objects) that I'd like

Re: Doubt about window and count trigger

2015-11-27 Thread Fabian Hueske

ompletion. > In that 1 min window, is my modified count trigger still valid ? Say, if > in that one minute window, I have 100 events after 30 s, it will still fire > at 30th second ? > > Cheers, > anwar. > > > > On Fri, Nov 27, 2015 at 3:31 PM, Fabian Hueske wrote: >

Re: POJO Dataset read and write

2015-11-27 Thread Fabian Hueske

rmal or TypeSerializer is > supposed to perform better then this? > > > On Fri, Nov 27, 2015 at 3:39 PM, Fabian Hueske wrote: > >> If you are just looking for an exchange format between two Flink jobs, I >> would go for the TypeSerializerInput/OutputFormat. >>

Re: Insufficient number of network buffers running on cluster

2015-11-27 Thread Fabian Hueske

Hi Guido, please check this section of the configuration documentation: https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/config.html#configuring-the-network-buffers It should answer your questions. Please let us know, if not. Cheers, Fabian 2015-11-27 16:41 GMT+01:00 Guido :

Re: Continuing from the stackoverflow post

2015-11-30 Thread Fabian Hueske

Hi Nirmalya, thanks for the detailed description of your understanding of Flink's window semantics. Most of it is correct, but a few things need a bit of correction ;-) Please see my comments inline. 2015-11-28 4:36 GMT+01:00 Nirmalya Sengupta : > Hello Fabian, > > > A little long mail; please

Re: Continuing from the stackoverflow post

2015-11-30 Thread Fabian Hueske

Sorry, I have to correct myself. The windowing semantics are not easy ;-) 2015-11-30 15:34 GMT+01:00 Fabian Hueske : > Hi Nirmalya, > > thanks for the detailed description of your understanding of Flink's > window semantics. > Most of it is correct, but a few things need

Re: Iterative queries on Flink

2015-11-30 Thread Fabian Hueske

Hi Flavio, Flink does not support caching of data sets in memory yet. Best, Fabian 2015-11-30 16:45 GMT+01:00 Flavio Pompermaier : > Hi to all, > I was wondering if Flink could fit a use case where a user load a dataset > in memory and then he/she wants to explore it interactively. Let's say I

Re: Iterative queries on Flink

2015-11-30 Thread Fabian Hueske

Mon, Nov 30, 2015 at 4:52 PM, Fabian Hueske wrote: > >> Hi Flavio, >> >> Flink does not support caching of data sets in memory yet. >> >> Best, Fabian >> >> 2015-11-30 16:45 GMT+01:00 Flavio Pompermaier : >> >>> Hi to all, >>>

Re: key

2015-11-30 Thread Fabian Hueske

Hi Radu, with the corrected setters/getters, Flink accepts your data type as a POJO data type which automatically is a key (in contrast to the GenericType it was before). There is no need to implement the Key interface. Best, Fabian 2015-11-30 17:46 GMT+01:00 Radu Tudoran : > Thank you both for

Re: Interpretation of Trigger and Eviction on a window

2015-11-30 Thread Fabian Hueske

Yes, that is correct. The first element will be lost. In fact, you do neither need a trigger nor an evictor if you want to get the max element for each group of 5 elements. See my reply on your other mail. Cheers, Fabian 2015-11-30 18:47 GMT+01:00 Nirmalya Sengupta : > Hello Aljoscha , > > Many

Re: Continuing from the stackoverflow post

2015-11-30 Thread Fabian Hueske

Hi Nirmalya, please don't feel discouraged to write blog posts! There are many things you could write about Flink's support for windows, e.g., you could discuss use cases / applications that require advanced window semantics. Regarding my comment on how/when elements are removed from a window: El

Re: Material on Apache flink internals

2015-12-01 Thread Fabian Hueske

Hi Madhu, checkout the following resources: - Apache Flink Blog: http://flink.apache.org/blog/index.html - Data Artisans Blog: http://data-artisans.com/blog/ - Flink Forward Conference website (Talk slides & recordings): http://flink-forward.org/?post_type=session - Flink Meetup talk recordings:

Re: Hello, the performance of apply function after join

2015-12-01 Thread Fabian Hueske

Hi Phil, an apply method after a join runs pipelined with the join, i.e., it starts processing when the first join result is emitted and finishes after it handled the last join result. Unless the logic in your apply function is not terribly complex, this should be OK. If you do not specify an appl

Re: Running WebClient from Windows

2015-12-02 Thread Fabian Hueske

Hi Welly, at the moment we only provide native Windows .bat scripts for start-local and the CLI client. However, we check that the Unix scripts (including start-webclient.sh) work in a Windows Cygwin environment. I have to admit, I am not familiar with MinGW, so not sure what is happening there.

Re: Continuing from the stackoverflow post

2015-12-02 Thread Fabian Hueske

Hi Nirmalya, please find my answers in line. 2015-12-02 3:26 GMT+01:00 Nirmalya Sengupta : > Hello Fabian (), > > Many thanks for your encouraging words about the blogs. I want to make a > sincere attempt. > > To summarise my understanding of the rule of removal of the elements from > the window

Re: NPE with Flink Streaming from Kafka

2015-12-02 Thread Fabian Hueske

Hi Mihail, not sure if I correctly got your requirements, but you can define windows on a keyed stream. This basically means that you partition the stream, for example by order-id, and compute windows over the keyed stream. This will create one (or more, depending on the window type) window for ea

Re: Using Kryo for serializing lambda closures

2015-12-08 Thread Fabian Hueske

Hi Nick, thanks for pushing this and opening the JIRA issue. The issue came up a couple of times and a known limitation (see FLINK-1256). So far the workaround of marking member variables as transient and initializing them in the open() method of a RichFunction has been good enough for all cases

Re: Taskmanager memory

2015-12-09 Thread Fabian Hueske

Hi Sebastian, There is no way to return memory from a Flink process except shutting the process down. I think YARN could help in your setup. In a YARN setup, you can flexibly start and stop Flink sessions with different configurations (memory, TMs, slots) or run a single job. When running a single

Re: Taskmanager memory

2015-12-09 Thread Fabian Hueske

memory lazily > - Set the memory to offheap memory. That way the JVM heap is small. The > off-heap memory is returned when no longer used deallocated - this releases > memory much better than JVM shrinking the heap. > > > > On Wed, Dec 9, 2015 at 10:06 AM, Fabian Hueske w

Re: Taskmanager memory

2015-12-09 Thread Fabian Hueske

memory consuming operators release the > memory. > > The Java process releases that memory then on the next GC, as far as I > know. > > On Wed, Dec 9, 2015 at 11:01 AM, Fabian Hueske wrote: > >> Streaming mode with on-heap memory won't help because the JVM allocates >

Re: Streaming time window

2015-12-10 Thread Fabian Hueske

Hi Martin, you can get the start and end time of a window from the TimeWindow object. The following Scala code snippet shows how to access the window end time (start time is equivalent): .timeWindow(Time.minutes(5)) .trigger(new EarlyCountTrigger(earlyCountThreshold)) .apply { ( key: Int, win

Re: Streaming time window

2015-12-10 Thread Fabian Hueske

ltiValuePoissonPreProcess()) > > How can I get access to the time window object in the fold function? > > > cheers Martin > > > On Thu, Dec 10, 2015 at 12:20 PM, Fabian Hueske wrote: > >> Hi Martin, >> >> you can get the start and end time of a windo

Re: Streaming time window

2015-12-10 Thread Fabian Hueske

ocessed (at least to my > understanding) this will lead to problems. Any Idea how to get around this? > > cheers Martin > > > > On Thu, Dec 10, 2015 at 2:59 PM, Fabian Hueske wrote: > >> Sure. You don't need a trigger, but a WindowFunction instead of the >> F

Re: Apache Flink Web Dashboard - Completed Job history

2015-12-16 Thread Fabian Hueske

Hi Ovidiu, the job execution information is only held in memory and completely discarded when the JobManager is shut down. However, you can query all stats that are displayed by the dashboard via a REST API [1] while the JM is running and save them yourself. This way you can analyze the data also

Re: Size of a window without explicit trigger/evictor

2015-12-18 Thread Fabian Hueske

Hi Nirmalya, sorry for the delayed answer. First of all, Flink does not take care that our windows fit into memory. The default trigger depends on the way in which you define a window. Given a KeyedStream you can define a window in the following ways: KeyedStream s = ... s.timeWindow() // this w

Re: Serialisation problem

2015-12-21 Thread Fabian Hueske

Hi, In your program, you apply a distinct transformation on a data set that has a (nested) GValue[] type. Distinct requires that all fields are comparable with each other. Therefore all fields of the data sets' type must be valid key types. However, Flink does not support object arrays as keys typ

Re: Explanation of the output of timeWindowAll(Time.milliseconds(3))

2015-12-28 Thread Fabian Hueske

Hi Nirmalya, event time events (such as an event time trigger to compute a window) are triggered when a watermark is received that is larger than the triggers timestamp. By default, watermarks are emitted with a fixed time interval, i.e., every x milliseconds. When a new watermark is emitted, Flin

Re: Getting executionplan in the local mode inside IDE

2016-01-01 Thread Fabian Hueske

Hi, you can only get the execution plan for programs that have a data sink and haven't been executed. In your code print() defines the data sink, however it also eagerly executes a program. After execution the program is "removed" from the execution environment. Therefore, Flink complains that no

Re: [ANNOUNCE] Introducing Apache Flink Taiwan User Group - Flink.tw

2016-01-02 Thread Fabian Hueske

Hi Gordon, this is great news! I am very happy to hear that there is a new local Flink community in Taiwan! Translating blog posts, slide sets, and other documentation is very valuable and makes the project known to a broader audience and accessible to more people. Please let us know, if you need

Re: RideCleansing example

2016-01-02 Thread Fabian Hueske

Hi Serkan, yes this is expected. The reference implementations use DataStream.print() to emit the resulting data stream to the standard output. If the program is executed in an IDE, the standard output is goes to the IDE's console. If you start Flink in a local (or cluster) setup, the standard out

Re: SQL query support in Flink

2016-01-12 Thread Fabian Hueske

Hi Sourav, Flink does not support to execute SQL queries on DataSets, yet. We just started an effort to change that. See the discussion on the dev mailing list [1] and the corresponding design document [2]. For simple relational queries, you can use the Table API [3]. Best, Fabian [1] https://m

Re: What port should I choose to run the SocketTextStreamWordCount sample

2016-01-12 Thread Fabian Hueske

Hi, I answered your question on Stack Overflow. To summarize, the program reads data from a text socket. You need to open the socket before starting the program. You can open the socket on any (available) port. If you run nc -lk from a command line, you should used localhost and port .

Re: Accessing configuration in RichFunction

2016-01-13 Thread Fabian Hueske

Hi Christian, the open method is called by the Flink workers when the parallel tasks are initialized. The configuration parameter is the configuration object of the operator. You can set parameters in the operator config as follows: DataSet text = ... DataSet wc = text.flatMap(new Tokenizer()).ge

Re: Flink Execution Plan

2016-01-14 Thread Fabian Hueske

@Christian: I don't think that is possible. There are quite a few things missing in the JSON including: - User function objects (Flink ships objects not class names) - Function configuration objects - Data types Best, Fabian 2016-01-14 16:02 GMT+01:00 lofifnc : > Hi Márton, > > Thanks for your

Re: Actual byte-streams in multiple-node pipelines

2016-01-21 Thread Fabian Hueske

Hi Tal, you said that most processing will be done in external processes. If these processes are stateful, this might be hard to integrate with Flink's fault-tolerance mechanism. In principle, Flink requires two things to achieve exactly-once processing: 1) A data source that can be replayed from

Re: Reading Binary Data (Matrix) with Flink

2016-01-25 Thread Fabian Hueske

Hi Saliya, yes that is possible, however the requirements for reading a binary file from local fs are basically the same as for reading it from HDSF. In order to be able to start reading different sections of a file in parallel, you need to know the different starting positions. This can be done b

Re: Reading Binary Data (Matrix) with Flink

2016-01-25 Thread Fabian Hueske

o, > the file is replicated across nodes and the reading (mapping) happens only > once. > > Thank you, > Saliya > > On Mon, Jan 25, 2016 at 4:38 AM, Fabian Hueske wrote: > >> Hi Saliya, >> >> yes that is possible, however the requirements for reading a bi

Re: Hello, a question about Dashborad in Flink

2016-01-25 Thread Fabian Hueske

You can start a job and then periodically request and store information about the running job and vertices from using corresponding REST calls [1]. The data will be in JSON format. After the job finished, you can stop requesting data. Next you parse the JSON, extract the information you need and g

Re: Task Manager metrics per job on Flink 0.9.1

2016-01-27 Thread Fabian Hueske

Hi, it is correct that the metrics are collected from the task managers. In Flink 0.9.1 the metrics are visualized as charts in the web dashboard. This visualization was removed when the dashboard was redesigned and updated for 0.10. but will be hopefully be added again. For Flink 0.9.1, the metr

Re: Mixing Batch & Streaming

2016-01-28 Thread Fabian Hueske

Hi, this is currently not support yet. However, this feature is on our roadmap and has been requested for a few times. So I hope somebody will pick it up soon. If the static data set is small enough, you can read the full data set (e.g., as a file) in the open method of FlatMapFunction, build a h

Re: Window stream using timestamp key for time

2016-01-28 Thread Fabian Hueske

Hi Emmanuel, the feature you are looking for is called event time processing in Flink. These blog posts should help you to become familiar with the concepts: 1) Event-Time concepts: http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/ 2) Windows in Flink: http://fl

Re: Hello, a question about Dashborad in Flink

2016-01-29 Thread Fabian Hueske

gt; I did not find the network usage metric in it. > > Best, > Phil > > On Mon, Jan 25, 2016 at 5:06 PM, Fabian Hueske wrote: > >> You can start a job and then periodically request and store information >> about the running job and vertices from using corresponding RE

Re: Flink stream data ordering/sequence

2016-01-29 Thread Fabian Hueske

Hi Sana, The feature you are looking for is called event time processing in Flink. These blog posts should help you to become familiar with the concepts: 1) Event-Time concepts: http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/ 2) Windows in Flink: http://flink.

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske

Hi Flavio, using a default FileOutputFormat, Flink writes one output file for each data sink task, i.e., as many files as the defined parallelism. The size of these files depends on the total output size and the distribution. If you write to HDFS, a file consists of one or more HDFS blocks. Parque

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske

e InputSplits. Am I right? Or am I misunderstanding something? > > Best, > Flavio > > On Fri, Jan 29, 2016 at 11:14 AM, Fabian Hueske wrote: > >> Hi Flavio, >> >> using a default FileOutputFormat, Flink writes one output file for each >> data sink task, i.e.

Re: Writing Parquet files with Flink

2016-01-29 Thread Fabian Hueske

ile), right? > > > On Fri, Jan 29, 2016 at 11:56 AM, Fabian Hueske wrote: > >> The number of input splits does not depend on the number of files but on >> the number of HDFS blocks of all files. >> Reading a single file with 100 HDFS blocks and reading of 100 files wit

Re: Where does Flink store intermediate state and triggering/scheduling a delayed event?

2016-02-03 Thread Fabian Hueske

Hi, 1) At the moment, state is kept on the JVM heap in a regular HashMap. However, we added an interface for pluggable state backends. State backends store the operator state (Flink's built-in window operators are based on operator state as well). A pull request to add a RocksDB backend (going to

Re: Build Flink for a specific tag

2016-02-03 Thread Fabian Hueske

Hi Flavio, we use tags to identify releases. The "release-0.10.1" tag, refers to the code that has been released as Flink 0.10.1. The "release-0.10" branch is used to develop 0.10 releases. Currently, it contains Flink 0.10.1 and additionally a few more bug fix commits. We will fork off this branc

Re: Left join with unbalanced dataset

2016-02-03 Thread Fabian Hueske

Hi Arnauld, in a previous mail you said: "Note that I did not rebuild & reinstall flink, I just used a 0.10-SNAPSHOT compiled jar submitted as a batch job using the "0.10.0" flink installation" This will not fix the Netty version error. You need to install a new Flink version or submit the Flink

Re: Left join with unbalanced dataset

2016-02-03 Thread Fabian Hueske

> However, I’ve just recompiled everything and ran with a real 0.10.1 > snapshot and everything worked at an astounding speed with a reasonable > memory amount. > > Thanks for the great work and the help, as always, > > Arnaud > > > > *De :* Fabian Hueske [mailto:fh

Re: Possibility to get the line numbers?

2016-02-03 Thread Fabian Hueske

Hi Anastasiia, this is difficult because the input is usually read in parallel, i.e., an input file is split into several blogs which are independently read and processed by different threads (possibly on different machines). So it is difficult to have a sequential row number. If all rows have th

Re: Where does Flink store intermediate state and triggering/scheduling a delayed event?

2016-02-04 Thread Fabian Hueske

Soumya Simanta : > Fabian, > > Thank a lot for your response. Really appreciated. I've some additional > questions (please see inline) > > On Wed, Feb 3, 2016 at 2:42 PM, Fabian Hueske wrote: > >> Hi, >> >> 1) At the moment, state is kept on the JVM h

Re: Flink writeAsCsv

2016-02-04 Thread Fabian Hueske

You can get the end time of a window from the TimeWindow object which is passed to the AllWindowFunction. This is basically a window ID / index. I would go for a custom output sink which writes records to files based on their timestamp. IMO, this would be cleaner & easier than implementing the file

Re: Error while reading binary file

2016-02-08 Thread Fabian Hueske

Hi, please try to replace DataSet ds = env.createInput(sif); by DataSet ds = env.createInput(sif, ValueTypeInfo.SHORT_VALUE_TYPE_INFO); Best, Fabian 2016-02-08 19:33 GMT+01:00 Saliya Ekanayake : > Till, > > I am still having trouble getting this to work. Here's my code ( > https://github.com/es

Re: Error while reading binary file

2016-02-08 Thread Fabian Hueske

java:169) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) > at java.lang.Thread.run(Thread.java:745) > > On Mon, Feb 8, 2016 at 3:50 PM, Fabian Hueske wrote: > >> Hi, >> >> please try to replace >> DataSet ds = env.createInput(sif); >>

Re: Kafka partition alignment for event time

2016-02-09 Thread Fabian Hueske

Hi, where did you observe the duplicates, within Flink or in Kafka? Please be aware that the Flink Kafka Producer does not provide exactly-once consistency. This is not easily possible because Kafka does not support transactional writes yet. Flink's exactly-once guarantees are only valid within t

Re: Join two Datasets --> InvalidProgramException

2016-02-09 Thread Fabian Hueske

What is the type of sessionId? It must be a key type in order to be used as key. If it is a generic class, it must implement Comparable to be used as key. 2016-02-09 11:53 GMT+01:00 Dominique Rondé : > The fields in SourceA and SourceB are private but have public getters and > setters. The classe

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1728 matches

Mail list logo