from:"Fabian Hueske"

Re: Csv data stream

2016-10-06 Thread Fabian Hueske

Hi, how are you executing your code? From an IDE or on a running Flink instance? If you execute it on a running Flink instance, you have to look into the .out files of the task managers (located in ./log/). Best, Fabian 2016-10-06 22:08 GMT+02:00 drystan mazur : > Hello I am reading a csv file

Re: DataStream csv reading

2016-10-06 Thread Fabian Hueske

Hi Greg, print is only eagerly executed for DataSet programs. In the DataStream API, print() just appends a print sink and execute() is required to trigger an execution. 2016-10-06 22:40 GMT+02:00 Greg Hogan : > The program executes when you call print (same for collect), which is why > you are

Re: Merging N parallel/partitioned WindowedStreams together, one-to-one, into a global window stream

2016-10-06 Thread Fabian Hueske

Maybe this can be done by assigning the same window id to each of the N local windows, and do a .keyBy(windowId) .countWindow(N) This should create a new global window for each window id and collect all N windows. Best, Fabian 2016-10-06 22:39 GMT+02:00 AJ Heller : > The goal is: > * to split

Re: Merging N parallel/partitioned WindowedStreams together, one-to-one, into a global window stream

2016-10-07 Thread Fabian Hueske

in > the mapper that increments on every map call). It works, but by any chance > is there a more succinct way to do it? > > On Thu, Oct 6, 2016 at 1:50 PM, Fabian Hueske wrote: > >> Maybe this can be done by assigning the same window id to each of the N >> local windows, an

Re: jdbc.JDBCInputFormat

2016-10-07 Thread Fabian Hueske

As the exception says the class org.apache.flink.api.scala.io.jdbc.JDBCInputFormat does not exist. You have to do: import org.apache.flink.api.java.io.jdbc.JDBCInputFormat There is no Scala implementation of this class but you can also use Java classes in Scala. 2016-10-07 21:38 GMT+02:00 Alber

Re: readCsvFile

2016-10-07 Thread Fabian Hueske

> Your solution compile with out errors, but IncludedFields Isn't working: > [image: Imágenes integradas 1] > > The output is incorrect: > [image: Imágenes integradas 2] > > The correct result must be only 1º Column > (a,1) > (aa,1) > > 2016-10-06 21:37 GMT+02:

Re: readCsvFile

2016-10-09 Thread Fabian Hueske

"," > ,includedFields = Array(2)) > val counts4 = text3 > .map { (_, 1) } > .groupBy(0) > .sum(1) > counts4.print() > > The result is: > [image: Imágenes integradas 1] > > Can you see any bug in mi code to read only 1º column ¿? > > > 2

Re: Regarding

2016-10-09 Thread Fabian Hueske

Hi Rashmi, as Marton said, you do not need to start a local Flink instance (start-lcoal.bat) if you want to run programs from your IDE. Maybe running a local instance causes a conflict when starting an instance from IDE. Developing and running Flink programs on Windows should work, both from the I

Re: Wordindex conversation.

2016-10-10 Thread Fabian Hueske

Hi, you can do it like this: 1) you have to split each label record of the main dataset into separate records: (0,List(a, b, c, d, e, f, g)) -> (0, a), (0, b), (0, c), ..., (0, g) (1,List(b, c, f, a, g)) -> (1, b), (1, c), ..., (1, g) 2) join word index dataset with splitted main dataset: Data

Re: Current alternatives for async I/O

2016-10-11 Thread Fabian Hueske

Hi Ken, I think your solution should work. You need to make sure though, that you properly manage the state of your function, i.e., memorize all records which have been received but haven't be emitted yet. Otherwise records might get lost in case of a failure. Alternatively, you can implement thi

Re: Handling decompression exceptions

2016-10-11 Thread Fabian Hueske

rd = super.nextRecord(record); > } catch (IOException e) { > e.printStackTrace(); > } > } while (returnRecord == null && !reachedEnd()); > return returnRecord; > } > } > > Thanks, > Yassine > &g

Re: Handling decompression exceptions

2016-10-11 Thread Fabian Hueske

ead of (2016-08-31 >> 12:08:11.223, 000) >> >> DataSet> withReadCSV = >> env.readCsvFile("C:\\Users\\yassine\\Desktop\\test.csv") >> .ignoreFirstLine() >> .fieldDelimiter(",") >> .includeFields("101") >>

Re: mapreduce.HadoopOutputFormat config value issue

2016-10-12 Thread Fabian Hueske

Hi Shannon, I tried to reproduce the problem in a unit test without success. My test configures a HadoopOutputFormat object, serializes and deserializes it, cally open, and verifies that a configured String property is present in the getRecordWriter() method. Next I would try to reproduce the err

Re: Error with table sql query - No match found for function signature TUMBLE(, )

2016-10-12 Thread Fabian Hueske

Hi Pedro, support for window aggregations in SQL and Table API is currently work in progress. We have a pull request for the Table API and will add this feature for the next release. For SQL we depend on Apache Calcite to include the TUMBLE keyword in its parser and optimizer. At the moment the o

Re: Error with table sql query - No match found for function signature TUMBLE(, )

2016-10-13 Thread Fabian Hueske

Hi Pedro, the DataStream program would like this: val eventData: DataStream[?] = ??? val result = eventData .filter("action = denied") .keyBy("user", "ip") .timeWindow(Time.hours(1)) .apply("window.end, user, ip, count(*)") .filter("count > 5") .map("windowEnd, user, ip") Note, this

Re: Current alternatives for async I/O

2016-10-13 Thread Fabian Hueske

Hi Ken, FYI: we just received a pull request for FLIP-12 [1]. Best, Fabian [1] https://github.com/apache/flink/pull/2629 2016-10-11 9:35 GMT+02:00 Fabian Hueske : > Hi Ken, > > I think your solution should work. > You need to make sure though, that you properly manage the s

Re: Error with table sql query - No match found for function signature TUMBLE(, )

2016-10-13 Thread Fabian Hueske

apply() accepts a WindowFunction which is essentially the same as a GroupReduceFunction, i.e., you have an iterator over all events in the window. If you only want to count, you should have a look at incremental window aggregation with a ReduceFunction or FoldFunction [1]. Best, Fabian [1] https:

[DISCUSS] Deprecate Hadoop source method from (batch) ExecutionEnvironment

2016-10-14 Thread Fabian Hueske

Hi everybody, I would like to propose to deprecate the utility methods to read data with Hadoop InputFormats from the (batch) ExecutionEnvironment. The motivation for deprecating these methods is reduce Flink's dependency on Hadoop but rather have Hadoop as an optional dependency for users that a

Re: HivePartitionTap is not working with cascading flink

2016-10-14 Thread Fabian Hueske

Hi Santlal, I'm afraid I don't know what is going wrong here either. Debugging and correctly configuring the Taps was one of the major obstacles when implementing the connector. Best, Fabian 2016-10-14 14:40 GMT+02:00 Aljoscha Krettek : > +Fabian directly looping in Fabian since he worked on th

Re: [DISCUSS] Deprecate Hadoop source method from (batch) ExecutionEnvironment

2016-10-14 Thread Fabian Hueske

tInputFormat and CqlBulkOutputFormat in Flink (although we won't > be using CqlBulkOutputFormat any longer because it doesn't seem to be > reliable). > > -Shannon > > From: Fabian Hueske > Date: Friday, October 14, 2016 at 4:29 AM > To: , "d...@flink.apache

Re: BoundedOutOfOrdernessTimestampExtractor and allowedlateness

2016-10-17 Thread Fabian Hueske

Hi Yassine, the difference is the following: 1) The BoundedOutOfOrdernessTimestampExtractor is a built-in timestamp extractor and watermark assigner. A timestamp extractor tells Flink when an event happened, i.e., it extracts a timestamp from the event. A watermark assigner tells Flink what the c

Re: BoundedOutOfOrdernessTimestampExtractor and allowedlateness

2016-10-17 Thread Fabian Hueske

evaluated and late events are processed alone, i.e., in my example <12:09, G> would be processed without [A, B, C, D]. When the allowed lateness is passed, all window state is purged regardless of the trigger. Best, Fabian 2016-10-17 16:24 GMT+02:00 Fabian Hueske : > Hi Yassine, > >

Re: Flink SQL Stream Parser based on calcite

2016-10-17 Thread Fabian Hueske

Hi Pedro, The sql() method calls the Calcite parser in line 129. Best, Fabian 2016-10-17 16:43 GMT+02:00 PedroMrChaves : > Hello, > > I am pretty new to Apache Flink. > > I am trying to figure out how does Flink parses an Apache Calcite sql query > to its own Streaming API in order to maybe ext

Re: Flink SQL Stream Parser based on calcite

2016-10-17 Thread Fabian Hueske

The translation is done in multiple stages. 1. Parsing (syntax check) 2. Validation (semantic check) 3. Query optimization (rule and cost based) 4. Generation of physical plan, incl. code generation (DataStream program) The final translation happens in the DataStream nodes, e.g., DataStreamCalc [

Re: org.apache.flink.core.fs.Path error?

2016-10-20 Thread Fabian Hueske

The error message suggests that Flink tries to resolve "D:" as a file system schema such as "file:" or "hdfs:". Can you try to use specify your path as "file:/D:/dir/myfile.csv"? Best, Fabian 2016-10-20 14:41 GMT+02:00 Radu Tudoran : > Hi, > > > > I know that Flink in general supports files als

Re: multiple processing of streams

2016-10-21 Thread Fabian Hueske

Hi Robert, it is certainly possible to feed the same DataStream into two (or more) operators. Both operators should then process the complete input stream. What you describe is an unintended behavior. Can you explain how you figure out that both window operators only receive half of the events?

Re: multiple processing of streams

2016-10-24 Thread Fabian Hueske

> EventTimeSessionWindow, or perhaps create a custom EventTrigger, to force a > session to close after either X seconds of inactivity or Y seconds of > duration (or perhaps after Z events). > > > > > > > > *From: *Fabian Hueske > *Reply-To: *"user@f

Re: multiple processing of streams

2016-10-24 Thread Fabian Hueske

d get rid of my heap issues. > > > > Thanks! > > > > *From: *Fabian Hueske > *Reply-To: *"user@flink.apache.org" > *Date: *Monday, October 24, 2016 at 2:27 PM > > *To: *"user@flink.apache.org" > *Subject: *Re: multiple processing of st

Re: Trigger evaluate

2016-10-24 Thread Fabian Hueske

The window is evaluated when a watermark arrives that is behind the window's end time. For instance, give the window in your example there are windows that end at 1:00:00, 1:00:30, 1:01:00, 1:01:30, ... (every 30 seconds). given the windows above, the window from 00:59:00 to 1:00:00 will be evalua

Re: Trigger evaluate

2016-10-24 Thread Fabian Hueske

hat is inserted into the window > and when a previously registered timer times out"* > Thanks !! > > 2016-10-24 20:45 GMT+02:00 Fabian Hueske : > >> The window is evaluated when a watermark arrives that is behind the >> window's end time. >> >> For inst

Re: Bug: Plan generation for Unions picked a ship strategy between binary plan operators.

2016-10-25 Thread Fabian Hueske

Hi Yassine, I thought I had fixed that bug a few weeks a ago, but apparently the fix did not catch all cases. Can you please reopen FLINK-2662 and post the program to reproduce the bug there? Thanks, Fabian [1] https://issues.apache.org/jira/browse/FLINK-2662 2016-10-25 12:33 GMT+02:00 Yassine

Re: Flushing the result of a groupReduce to a Sink before all reduces complete

2016-10-26 Thread Fabian Hueske

Hi Paul, Flink pushes the results of operators (including GroupReduce) to the next operator or sink as soon as they are computed. So what you are asking for is actually happening. However, before the GroupReduceFunction can be applied, the whole data is sorted in order to group the data. This step

Re: TIMESTAMP TypeInformation

2016-10-26 Thread Fabian Hueske

Hi Radu, I might not have complete understood your problem, but if you do val env = StreamExecutionEnvironment.getExecutionEnvironment val tEnv = TableEnvironment.getTableEnvironment(env) val ds = env.fromElements( (1, 1L, new Time(1,2,3)) ) val t = ds.toTable(tEnv, 'a, 'b, 'c) val results = t

Re: Flink Cassandra Connector is not working

2016-10-27 Thread Fabian Hueske

Hi, a NoSuchMethod indicates that you are using incompatible versions. You should check that the versions of your job dependencies and the version cluster you want to run the job on are the same. Best, Fabian 2016-10-27 7:13 GMT+02:00 NagaSaiPradeep : > Hi, > I am working on connecting Flink

Re: TIMESTAMP TypeInformation

2016-10-27 Thread Fabian Hueske

its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including, but not limited to, total or partial disclosure, reproduction, > or dis

Re: TIMESTAMP TypeInformation

2016-10-27 Thread Fabian Hueske

Wouldn't that be orthogonal to adding it to the TypeInfoParser? 2016-10-27 15:22 GMT+02:00 Greg Hogan : > Fabian, > > Should we instead add this as a registered TypeInfoFactory? > > Greg > > On Thu, Oct 27, 2016 at 3:55 AM, Fabian Hueske wrote: > >> Yes, I thi

Re: emit partial state in window (streaming)

2016-10-27 Thread Fabian Hueske

Hi Luis, these blogposts should help you with the periodic partial result trigger [1] [2]. Regarding the second question: Time windows are by default aligned to 1970-01-01-00:00:00. So a 24 hour window will always start at 00:00. Best, Fabian [1] http://flink.apache.org/news/2015/12/04/Introduc

Re: A custom FileInputFormat

2016-11-01 Thread Fabian Hueske

Hi Niklas, I don't know exactly what is going wrong there, but I have a few pointers for you: 1) in cluster setups, Flink redirects println() to ./log/*.out files, i.e, you have to search for the task manager that ran the DirReader and check its ./log/*.out file 2) you are using Java's File class

Re: Reg. custom sink for Flink streaming

2016-11-03 Thread Fabian Hueske

Hi, a MapFunction should be the way to go for this use case. What exactly is not working? Do you get an exception? Is the map method not called? Best, Fabian 2016-11-03 0:00 GMT+01:00 Sandeep Vakacharla : > Hi there, > > > > I have the following use case- > > > > I have data coming from Kafka w

Re: Release Process

2016-11-03 Thread Fabian Hueske

Hi Dominik, the discussion about the 1.2 release was started on the dev mailing list [1] about 2 weeks ago. So far the proposed timeline is have a release in mid December. Best, Fabian [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Schedule-and-Scope-for-Flink-1-2-tp

Re: Flink stream job change and recovery

2016-11-04 Thread Fabian Hueske

Yes, that is true for Flink 1.1.x. The upcoming Flink 1.2.0 release will be able to restore jobs from savepoints with different parallelism. 2016-11-04 11:24 GMT+01:00 Renjie Liu : > Hi, all: > It seems that flink's checkpoint mechanism saves state per partition. > However, if I want to change c

Re: Flink stream job change and recovery

2016-11-07 Thread Fabian Hueske

will not lose any state. Best, Fabian 2016-11-04 12:26 GMT+01:00 Renjie Liu : > Hi, Fabian: > Will the checkpointed state be restored between streaming jobs? > > On Fri, Nov 4, 2016 at 6:47 PM Fabian Hueske wrote: > >> Yes, that is true for Flink 1.1.x. >> >> The

Re: Memory on Aggr

2016-11-07 Thread Fabian Hueske

First of all, the document only proposes semantics for Flink's support of relational queries on streams. It does not describe the implementation and in fact most of it is not implemented. How the queries will be executed would depend on the definition of the table, i.e., whether the tables are der

Re: Memory on Aggr

2016-11-08 Thread Fabian Hueske

> I'm try to find the video of: http://flink-forward.org/kb_se > ssions/scaling-stream-processing-with-apache-flink-to-very-large-state/ > > 2016-11-07 22:02 GMT+01:00 Fabian Hueske : > >> First of all, the document only proposes semantics for Flink's support of

Re: Kinesis Connector Dependency Problems

2016-11-08 Thread Fabian Hueske

Hi, I encountered this issue before as well. Which Maven version are you using? Maven 3.3.x does not properly shade dependencies. You have to use Maven 3.0.3 (see [1]). Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/setup/building.html 2016-11-08 11:05 GMT+01:00 T

Re: Kinesis Connector Dependency Problems

2016-11-08 Thread Fabian Hueske

/16, 7:18 AM, "Steffen Hausmann" > wrote: > > Hi Fabian, > > I can confirm that the behaviour is reproducible with both, Maven > 3.3.9 and Maven 3.0.5. > > Cheers, > Steffen > > Am 8. November 2016 11:11:19 MEZ, schrieb Fabian Hue

Re: Maintaining watermarks per key, instead of per operator instance

2016-11-10 Thread Fabian Hueske

Hi Stephan, I just wrote an answer to your SO question. Best, Fabian 2016-11-10 11:01 GMT+01:00 Stephan Epping : > Hello, > > I found this question in the Nabble archive (http://apache-flink-user- > mailing-list-archive.2336050.n4.nabble.com/Maintaining- > watermarks-per-key-instead-of-per-oper

Re: Flink - Exception Handling best practices

2016-11-11 Thread Fabian Hueske

Hi Mich, at the moment there is not much support handle such data driven exceptions (badly formatted data, late data, ...). However, there is a proposal to improve this: FLIP-13 [1]. So it is work in progress. It would be very helpful if you could check if the proposal would address your use case

Re: Maintaining watermarks per key, instead of per operator instance

2016-11-14 Thread Fabian Hueske

tioned design? > > kind regards, > Stephan > > > On 11 Nov 2016, at 00:39, Fabian Hueske wrote: > > Hi Stephan, > > I just wrote an answer to your SO question. > > Best, Fabian > > 2016-11-10 11:01 GMT+01:00 Stephan Epping : > >> Hello, &g

Re: fileInpuFormat of type json

2016-11-18 Thread Fabian Hueske

Hi, there are no built-in methods to read JSON. How this can be done depends on the formatting of the JSON. If you have a collection of JSON objects which are separated by newline (and there are no other newlines) you can read the file with env.readTextFile(). This will give you one String per lin

Re: Cross product of datastream and dataset

2016-11-18 Thread Fabian Hueske

Hi, it is not possible to mix the DataSet and DataStream APIs at the moment. If the DataSet is constant and not too big (which I assume, since otherwise crossing would be extremely expensive), you can load the data into a stateful MapFunction. For that you can implement a RichFlatMapFunction and

Re: Regarding time window based on the values received in the stream

2016-11-18 Thread Fabian Hueske

Hi, that does not sound like a time window problem because there is not time-related condition to split the windows. I think you can implement that with a GlobalWindow and a custom trigger. The documentation about global windows, triggers, and evictors [1] and this blogpost [2] might be helpful O

Re: Why use Kafka after all?

2016-11-18 Thread Fabian Hueske

A common choice is Apache Avro. You can to define a schema for you Pojos and generate serializers and deserializers. 2016-11-18 5:11 GMT+01:00 Matt : > Just to be clear, what I'm looking for is a way to serialize a POJO class > for Kafka but also for Flink, I'm not sure the interface of both fra

Re: Any way to increase sort buffer size?

2016-11-18 Thread Fabian Hueske

Hi Gabor, I don't think there is a way to tune the memory settings for specific operators. For that you would need to change the memory allocation in the optimizers, which is possible but not a lightweight change either. If you want to get something working, you could add a method to the API to m

Re: CodeAnalysisMode in Flink

2016-11-18 Thread Fabian Hueske

Hi Vinay, not sure why it's not working, but maybe TImo (in CC) can help. Best, Fabian 2016-11-18 17:41 GMT+01:00 Vinay Patil : > Hi, > > According to JavaDoc if I use the below method > *env.getConfig().setCodeAnalysisMode(CodeAnalysisMode.HINT);* > > ,it will print the program improvements to

Re: Regarding dividing the streams using keyby

2016-11-20 Thread Fabian Hueske

Hi, the result of a window operation on a KeyedStream is a regular DataStream. So, you would need to call keyBy() on the result again if you'd like to have a KeyedStream. You can also key a stream by two or more attributes: DataStream> windowedStream = jsonToTuple.keyBy(0,1,3) /

Re: Flink application and curator integration issues

2016-11-22 Thread Fabian Hueske

This looks rather like a version conflict. If Curator wasn't on the classpath it should be a ClassNotFoundException. Can you check if any of your jobs dependencies depends on a different Curator version? Best, Fabian 2016-11-22 12:06 GMT+01:00 Maximilian Michels : > As far as I know we're shadi

Re: multiple k-means in parallel

2016-11-28 Thread Fabian Hueske

Hi Lydia, that is certainly possible, however you need to adapt the algorithm a bit. The straight-forward approach would be to replicate the input data and assign IDs for each k-means run. If you have a data point (1, 2, 3) you could replicate it to three data points (10, 1, 2, 3), (15, 1, 2, 3),

Re: DB connection and query inside map function

2016-11-28 Thread Fabian Hueske

Hi Anastasios, that's certainly possible. The most straight-forward approach would be a synchronous call to the database. Because only one request is active at the same time, you do not need a thread pool. You can establish the connection in the open() method of a RichMapFunction. The problem with

Re: Problem - Negative currentWatermark if the watermark assignment is made before connecting the streams

2016-11-28 Thread Fabian Hueske

Hi Pedro, if I read you code correctly, you are not assigning timestamps and watermarks to the rules stream. Flink automatically derives watermarks from all streams involved. If you do not assign a watermark, the default is watermark is Long.MIN_VALUE which is exactly the value you are observing.

Re: Regarding time window based on the values received in the stream

2016-11-28 Thread Fabian Hueske

will the concept of generating timestamps/watermarks be applicable > in this scenario ? > > > On Fri, Nov 18, 2016 at 9:50 AM, Fabian Hueske wrote: > >> Hi, >> >> that does not sound like a time window problem because there is not >> time-related condition

Re: Split csv file in equally sized slices (less than or equal)

2016-11-28 Thread Fabian Hueske

Hi Flavio, I think the easiest solution is to read the CSV file with the CsvInputFormat and use a subsequent MapPartition to batch 1000 rows together. In each partition, you might end up with an incomplete batch. However, I don't see yet how you can feed these batches into the JdbcInputFormat whic

Re: Split csv file in equally sized slices (less than or equal)

2016-11-28 Thread Fabian Hueske

pseudocode) of a mapPartition that groups together elements into chunks of > size n? > > Best, > Flavio > > On Mon, Nov 28, 2016 at 8:24 PM, Fabian Hueske wrote: > >> Hi Flavio, >> >> I think the easiest solution is to read the CSV file with the >> CsvIn

Re: Problems with RollingSink

2016-11-29 Thread Fabian Hueske

Hi Diego, If you want the data of all streams to be written to the same files, you can also union the streams before sending them to the sink. Best, Fabian 2016-11-29 15:50 GMT+01:00 Kostas Kloudas : > Hi Diego, > > You cannot prefix each stream with a different > string so that the paths do no

Re: Thread 'SortMerger spilling thread' terminated due to an exception: No space left on device

2016-12-02 Thread Fabian Hueske

Hi Miguel, the exception does indeed indicate that the process ran out of available disk space. The quoted paragraph of the blog post describes the situation when you receive the IOE. By default the systems default tmp dir is used. I don't know which folder that would be in a Docker setup. You ca

Re: Flink 1.1.3 OOME Permgen

2016-12-02 Thread Fabian Hueske

Hi Konstantin, Regarding 2): I've opened FLINK-5227 to update the documentation [1]. Regarding the Row type: The Row type was introduced for flink-table and was later used by other modules. There is FLINK-5186 to move Row and all the related TypeInfo (+serializer and comparator) to flink-core [2]

Re: separation of JVMs for different applications

2016-12-02 Thread Fabian Hueske

Hi Manu, As far as I know, there are not plans to change the stand-alone deployment. FLIP-6 is focusing on deployments via resource providers (YARN, Mesos, etc.) which allow to start Flink processes per job. Till (in CC) is more familiar with the FLIP-6 effort and might be able to add more detail

Re: Thread 'SortMerger spilling thread' terminated due to an exception: No space left on device

2016-12-04 Thread Fabian Hueske

Hi Miguel, have you found a solution to your problem? I'm not a docker expert but this forum thread looks like could be related to your problem [1]. Best, Fabian [1] https://forums.docker.com/t/no-space-left-on-device-error/10894 2016-12-02 17:43 GMT+01:00 Miguel Coimbra : > Hello Fabian, > >

Re: Variable Tuple Type

2016-12-06 Thread Fabian Hueske

Hi Max, Tuples in Flink are of fixed length. You can define your own data types and serializers, but this is not the easiest solution. I would go for Array types, especially if your data can be primitive types (long). The serializer for primitive arrays should be almost as efficient as the Tuple

Re: Thread 'SortMerger spilling thread' terminated due to an exception: No space left on device

2016-12-06 Thread Fabian Hueske

ue(), new ArrayList()); > } > commSizes.get(v.getValue()).add(v.getId()); > } > > > System.out.println("#communities:\t" + > commSizes.keySet().size() + "\n|result|:\t" + result.count() + > "\n|col

Re: OutOfMemory when looping on dataset filter

2016-12-09 Thread Fabian Hueske

Hi Arnaud, Flink does not cache data at the moment. What happens is that for every day, the complete program is executed, i.e., also the program that computes wholeSet. Each execution should be independent from each other and all temporary data be cleaned up. Since Flink executes programs in a pip

Re: How to analyze space usage of Flink algorithms

2016-12-09 Thread Fabian Hueske

Hi, the heap mem usage should be available via Flink's metrics system. Not sure if that also captures spilled data. Chesnay (in CC) should know that. If the spilled data is not available as a metric, you can try to write a small script that monitors the directories to which Flink spills (Config p

Re: Thread 'SortMerger spilling thread' terminated due to an exception: No space left on device

2016-12-09 Thread Fabian Hueske

of encoded class > org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage > was 79885809 bytes.* > > Do you have any idea what this might be? > > Kind regards, > > Miguel E. Coimbra > Email: miguel.e.coim...@gmail.com > Skype: miguel.e.coimbra > &g

Re: Bloom filter in Flink

2016-12-13 Thread Fabian Hueske

Hi Gennady, this bloom filter is actually not distributed and only used internally as an optimization to reduce the amount of data spilled by a hash join. So, it is not meant to be user facing and not integrated in any API. You could of course use the code, but there might be better implementation

Re: How to analyze space usage of Flink algorithms

2016-12-16 Thread Fabian Hueske

The system metrics [1] are only available on a system level, i.e. not for an individual job. The reason is that multiple job might run concurrently on the same task manager JVM process. So it would not be possible to separate their heap usage. The same would be true for the approach that monitors t

Re: Updating a Tumbling Window every second?

2016-12-19 Thread Fabian Hueske

Hi Matt, the combination of a tumbling time window and a count window is one way to define a sliding window. In your example of a 30 secs tumbling window and a (3,1) count window results in a time sliding window of 90 secs width and 30 secs slide. You could define a time sliding window of 90 secs

Re: Blocking RichFunction.open() and backpressure

2016-12-19 Thread Fabian Hueske

Hi Yury, Flink's operators start processing as soon as they receive data. If an operator produces more data than its successor task can process, the data is buffered in Flink's network stack, i.e., its network buffers. The backpressure mechanism kicks in when all network buffers are in use and no

Re: How to analyze space usage of Flink algorithms

2016-12-19 Thread Fabian Hueske

Your functions do not need to implement RichFunction (although, each function can be a RichFunction and it should not be a problem to adapt the job). The system metrics are automatically collected. Metrics are exposed via a Reporter [1]. So you do not need to take care of the collection but rather

Re: Implementing tee functionality in a streaming job

2016-12-20 Thread Fabian Hueske

Hi Yury, your solution should exactly solve your problem. An operator sends all outgoing records to all connected successor operators. There should not be any non-deterministic behavior or splitting of records. Can you share some example code that produces the non-deterministic behavior? Best, F

Re: Generate _SUCCESS (map-reduce style) when folder has been written

2016-12-20 Thread Fabian Hueske

Hi Gwenhael, The _SUCCESS files were originally generated by Hadoop for successful jobs. AFAIK, Spark leverages Hadoop's Input and OutputFormats and seems to have followed this approach as well to be compatible. You could use Flink's HadoopOutputFormat which is a wrapper for Hadoop OutputFormats

Re: Calculating stateful counts per key

2016-12-20 Thread Fabian Hueske

Hi Mäki, some additions to Greg's answer: The flatMapWithState shortcut of the Scala API uses Flink's key-value state while your TestCounters class uses the Checkpointed interface. As Greg said, the checkpointed interface operates on an operator level not per key. The key-value state automatically

Re: Generate _SUCCESS (map-reduce style) when folder has been written

2016-12-20 Thread Fabian Hueske

nhael Pasquiers < gwenhael.pasqui...@ericsson.com>: > Thanks, it is working properly now. > > NB : Had to delete the folder by code because Hadoop’s OuputFormats will > only overwrite file by file, not the whole folder. > > > > *From:* Fabian Hueske [mailto:fhue...@gmai

Re: Implementing tee functionality in a streaming job

2016-12-20 Thread Fabian Hueske

;Records In" values in successor operators > were not exactly equal which made me think they receive different portions > of the stream. I believe the inaccuracy is somewhat intrinsic to live > stream sampling, so that's fine. > > > 2016-12-20 14:35 GMT+03:00 Yury Ruchin

Re: static/dynamic lookups in flink streaming

2016-12-21 Thread Fabian Hueske

Hi Sandeep, I'm sorry but I think I do not understand your question. What do you mean by static or dynamic look ups? Do you want to access an external data store and cache data? Can you give a bit more detail about your use? Best, Fabian 2016-12-20 23:07 GMT+01:00 Meghashyam Sandeep V : > Hi t

Re:

2016-12-21 Thread Fabian Hueske

Hi Abiy, to which type of filesystem are you persisting your checkpoints? We have seen problems with S3 and its consistency model. These issues have been addressed in newer versions of Flink. Not sure if the fix went into 1.1.3 already but release 1.1.4 is currently voted on and has tons of other

Re: CodeAnalysisMode in Flink

2016-12-21 Thread Fabian Hueske

: > Hi, > > Any updates on this thread ? > > Regards, > Vinay Patil > > On Fri, Nov 18, 2016 at 10:25 PM, Fabian Hueske-2 [via Apache Flink User > Mailing List archive.] <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=10722&i=0>> wrote:

Re: Stateful Stream Processing with RocksDB causing Job failure

2016-12-21 Thread Fabian Hueske

Copying my reply from the other thread with the same issue to have the discussion in one place. -- Hi Abiy, to which type of filesystem are you persisting your checkpoints? We have seen problems with S3 and its consistency model. These issues have been addressed in newer versions of Flink.

Re:

2016-12-21 Thread Fabian Hueske

This issue was posted twice to the ML. The discussion should be continued on the other thread with the subject "Stateful Stream Processing with RocksDB causing Job failure" 2016-12-21 9:44 GMT+01:00 Fabian Hueske : > Hi Abiy, > > to which type of filesystem are you persisti

Re: static/dynamic lookups in flink streaming

2016-12-21 Thread Fabian Hueske

; Thanks, > Sandeep > > On Dec 21, 2016 12:35 AM, "Fabian Hueske" wrote: > >> Hi Sandeep, >> >> I'm sorry but I think I do not understand your question. >> What do you mean by static or dynamic look ups? Do you want to access an >> exte

Re: static/dynamic lookups in flink streaming

2016-12-21 Thread Fabian Hueske

applying operators on stream from Kafka. > > Thanks, > Sandeep > > On Wed, Dec 21, 2016 at 6:17 AM, Fabian Hueske wrote: > >> OK, I see. Yes, you can do that with Flink. It's actually a very common >> use case. >> >> You can store the names in operator st

Re: static/dynamic lookups in flink streaming

2016-12-22 Thread Fabian Hueske

I have a flink streaming process with Kafka >>> as a source. I only have ids coming from kafka messages. My look ups >>> () which is a static map come from a different source. I would >>> like to use those lookups while applying operators on stream from Kafka. >>> &

Re: Compiling Flink for Scala 2.11

2016-12-27 Thread Fabian Hueske

Hi Markus, thanks for reporting this issue. This bug was introduced when the opt.xml file was added to the repository a few days ago. There are two open JIRAs, FLINK-5392 and FLINK-5396, each one with a pull request to fix the problem. Best, Fabian 2016-12-26 2:16 GMT+01:00 M. Dale : > I cloned

Re: [E] writeBufferLowWaterMark cannot be greater than writeBufferHighWaterMark error

2016-12-27 Thread Fabian Hueske

Hi Kanagaraj, I would assume that the issue is caused by this configuration parameter: taskmanager.memory.segment-size: 131072 I think the maximum possible value given Netty's "writeBufferHighWaterMark" parameter is 65536. There might be a way to tune Netty's parameters but I don't know how to d

Re: Is there some way to use broadcastSet in streaming ?

2016-12-27 Thread Fabian Hueske

Hi, no, broadcast sets are not available in the DataStream API. There might be other ways to achieve similar functionality, but the optimal solution depends on the use case. If you give a few details about what you would like to do, we might be able to suggest alternatives. Best, Fabian 2016-12-

Re: Reading worker-local input files

2016-12-27 Thread Fabian Hueske

Hi Robert, this is indeed a bit tricky to do. The problem is mostly with the generation of the input splits, setup of Flink, and the scheduling of tasks. 1) you have to ensure that on each worker at least one DataSource task is scheduled. The easiest way to do this is to have a bare metal setup (

Re: [E] writeBufferLowWaterMark cannot be greater than writeBufferHighWaterMark error

2016-12-27 Thread Fabian Hueske

Twitter] > <http://twitter.com/verizon> [image: LinkedIn] > <http://www.linkedin.com/company/verizon> > > > > *From:* Fabian Hueske [mailto:fhue...@gmail.com] > *Sent:* Tuesday, December 27, 2016 10:01 AM > *To:* user@flink.apache.org > *Cc:* Ufuk Celebi > *Sub

Re: Does Flink support multi-way join

2016-12-29 Thread Fabian Hueske

Hi, Flink's batch join API only features a binary join. However, if the join functions have semantic annotations [1], multiple binary joins on the same attributes can be executed in a pipelined fashion without additional shuffles or sorts. Best, Fabian [1] https://ci.apache.org/projects/flink/fl

Re: How do I ensure binary comparisons are being used?

2017-01-04 Thread Fabian Hueske

Hi Lawrence, comparison of binary data are mainly used by the DataSet API when sorting large data sets or building and probing hash tables. The DataStream API mainly benefits from Flink's custom and efficient serialization when sending data over the wire or taking checkpoints. There are also plan

Re: Flink streaming questions

2017-01-04 Thread Fabian Hueske

Hi Henri, can you express the logic of your FoldFunction (or WindowFunction) as a combination of ReduceFunction and WindowFunction [1]? ReduceFunction should be supported by a merging WindowAssigner and has the same resource consumption as a FoldFunction, i.e., a single record per window. Best, F

Re: Question about Scheduling of Batch Jobs

2017-01-04 Thread Fabian Hueske

Hi Konstantin, the DataSet API tries to execute all operators as soon as possible. I assume that in your case, Flink does not do this because it tries to avoid a deadlock. A dataflow which replicates data from the same source and joins it again might get deadlocked because all pipelines need to m

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 1728 matches

Mail list logo