date:20150903

Re: Usage of Hadoop 2.2.0

2015-09-03 Thread Chiwan Park

+1 for dropping Hadoop 2.2.0 Regards, Chiwan Park > On Sep 4, 2015, at 5:58 AM, Ufuk Celebi wrote: > > +1 to what Robert said. > > On Thursday, September 3, 2015, Robert Metzger wrote: > I think most cloud providers moved beyond Hadoop 2.2.0. > Google's Click-To-Deploy is on 2.4.1 > AWS EMR i

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

2015-09-03 Thread Welly Tambunan

Hi Stephan, That's good information to know. We will hit that throughput easily. Our computation graph has lot of chaining like this right now. I think it's safe to minimize the chain right now. Thanks a lot for this Stephan. Cheers On Thu, Sep 3, 2015 at 7:20 PM, Stephan Ewen wrote: > In a s

Re: what different between join and coGroup in flink

2015-09-03 Thread Fabian Hueske

CoGroup is more generic than Join. You can perform a Join with CoGroup but not do a CoGroup with a Join. However, Join can be executed more efficiently than CoGroup. 2015-09-03 22:28 GMT+02:00 hagersaleh : > what different between join and coGroup in flink > > > > > -- > View this message in cont

Re: Usage of Hadoop 2.2.0

2015-09-03 Thread Ufuk Celebi

+1 to what Robert said. On Thursday, September 3, 2015, Robert Metzger wrote: > I think most cloud providers moved beyond Hadoop 2.2.0. > Google's Click-To-Deploy is on 2.4.1 > AWS EMR is on 2.6.0 > > The situation for the distributions seems to be the following: > MapR 4 uses Hadoop 2.4.0 (curr

Re: Multiple restarts of Local Cluster

2015-09-03 Thread Stephan Ewen

Have a look at the class IOManager and IOManagerAsync, it is a good example of how we use these hooks for cleanup. The constructor usually installs them, and the shutdown logic removes them. On Thu, Sep 3, 2015 at 9:19 PM, Stephan Ewen wrote: > Stopping the JVM process clean up all resources, e

Re: Multiple restarts of Local Cluster

2015-09-03 Thread Stephan Ewen

Stopping the JVM process clean up all resources, except temp files. Everything that creates temp files uses a shutdown hook to remove these: IOManager, BlobManager, LibraryCache, ... On Wed, Sep 2, 2015 at 7:40 PM, Sachin Goel wrote: > I'm not sure what you mean by "Crucial cleanup is in shutdo

Re: Question on flink and hdfs

2015-09-03 Thread Stephan Ewen

Hi! Yes, you can run Flink completely without HDFS. Also the checkpointing can put state into any file system, like S3, or a Unix file system (like a NAS or Amazon EBS), or even Tachyon. Greetings, Stephan On Thu, Sep 3, 2015 at 8:57 PM, Jerry Peng wrote: > Hello, > > Does flink require hdfs

Re: Question on flink and hdfs

2015-09-03 Thread Márton Balassi

Hi Jerry, Yes, you can. Best, Marton On Thu, Sep 3, 2015 at 8:57 PM, Jerry Peng wrote: > Hello, > > Does flink require hdfs to run? I know you can use hdfs to checkpoint and > process files in a distributed fashion. So can flink run standalone > without hdfs? >

Question on flink and hdfs

2015-09-03 Thread Jerry Peng

Hello, Does flink require hdfs to run? I know you can use hdfs to checkpoint and process files in a distributed fashion. So can flink run standalone without hdfs?

Re: when use broadcast variable and run on bigdata display this error please help

2015-09-03 Thread hagersaleh

Hi Chiwan Park not understand this solution please explain more -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/when-use-broadcast-variable-and-run-on-bigdata-display-this-error-please-help-tp2455p2676.html Sent from the Apache Flink User Ma

Re: Hardware requirements and learning resources

2015-09-03 Thread Kostas Tzoumas

Well hidden. I added now a link at the menu of http://data-artisans.com/. This material is provided for free by data Artisans but they are not part of the official Apache Flink project. On Thu, Sep 3, 2015 at 2:20 PM, Stefan Winterstein < stefan.winterst...@dfki.de> wrote: > > > Answering to mys

Re: Usage of Hadoop 2.2.0

2015-09-03 Thread Robert Metzger

I think most cloud providers moved beyond Hadoop 2.2.0. Google's Click-To-Deploy is on 2.4.1 AWS EMR is on 2.6.0 The situation for the distributions seems to be the following: MapR 4 uses Hadoop 2.4.0 (current is MapR 5) CDH 5.0 uses 2.3.0 (the current CDH release is 5.4) HDP 2.0 (October 2013)

Re: when use broadcast variable and run on bigdata display this error please help

2015-09-03 Thread Stephan Ewen

Chiwan has a good point. Once the data that needs to be available to all machines is too large for one machine, there is no good solution any more. The best approach is an external store to which all nodes have access. It is not going to be terribly fast, though. If you are in the situation that y

Re: How to force the parallelism on small streams?

2015-09-03 Thread Fabian Hueske

The purpose of rebalance() should be to rebalance the partitions of a data streams as evenly as possible, right? If all senders start sending data to the same receiver and there is less data in each partition than receivers, partitions are not evenly rebalanced. That is exactly the problem Arnaud r

Re: How to force the parallelism on small streams?

2015-09-03 Thread Fabian Hueske

In case of rebalance(), all sources start the round-robin partitioning at index 0. Since each source emits only very few elements, only the first 15 mappers receive any input. It would be better to let each source start the round-robin partitioning at a different index, something like startIdx = (n

Re: How to force the parallelism on small streams?

2015-09-03 Thread Fabian Hueske

Btw, it is working with a parallelism 1 source, because only a single source partitions (round-robin or random) the data. Several sources do not assign work to the same few mappers. 2015-09-03 15:22 GMT+02:00 Matthias J. Sax : > If it would be only 14 elements, you are obviously right. However, i

Re: How to force the parallelism on small streams?

2015-09-03 Thread Matthias J. Sax

If it would be only 14 elements, you are obviously right. However, if I understood Arnaud correctly, the problem is, that there are more than 14 elements: > Each of my 100 sources gives only a few lines (say 14 max) That would be about 140 lines in total. Using non-parallel source, he is able to

Re: How to force the parallelism on small streams?

2015-09-03 Thread Aljoscha Krettek

Hi, I don't think it's a bug. If there are 100 sources that each emit only 14 elements then only the first 14 mappers will ever receive data. The round-robin distribution is not global, since the sources operate independently from each other. Cheers, Aljoscha On Wed, 2 Sep 2015 at 20:00 Matthias

Re: Duplicates in Flink

2015-09-03 Thread Rico Bergmann

The KafkaSink is the last step in my program after the 2nd deduplication. I could not yet track down where duplicates show up. That's a bit difficult to find out... But I'm trying to find it... > Am 03.09.2015 um 14:14 schrieb Stephan Ewen : > > Can you tell us where the KafkaSink comes into

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

2015-09-03 Thread Stephan Ewen

In a set of benchmarks a while back, we found that the chaining mechanism has some overhead right now, because of its abstraction. The abstraction creates iterators for each element and makes it hard for the JIT to specialize on the operators in the chain. For purely local chains at full speed, th

Re: Hardware requirements and learning resources

2015-09-03 Thread Stefan Winterstein

> Answering to myself, I have found some nice training material at > http://dataartisans.github.io/flink-training. Excellent resources! Somehow, I managed not to stumble over them by myself - either I was blind, or they are well hidden... :) Best, -Stefan

Re: Duplicates in Flink

2015-09-03 Thread Stephan Ewen

Can you tell us where the KafkaSink comes into play? At what point do the duplicates come up? On Thu, Sep 3, 2015 at 2:09 PM, Rico Bergmann wrote: > No. I mean the KafkaSink. > > A bit more insight to my program: I read from a Kafka topic with > flinkKafkaConsumer082, then hashpartition the data

Re: Duplicates in Flink

2015-09-03 Thread Rico Bergmann

No. I mean the KafkaSink. A bit more insight to my program: I read from a Kafka topic with flinkKafkaConsumer082, then hashpartition the data, then I do a deduplication (does not eliminate all duplicates though). Then some computation, afterwards again deduplication (group by message in a wind

Re: Duplicates in Flink

2015-09-03 Thread Stephan Ewen

Do you mean the KafkaSource? Which KafkaSource are you using? The 0.9.1 FlinkKafkaConsumer082 or the KafkaSource? On Thu, Sep 3, 2015 at 1:10 PM, Rico Bergmann wrote: > Hi! > > Testing it with the current 0.10 snapshot is not easily possible atm > > But I deactivated checkpointing in my program

Re: Splitting Streams

2015-09-03 Thread Martin Neumann

@Till: The fields are hidden inside the json string so that I have to deserialize first. Also the classes do not have so much in common. It might be possible to do it with a hierarchy of group by. I'm not sure how complicated that would be, additionally I will have to send the raw String around for

Re: Duplicates in Flink

2015-09-03 Thread Rico Bergmann

Hi! Testing it with the current 0.10 snapshot is not easily possible atm But I deactivated checkpointing in my program and still get duplicates in my output. So it seems not only to come from the checkpointing feature, or? May be the KafkaSink is responsible for this? (Just my guess) Cheers Ri

Re: Splitting Streams

2015-09-03 Thread Aljoscha Krettek

Hi Martin, maybe this is what you are looking for: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#output-splitting Regards, Aljoscha On Thu, 3 Sep 2015 at 12:02 Till Rohrmann wrote: > Hi Martin, > > could grouping be a solution to your problem? > > Cheers, > Ti

Re: Splitting Streams

2015-09-03 Thread Till Rohrmann

Hi Martin, could grouping be a solution to your problem? Cheers, Till On Thu, Sep 3, 2015 at 11:56 AM, Martin Neumann wrote: > Hej, > > I have a Stream of json objects of several different types. I want to > split this stream into several streams each of them dealing with one type. > (so its n

Splitting Streams

2015-09-03 Thread Martin Neumann

Hej, I have a Stream of json objects of several different types. I want to split this stream into several streams each of them dealing with one type. (so its not partitioning) The only Way I found so far is writing a bunch of filters and connect them to the source directly. This way I will have a

Re: max-fan

2015-09-03 Thread Stephan Ewen

Hi Greg! That number should control the merge fan in, yes. Maybe a bug was introduced a while back that prevents this parameter from being properly passed through the system. Have you modified the config value in the cluster, on the client, or are you starting the job via the command line, in whic

Re: Bug broadcasting objects (serialization issue)

2015-09-03 Thread Maximilian Michels

Thanks for clarifying the "eager serialization". By serializing and deserializing explicitly (eagerly) we can raise better Exceptions to notify the user of non-serializable classes. > BTW: There is an opportunity to fix two problems with one patch: The > framesize overflow for the input format, a

Re: Flink batch runs OK but Yarn container fails in batch mode with -m yarn-cluster

2015-09-03 Thread Robert Metzger

Hi Arnaud, I think that's a bug ;) I'll file a JIRA to fix it for the next release. On Thu, Sep 3, 2015 at 10:26 AM, LINZ, Arnaud wrote: > Hi, > > > > I am wondering why, despite the fact that my java main() methods runs OK > and exit with 0 code value, the Yarn container status set by the engl

Re: when use broadcast variable and run on bigdata display this error please help

2015-09-03 Thread Chiwan Park

Hi hagersaleh, Sorry for late reply. I think using an external system could be a solution for large scale data. To use an external system, you have to implement rich functions such as RichFilterFunction, RichMapFunction, …, etc. Regards, Chiwan Park > On Aug 30, 2015, at 1:30 AM, hagersaleh

Flink batch runs OK but Yarn container fails in batch mode with -m yarn-cluster

2015-09-03 Thread LINZ, Arnaud

Hi, I am wondering why, despite the fact that my java main() methods runs OK and exit with 0 code value, the Yarn container status set by the englobing flink execution is FAILED with diagnostic "Flink YARN Client requested shutdown."? Command line : flink run -m yarn-cluster -yn 20 -ytm 81

Re: question on flink-storm-examples

2015-09-03 Thread Matthias J. Sax

One more remark that just came to my mind. There is a storm-hdfs module available: https://github.com/apache/storm/tree/master/external/storm-hdfs Maybe you can use it. It would be great if you could give feedback if this works for you. -Matthias On 09/02/2015 10:52 AM, Matthias J. Sax wrote: >

Re: Bug broadcasting objects (serialization issue)

2015-09-03 Thread Andres R. Masegosa

Hi, I get a new similar bug when broadcasting a list of integers if this list is made unmodifiable, elements = Collections.unmodifiableList(elements); I include this code to reproduce the result, public class WordCountExample { public static void main(String[] args) throws Excepti

Re: Travis updates on Github

2015-09-03 Thread Robert Metzger

I've filed a JIRA at INFRA: https://issues.apache.org/jira/browse/INFRA-10239 On Wed, Sep 2, 2015 at 11:18 AM, Robert Metzger wrote: > Hi Sachin, > > I also noticed that the GitHub integration is not working properly. I'll > ask the Apache Infra team. > > On Wed, Sep 2, 2015 at 10:20 AM, Sachin

Re: nosuchmethoderror

2015-09-03 Thread Robert Metzger

I'm sorry that we changed the method name between minor versions. We'll soon bring some infrastructure in place a) mark the audience of classes and b) ensure that public APIs are stable. On Wed, Sep 2, 2015 at 9:04 PM, Ferenc Turi wrote: > Ok. As I see only the method name was changed. It was a

38 matches

Mail list logo