+1 for dropping Hadoop 2.2.0
Regards,
Chiwan Park
> On Sep 4, 2015, at 5:58 AM, Ufuk Celebi wrote:
>
> +1 to what Robert said.
>
> On Thursday, September 3, 2015, Robert Metzger wrote:
> I think most cloud providers moved beyond Hadoop 2.2.0.
> Google's Click-To-Deploy is on 2.4.1
> AWS EMR i
Hi Stephan,
That's good information to know. We will hit that throughput easily. Our
computation graph has lot of chaining like this right now.
I think it's safe to minimize the chain right now.
Thanks a lot for this Stephan.
Cheers
On Thu, Sep 3, 2015 at 7:20 PM, Stephan Ewen wrote:
> In a s
CoGroup is more generic than Join. You can perform a Join with CoGroup but
not do a CoGroup with a Join.
However, Join can be executed more efficiently than CoGroup.
2015-09-03 22:28 GMT+02:00 hagersaleh :
> what different between join and coGroup in flink
>
>
>
>
> --
> View this message in cont
+1 to what Robert said.
On Thursday, September 3, 2015, Robert Metzger wrote:
> I think most cloud providers moved beyond Hadoop 2.2.0.
> Google's Click-To-Deploy is on 2.4.1
> AWS EMR is on 2.6.0
>
> The situation for the distributions seems to be the following:
> MapR 4 uses Hadoop 2.4.0 (curr
Have a look at the class IOManager and IOManagerAsync, it is a good example
of how we use these hooks for cleanup.
The constructor usually installs them, and the shutdown logic removes them.
On Thu, Sep 3, 2015 at 9:19 PM, Stephan Ewen wrote:
> Stopping the JVM process clean up all resources, e
Stopping the JVM process clean up all resources, except temp files.
Everything that creates temp files uses a shutdown hook to remove these:
IOManager, BlobManager, LibraryCache, ...
On Wed, Sep 2, 2015 at 7:40 PM, Sachin Goel
wrote:
> I'm not sure what you mean by "Crucial cleanup is in shutdo
Hi!
Yes, you can run Flink completely without HDFS.
Also the checkpointing can put state into any file system, like S3, or a
Unix file system (like a NAS or Amazon EBS), or even Tachyon.
Greetings,
Stephan
On Thu, Sep 3, 2015 at 8:57 PM, Jerry Peng
wrote:
> Hello,
>
> Does flink require hdfs
Hi Jerry,
Yes, you can.
Best,
Marton
On Thu, Sep 3, 2015 at 8:57 PM, Jerry Peng
wrote:
> Hello,
>
> Does flink require hdfs to run? I know you can use hdfs to checkpoint and
> process files in a distributed fashion. So can flink run standalone
> without hdfs?
>
Hello,
Does flink require hdfs to run? I know you can use hdfs to checkpoint and
process files in a distributed fashion. So can flink run standalone
without hdfs?
Hi Chiwan Park
not understand this solution please explain more
--
View this message in context:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/when-use-broadcast-variable-and-run-on-bigdata-display-this-error-please-help-tp2455p2676.html
Sent from the Apache Flink User Ma
Well hidden.
I added now a link at the menu of http://data-artisans.com/. This material
is provided for free by data Artisans but they are not part of the official
Apache Flink project.
On Thu, Sep 3, 2015 at 2:20 PM, Stefan Winterstein <
stefan.winterst...@dfki.de> wrote:
>
> > Answering to mys
I think most cloud providers moved beyond Hadoop 2.2.0.
Google's Click-To-Deploy is on 2.4.1
AWS EMR is on 2.6.0
The situation for the distributions seems to be the following:
MapR 4 uses Hadoop 2.4.0 (current is MapR 5)
CDH 5.0 uses 2.3.0 (the current CDH release is 5.4)
HDP 2.0 (October 2013)
Chiwan has a good point. Once the data that needs to be available to all
machines is too large for one machine, there is no good solution any more.
The best approach is an external store to which all nodes have access. It
is not going to be terribly fast, though.
If you are in the situation that y
The purpose of rebalance() should be to rebalance the partitions of a data
streams as evenly as possible, right?
If all senders start sending data to the same receiver and there is less
data in each partition than receivers, partitions are not evenly rebalanced.
That is exactly the problem Arnaud r
In case of rebalance(), all sources start the round-robin partitioning at
index 0. Since each source emits only very few elements, only the first 15
mappers receive any input.
It would be better to let each source start the round-robin partitioning at
a different index, something like startIdx = (n
Btw, it is working with a parallelism 1 source, because only a single
source partitions (round-robin or random) the data.
Several sources do not assign work to the same few mappers.
2015-09-03 15:22 GMT+02:00 Matthias J. Sax :
> If it would be only 14 elements, you are obviously right. However, i
If it would be only 14 elements, you are obviously right. However, if I
understood Arnaud correctly, the problem is, that there are more than 14
elements:
> Each of my 100 sources gives only a few lines (say 14 max)
That would be about 140 lines in total.
Using non-parallel source, he is able to
Hi,
I don't think it's a bug. If there are 100 sources that each emit only 14
elements then only the first 14 mappers will ever receive data. The
round-robin distribution is not global, since the sources operate
independently from each other.
Cheers,
Aljoscha
On Wed, 2 Sep 2015 at 20:00 Matthias
The KafkaSink is the last step in my program after the 2nd deduplication.
I could not yet track down where duplicates show up. That's a bit difficult to
find out... But I'm trying to find it...
> Am 03.09.2015 um 14:14 schrieb Stephan Ewen :
>
> Can you tell us where the KafkaSink comes into
In a set of benchmarks a while back, we found that the chaining mechanism
has some overhead right now, because of its abstraction. The abstraction
creates iterators for each element and makes it hard for the JIT to
specialize on the operators in the chain.
For purely local chains at full speed, th
> Answering to myself, I have found some nice training material at
> http://dataartisans.github.io/flink-training.
Excellent resources! Somehow, I managed not to stumble over them by
myself - either I was blind, or they are well hidden... :)
Best,
-Stefan
Can you tell us where the KafkaSink comes into play? At what point do the
duplicates come up?
On Thu, Sep 3, 2015 at 2:09 PM, Rico Bergmann wrote:
> No. I mean the KafkaSink.
>
> A bit more insight to my program: I read from a Kafka topic with
> flinkKafkaConsumer082, then hashpartition the data
No. I mean the KafkaSink.
A bit more insight to my program: I read from a Kafka topic with
flinkKafkaConsumer082, then hashpartition the data, then I do a deduplication
(does not eliminate all duplicates though). Then some computation, afterwards
again deduplication (group by message in a wind
Do you mean the KafkaSource?
Which KafkaSource are you using? The 0.9.1 FlinkKafkaConsumer082 or the
KafkaSource?
On Thu, Sep 3, 2015 at 1:10 PM, Rico Bergmann wrote:
> Hi!
>
> Testing it with the current 0.10 snapshot is not easily possible atm
>
> But I deactivated checkpointing in my program
@Till:
The fields are hidden inside the json string so that I have to deserialize
first. Also the classes do not have so much in common. It might be possible
to do it with a hierarchy of group by. I'm not sure how complicated that
would be, additionally I will have to send the raw String around for
Hi!
Testing it with the current 0.10 snapshot is not easily possible atm
But I deactivated checkpointing in my program and still get duplicates in my
output. So it seems not only to come from the checkpointing feature, or?
May be the KafkaSink is responsible for this? (Just my guess)
Cheers Ri
Hi Martin,
maybe this is what you are looking for:
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#output-splitting
Regards,
Aljoscha
On Thu, 3 Sep 2015 at 12:02 Till Rohrmann wrote:
> Hi Martin,
>
> could grouping be a solution to your problem?
>
> Cheers,
> Ti
Hi Martin,
could grouping be a solution to your problem?
Cheers,
Till
On Thu, Sep 3, 2015 at 11:56 AM, Martin Neumann wrote:
> Hej,
>
> I have a Stream of json objects of several different types. I want to
> split this stream into several streams each of them dealing with one type.
> (so its n
Hej,
I have a Stream of json objects of several different types. I want to split
this stream into several streams each of them dealing with one type. (so
its not partitioning)
The only Way I found so far is writing a bunch of filters and connect them
to the source directly. This way I will have a
Hi Greg!
That number should control the merge fan in, yes. Maybe a bug was
introduced a while back that prevents this parameter from being properly
passed through the system. Have you modified the config value in the
cluster, on the client, or are you starting the job via the command line,
in whic
Thanks for clarifying the "eager serialization". By serializing and
deserializing explicitly (eagerly) we can raise better Exceptions to
notify the user of non-serializable classes.
> BTW: There is an opportunity to fix two problems with one patch: The
> framesize overflow for the input format, a
Hi Arnaud,
I think that's a bug ;)
I'll file a JIRA to fix it for the next release.
On Thu, Sep 3, 2015 at 10:26 AM, LINZ, Arnaud
wrote:
> Hi,
>
>
>
> I am wondering why, despite the fact that my java main() methods runs OK
> and exit with 0 code value, the Yarn container status set by the engl
Hi hagersaleh,
Sorry for late reply.
I think using an external system could be a solution for large scale data. To
use an external system, you have to implement rich functions such as
RichFilterFunction, RichMapFunction, …, etc.
Regards,
Chiwan Park
> On Aug 30, 2015, at 1:30 AM, hagersaleh
Hi,
I am wondering why, despite the fact that my java main() methods runs OK and
exit with 0 code value, the Yarn container status set by the englobing flink
execution is FAILED with diagnostic "Flink YARN Client requested shutdown."?
Command line :
flink run -m yarn-cluster -yn 20 -ytm 81
One more remark that just came to my mind. There is a storm-hdfs module
available: https://github.com/apache/storm/tree/master/external/storm-hdfs
Maybe you can use it. It would be great if you could give feedback if
this works for you.
-Matthias
On 09/02/2015 10:52 AM, Matthias J. Sax wrote:
>
Hi,
I get a new similar bug when broadcasting a list of integers if this
list is made unmodifiable,
elements = Collections.unmodifiableList(elements);
I include this code to reproduce the result,
public class WordCountExample {
public static void main(String[] args) throws Excepti
I've filed a JIRA at INFRA:
https://issues.apache.org/jira/browse/INFRA-10239
On Wed, Sep 2, 2015 at 11:18 AM, Robert Metzger wrote:
> Hi Sachin,
>
> I also noticed that the GitHub integration is not working properly. I'll
> ask the Apache Infra team.
>
> On Wed, Sep 2, 2015 at 10:20 AM, Sachin
I'm sorry that we changed the method name between minor versions.
We'll soon bring some infrastructure in place a) mark the audience of
classes and b) ensure that public APIs are stable.
On Wed, Sep 2, 2015 at 9:04 PM, Ferenc Turi wrote:
> Ok. As I see only the method name was changed. It was a
38 matches
Mail list logo