Re: Using Custom Partitioner in Streaming with parallelism 1 adding latency

2017-06-15 Thread sohimankotia
You are right Aljoscha . Jog graph is splitted after introducing partitioner . I was under impression that If parallelism is set everything will be chained together . Can you explain how data will flow for map -> partitioner -> flatmap if parallelism or It would be great if point me to right doc

Re: Guava version conflict

2017-06-15 Thread Flavio Pompermaier
Hi Gordon, any news on this? On Mon, Jun 12, 2017 at 9:54 AM, Tzu-Li (Gordon) Tai wrote: > This seems like a shading problem then. > I’ve tested this again with Maven 3.0.5, even without building against CDH > Hadoop binaries the flink-dist jar contains non-shaded Guava dependencies. > > Let me

Re: Stateful streaming question

2017-06-15 Thread Flavio Pompermaier
Hi Aljoscha, we're still investigating possible solutions here. Yes, as you correctly said there are links between data of different keys so we can only proceed with the next job only once we are sure at 100% that all input data has been consumed and no other data will be read until this last jobs

Re: Streaming use case: Row enrichment

2017-06-15 Thread Flavio Pompermaier
Hi Aljosha, thanks for the great suggestions, I wasn't aware of AsyncDataStream.unorderedWait and BucketingSink setBucketer(). Most probably that's exactly what I was looking for...(I should just have the time to test it. Just one last question: what are you referring to with "you could use a diffe

Re: Unable to use Flink RocksDB state backend due to endianness mismatch

2017-06-15 Thread Ziyad Muhammed
Hi Stefan I could solve the issue by building frocksdb with a patch for ppc architecture. How ever this patch is already applied to the latest version of rocksdb, where as frocksdb seems not updated in a while. It would be nice to have it updated with this patch. Thanks Ziyad On Tue, Jun 6, 201

Re: Process event with last 1 hour, 1week and 1 Month data

2017-06-15 Thread shashank agarwal
Thanks Aljoscha Krettek I will try the same. On Thu, Jun 15, 2017 at 3:11 PM, Aljoscha Krettek wrote: > Hi, > > How would you evaluate such a query? I think the answer could be that you > have to keep all that older data around so that you can evaluate when a new > event arrives. In Flink, you c

Re: coGroup exception or something else in Gelly job

2017-06-15 Thread Kaepke, Marc
Hi Greg, I wanna ask if there was any news about the implementation or opportunities? Thanks and best regards, Marc Am 12.06.2017 um 19:28 schrieb Kaepke, Marc mailto:marc.kae...@haw-hamburg.de>>: I’m working on an implementation of SemiClustering [1]. I used two graph models (Pregel aka. vert

Re: Using Custom Partitioner in Streaming with parallelism 1 adding latency

2017-06-15 Thread Aljoscha Krettek
Hi, Before adding the partitioner all your functions are chained together. That is, everything is executed in one Thread and sending elements from one function to the next is essentially just a method call. By introducing a partitioner you break this chain and therefore your job now has to send

Using Custom Partitioner in Streaming with parallelism 1 adding latency

2017-06-15 Thread sohimankotia
Hi, I have streaming job which is running with parallelism 1 as of now . (This job will run with parallelism > 1 in future ) So I have added custom partitioner to partition the data based on one tuple field . The flow is : source -> map -> partitioner -> flatmap -> sink The partitioner is ad

Re: User self resource file.

2017-06-15 Thread yunfan123
So your suggestion is I create an archive of all the file in the resources. Then I get the distributed cache of this file and extracted it to a path. Use this path as my resource path? But in which time I should clear the temp path? -- View this message in context: http://apache-flink-user-mail

Re: User self resource file.

2017-06-15 Thread Aljoscha Krettek
I think the code that submits the job can create an archive of all the files in the “resources”, this making sure that they stay together. This file would then be placed in the distributed cache. When executing the contents of the archive can be extracted again and be used, since they still main

Re: Streaming use case: Row enrichment

2017-06-15 Thread Aljoscha Krettek
Ok, just trying to make sure I understand everything: You have this: 1. A bunch of data in HDFS that you want to enrich 2. An external service (Solr/ES) that you query for enriching the data rows stored in 1. 3. You need to store the enriched rows in HDFS again I think you could just do this (ro

Re: User self resource file.

2017-06-15 Thread yunfan123
Yes. My resource file is python or other script reference to each other by relative path. What I want is all my resource file in one job place in one directory. And the resource files in different jobs can't place in one directory. The distributedCache can not guarantee this. -- View this messag

Re: Storm topology running in flink.

2017-06-15 Thread yunfan123
That returns a String specific the resource path. Any suggestion about this? What I want is copy the resource to specific path in task manger, and pass the specific path to my operator. -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Storm-to

Re: User self resource file.

2017-06-15 Thread Aljoscha Krettek
I’m sensing this is related to your other question about adding a method to the RuntimeContext. Would it be possible to extract the resources from the Jar when submitting the program and placing them in the distributed cache? Files can be registered using StreamExecutionEnvironment.registerCache

Re: Storm topology running in flink.

2017-06-15 Thread Aljoscha Krettek
What would getUserResourceDir() return? In general, Flink devs are very reluctant when adding new methods to such central interfaces because it’s not easy to fix them if they’re broken, unneeded. Best, Aljoscha > On 15. Jun 2017, at 12:40, yunfan123 wrote: > > It seems ok, but flink-storm not

Re: union followed by timestamp assignment / watermark generation

2017-06-15 Thread Aljoscha Krettek
Hi, Yes, I can’t think of cases right now where placing the extractor after a union makes sense. In general, I think it’s always best to place the timestamp extractor as close to the sources (or in the sources, for Kafka) as possible. Right now it would be quite hard (and probably a bit hacky)

Re: union followed by timestamp assignment / watermark generation

2017-06-15 Thread Petr Novotnik
Hello Aljoscha, Fortunately, I found the program in Google's caches :) I've attached below for reference. I'm stunned by how accurately you have hit the point given the few pieces of information I left in the original text. +1 Yes, it's exactly as you explained. Can you think of a scenario where

Re: Storm topology running in flink.

2017-06-15 Thread yunfan123
It seems ok, but flink-storm not support storm codeDir. I'm working on to make the flink-storm support the codeDir. To support the code dir, I have to add a new funtion such as getUserResourceDir(may be return null) in flink RuntimeContext. I know this may be a big change. What do you think of this

Re: How to divide streams on key basis and deliver them

2017-06-15 Thread AndreaKinn
Thank you a lot Carst, Flink runs at an higher level than I imagined. I will try with some experiments! -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-to-divide-streams-on-key-basis-and-deliver-them-tp13743p13755.html Sent from the Apa

Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-15 Thread Sebastian Neef
Hi, @Ted: > Is it possible to prune (unneeded) field(s) so that heap requirement is > lower ? The XmlInputFormat [0] splits the raw data into smaller chunks, which are then further processed. I don't think I can reduce the field's (Tuple2) sizes. The major difference to Mahout's XmlInputFormat i

Re: Stateful streaming question

2017-06-15 Thread Aljoscha Krettek
Hi, Trying to revive this somewhat older thread: have you made any progress? I think going with a ProcessFunction that keeps all your state internally and periodically outputs to, say, Elasticsearch using a sink seems like the way to go? You can do the periodic emission using timers in the Proc

Re: Storm topology running in flink.

2017-06-15 Thread Aljoscha Krettek
Hi, I’m afraid the Storm compatibility layer has not been touched in a while and was never more than a pet project. Do you have an actual use case for running Storm topologies on top of Flink? Do be honest, I think it might be easier to simply port spouts to Flink SourceFunctions and bolts to a

Re: Process event with last 1 hour, 1week and 1 Month data

2017-06-15 Thread Aljoscha Krettek
Hi, How would you evaluate such a query? I think the answer could be that you have to keep all that older data around so that you can evaluate when a new event arrives. In Flink, you could use a ProcessFunction for that and use a MapState that keeps events bucketed into one-week intervals. This

Re: Fink: KafkaProducer Data Loss

2017-06-15 Thread Aljoscha Krettek
Hi Ninad, I discussed a bit with Gordon and we have some follow-up questions and some theories as to what could be happening. Regarding your first case (the one where data loss is observed): How do you ensure that you only shut down the brokers once Flink has read all the data that you expect

Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-15 Thread Kurt Young
I think the only way is adding more managed memory. The large record handler only take effects in reduce side which used by the merge sorter. According to the exception, it is thrown during the combing phase which only uses an in-memory sorter, which doesn't have large record handle mechanism. Be