Re: Why use Kafka after all?

2016-11-15 Thread Tzu-Li (Gordon) Tai
Hi Matt, Here’s an example of writing a DeserializationSchema for your POJOs: [1]. As for simply writing messages from WebSocket to Kafka using a Flink job, while it is absolutely viable, I would not recommend it, mainly because you’d never know if you might need to temporarily shut down Flink

Re: Subtask keeps on discovering new Kinesis shard when using Kinesalite

2016-11-15 Thread Tzu-Li (Gordon) Tai
Hi Philipp, When used against Kinesalite, can you tell if the connector is already reading data from the test shard before any of the shard discovery messages? If you have any spare time to test this, you can set a larger value for the `ConsumerConfigConstants.SHARD_DISCOVERY_INTERVAL_MILLIS` in

Re: Why use Kafka after all?

2016-11-15 Thread Dromit
"So in your case I would directly ingest my messages into Kafka" I will do that through a custom SourceFunction that reads the messages from the WebSocket, creates simple java objects (POJOs) and sink them in a Kafka topic using a FlinkKafkaProducer, if that makes sense. The problem now is I need

Re: Flink job restart at checkpoint interval

2016-11-15 Thread Satish Chandra Gupta
Hi Ufuk and Till, Thanks a lot. Both these suggestions were useful. Older version of xerces was being loaded from one of the dependencies, and I also fixed the serialization glitch in my code, and now checkpointing works. I have 5 value states apart from a custom trigger, and a custom trigger. Is

Re: Data Loss in HDFS after Job failure

2016-11-15 Thread Kostas Kloudas
Hi Dominique, I totally agree with you on opening a JIRA. I would suggest to add these “collision” checks also for other types like in-progress and pending for example. Thanks for reporting it, Kostas > On Nov 15, 2016, at 10:34 PM, Dominique Rondé > wrote: > > Thanks Kostas for this fast and

AW: Re: Data Loss in HDFS after Job failure

2016-11-15 Thread Dominique Rondé
Thanks Kostas for this fast and helpful response! I was able to reproduce it and changing the prefix and suffix really solve my problem. I like to suggest to check both values and write a warning into the logs. If there is no doubt, i like to open a jira and add this message. GreetsDominiqu

Subtask keeps on discovering new Kinesis shard when using Kinesalite

2016-11-15 Thread Philipp Bussche
Hi there, I am looking into AWS Kinesis and wanted to test with a local install of Kinesalite. This is on the Flink 1.2-SNAPSHOT. However it looks like my subtask keeps on discovering new shards indicated by the following log messages which is constantly written: 21:45:42,867 INFO org.apache.flin

Re: Why are externalized checkpoints deleted on Job Manager exit?

2016-11-15 Thread Cliff Resnick
Hi, Anything keeping this from being merged into master? On Thu, Nov 3, 2016 at 10:56 AM, Ufuk Celebi wrote: > A fix is pending here: https://github.com/apache/flink/pull/2750 > > The behaviour on graceful shut down/suspension respects the > cancellation behaviour with this change. > > On Thu,

Re: Proper way of adding external jars

2016-11-15 Thread Till Rohrmann
Which version of Flink are you using because I tested exactly this scenario with the latest 1.2-SNAPSHOT on a local flink cluster and it worked both ways (independent of which jar was specified by -C or provided as the user code jar). Can you maybe share the stack trace of the error. Then I could

Re: Data Loss in HDFS after Job failure

2016-11-15 Thread Kostas Kloudas
Hi Dominique, Just wanted to add that the RollingSink is deprecated and will eventually be replaced by the BucketingSink, so it is worth migrating to that. Cheers, Kostas > On Nov 15, 2016, at 3:51 PM, Kostas Kloudas > wrote: > > Hello Dominique, > > I think the problem is that you set both

Re: Lot of RocksDB files in tmp Directory

2016-11-15 Thread Aljoscha Krettek
Ok, please let me know when you manage to reproduce. Cheers, Aljoscha On Tue, 15 Nov 2016 at 15:00 Dominique Rondé wrote: > Hi Aljoscha, > > i try to reproduce this. The job is started only once and fill my /tmp > Directory within minutes. > > Greets > > Dominique > > Am 14.11.2016 um 16:31 sch

Re: Retrieving values from a dataset of datasets

2016-11-15 Thread otherwise777
It seems what i tried did indeed not work. Can you explain me why that doesn't work though? -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Retrieving-values-from-a-dataset-of-datasets-tp10108p10128.html Sent from the Apache Flink User Mailin

Re: Data Loss in HDFS after Job failure

2016-11-15 Thread Kostas Kloudas
Hello Dominique, I think the problem is that you set both pending prefix and suffix to “”. Doing this makes the “committed” or “finished” filepaths indistinguishable from the pending ones. Thus they are cleaned up upon restoring. Could you undo this, and put for example a suffix “pending” or s

RDF/SPARQL and Flink

2016-11-15 Thread Tomas Knap
Good afternoon, Currently we are using UnifiedViews (unifiedviews.eu) for RDF data processing. So you may define various RDF data processing tasks in UnifiedViews, e.g.: 1) extract data from certain SPARQL endpoint A, 2) extract data from certain folder and convert it to RDF, 3) merge RDF data out

Re: Too few memory segments provided. Hash Table needs at least 33 memory segments.

2016-11-15 Thread Vasiliki Kalavri
Hi Miguel, I'm sorry for the late reply; this e-mail got stuck in my spam folder. I'm glad that you've found a solution :) I've never used flink with docker, so I'm probably not the best person to advise you on this. However, if I understand correctly, you're changing the configuration before sub

Re: Proper way of adding external jars

2016-11-15 Thread Gyula Fóra
Hi Till, Sorry, I understand that this was confusing. The JarWithMain contains a class called Launcher with main method that does something like: Class.getForName("classFromUserJar").newInstance().launch() So it doesnt have any "static" dependency to the UserJar. JarWithMain has a lot of API cla

Data Loss in HDFS after Job failure

2016-11-15 Thread Dominique Rondé
Hi @all! I figured out a strange behavior with the Rolling HDFS-Sink. We consume events from a kafka topic and write them into a HDFS Filesystem. We use the RollingSink-Implementation in this way: RollingSink sink = new RollingSink("/some/hdfs/directory") // .setBucketer(new DateT

Re: Lot of RocksDB files in tmp Directory

2016-11-15 Thread Dominique Rondé
Hi Aljoscha, i try to reproduce this. The job is started only once and fill my /tmp Directory within minutes. Greets Dominique Am 14.11.2016 um 16:31 schrieb Aljoscha Krettek: > Hi, > could it be that the job has restarted due to failures a large number > of times? > > Cheers, > Aljoscha > > O

Re: Proper way of adding external jars

2016-11-15 Thread Till Rohrmann
Hi Gyula, did I understand it correct that JarWithMain depends on UserJar because the former will execute classes from the latter and UserJar depends on JarWithMain because it contains classes depending on class from JarWithMain? This sounds like a cyclic dependency to me. The command line help f

Re: WindowOperator - element's timestamp

2016-11-15 Thread Aljoscha Krettek
Hi, I understand now. For early (speculative) firing I would suggest to write a custom trigger that repeatedly fires on processing time. We're also working on a Trigger DSL that will make such cases simpler, for example, you would be able to write: window.trigger(EventTime.pastEndOfWindow().withEa

Re: Why use Kafka after all?

2016-11-15 Thread Till Rohrmann
Hi Matt, as you've stated Flink is a stream processor and as such it needs to get its inputs from somewhere. Flink can provide you up to exactly-once processing guarantees. But in order to do this, it requires a re-playable source because in case of a failure you might have to reprocess parts of t

Re: Flink job restart at checkpoint interval

2016-11-15 Thread Till Rohrmann
Hi Satish, your problem seems to be more related to a problem reading Hadoop's configuration. According to the internet [1,2,3] try to select a proper xerces version to resolve the problem. [1] http://stackoverflow.com/questions/26974067/org-apache-hadoop-conf-configuration-loadresource-error [2]

Re: Csv to windows?

2016-11-15 Thread Felix Neutatz
I found the solution here: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Regarding-Broadcast-of-datasets-in-streaming-context-td6456.html 2016-11-14 9:41 GMT+01:00 Ufuk Celebi : > I think this is independent of streaming. If you want to compute the > aggregate over all keys

Re: Retrieving values from a dataset of datasets

2016-11-15 Thread Gábor Gévay
Hello, How exactly do you represent the DataSet of DataSets? I'm asking because if you have something like a DataSet> that unfortunately doesn't work in Flink. Best, Gábor 2016-11-14 20:44 GMT+01:00 otherwise777 : > Hey There, > > I'm trying to calculate the betweenness in a graph with Flin

Re: Proper way of adding external jars

2016-11-15 Thread Gyula Fóra
Hi Scott, Thanks, I am familiar with the ways you suggested. Unfortunately packaging everything together is not really an option in our case, we specifically want to avoid having to do this as many people will set up their own builds and they will inevitable fail to include everything necessary wi

Re: Maintaining watermarks per key, instead of per operator instance

2016-11-15 Thread Stephan Epping
Hey Aljoscha, that sounds very promising, awesome! Though, I still would need to implement my own window management logic (window assignment and window state purging), right? I was thinking about reusing some of the existing components (TimeWindow) and WindowAssigner, but run my own WindowOpera