Re: Ask for SQL using kafka in Flink

2018-06-05 Thread Timo Walther
@Shuyi: Yes, a more advanced table example would be helpful anyway and combining it with Kafka/Avro end-to-end would be even better. @Will: I totally agree that the current connector ecosystem could be improved. This is also on mid-term roadmap. Contributors that could help here are very welco

Re: Output batch to Kafka

2018-06-05 Thread Stephan Ewen
You could go with Chesnay's suggestion, which might be the quickest fix. Creating a KafkaOutputFormat (possibly wrapping the KafkaProducer) would be a bit cleaner. Would be happy to have that as a contribution, actually ;-) If you care about producing "exactly once" using Kafka Transactions (Kaf

Re: Checkpointing when reading from files?

2018-06-05 Thread Fabian Hueske
Hi, The continuous file source is split into two components. 1) A split generator that monitors a directory and generates splits when a new file is observed, and 2) reading tasks that receive splits and read the referenced files. I think this is the code that generates input splits which are dist

Re: Cannot determine simple type name - [FLINK-7490]

2018-06-05 Thread Timo Walther
This sounds similar to https://issues.apache.org/jira/browse/FLINK-9220. Can you explain the steps that I have to do to reproduce the error? Regards, Timo Am 05.06.18 um 08:06 schrieb Chesnay Schepler: Please re-open the issue. It would be great if you could also provide us with a reproducing

Writing stream to Hadoop

2018-06-05 Thread miki haiat
Im trying to write some data to Hadoop by using this code The state backend is set without time StateBackend sb = new FsStateBackend("hdfs://***:9000/flink/my_city/checkpoints"); env.setStateBackend(sb); BucketingSink> sink = new BucketingSink<>("hdfs://:9000/mycity/raw"); sink.setBu

Re: Writing stream to Hadoop

2018-06-05 Thread Marvin777
I think you can look at this comment, thanks. * Part files can be in one of three states: {@code in-progress}, {@code pending} or {@code finished}. * The reason for this is how the sink works together with the checkpointing mechanism to provide exactly-once * semantics and fault-tolerance. The pa

Re: NPE in flink sql over-window

2018-06-05 Thread Fabian Hueske
Hi Yan, Thanks for providing the logs and opening the JIRA issue! Let's continue the discussion there. Best, Fabian 2018-06-05 1:26 GMT+02:00 Yan Zhou [FDS Science] : > Hi Fabian, > > I added some trace logs in ProcTimeBoundedRangeOver and think it should > be a bug. The log should explain how

Checkpointing on cluster shutdown

2018-06-05 Thread Data Engineer
Hi, Suppose I have a working Flink cluster with 1 taskmanager and 1 jobmanager and I have enabled checkpointing with say an interval of 1 minute. Now if I shut down the Flink cluster in between checkpoints (say for some upgrade), will the JobManager automatically trigger a checkpoint before going

How to read from Cassandra using Apache Flink?

2018-06-05 Thread HarshithBolar
My flink program should do a Cassandra look up for each input record and based on the results, should do some further processing. But I'm currently stuck at reading data from Cassandra. This is the code snippet I've come up with so far. > ClusterBuilder secureCassandraSinkClusterBuilder = new Clu

Re: Writing stream to Hadoop

2018-06-05 Thread Kostas Kloudas
Hi Miki, Have you enabled checkpointing? Kostas > On Jun 5, 2018, at 11:14 AM, miki haiat wrote: > > Im trying to write some data to Hadoop by using this code > > The state backend is set without time > StateBackend sb = new > FsStateBackend("hdfs://***:9000/flink/my_city/checkpoints");

Re: Writing stream to Hadoop

2018-06-05 Thread miki haiat
I saw the option of enabling checkpoint enabling-and-configuring-checkpointing But on 1.5 it said that the method is deprecated so im a bit confused . /** @deprecated

Link checkpoint failure issue

2018-06-05 Thread James (Jian Wu) [FDS Data Platform]
Hi: I am using Flink streaming continuous query. Scenario: Kafka-connector to consume a topic, and streaming incremental calculate 24 hours window data. And use processingTime as TimeCharacteristic. I am using RocksDB as StateBackend, file system is HDFS, and checkpoint interval is 5 minu

Re: Writing stream to Hadoop

2018-06-05 Thread Chesnay Schepler
This particular version of the method is deprecated, use enableCheckpointing(long checkpointingInterval) instead. On 05.06.2018 12:19, miki haiat wrote: I saw the option of enabling checkpoint enabling-and-configuring-checkpointing

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

2018-06-05 Thread makeyang
can anybody share anythoughts, insights about this issue? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Writing stream to Hadoop

2018-06-05 Thread miki haiat
OMG i missed it ... Thanks, MIki On Tue, Jun 5, 2018 at 1:30 PM Chesnay Schepler wrote: > This particular version of the method is deprecated, use > enableCheckpointing(long > checkpointingInterval) instead. > > On 05.06.2018 12:19, miki haiat wrote: > > I saw the option of enabling checkpoin

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

2018-06-05 Thread Chesnay Schepler
Please look into high-availability to make your cluster resistant against shutdowns. On 05.06.2018 12:31, makeyang wrote: can anybody share anythoughts, insights about this issue? -- Sent from: ht

Re: Checkpointing on cluster shutdown

2018-06-05 Thread Chesnay Schepler
No checkpoint will be triggered when the cluster is shutdown. For this case you will have to manually trigger a savepoint. If a TM goes down it does not create a checkpoint. IN these cases the job will be restarted from the last successful checkpoint. On 05.06.2018 12:01, Data Engineer wrote:

Re: How to read from Cassandra using Apache Flink?

2018-06-05 Thread Chesnay Schepler
You are creating an entirely new sink for each fetch, which includes setting up a connection to cassandra. It is not surprising that this is slow. The cassandra input format was written for reading large amounts of data, not synchronous single row fetches. You can try using the datastax driver

Re: Link checkpoint failure issue

2018-06-05 Thread Chesnay Schepler
Can you provide us with the TaskManager logs? On 05.06.2018 12:30, James (Jian Wu) [FDS Data Platform] wrote: Hi: I am using Flink streaming continuous query. Scenario: Kafka-connector to consume a topic, and streaming incremental calculate 24 hours window data. And use processingTime a

Re: Checkpointing on cluster shutdown

2018-06-05 Thread Garvit Sharma
But job should be terminated gracefully. Why is this behavior not there? On Tue, Jun 5, 2018 at 4:19 PM, Chesnay Schepler wrote: > No checkpoint will be triggered when the cluster is shutdown. For this > case you will have to manually trigger a savepoint. > > If a TM goes down it does not create

Re: Checkpointing on cluster shutdown

2018-06-05 Thread Chesnay Schepler
If a TM goes down any data generated after the last successful checkpoint cannot be guaranteed to be consistent across the cluster. Hence, this data is discarded and we go back to the last known consistent state, the last checkpoint that was successfully created. On 05.06.2018 13:06, Garvit Sha

Implementing a “join” between a DataStream and a “set of rules”

2018-06-05 Thread Sandybayev, Turar (CAI - Atlanta)
Hi, What is the best practice recommendation for the following use case? We need to match a stream against a set of “rules”, which are essentially a Flink DataSet concept. Updates to this “rules set" are possible but not frequent. Each stream event must be checked against all the records in “ru

Re: Implementing a “join” between a DataStream and a “set of rules”

2018-06-05 Thread Garvit Sharma
Hi, For the above use case, you should do the following : 1. Convert your DataStream into KeyedDataStream by defining a key which would be used to get validated against your rules. 2. Same as 1 for rules stream. 3. Join the two keyedStreams using Flink's connect operator. 4. Store the rules into

Re: Implementing a “join” between a DataStream and a “set of rules”

2018-06-05 Thread Amit Jain
Hi Sandybayev, In the current state, Flink does not provide a solution to the mentioned use case. However, there is open FLIP[1] [2] which has been created to address the same. I can see in your current approach, you are not able to update the rule set data. I think you can update rule set data b

Re: PartitionNotFoundException after deployment

2018-06-05 Thread Nico Kruber
Hi Gyula, as a follow-up, you may be interested in https://issues.apache.org/jira/browse/FLINK-9413 Nico On 04/05/18 15:36, Gyula Fóra wrote: > Looks pretty clear that one operator takes too long to start (even on > the UI it shows it in the created state for far too long). Any idea what > might

Re: Cannot determine simple type name - [FLINK-7490]

2018-06-05 Thread rakeshchalasani
Thanks for the response, the issue does look similar to FLINK-9220. The code is part of our application, so I have to come up with an example. But following are the steps that will likely reproduce the issue. 1. Define UDF in module, say in com.udf 2. Create a topology using the above UDF in anot

Re: Implementing a “join” between a DataStream and a “set of rules”

2018-06-05 Thread Aljoscha Krettek
Hi, you might be interested in this newly-introduced feature: https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/state/broadcast_state.html Best, Aljoscha > On 5. Jun 2018,

Re: Do I still need hadoop-aws libs when using Flink 1.5 and Presto?

2018-06-05 Thread Aljoscha Krettek
Hi, what are you using as the FileSystem scheme? s3 or s3a? Also, could you also post the full stack trace, please? Best, Aljoscha > On 2. Jun 2018, at 07:34, Hao Sun wrote: > > I am trying to figure out how to use S3 as state storage. > The recommended way is > https://ci.apache.org/project

Re: Do I still need hadoop-aws libs when using Flink 1.5 and Presto?

2018-06-05 Thread Hao Sun
Thanks for pick up my question. I had s3a in the config now I removed it. I will post a full trace soon, but want to get some questions answered to help me understand this better. 1. Can I use the presto lib with Flink 1.5 without bundled hdp? Can I use this? http://www.apache.org/dyn/closer.lua/f

Re: Do I still need hadoop-aws libs when using Flink 1.5 and Presto?

2018-06-05 Thread Aljoscha Krettek
Hi, sorry, yes, you don't have to add any of the Hadoop dependencies. Everything that's needed comes in the presto s3 jar. You should use "s3:" as the prefix, the Presto S3 filesystem will not be used if you use s3a. And yes, you add config values to the flink config as s3.xxx. Best, Aljoscha

Re: Do I still need hadoop-aws libs when using Flink 1.5 and Presto?

2018-06-05 Thread Hao Sun
I do not have the S3A lib requirement anymore, but I got a new error. org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied Here are more logs: https://gist.github.com/zenhao/5f22ca084717bae3194555398dad332d Thanks On Tue, Jun 5, 2018 at 9:39 AM Al

Re: Do I still need hadoop-aws libs when using Flink 1.5 and Presto?

2018-06-05 Thread Hao Sun
also a follow up question. Can I use all properties here? Should I remove `hive.` for all the keys? https://prestodb.io/docs/current/connector/hive.html#hive-configuration-properties More specifically how I configure sse for s3? On Tue, Jun 5, 2018 at 11:33 AM Hao Sun wrote: > I do not have the

Re: Do I still need hadoop-aws libs when using Flink 1.5 and Presto?

2018-06-05 Thread Hao Sun
After I added these to my flink-conf.yml, everything works now. s3.sse.enabled: true s3.sse.type: S3 Thanks for the help! In general I also want to know what config keys for presto-s3 I can use. On Tue, Jun 5, 2018 at 11:43 AM Hao Sun wrote: > also a follow up question. Can I use all properti

IoT Use Case, Problem and Thoughts

2018-06-05 Thread ashish pok
Fabian, Stephan, All, I started a discussion a while back around having a form of event-based checkpointing policy that will help us in some of our high volume data pipelines. Here is an effort to put this in front of community and understand what capabilities can support these type of use cases

Re: Implementing a “join” between a DataStream and a “set of rules”

2018-06-05 Thread Sandybayev, Turar (CAI - Atlanta)
Thanks Garvit for your suggestion! From: Garvit Sharma Date: Tuesday, June 5, 2018 at 8:44 AM To: "Sandybayev, Turar (CAI - Atlanta)" Cc: "user@flink.apache.org" Subject: Re: Implementing a “join” between a DataStream and a “set of rules” Hi, For the above use case, you should do the followin

Re: Implementing a “join” between a DataStream and a “set of rules”

2018-06-05 Thread Sandybayev, Turar (CAI - Atlanta)
Hi Aljoscha, Thank you, this seems like a match for this use case. Am I understanding correctly that since only MemoryStateBackend is available for broadcast state, the max amount possible is 5MB? If I use Flink state mechanism for storing rules, I will still need to iterate through all rules

Re: Implementing a “join” between a DataStream and a “set of rules”

2018-06-05 Thread Sandybayev, Turar (CAI - Atlanta)
Hi Amit, In my current approach the idea for updating rule set data was to have some kind of a "control" stream that will trigger an update to a local data structure, or a "control" event within the main data stream that will trigger the same. Using external system like a cache or database is

Re: [DISCUSS] Flink 1.6 features

2018-06-05 Thread Hao Sun
adding my vote to K8S Job mode, maybe it is this? > Smoothen the integration in Container environment, like "Flink as a Library", and easier integration with Kubernetes services and other proxies. On Mon, Jun 4, 2018 at 11:01 PM Ben Yan wrote: > Hi Stephan, > > Will [ https://issues.apache.or

guidelines for setting parallelism in operations/job?

2018-06-05 Thread chrisr123
Hello, I'm trying to get some simple rules or guidelines for what values to set for operator or job parallelism. It would seem to me that it should be a number <= the number of available task slots? For example, suppose I have 2 task manager machines, each with 4 task slots. Assuming no other job

Re: [DISCUSS] Flink 1.6 features

2018-06-05 Thread Rico Bergmann
+1 on K8s integration > Am 06.06.2018 um 00:01 schrieb Hao Sun : > > adding my vote to K8S Job mode, maybe it is this? > > Smoothen the integration in Container environment, like "Flink as a > > Library", and easier integration with Kubernetes services and other proxies. > > > >> On Mon, J

Re: [DISCUSS] Flink 1.6 features

2018-06-05 Thread Yan Zhou [FDS Science]
+1 on https://issues.apache.org/jira/browse/FLINK-5479 [FLINK-5479] Per-partition watermarks in ... issues.apache.org Reported in ML: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kafka-topic-partition-skewness-causes-waterm

Can not run scala-shell in yarn mode in flink 1.5

2018-06-05 Thread Jeff Zhang
I try to run scala-shell in yarn mode in 1.5, but hit the following error. I can run it successfully in 1.4.2. It is the same even when I change the mode to legacy. Is this a known issue or something changed in 1.5 ? Thanks Command I Use: bin/start-scala-shell.sh yarn -n 1 Starting Flink Shell:

Any remote opportunity to work on Flink project?

2018-06-05 Thread Deepak Sharma
Hi Flink Users, Sorry to spam your inbox and GM all. I am looking for opportunity to work on Flink project , specifically if its Flink ML over streaming Please do let me know if anyone is looking for freelancers around any of their Flink projects. -- Thanks Deepak

Re: Any remote opportunity to work on Flink project?

2018-06-05 Thread Garvit Sharma
Flink is OpenSource!! On Wed, Jun 6, 2018 at 10:45 AM, Deepak Sharma wrote: > Hi Flink Users, > Sorry to spam your inbox and GM all. > I am looking for opportunity to work on Flink project , specifically if > its Flink ML over streaming > Please do let me know if anyone is looking for freelancer

Re: Any remote opportunity to work on Flink project?

2018-06-05 Thread Christophe Salperwyck
Still some people are interested to pay people to build a product around Flink :-) Interested too about Flink and online ML! Cheers, Christophe On Wed, 6 Jun 2018 at 07:40, Garvit Sharma wrote: > Flink is OpenSource!! > > On Wed, Jun 6, 2018 at 10:45 AM, Deepak Sharma > wrote: > >> Hi Flink U

Re: Any remote opportunity to work on Flink project?

2018-06-05 Thread Garvit Sharma
If that's the case, then I too would be interested to build a product around Flink :). Please let me know. On Wed, Jun 6, 2018 at 11:21 AM, Christophe Salperwyck < christophe.salperw...@gmail.com> wrote: > Still some people are interested to pay people to build a product around > Flink :-) > > I