Re: Flink ML

2020-06-17 Thread Jark Wu
Currently, FLIP-39 is mainly driven by Becket and his team. I'm including him, maybe he can answer your question. Best, Jark On Wed, 17 Jun 2020 at 23:00, Piotr Nowojski wrote: > Hi, > > It looks like FLIP-39 is only partially implemented as for now [1], so I’m > not sure which features are alr

Re: Flink plugin File System for GCS

2020-06-17 Thread Yang Wang
Hi Alex, Build a fat jar is a good practice for flink filesystem plugin, just like flink-s3-fs-hadoop, flink-s3-fs-presto, flink-azure-fs-hadoop and flink-oss-fs-hadoop. All the provided filesystem plugins are self-contained, which means you need to bundle the hadoop in your fat jar. The reason w

Re: Convert flink table with field of type RAW to datastream

2020-06-17 Thread Jark Wu
Flink SQL/Table requires to know the field data types explicitly. Maybe you can apply a MapFunction before `toTable` to convert/normalize the data and type. Best, Jark On Thu, 18 Jun 2020 at 14:12, YI wrote: > Hi Jark, > > Thank you for your suggestion. My current problem is that there are quit

Re: Convert flink table with field of type RAW to datastream

2020-06-17 Thread YI
Hi Jark, Thank you for your suggestion. My current problem is that there are quite a few data types. All these data types are defined upstream which I have no control. I don't think I can easily change the type information of a specific field. Can I? Things become nasty when there are so many `j

Re: Writing to S3 parquet files in Blink batch mode. Flink 1.10

2020-06-17 Thread Dmytro Dragan
Hi Jingsong, Thank you for detailed clarification. Best regards, Dmytro Dragan | dd...@softserveinc.com | Lead Big Data Engineer | Big Data & Analytics | SoftServe From: Jingsong Li Sent: Thursday, June 18, 2020 4:58:22 AM To: Dmytro Dragan Cc: user@flink.apac

Re: Trouble with large state

2020-06-17 Thread Yun Tang
Hi Jeff 1. " after around 50GB of state, I stop being able to reliably take checkpoints or savepoints. " What is the exact reason that job cannot complete checkpoint? Expired before completing or decline by some tasks? The former one is manly caused by high back-pressure and the later one i

Flink plugin File System for GCS

2020-06-17 Thread Alexander Filipchik
Hello, I'm trying to implement a flink native FS for GCS which can be used with a streaming file sink. I used S3 one as a reference and made it work locally. however, it fails to load when I deploy it to the cluster. If I put hadoop in the fat jar I get: Caused by: java.lang.LinkageError: loader c

Re: Convert flink table with field of type RAW to datastream

2020-06-17 Thread Jark Wu
Hi YI, Flink doesn't have a TypeInformation for `java.util.Date`, but only SqlTimeTypeInfo.DATE for `java.sql.Date`. That's why the TypeInformation.of(java.util.Date) is being recognized as a RAW type. To resolve your problem, I think in `TypeInformation.of(..)` you should use a concrete type for

Re: Writing to S3 parquet files in Blink batch mode. Flink 1.10

2020-06-17 Thread Jingsong Li
Hi Dmytro, Yes, Batch mode must disabled checkpoint, So StreamingFileSink can not be used in batch mode (StreamingFileSink requires checkpoint whatever formats), we are refactoring it to more generic, and can be used in batch mode, but this is a future topic. Currently, in batch mode, for sink, we

Re: Implications on incremental checkpoint duration based on RocksDBStateBackend, Number of Slots per TM

2020-06-17 Thread Yun Tang
Hi Sameer If you only have one disk for one TM, 10 TMs could deploy at most 10 disks while 100TM could deploy at most 100 disks. The sync checkpoint phase of RocksDB need to write disk and if you could distribute the write pressure over more disks, you could get better performance which is what

Convert flink table with field of type RAW to datastream

2020-06-17 Thread YI
Hi all, I am using flink to process external data. The source format is json, and the underlying data types are defined in a external library. I generated table schema with `TableSchema.fromTypeInfo` and `TypeInformation.of[_]`. From what I read, this method is deprecated. But I didn't find any

Re: Blink Planner Retracting Streams

2020-06-17 Thread Jark Wu
Hi John, Maybe I misunderstand something, but CRow doesn't have the `getSchema()` method. You can getSchema() on the Table, this also works if you convert the table into Tuple2. Actually, there is no big difference between CRow and Tuple2, they both wrap the change flag and the Row. Best, Jark

Interact with different S3 buckets from a shared Flink cluster

2020-06-17 Thread Ricardo Cardante
Hi! We are working in a use case where we have a shared Flink cluster to deploy multiple jobs from different teams. With this strategy, we are facing a challenge regarding the interaction with S3. Given that we already configured S3 for the state backend (through flink-conf.yaml) every tim

Re: Blink Planner Retracting Streams

2020-06-17 Thread John Mathews
Hello Godfrey, Thanks for the response! I think the problem with Tuple2, is that if my understanding is correct of how CRow worked, when CRow's getSchema() was returned it would return the underlying schema of the row it contained. Tuple2 doesn't do that, so it changes/breaks a lot of our downstr

Implications on incremental checkpoint duration based on RocksDBStateBackend, Number of Slots per TM

2020-06-17 Thread Sameer W
Hi, The number of RocksDB databases the Flink creates is equal to the number of operator states multiplied by the number of slots. Assuming a parallelism of 100 for a job which is executed on 100 TM's with 1 slot per TM as opposed to 10 TM's with 10 slots per TM I have noticed that the former con

Re: Unable to run flink job in dataproc cluster with jobmanager provided

2020-06-17 Thread Till Rohrmann
Hi Sourabh, do you have access to the cluster logs? They could be helpful for debugging the problem. Which version of Flink are you using? Cheers, Till On Wed, Jun 17, 2020 at 7:39 PM Sourabh Mehta wrote: > No, I am not. > > On Wed, 17 Jun 2020 at 10:48 PM, Chesnay Schepler > wrote: > >> Are

Trouble with large state

2020-06-17 Thread Jeff Henrikson
Hello Flink users, I have an application of around 10 enrichment joins. All events are read from kafka and have event timestamps. The joins are built using .cogroup, with a global window, triggering on every 1 event, plus a custom evictor that drops records once a newer record for the same I

Re: Unable to run flink job in dataproc cluster with jobmanager provided

2020-06-17 Thread Sourabh Mehta
No, I am not. On Wed, 17 Jun 2020 at 10:48 PM, Chesnay Schepler wrote: > Are you by any chance creating a local environment via > (Stream)ExecutionEnvironment#createLocalEnvironment? > > On 17/06/2020 17:05, Sourabh Mehta wrote: > > Hi Team, > > I'm exploring flink for one of my use case, I'm f

Re: flink-s3-fs-hadoop retry configuration

2020-06-17 Thread Jeff Henrikson
Robert, Thanks for the tip! Before you replied, I did figure out to put the keys in flink-conf.yaml, using btrace. I instrumented the methods org.apache.hadoop.conf.Configuration.get for the keys, and org.apache.hadoop.conf.Configuration.substituteVars for effective values. (There is a btr

Re: Unable to run flink job in dataproc cluster with jobmanager provided

2020-06-17 Thread Chesnay Schepler
Are you by any chance creating a local environment via (Stream)ExecutionEnvironment#createLocalEnvironment? On 17/06/2020 17:05, Sourabh Mehta wrote: Hi Team, I'm  exploring flink for one of my use case, I'm facing some issues while running a flink job in cluster mode. Below are the steps I

Unable to run flink job in dataproc cluster with jobmanager provided

2020-06-17 Thread Sourabh Mehta
Hi Team, I'm exploring flink for one of my use case, I'm facing some issues while running a flink job in cluster mode. Below are the steps I followed to setup and run job in cluster mode : 1. Setup flink on google cloud dataproc using https://github.com/GoogleCloudDataproc/initialization-actions/

Re: Flink ML

2020-06-17 Thread Piotr Nowojski
Hi, It looks like FLIP-39 is only partially implemented as for now [1], so I’m not sure which features are already done. I’m including Shaoxuan Wang in this thread, maybe he will be able to better answer your question. Piotrek [1] https://issues.apache.org/jira/browse/FLINK-12470

Re: Does Flink support reading files or CSV files from java.io.InputStream instead of file paths?

2020-06-17 Thread Marco Villalobos
While I still think it would be great for Flink to accept an InputStream, and allow the programmer to decide if it is a remote TCP call or local file, for the sake of my demo, I simply found the file path within Gradle and supplied to the Gradle application run plugin like this: run { args

Re: Running Kubernetes on Flink with Savepoint

2020-06-17 Thread Matt Magsombol
Yeah, our set up is a bit out dated ( since flink 1.7-ish ) but we're effectively just using helm templates...when upgrading to 1.10, I just ended up looking at diffs and change logs for changes... Anyways, thanks, I was hoping that flink has a community supported way of doing this, but I think

Re: Shared state between two process functions

2020-06-17 Thread Congxian Qiu
Hi Maybe you can take a look at broadcast state[1] [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/state/broadcast_state.html Best, Congxian Robert Metzger 于2020年6月16日周二 上午2:18写道: > Thanks for sharing some details on the use case: Are you able to move the > common

Re: Is State TTL possible with event-time characteristics ?

2020-06-17 Thread Andrey Zagrebin
Hi Arti, Any program can use State with TTL but the state can only expire in processing time at the moment even if you configure event-time characteristics. As Congxian mentioned, the event time for TTL is planned. The cleanup says that it will not be removed 'by default'. The following sections

Re: Improved performance when using incremental checkpoints

2020-06-17 Thread Congxian Qiu
Hi Nick The result is a bit wired. Did you compare the disk util/performance before and after enabling checkpoint? Best, Congxian Yun Tang 于2020年6月17日周三 下午8:56写道: > Hi Nick > > I think this thread use the same program as thread "MapState bad > performance" talked. > Please provide a simple pr

Re: Improved performance when using incremental checkpoints

2020-06-17 Thread Yun Tang
Hi Nick I think this thread use the same program as thread "MapState bad performance" talked. Please provide a simple program which could reproduce this so that we can help you more. Best Yun Tang From: Aljoscha Krettek Sent: Tuesday, June 16, 2020 19:53 To: us

Re: Is State TTL possible with event-time characteristics ?

2020-06-17 Thread Congxian Qiu
Hi Currently, Flink does not support event-time TTL state, there is an issue[1] tracking this. [1] https://issues.apache.org/jira/browse/FLINK-12005 Best, Congxian Arti Pande 于2020年6月17日周三 下午7:37写道: > With Flink 1.9 is state TTL supported for event-time characteristics? This > part >

Is State TTL possible with event-time characteristics ?

2020-06-17 Thread Arti Pande
With Flink 1.9 is state TTL supported for event-time characteristics? This part of the documentation says that - Only TTLs in reference to *processing time* are currently suppor

Re: [ANNOUNCE] Yu Li is now part of the Flink PMC

2020-06-17 Thread Fabian Hueske
Congrats Yu! Cheers, Fabian Am Mi., 17. Juni 2020 um 10:20 Uhr schrieb Till Rohrmann < trohrm...@apache.org>: > Congratulations Yu! > > Cheers, > Till > > On Wed, Jun 17, 2020 at 7:53 AM Jingsong Li > wrote: > > > Congratulations Yu, well deserved! > > > > Best, > > Jingsong > > > > On Wed, Jun

Re: [ANNOUNCE] Yu Li is now part of the Flink PMC

2020-06-17 Thread Till Rohrmann
Congratulations Yu! Cheers, Till On Wed, Jun 17, 2020 at 7:53 AM Jingsong Li wrote: > Congratulations Yu, well deserved! > > Best, > Jingsong > > On Wed, Jun 17, 2020 at 1:42 PM Yuan Mei wrote: > >> Congrats, Yu! >> >> GXGX & well deserved!! >> >> Best Regards, >> >> Yuan >> >> On Wed, Jun 17,

Re: Kinesis ProvisionedThroughputExceededException

2020-06-17 Thread Roman Grebennikov
Hi, It will occur if your job will reach SHARD_GETRECORDS_RETRIES consecutive failed attempts to pull the data from kinesis. So if you scale up the topic in kinesis and tune a bit backoff parameters, you will lower the probability of this exception almost to zero (but with increased costs and w