RE: EXT: Re: Issues in Batch Jobs Submission for a Session Cluster

2021-12-02 Thread Ghiya, Jay (GE Healthcare)
Thanks for prompt response. Understood @David Morávek. Will record cpu and mem usage from Kubernetes metrics Grafana dashboard of job managers and task managers when this happens and share here. If there is anything abnormal then we can get the jvm metrics for e

Re: Pyflink/Flink Java parquet streaming file sink for a dynamic schema stream

2021-12-02 Thread Georg Heiler
Do the JSONs have the same schema overall? Or is each potentially structured differently? Best, Georg Am Fr., 3. Dez. 2021 um 00:12 Uhr schrieb Kamil ty : > Hello, > > I'm wondering if there is a possibility to create a parquet streaming file > sink in Pyflink (in Table API) or in Java Flink (in

Pyflink/Flink Java parquet streaming file sink for a dynamic schema stream

2021-12-02 Thread Kamil ty
Hello, I'm wondering if there is a possibility to create a parquet streaming file sink in Pyflink (in Table API) or in Java Flink (in Datastream api). To give an example of the expected behaviour. Each element of the stream is going to contain a json string. I want to save this stream to parquet

Cannot consum from Kinesalite using FlinkKinesisConsumer

2021-12-02 Thread jonas eyob
Hi all, I have a really simple pipeline to consume events from a local kinesis (kinesalite) and print them out to stdout. But struggling to make sense of why it's failing almost immediately The pipeline code: /* Added this to verify it wasn't a problem with AWS CBOR which needs to be disabled */

Re: GCS/Object Storage Rate Limiting

2021-12-02 Thread Kevin Lam
HI David, Thanks for your response. What's the DefaultScheduler you're referring to? Is that available in Flink 1.13.1 (the version we are using)? How large is the state you're restoring from / how many TMs does the job > consume / what is the parallelism? Our checkpoint is about 900GB, and we

Re: custom metrics in elasticsearch ActionRequestFailureHandler

2021-12-02 Thread Lars Bachmann
Hi David, Thanks for the reply. I think especially in an error/failure handler metrics are important in order to have proper monitoring/alerting in such cases. Would be awesome if this could be added to Flink at some point :). Regards, Lars > Am 02.12.2021 um 18:13 schrieb David Morávek : >

Re: Buffering when connecting streams

2021-12-02 Thread David Morávek
I think this could happen, but I have a very limited knowledge about how the input gates work internally. @Piotr could definitely provide some more insight here. D. On Thu, Dec 2, 2021 at 5:54 PM Alexis Sarda-Espinosa < alexis.sarda-espin...@microfocus.com> wrote: > I do have some logic with tim

Re: custom metrics in elasticsearch ActionRequestFailureHandler

2021-12-02 Thread David Morávek
Hi Lars, quickly looking at the ES connector code, I think you're right and there is no way to do that :( In general I'd say that being able to expose metrics is a valid request. I can imagine having some kind of `RichActionRequestFailureHandler` with `{get|set}RuntimeContext` methods. More or l

Re: Watermark behavior when connecting streams

2021-12-02 Thread David Morávek
The input one are ignored and *replaced* by the new one you've calculated based on incoming data and their timestamps. Any downstream windowing operations will trigger based on this newly calculated watermark. In general, I'd say that if you don't have any special use-case that requires this (I ca

RE: Watermark behavior when connecting streams

2021-12-02 Thread Alexis Sarda-Espinosa
Hi David, If watermarks are ignored how do Consecutive windowed operations [1] work? I’m just trying to understand in which scenarios I need to assign timestamps and watermarks, or if it’s enough if I do it once near the beginning of the DAG (assuming the source doesn’t do it). https://nightli

RE: Buffering when connecting streams

2021-12-02 Thread Alexis Sarda-Espinosa
I do have some logic with timers today, but it’s indeed not ideal. I guess I’ll have a look at TwoInputStreamOperator, but I do have related questions. You mentioned a sample scenario of "processing backlog" where windows fire very quickly; could it happen that, in such a situation, the framewor

custom metrics in elasticsearch ActionRequestFailureHandler

2021-12-02 Thread lars . bachmann
Hi, is there a way to expose custom metrics within an elasticsearch failure handler (ActionRequestFailureHandler)? To register custom metrics I need access to the runtime context but I don't see a way to access the context in the failure handler. Thanks and regards, Lars

Re: Watermark behavior when connecting streams

2021-12-02 Thread David Morávek
Hi Alexis, please take a look at AbstractStreamOperator [1] for details how the watermark is calculate for TwoInputOperator. It uses pretty much the same approach as for with the single input one (it simply takes a minimum). For watermark re-assignment, we ignore input watermark unless it's Long.

Re: Buffering when connecting streams

2021-12-02 Thread David Morávek
Even with the TwoInputStreamOperator you can not "halt" the processing. You need to buffer these elements for example in the ListState for later processing. At the time the watermark of the second stream arrives, you can process all buffered elements that satisfy the condition. You could probably

Re: Broadcast and watermark

2021-12-02 Thread David Morávek
One more thought, if you're "broadcasting" the output of the KafkaSource, it may as well be the case that some partition is empty? Best, D. On Thu, Dec 2, 2021 at 5:11 PM David Morávek wrote: > Hi Sweta, > > the output timestamp seems reasonable to me. I guess you're concerned > about watermark

Re: Broadcast and watermark

2021-12-02 Thread David Morávek
Hi Sweta, the output timestamp seems reasonable to me. I guess you're concerned about watermarks you're seeing, is that correct? final Instant min = Instant.ofEpochMilli(Long.MIN_VALUE); final Instant max = Instant.ofEpochMilli(Long.MAX_VALUE); System.out.printf("Min: %s, Max: %s%n", min, max);

RE: Buffering when connecting streams

2021-12-02 Thread Alexis Sarda-Espinosa
Yes, that sounds right, but with my current KeyedCoProcessFunction I can’t tell Flink to "halt" processElement1 and switch to the other stream depending on watermarks. I could look into TwoInputStreamOperator if you think that’s the best approach. Regards, Alexis. From: David Morávek Sent: Do

Re: Buffering when connecting streams

2021-12-02 Thread David Morávek
I think this would require using lower level API and implementing a custom `TwoInputStreamOperator`. Then you can hook to `processWatemark{1,2}` methods. Let's also make sure we're on the same page on what the watermark is. You can think of the watermark as event time clock. It basically gives you

Re: GCS/Object Storage Rate Limiting

2021-12-02 Thread David Morávek
Hi Kevin, this happens only when the pipeline is started up from savepoint / retained checkpoint right? Guessing from the "path" you've shared it seems like a RockDB based retained checkpoint. In this case all task managers need to pull state files from the object storage in order to restore. This

GCS/Object Storage Rate Limiting

2021-12-02 Thread Kevin Lam
Hi all, We're running a large (256 task managers with 4 task slots each) Flink Cluster with High Availability enabled, on Kubernetes, and use Google Cloud Storage (GCS) as our object storage for the HA metadata. In addition, our Flink application writes out to GCS from one of its sinks via streami

Re: Unable to create new native thread error

2021-12-02 Thread David Morávek
Hi Ilan, we are aware of multiple issues when web-submission can result in classloader / thread local leaks, which could potentially result in the behavior you're describing. We're working on addressing them. FLINK-25022 [1]: The most critical one leaking thread locals. FLINK-25027 [2]: Is only a

RE: Buffering when connecting streams

2021-12-02 Thread Alexis Sarda-Espinosa
Could you elaborate on what you mean with synchronize? Buffering in the state would be fine, but I haven’t been able to come up with a good way of ensuring that all data from the side stream for a given minute is processed by processElement2 before all data for the same (windowed) minute reaches

Broadcast and watermark

2021-12-02 Thread Sweta Kalakuntla
Hi, I am using a broadcast pattern for publishing rules and aggregating the data(https://flink.apache.org/news/2020/01/15/demo-fraud-detection.html). My use case is similar and also the code. One thing I wanted to capture is to figure out any latevents if any and send them to a sink. But when ther

Unable to create new native thread error

2021-12-02 Thread Ilan Huchansky
Hi Flink mailing list, I am Ilan from Start.io data platform team, need some guidance. We have a flow with the following use case: * We read files from AWS S3 buckets process them on our cluster and sink the data into files using Flink file sink. * The jobs use always the same jar, we

Re: Buffering when connecting streams

2021-12-02 Thread David Morávek
You can not rely on order of the two streams that easily. In case you are for example processing backlog and the windows fire quickly, it can happen that it's actually faster than the second branch which has less work to do. This will make the pipeline non-deterministic. What you can do is to "syn

Stateful functions module configurations (module.yaml) per deployment environment

2021-12-02 Thread Deniz Koçak
Hi, We have a simple stateful-function job, consuming from Kafka, calling an HTTP endpoint (on AWS via an Elastic Load Balancer) and publishing the result back via Kafka again. * We created a jar file to be deployed on a standalone cluster (it's not a docker Image), therefore we add `statefun-fli

RE: Buffering when connecting streams

2021-12-02 Thread Alexis Sarda-Espinosa
Hi David, A watermark step simply refers to assigning timestamps and watermarks, my source doesn’t do that. I have a test system with only a couple of data points per day, so there’s definitely no back pressure. I basically have a browser where I can see the results from the sink, and I found

Re: Buffering when connecting streams

2021-12-02 Thread David Morávek
Hi Alexis, I'm not sure what "watermark" step refers to in you graph, but in general I'd say your intuition is correct. For the "buffering" part, each sub-task needs to send data via data exchange (last operator in chain) has an output buffer and the operator that consumes this data (maybe on dif

Re: Question about relationship between operator instances and keys

2021-12-02 Thread David Morávek
Hi haocheng, in short it works as follows: - Each parallel instance of an operator is responsible for one to N key groups. - Each parallel instance belongs to a slot, which is tied with a single thread (slot may actually introduce multiple subtasks) - # of keygroups for each operator = max parall

Question about relationship between operator instances and keys

2021-12-02 Thread haocheng
Hi community! So many keys are computed separately in each operator parallel, how does Flink achieve this? Let an operator instance processes 100 keys, is there 100 thread that are managed by the operator instance thread? I know that 1. an operator many have many parallel instances at the same

Re: REST service for flinkSQL

2021-12-02 Thread Martijn Visser
Hi Lu, Thanks for clarifying, much appreciated! Let's hope there are also others in the community who would value such a REST service so we could (re)start discussions and contributions around this topic. Best regards, Martijn On Mon, 29 Nov 2021 at 19:50, Lu Niu wrote: > Sure. The requiremen

Re: KafkaSink.builder setDeliveryGuarantee is not a member

2021-12-02 Thread Hang Ruan
Hi, It seems to be an error in documents. `setDeliverGuarantee` is the method of class `KafkaSinkBuilder`, . It could be used like this : KafkaSink.builder().setDeliverGuarantee(xxx) Lars Skjærven 于2021年12月2日周四 19:34写道: > Hello, > upgrading to 1.14 I bumped into an issue with the kafka sink bui

Re: KafkaSink.builder setDeliveryGuarantee is not a member

2021-12-02 Thread Fabian Paul
Hi Lars, Unfortunately, there is at the moment a small bug in our documentation [1]. You can set the DeliveryGuarantee on the builder object and not on the serialization schema. Sorry for the inconvenience. Best, Fabian [1] https://github.com/apache/flink/pull/17971 On Thu, Dec 2, 2021 at 12:34

KafkaSink.builder setDeliveryGuarantee is not a member

2021-12-02 Thread Lars Skjærven
Hello, upgrading to 1.14 I bumped into an issue with the kafka sink builder when defining delivery guarantee: value setDeliveryGuarantee is not a member of org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchemaBuilder[] Seems to be working with the default value (i.e. without m

Buffering when connecting streams

2021-12-02 Thread Alexis Sarda-Espinosa
Hello, I have a use case with event-time processing that ends up with a DAG roughly like this: source -> filter -> keyBy -> watermark -> window -> process -> keyBy_2 -> connect (KeyedCoProcessFunction) -> sink | / (s

Re: Issues in Batch Jobs Submission for a Session Cluster

2021-12-02 Thread David Morávek
Hi Jay, It's hard to say what going on here. My best guess is that you're running out of memory for your process (eg. hitting ulimit). Can you please start with checking the ulimits memory usage of your container? For the cleanup, right now it may happen in some failover scenarios that we don't c

Re: Re: how to run streaming process after batch process is completed?

2021-12-02 Thread Yun Gao
Hi Vtygoss, Very thanks for sharing the scenarios! Currently for batch mode checkpoint is not support, thus it could not create a snapshot after the job is finished. However, there might be some alternative solutions: 1. Hybrid source [1] targets at allowing first read from a bounded source, the