Re: Flink parallelism with Kafka source

2025-01-02 Thread David Anderson
Flink handles its parallelism independently from the number of partitions in the topic(s) being read. The parallelism comes from whatever is set in the cluster configuration, without any concern for the source's native parallelism. If there are fewer kafka partitions than the flink parallelism, the

Re: TTL in pyflink does not seem to work

2024-03-09 Thread David Anderson
My guess is that this only fails when pyflink is used with the heap state backend, in which case one possible workaround is to use the RocksDB state backend instead. Another workaround would be to rely on timers in the process function, and clear the state yourself. David On Fri, Mar 8, 2024 at 1

Re: The fault tolerance and recovery mechanism in batch mode within Apache Flink.

2024-02-16 Thread David Anderson
With streaming execution, the entire pipeline is always running, which is necessary so that results can be continuously produced. But with batch execution, the job graph can be segmented into separate pipelined stages that can be executed sequentially, each running to completion before the next beg

Re: Request to provide sample codes on Data stream using flink, Spring Boot & kafka

2024-02-08 Thread David Anderson
For a collection of several complete sample applications using Flink with Kafka, see https://github.com/confluentinc/flink-cookbook. And I agree with Marco -- in fact, I would go farther, and say that using Spring Boot with Flink is an anti-pattern. David On Wed, Feb 7, 2024 at 4:37 PM Marco Vil

Re: Re: Case for a Broadcast join + hint in Flink Streaming SQL

2024-02-02 Thread David Anderson
I've seen enough demand for a streaming broadcast join in the community to justify a FLIP -- I think it's a good idea, and look forward to the discussion. David On Fri, Feb 2, 2024 at 6:55 AM Feng Jin wrote: > +1 a FLIP for this topic. > > > Best, > Feng > > On Fri, Feb 2, 2024 at 10:26 PM Mart

Re: Redis as a State Backend

2024-01-31 Thread David Anderson
When it comes to decoupling the state store from Flink, I suggest taking a look at FlinkNDB, which is an experimental state backend for Flink that puts the state into an external distributed database. There's a Flink Forward talk [1] and a master's thesis [2] available. [1] https://www.youtube.com

Re: Continuous Reading of File using FileSource does not process the existing files in version 1.17

2024-01-08 Thread David Anderson
While the readFile method would monitor changes to existing files, it would completely re-ingest each changed file after every change. This behavior wasn't very user friendly. David On Fri, Jan 5, 2024 at 2:22 AM Martijn Visser wrote: > Hi Prasanna, > > I think this is as expected. There is no

Re: [PyFlink] Collect multiple elements in CoProcessFunction

2023-11-18 Thread David Anderson
Hi, Alex! Yes, in PyFlink the various flatmap and process functions are implemented as generator functions, so they use yield to emit results. David On Tue, Nov 7, 2023 at 1:16 PM Alexander Fedulov < alexander.fedu...@gmail.com> wrote: > Java ProcessFunction API defines a clear way to collect d

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

2023-11-17 Thread David Anderson
Rui, I don't have any direct experience with this topic, but given the motivation you shared, the proposal makes sense to me. Given that the new default feels more complex than the current behavior, if we decide to do this I think it will be important to include the rationale you've shared in the

Re: Flink custom parallel data source

2023-11-03 Thread David Anderson
> As you suggested message broker below then how it is feasible in this case? To my mind, the idea would be to use something like a socket source for Kafka Connect. This would give you a simple, reliable way to get the data stored into a replayable data store. You'd then be able to start, stop, an

Re: Bloom Filter for Rocksdb

2023-10-29 Thread David Anderson
I believe bloom filters are off by default because they add overhead and aren't always helpful. I.e., in workloads that are write heavy and have few reads, bloom filters aren't worth the overhead. David On Fri, Oct 20, 2023 at 11:31 AM Mate Czagany wrote: > Hi, > > There have been no reports ab

Re: Order of Execution in KeyedBroadcastProcessFunction

2023-09-10 Thread David Anderson
In Flink, all user functions, including KeyedBroadcastProcessFunction, are (effectively) single threaded, so the processBroadcastElement method will run to completion before any further messages are processed in the processElement method. (I said "effectively" because in the case of processing time

Re: Job graph

2023-09-01 Thread David Anderson
This may or may not help, but you can get the execution plan from inside the client, by doing something like this (I printed the plan to stderr): ... System.err.println(env.getExecutionPlan()); env.execute("my job"); The result is a JSON-encoded representation of the job graph, which

Re: Blue green deployment with Flink Apache Operator

2023-09-01 Thread David Anderson
Back in 2020, there was a Flink Forward talk [1] about how Lyft was doing blue green deployments. Earlier (all the way back in 2017) Drivetribe described [2] how they were doing so as well. David [1] https://www.youtube.com/watch?v=Hyt3YrtKQAM [2] https://www.ververica.com/blog/drivetribe-cqrs-ap

Re: Streaming join performance

2023-08-08 Thread David Anderson
This join optimization sounds promising, but I'm wondering why Flink SQL isn't taking advantage of the N-Ary Stream Operator introduced in FLIP-92 [1][2] to implement a n-way join in a single operator. Is there something that makes this impossible/impractical? [1] https://cwiki.apache.org/confluen

Re: Investigating use of Custom Trigger to smooth CPU usage

2023-08-03 Thread David Anderson
There's already a built-in concept of WindowStagger that provides an interface for accomplishing this. It's not as well integrated (or documented) as it might be, but the basic mechanism exists. To use it, I believe you would do something like this: assigner = new TumblingEventTimeWindows(Time.se

Re: High Start-Delay And Aligned Checkpointing Causing Timeout.

2023-06-02 Thread David Anderson
I'm not 100% certain what "alignment duration" is measuring exactly in the context of unaligned checkpoints -- however, even with unaligned checkpointing each operator still has to wait until all of the barriers are present in the operator's input queues. It doesn't have to wait for the barriers to

Re: Watermarks lagging behind events that generate them

2023-03-16 Thread David Anderson
UE if the subtask is >>> restarted and no event from source is processed. >>> >>> Best, >>> Shammon FY >>> >>> On Tue, Mar 14, 2023 at 4:58 PM Alexis Sarda-Espinosa >>> wrote: >>>> >>>> H

Re: is there any detrimental side-effect if i set the max parallelism as 32768

2023-03-13 Thread David Anderson
I believe there is some noticeable overhead if you are using the heap-based state backend, but with RocksDB I think the difference is negligible. David On Tue, Mar 7, 2023 at 11:10 PM Hangxiang Yu wrote: > > Hi, Tony. > "be detrimental to performance" means that some extra space overhead of the

Re: Watermarks lagging behind events that generate them

2023-03-13 Thread David Anderson
Watermarks always follow the corresponding event(s). I'm not sure why they were designed that way, but that is how they are implemented. Windows maintain this contract by emitting all of their results before forwarding the watermark that triggered the results. David On Mon, Mar 13, 2023 at 5:28 P

Re: Reusing the same OutputTag in multiple ProcessFunctions

2023-02-14 Thread David Anderson
I can't respond to the python-specific aspects of this situation, but I don't believe you need to use the same OutputTag instance. It should be enough that the various tag instances involved all have the same String id. (That's why the id exists.) David On Tue, Feb 14, 2023 at 11:51 AM Andrew Ott

Re: Non-temporal watermarks

2023-02-03 Thread David Anderson
DataStream time windows and Flink SQL make assumptions about the timestamps and watermarks being milliseconds since the epoch. But the underlying machinery does not. So if you limit yourself to process functions (for example), then nothing will assign any semantics to the time values. David On Th

Re: Failing to build Flink 1.9 using Scala 2.12

2022-12-24 Thread David Anderson
Flink only officially supports Scala 2.12 up to 2.12.7 -- you are running into the binary compatibility check, intended to keep you from unknowingly running into problems. You can disable japicmp, and everything will hopefully work: mvn clean install -DskipTests -Djapicmp.skip -Dscala-2.12 -Dscala

Re: Support for higher-than-millisecond resolution event-time timestamps

2022-11-25 Thread David Anderson
When it comes to event time processing and watermarks, I believe that if you stick to the lower level APIs, then the milliseconds assumption is indeed arbitrary, but at higher levels that assumption is baked in. In other words, that rules out using Flink SQL, or things like TumblingEventTimeWindow

Re: question about Async IO

2022-11-04 Thread David Anderson
Yes, that will work as you expect. So long as you don't put another shuffle or rebalance in between, the keyed partitioning that's already in place will carry through the async i/o operator, and beyond. In most cases you can even use reinterpretAsKeyedStream on the output (so long as you haven't do

Re: Can temporal table functions only be registered using the table API?

2022-10-06 Thread David Anderson
I was wrong about this. The AS OF style processing join has been disabled at a higher level, in org.apache.flink.table.planner.plan.nodes.exec.stream.StreamExecTemporalJoin#createJoinOperator David On Thu, Oct 6, 2022 at 9:59 AM David Anderson wrote: > Salva, > > Have you tried doing

Re: Can temporal table functions only be registered using the table API?

2022-10-06 Thread David Anderson
As for your original question, the documentation states that a temporal table function can only be registered via the Table API, and I believe this is true. David On Thu, Oct 6, 2022 at 9:59 AM David Anderson wrote: > Salva, > > Have you tried doing an AS OF style processing time temp

Re: Can temporal table functions only be registered using the table API?

2022-10-06 Thread David Anderson
Salva, Have you tried doing an AS OF style processing time temporal join? I know the documentation leads one to believe this isn't supported, but I think it actually works. I'm basing this on this comment [1] in the code for the TemporalProcessTimeJoinOperator: The operator to temporal join a str

Re: [DISCUSS] FLIP-265 Deprecate and remove Scala API support

2022-10-05 Thread David Anderson
I want to clarify one point here, which is that modifying jobs written in Scala to use Flink's Java API does not require porting them to Java. I can readily understand why folks using Scala might rather use Java 17 than Java 11, but sticking to Scala will remain an option even if Flink's Scala API

Re: Loading broadcast state on BroadcastProcessFunction instantiation or open method

2022-09-27 Thread David Anderson
Logically it would make sense to be able to initialize BroadcastState in the open method of a BroadcastProcessFunction, but in practice I don't believe it can be done -- because the necessary Context isn't made available. Perhaps you could use the State Processor API to bootstrap some state into t

Re: A question about restoring state with an additional variable with kryo

2022-09-18 Thread David Anderson
Vishal, If you decide you can't live with dropping that state, [1] is a complete example showing how to migrate from Kryo by using the state processor API. David [1] https://www.docs.immerok.cloud/docs/cookbook/migrating-state-away-from-kryo/ On Fri, Sep 16, 2022 at 8:32 AM Vishal Santoshi wr

Re: Mixed up session aggregations for same key

2022-09-07 Thread David Anderson
The way that Flink handles session windows is that every new event is initially assigned to its own session window, and then overlapping sessions are merged. I imagine this is why you are seeing so many calls to createAccumulator. This implementation choice is deeply embedded in the code; I don't

Re: StreamingFileSink question

2022-08-31 Thread David Anderson
If I remember correctly, there's a fix for this in Flink 1.14 (but the feature is disabled by default in 1.14, and enabled by default in 1.15). (I'm thinking that execution.checkpointing.checkpoints-after-tasks-finish.enabled [1] takes care of this.) With Flink 1.13 I believe you'll have to handle

Re: Why this example does not save anything to file?

2022-08-31 Thread David Anderson
t; What's that? > > > > Sent: Monday, August 01, 2022 at 2:49 PM > From: "Martijn Visser" martijnvis...@apache.org][mailto:martijnvis...@apache.org[mailto: > martijnvis...@apache.org]]> > To: pod...@gmx.com[mailto:pod...@gmx.com][mailto:pod...@gmx.com[ma

Re: How Flink knows that CREATE TABLE is sometimes about creating table, sometimes about creating file?

2022-08-29 Thread David Anderson
The role of CREATE TABLE is to provide the necessary metadata for the table -- the location of the data, its format, etc. Executing CREATE TABLE creates an entry in the catalog, but otherwise doesn't do anything. In a case like this one CREATE TABLE Table1 (column_name1 STRING, column_name2 DOUBL

Re: Overwriting watermarks in DataStream

2022-08-21 Thread David Anderson
If you have two watermark strategies in your job, the downstream TimestampsAndWatermarksOperator will absorb incoming watermarks and not forward them downstream, but it will have no effect upstream. The only exception to this is that watermarks equal to Long.MAX_VALUE are forwarded downstream, sin

Re: flink sink kafka exactly once plz help me

2022-08-17 Thread David Anderson
You can keep the same transaction ID if you are restarting the job as a continuation of what was running before. You need distinct IDs for different jobs that will be running against the same kafka brokers. I think of the transaction ID as an application identifier. See [1] for a complete list of

Re: Problem with KafkaSource and watermark idleness

2022-08-15 Thread David Anderson
. The problem was quite obvious when I > enabled idleness and data flowed through much faster with different results > even though the topics were not idle. > > Regards. > > On Mon, Aug 15, 2022 at 12:12 AM David Anderson > wrote: > >> Although I'm not very familiar wi

Re: Problem with KafkaSource and watermark idleness

2022-08-14 Thread David Anderson
Although I'm not very familiar with the design of the code involved, I also looked at the code, and I'm inclined to agree with you that this is a bug. Please do raise an issue. I'm wondering how you noticed this. I was thinking about how to write a failing test, and I'm wondering if this has some

Re: RichFunctions, streaming, and configuration (it's always empty)

2022-08-07 Thread David Anderson
The configuration parameter passed to the open method is a legacy holdover that has been retained to avoid breaking a public API, but is no longer used. Your options are to either get the global job parameters from the execution context as described in [1], or to pass the configuration to a constr

Re: Why this example does not save anything to file?

2022-07-30 Thread David Anderson
You need to add 'csv.field-delimiter'=';' to the definition of Table1 so that the input from test4.txt can be correctly parsed: tEnv.executeSql("CREATE TABLE Table1 (column_name1 STRING, column_name2 DOUBLE) WITH ('connector.type' = 'filesystem', 'connector.path' = 'file:///C:/temp/test4

Re: Did the semantics of Kafka's earliest offset change with the new source API?

2022-07-15 Thread David Anderson
What did change was the default starting position when not starting from a checkpoint. With FlinkKafkaConsumer, it starts from the committed offsets by default. With KafkaSource, it starts from the earliest offset. David On Fri, Jul 15, 2022 at 5:57 AM Chesnay Schepler wrote: > I'm not sure abo

Re: Data is lost in the ListState

2022-07-11 Thread David Anderson
This is, in fact, the expected behavior. Let me explain why: In order for Flink to provide exactly-once guarantees, the input sources must be able to rewind and then replay any events since the last checkpoint. In the scenario you shared, the last checkpoint was checkpoint 2, which occurred befor

[ANNOUNCE] Apache Flink 1.15.1 released

2022-07-07 Thread David Anderson
available in Jira: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351546 We would like to thank all contributors of the Apache Flink community who made this release possible! Regards, David Anderson

Re: Fink 15: InvalidProgramException: Table program cannot be compiled. This is a bug

2022-06-09 Thread David Anderson
ption: > org.apache.flink.api.common.InvalidProgramException: Table program cannot > be compiled. This is a bug. Please file an issue. > > at > org.apache.flink.table.runtime.generated.CompileUtils.compile(CompileUtils.java:94) > ~[flink-table-runtime-1.15.0.jar:1.15.0] >

Re: Fink 15: InvalidProgramException: Table program cannot be compiled. This is a bug

2022-06-09 Thread David Anderson
A Table can have at most one time attribute. In your Table the proc_time column is a processing time attribute, and when you define a watermark on the event_time column then that column becomes an event-time attribute. If you want to combine event time and processing time, you can use the PROCTIME

Re: Recover watermark from savepoint

2022-06-09 Thread David Anderson
Sweta, Flink does not include watermarks in savepoints, nor are they included in aligned checkpoints. For what it's worth, I believe that with unaligned checkpoints in-flight watermarks are included in checkpoints, but I don't believe that would solve the problem, since the watermark strategy's st

Re: Request for Review: FLINK-27507 and FLINK-27509

2022-05-23 Thread David Anderson
I've taken care of this. David On Sun, May 22, 2022 at 4:12 AM Shubham Bansal wrote: > Hi Everyone, > > I am not sure who to reach out for the reviews of these changesets, so I > am putting this on the mailing list here. > > I have raised the review for > FLINK-27507 - https://github.com/apache

Re: Checkpointing - Job Manager S3 Head Requests are 404s

2022-05-20 Thread David Anderson
t; Great, that all makes sense to me. Thanks again. > > On Thu, May 19, 2022 at 11:42 AM David Anderson > wrote: > > > > Sure, happy to try to help. > > > > What's happening with the hadoop filesystem is that before it writes > each key it checks to see if the

Re: Checkpointing - Job Manager S3 Head Requests are 404s

2022-05-19 Thread David Anderson
sto plugin with S3 I would > conclude that this increases the size of the cluster that would > require entropy injection, yes? But that it doesn't necessarily get > rid of the need because one could have a large enough cluster and say > a lifecycle policy that could still end up requi

Re: Checkpointing - Job Manager S3 Head Requests are 404s

2022-05-19 Thread David Anderson
Aeden, this is probably happening because you are using the Hadoop implementation of S3. The Hadoop S3 filesystem tries to imitate a filesystem on top of S3. In so doing it makes a lot of HEAD requests. These are expensive, and they violate read-after-create visibility, which is what you seem to b

Re: Confusing S3 Entropy Injection Behavior

2022-05-19 Thread David Anderson
ths of any and all checkpoint data files. With presto the metadata path won't include "_entropy_" at all (it will disappear, rather than being replaced by something specific). For point 2, I'm not sure. David On Thu, May 19, 2022 at 2:37 PM David Anderson wrote: > This so

Re: Confusing S3 Entropy Injection Behavior

2022-05-19 Thread David Anderson
This sounds like it could be FLINK-17359 [1]. What version of Flink are you using? Another likely explanation arises from the fact that only the checkpoint data files (the ones created and written by the task managers) will have the _entropy_ replaced. The job manager does not inject entropy into

Re: sharing data between 2 pipelines

2022-05-10 Thread David Anderson
This sounds like it might be a use case for something like a KeyedCoProcessFunction (or possibly a KeyedBroadcastProcessFunction, depending on the details). These operators can receive inputs from two different sources, and share state between them. The rides and fares exercise [1] from the flink-

Re: [Discuss] Creating an Apache Flink slack workspace

2022-05-10 Thread David Anderson
pared >>> >> to the user-zh@ ML, which I'd attribute to the improvement of >>> interaction >>> >> experiences. Admittedly, there are questions being repeatedly asked & >>> >> answered, but TBH I don't think that compares to the benefit

Re: [Discuss] Creating an Apache Flink slack workspace

2022-05-06 Thread David Anderson
I have mixed feelings about this. I have been rather visible on stack overflow, and as a result I get a lot of DMs asking for help. I enjoy helping, but want to do it on a platform where the responses can be searched and shared. It is currently the case that good questions on stack overflow frequ

Re: RocksDB's state size discrepancy with what's seen with state processor API

2022-04-22 Thread David Anderson
Alexis, Compaction isn't an all-at-once procedure. RocksDB is organized as a series of levels, each 10x larger than the one below. There are a few different compaction algorithms available, and they are tunable, but what's typically happening during compaction is that one SST file at level n is be

Re: Flink batch mode does not sort by event timestamp

2022-04-22 Thread David Anderson
The DataStream API's BATCH execution mode first sorts by key, and within each key, it sorts by timestamp. By operating this way, it only needs to keep state for one key at a time, so this keeps the runtime simple and efficient. Regards, David P.S. I see you also asked this question on stack overf

Re: How to reprocess historical data with event-time windowing?

2022-04-13 Thread David Anderson
Ty, Usually what's done is to run a separate instance of the app to handle the re-ingestion of the historic data while another instance is processing live data. That way the backfill job won't be confused by observing events with recent timestamps -- it will only see the historic data. But you wil

Re: Parallel processing in a 2 node cluster apparently not working

2022-03-29 Thread David Anderson
In this situation, changing your configuration [1] to include cluster.evenly-spread-out-slots: true should change the scheduling behavior to what you are looking for. [1] https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#cluster-evenly-spread-out-slots Regar

Re: Using Amazon EC2 Spot instances with Flink

2022-03-24 Thread David Anderson
I remember a Flink Forward talk several years ago where the speaker shared how they were running on spot instances. They were catching the notification that the instance was being shutdown, taking a savepoint, and relaunching. They were also proactively monitoring spot instance prices around the wo

Re: How to flatten ARRAY in Table API

2022-02-20 Thread David Anderson
Matthias, You can use a CROSS JOIN UNNEST, as mentioned very briefly in the docs [1]. Something like this should work: SELECT id, customerid, productid, quantity, ... FROM orders CROSS JOIN UNNEST(entries) AS items (productid, quantity, unit_price, discount); [1] https://nightlies.apache.or

Re: Implement watermark buffering with Process Function

2022-02-16 Thread David Anderson
ts will be dramatically smaller. David On Wed, Feb 16, 2022 at 10:17 PM David Anderson wrote: > I'm afraid not. The DataStream window implementation uses internal APIs to > manipulate the state backend namespace, which isn't possible to do with the > public-facing API. And without this

Re: Implement watermark buffering with Process Function

2022-02-16 Thread David Anderson
I'm afraid not. The DataStream window implementation uses internal APIs to manipulate the state backend namespace, which isn't possible to do with the public-facing API. And without this, you can't implement this as efficiently. David On Wed, Feb 16, 2022 at 12:04 PM Ruibin Xing wrote: > Hi, >

Re: Illegal reflective access by org.apache.flink.api.java.ClosureCleaner

2022-02-12 Thread David Anderson
You are probably running with Java 11 (with Java 8 these messages aren't produced). The Flink docs [1] say These warnings are considered harmless and will be addressed in future Flink releases. [1] https://nightlies.apache.org/flink/flink-docs-release-1.14/release-notes/flink-1.10/#java-11-su

Re: Apache Flink - Will later events from a table with watermark be propagated and when can CURRENT_WATERMARK be null in that situation

2022-02-12 Thread David Anderson
Flink uses watermarks to indicate when a stream has become complete up through some point in time. Various operations on streams wait for watermarks in order to know when they can safely stop waiting for further input, and so go ahead and produce their results. These operations include event-time w

Re: KafkaSource vs FlinkKafkaConsumer010

2022-02-01 Thread David Anderson
Before Kafka introduced their universal client, Flink had version-specific connectors, e.g., for versions 0.8, 0.9, 0.10, and 0.11. Those were eventually removed in favor of FlinkKafkaConsumer, which is/was backward compatible back to Kafka version 0.10. FlinkKafkaConsumer itself was deprecated in

Re: [DISCUSS] Deprecate/remove Twitter connector

2022-01-30 Thread David Anderson
I agree. The Twitter connector is used in a few (unofficial) tutorials, so if we remove it that will make it more difficult for those tutorials to be maintained. On the other hand, if I recall correctly, that connector uses V1 of the Twitter API, which has been deprecated, so it's really not very

Re: Stack Overflow Question - Deserialization schema for multiple topics

2022-01-30 Thread David Anderson
Hussein, To use a JsonRowDeserializationSchema you'll need to use the Table API, and not DataStream. You'll want to use a JsonRowSchemaConverter to convert your json schema into the TypeInformation needed by Flink, which is done for you by the JsonRowDeserializationSchema builder: json_row_s

Re: Stack Overflow Question - Deserialization schema for multiple topics

2022-01-28 Thread David Anderson
For questions like this one, please address them to either Stack Overflow or the user mailing list, but not both at once. Those two forums are appropriate places to get help with using Flink's APIs. And once you've asked a question, please allow some days for folks to respond before trying again.

Re: Serving Machine Learning models

2022-01-10 Thread David Anderson
Another approach that I find quite natural is to use Flink's Stateful Functions API [1] for model serving, and this has some nice advantages, such as zero-downtime deployments of new models, and the ease with which you can use Python. [2] is an example of this approach. [1] https://flink.apache.or

Re: adding elapsed times to events that form a transaction

2022-01-07 Thread David Anderson
One way to solve this with Flink SQL would be to use MATCH_RECOGNIZE. [1] is an example illustrating a very similar use case. [1] https://stackoverflow.com/a/62122751/2000823 On Fri, Jan 7, 2022 at 11:32 AM Ali Bahadir Zeybek wrote: > Hello Hans, > > If you would like to see some hands-on examp

Re: [DISCUSS] Drop Gelly

2022-01-03 Thread David Anderson
Most of the inquiries I've had about Gelly in recent memory have been from folks looking for a streaming solution, and it's only been a handful. +1 for dropping Gelly David On Mon, Jan 3, 2022 at 2:41 PM Till Rohrmann wrote: > I haven't seen any changes or requests to/for Gelly in ages. Hence,

Re: Order of events in Broadcast State

2021-12-06 Thread David Anderson
Event ordering in Flink is only maintained between pairs of events that take exactly the same path through the execution graph. So if you have multiple instances of A (let's call them A1 and A2), each broadcasting a partition of the total rule space, then one instance of B (B1) might receive rule1

Re: behavior change with idle partitions and the new KafkaSource?

2021-11-22 Thread David Anderson
org/confluence/display/FLINK/FLIP-180%3A+Adjust+StreamStatus+and+Idleness+definition > > On Mon, Nov 22, 2021 at 10:37 AM David Anderson > wrote: > >> I've seen a few questions recently from folks migrating from >> FlinkKafkaConsumer to KafkaSource that make me suspec

behavior change with idle partitions and the new KafkaSource?

2021-11-22 Thread David Anderson
I've seen a few questions recently from folks migrating from FlinkKafkaConsumer to KafkaSource that make me suspect that something has changed. In FlinkKafkaConsumerBase we have this code which sets a source subtask to idle if all of its partitions are empty when the subtask starts: // ma

Re: s3.entropy working locally but not in production

2021-11-13 Thread David Anderson
> > It seems to work for some jobs and not for others. Maybe jobs with little > or empty state don't have _entropy_ swapped out correctly? This is done by design. As the documentation explains: The Flink runtime currently passes the option to inject entropy only to > checkpoint data files. All

Re: Custom partitioning of keys with keyBy

2021-11-03 Thread David Anderson
Another possibility, if you know in advance the values of the keys, is to find a mapping that transforms the original keys into new keys that will, in fact, end up in disjoint key groups that will, in turn, be assigned to different slots (given a specific parallelism). This is ugly, but feasible.

Re: Flink application question

2021-08-09 Thread David Anderson
FYI, I've responded to this on stack overflow: https://stackoverflow.com/questions/68715430/apache-flink-accessing-keyed-state-from-late-window On Mon, Aug 9, 2021 at 3:16 AM suman shil wrote: > I am writing a Flink application which consumes time series data from > kafka topic. Time series dat

Re: [ANNOUNCE] RocksDB Version Upgrade and Performance

2021-08-04 Thread David Anderson
I am hearing quite often from users who are struggling to manage memory usage, and these are all users using RocksDB. While I don't know for certain that RocksDB is the cause in every case, from my perspective, getting the better memory stability of version 6.20 in place is critical. Regards, Davi

Re: StreamingFileSink only writes data to MINIO during savepoint

2021-05-31 Thread David Anderson
The StreamingFileSink requires that you have checkpointing enabled. I'm guessing that you don't have checkpointing enabled, since that would explain the behavior you are seeing. The relevant section of the docs [1] explains: Checkpointing needs to be enabled when using the StreamingFileSink. Part

Re: Nested match_recognize query not supported in SQL ?

2021-05-13 Thread David Anderson
By the way, views that use MATCH_RECOGNIZE don't work in Flink 1.11. [1] [1] https://issues.apache.org/jira/browse/FLINK-20077 On Thu, May 13, 2021 at 11:06 AM David Anderson wrote: > I was able to get something like this working, but only by introducing a > view: > > CREATE T

Re: Nested match_recognize query not supported in SQL ?

2021-05-13 Thread David Anderson
I was able to get something like this working, but only by introducing a view: CREATE TEMPORARY VIEW mmm AS SELECT id FROM events MATCH_RECOGNIZE (...); SELECT * FROM event WHERE id IN (SELECT id FROM mmm); Regards, David On Tue, May 11, 2021 at 9:22 PM Tejas wrote: > Hi, > I am using flink 1

Re: Flink auto-scaling feature and documentation suggestions

2021-05-05 Thread David Anderson
Well, I was thinking you could have avoided overwhelming your internal services by using something like Flink's async i/o operator, tuned to limit the total number of concurrent requests. That way the pipeline could have uniform parallelism without overwhelming those services, and then you'd rely o

Re: Flink auto-scaling feature and documentation suggestions

2021-05-05 Thread David Anderson
Interesting. So if I understand correctly, basically you limited the parallelism of the sources in order to avoid running the job with constant backpressure, and then scaled up the windows to maximize throughput. On Tue, May 4, 2021 at 11:23 PM vishalovercome wrote: > In one of my jobs, windowin

Re: Flink auto-scaling feature and documentation suggestions

2021-05-04 Thread David Anderson
Could you describe a situation in which hand-tuning the parallelism of individual operators produces significantly better throughput than the default approach? I think it would help this discussion if we could have a specific use case in mind where this is clearly better. Regards, David On Tue, M

Re: Use State query to dump state into datalake

2021-05-03 Thread David Anderson
I think you'd be better off using the State Processor API [1] instead. The State Processor API has cleaner semantics -- as you'll be seeing a self-consistent snapshot of all the state -- and it's also much more performant. Note also that the Queryable State API is "approaching end of life" [2]. Th

Re: Question about snapshot file

2021-04-30 Thread David Anderson
eprecated and isn't supported by the savepoint API. David On Fri, Apr 30, 2021 at 5:42 PM Abdullah bin Omar < abdullahbinoma...@gmail.com> wrote: > Hi, > > So, can't we extract all previous savepoint data by using > ExistingSavepoint? > > > Thank you > > >

Re: "myuid" in snapshot.readingstate

2021-04-30 Thread David Anderson
> > I am trying to *import* > org.apache.flink.training.exercises.common.sources.TaxiFareGenerator; > However, it can not resolve. > > What is dependency (in pom.xml) for the org.apache.flink.training? > > > Thank you > > On Fri, Apr 30, 2021 at 10:12 AM David Anderso

Re: Question about snapshot file

2021-04-30 Thread David Anderson
file location will be the location > that set up in the flink conf and FileSystemBackend will have to use > instead of MemoryStateBackend. *is this correct?* > > > > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/state_processor_api.html > > [2] >

Re: "myuid" in snapshot.readingstate

2021-04-30 Thread David Anderson
You can read about assigning unique IDs to stateful operators in the docs [1][2]. What the uid() method does is to establish a stable and unique identifier for a stateful operator. Then as you evolve your application, this helps ensure that future versions of your job will be able to restore savepo

Re: Flink Metric isBackPressured not available

2021-04-26 Thread David Anderson
The isBackPressured metric is a Boolean -- it reports true or false, rather than 1 or 0. The Flink web UI can not display it (it shows NaN); perhaps the same is true for Datadog. https://issues.apache.org/jira/browse/FLINK-15753 relates to this. Regards, David On Tue, Apr 13, 2021 at 12:13 PM Cl

Re: Approaches for external state for Flink

2021-04-25 Thread David Anderson
Apr 25, 2021 at 12:25 PM Omngr wrote: > Thank you David. That's perfect. > > Now, I'm just worried about the state size. State size will grow forever. > There is no TTL. > > 24 Nis 2021 Cmt 17:42 tarihinde David Anderson > şunu yazdı: > >&g

Re: Approaches for external state for Flink

2021-04-24 Thread David Anderson
for bootstrapping rocksdb state? > > David Anderson , 24 Nis 2021 Cmt, 15:43 tarihinde > şunu yazdı: > >> Oguzhan, >> >> Note, the state size is very large and I have to feed the state from >>> batch flow firstly. Thus I can not use the internal state like ro

Re: Approaches for external state for Flink

2021-04-24 Thread David Anderson
Oguzhan, Note, the state size is very large and I have to feed the state from batch > flow firstly. Thus I can not use the internal state like rocksdb. How large is "very large"? Using RocksDB, several users have reported working with jobs using many TBs of state. And there are techniques for b

Re: Question about snapshot file

2021-04-23 Thread David Anderson
Abdullah, ReadRidesAndFaresSnapshot [1] is an example that shows how to use the State Processor API to display the contents of a snapshot taken while running RidesAndFaresSolution [2]. Hopefully that will help you get started. [1] https://github.com/ververica/flink-training/blob/master/state-pro

Re: WindowFunction is stuck until next message is processed although Watermark with idle timeout is applied.

2021-04-15 Thread David Anderson
The withIdleness option does not attempt to handle situations where all of the sources are idle. Flink operators with multiple input channels keep track of the current watermark from each channel, and use the minimum of these watermarks as their own watermark. withIdleness marks idle channels as i

Re: Do data streams partitioned by same keyBy but from different operators converge to the same sub-task?

2021-03-29 Thread David Anderson
Yes, since the two streams have the same type, you can union the two streams, key the resulting stream, and then apply something like a RichFlatMapFunction. Or you can connect the two streams (again, they'll need to be keyed so you can use state), and apply a RichCoFlatMapFunction. You can use whic

Re: How to visualize the results of Flink processing or aggregation?

2021-03-26 Thread David Anderson
Prometheus is a metrics system; you can use Flink's Prometheus metrics reporter to send metrics to Prometheus. Grafana can also be connected to influxdb, and to databases like mysql and postgresql, for which sinks are available. And the Elasticsearch sink can be used to create visualizations with

Re: Flink SQL client: SELECT 'hello world' throws [ERROR] Could not execute SQL statement

2021-03-26 Thread David Anderson
There needs to be a Flink session cluster available to the SQL client on which it can run the jobs created by your queries. See the Getting Started [1] section of the SQL Client documentation for more information: The SQL Client is bundled in the regular Flink distribution and thus runnable out-of

  1   2   3   >