Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread David Anderson
I assume that along with DataStream#fold you would also remove WindowedStream#fold. I'm in favor of going ahead with all of these. David On Mon, Aug 17, 2020 at 10:53 AM Dawid Wysakowicz wrote: > Hi devs and users, > > I wanted to ask you what do you think about removing some of the > deprecat

Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-08-17 Thread David Anderson
Kostas, I'm pleased to see some concrete details in this FLIP. I wonder if the current proposal goes far enough in the direction of recognizing the need some users may have for "batch" and "bounded streaming" to be treated differently. If I've understood it correctly, the section on scheduling al

Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-08-18 Thread David Anderson
ution or be ignored (but not throw an exception). I > could also see ignoring the timers in batch as the default, if this > makes more sense. > > By the way, do you have any usecases in mind that will help us better > shape our processing time timer handling? > > Kostas > >

Re: How to trigger only once for each window when using TumblingProcessingTimeWindows?

2020-08-19 Thread David Anderson
The purpose of the reduce() and aggregate() methods on windows is to allow for incremental computation of window results. This has two principal advantages: (1) the computation of the results is spread out, rather than occurring all in one go at the end of each window, thereby reducing the likeliho

Re: Ververica Flink training resources

2020-08-23 Thread David Anderson
Piper, 1. Thanks for reporting the problem with the broken links. I've just fixed this. 2. The exercises were recently rewritten so that they no longer use the old file-based datasets. Now they use data generators that are included in the project. As part of this update, the schema was modified s

Re: Ververica Flink training resources

2020-08-24 Thread David Anderson
this feature was removed in the > new repo. > > Best, > Piper > > On Sun, Aug 23, 2020 at 10:15 AM David Anderson > wrote: > >> Piper, >> >> 1. Thanks for reporting the problem with the broken links. I've just >> fixed this. >> >> 2.

Re: Failures due to inevitable high backpressure

2020-08-26 Thread David Anderson
You should begin by trying to identify the cause of the backpressure, because the appropriate fix depends on the details. Possible causes that I have seen include: - the job is inadequately provisioned - blocking i/o is being done in a user function - a huge number of timers are firing simultaneo

Re: Failures due to inevitable high backpressure

2020-08-26 Thread David Anderson
d On Wed, Aug 26, 2020 at 7:41 PM David Anderson wrote: > You should begin by trying to identify the cause of the backpressure, > because the appropriate fix depends on the details. > > Possible causes that I have seen include: > > - the job is inadequately provisioned > - blo

Re: Flink not outputting windows before all data is seen

2020-08-29 Thread David Anderson
Teodor, This is happening because of the way that readTextFile works when it is executing in parallel, which is to divide the input file into a bunch of splits, which are consumed in parallel. This is making it so that the watermark isn't able to move forward until much or perhaps all of the file

Re: Flink not outputting windows before all data is seen

2020-09-01 Thread David Anderson
t I had set the parallelism to 1 > >job wide, but when I reran the task, I added your line. Are there any > >circumstances where despite having the global level set to 1, you > >still need to set the level on individual operators? > > > >PS: I sent this to you direc

Re: How to get Latency Tracking results?

2020-09-09 Thread David Anderson
Pankaj, The Flink web UI doesn't do any visualizations of histogram metrics, so the only way to access the latency metrics is either through the REST api or a metrics reporter. The REST endpoint you tried is the correct place to find these metrics in all recent versions of Flink, but somewhere ba

Re: How to get Latency Tracking results?

2020-09-09 Thread David Anderson
ommit ID: 7eb514a. > Is it possible that the default SocketWindowWordCount job is too simple to > generate Latency metrics? Or that the latency metrics disappear from the > output JSON when the data ingestion is zero? > > Thanks, > > Pankaj > > > On Wed, Sep 9, 2020 at 6:27 AM David An

Re: Watermark generation issues with File sources in Flink 1.11.1

2020-09-09 Thread David Anderson
Arti, The problem with watermarks and the File source operator will be fixed in 1.11.2 [1]. This bug was introduced in 1.10.0, and isn't related to the new WatermarkStrategy api. [1] https://issues.apache.org/jira/browse/FLINK-19109 David On Wed, Sep 9, 2020 at 2:52 PM Arti Pande wrote: > Hi

Re: Slow Performance inquiry

2020-09-09 Thread David Anderson
Heidy, which state backend are you using? With RocksDB Flink will have to do ser/de on every access and update, but with the FsStateBackend, your sparse matrix will sit in memory, and only have to be serialized during checkpointing. David On Wed, Sep 9, 2020 at 2:41 PM Heidi Hazem Mohamed wrote:

Re: How to get Latency Tracking results?

2020-09-10 Thread David Anderson
s, which is not happening in my case. >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-release-1.11/monitoring/metrics.html#latency-tracking >> >> On Wed, Sep 9, 2020 at 10:08 AM David Anderson >> wrote: >> >>> Pankaj, >>> >

Re: Connecting two streams and order of their processing

2020-09-16 Thread David Anderson
The details of what can go wrong will vary depending on the precise scenario, but no, Flink is unable to provide any such guarantee. Doing so would require being able to control the scheduling of various threads running on different machines, which isn't possible. Of course, if event A becomes ava

Re: BufferTimeout throughput vs latency.

2020-09-16 Thread David Anderson
The performance loss being referred to there is reduced throughput. There's a blog post by Nico Kruber [1] that covers Flink's network stack in considerable detail. The last section on latency vs throughput gives some more insight on this point. In the experiment reported on there, the difference

Re: Automatically restore from checkpoint

2020-09-18 Thread David Anderson
If your job crashes, Flink will automatically restart from the latest checkpoint, without any manual intervention. JobManager HA is only needed for automatic recovery after the failure of the Job Manager. You only need externalized checkpoints and "-s :checkpointPath" if you want to use checkpoint

Re: Efficiently processing sparse events in a time windows

2020-09-24 Thread David Anderson
Steven, I'm pretty sure this is a scenario that doesn't have an obvious good solution. As you have discovered, the window API isn't much help; using a process function does make sense. The challenge is finding a data structure to use in keyed state that can be efficiently accessed and updated. On

Re: Experiencing different watermarking behaviour in local/standalone modes when reading from kafka topic with idle partitions

2020-10-01 Thread David Anderson
If you were to use per-partition watermarking, which you can do by calling assignTimestampsAndWatermarks directly on the Flink Kafka consumer [1], then I believe the idle partition(s) would consistently hold back the overall watermark. With per-partition watermarking, each Kafka source task will a

Re: need help about "incremental checkpoint",Thanks

2020-10-02 Thread David Anderson
It looks like you were trying to resume from a checkpoint taken with the FsStateBackend into a revised version of the job that uses the RocksDbStateBackend. Switching state backends in this way is not supported: checkpoints and savepoints are written in a state-backend-specific format, and can only

Re: need help about "incremental checkpoint",Thanks

2020-10-02 Thread David Anderson
sStateBackend.* > *It's NOT a match.* > So I'm wrong in step 5? > Is my above understanding right? > > Thanks for your help. > > -- 原始邮件 -- > *发件人:* "David Anderson" ; > *发送时间:* 2020年10月2日(星期五) 晚上10:35 > *收件人:* "大森林&

Re: need help about "incremental checkpoint",Thanks

2020-10-02 Thread David Anderson
kEnd)? > > Much Thanks~! > > > -- 原始邮件 -- > 发件人: "大森林" ; > 发送时间: 2020年10月2日(星期五) 晚上11:41 > 收件人: "David Anderson"; > 抄送: "user"; > 主题: 回复: need help about "incremental checkpoint",Thanks > >

Re: need help about "incremental checkpoint",Thanks

2020-10-06 Thread David Anderson
. Best, David On Mon, Oct 5, 2020 at 2:38 PM 大森林 wrote: > Could you give more details? > Thanks > > > -- 原始邮件 -- > *发件人:* "大森林" ; > *发送时间:* 2020年10月3日(星期六) 上午9:30 > *收件人:* "David Anderson"; > *抄送:* "user"; &g

Re: S3 StreamingFileSink issues

2020-10-07 Thread David Anderson
Dan, The first point you've raised is a known issue: When a job is stopped, the unfinished part files are not transitioned to the finished state. This is mentioned in the docs as Important Note 2 [1], and fixing this is waiting on FLIP-46 [2]. That section of the docs also includes some S3-specifi

Re: S3 StreamingFileSink issues

2020-10-07 Thread David Anderson
ers here on what should happen, happy to submit a > patch. > > > > > On Wed, Oct 7, 2020 at 1:37 AM David Anderson > wrote: > >> Dan, >> >> The first point you've raised is a known issue: When a job is stopped, >> the unfinished part files are

Re: how to simulate the scene "back pressure" in flink?Thanks~!

2020-10-09 Thread David Anderson
ould you tell me where to set RestOption.POPT?in configuration >> what's the value should I set for RestOption.PORT? >> >> Thanks. >> >> >> -- 原始邮件 -- >> *发件人:* "Arvid Heise" ; >> *发送时间:* 2020年10月9日(星期五) 下午3

Re: how to simulate the scene "back pressure" in flink?Thanks~!

2020-10-10 Thread David Anderson
erstand both: > > ---- > Dear David Anderson: > Is the whole command like this? > flink run *--backpressure* -c wordcount_increstate > d

Re: how to simulate the scene "back pressure" in flink?Thanks~!

2020-10-10 Thread David Anderson
020 at 5:32 PM 大森林 wrote: > > Could I use your command with no docker? > > -- 原始邮件 ------ > *发件人:* "David Anderson" ; > *发送时间:* 2020年10月10日(星期六) 晚上10:30 > *收件人:* "大森林"; > *抄送:* "Arvid Heise";"user"; >

Re: [DISCUSS] Remove flink-connector-filesystem module.

2020-10-13 Thread David Anderson
I think the pertinent question is whether there are interesting cases where the BucketingSink is still a better choice. One case I'm not sure about is the situation described in docs for the StreamingFileSink under Important Note 2 [1]: ... upon normal termination of a job, the last in-progres

Re: what's the datasets used in flink sql document?

2020-10-15 Thread David Anderson
For a dockerized playground that includes a dataset, many working examples, and training slides, see [1]. [1] https://github.com/ververica/sql-training David On Thu, Oct 15, 2020 at 10:18 AM Piotr Nowojski wrote: > Hi, > > The examples in the documentation are not fully functional. They assume

Re: Feature request: Removing state from operators

2020-10-30 Thread David Anderson
It seems like another option here would be to occasionally use the state processor API to purge a savepoint of all unnecessary state. On Fri, Oct 30, 2020 at 6:57 PM Steven Wu wrote: > not a solution, but a potential workaround. Maybe rename the operator uid > so that you can continue to leverag

Re: Running flink in a Local Execution Environment for Production Workloads

2020-10-30 Thread David Anderson
Another factor to be aware of is that there's no cluster configuration file to load (the mini-cluster doesn't know where to find flink-conf.yaml), and by default you won't have a REST API endpoint or web UI available. But you can do the configuration in the code, and/or provide configuration on the

Re: Print on screen DataStream content

2020-11-24 Thread David Anderson
When Flink is running on a cluster, `DataStream#print()` prints to files in the log directory. Regards, David On Tue, Nov 24, 2020 at 6:03 AM Pankaj Chand wrote: > Please correct me if I am wrong. `DataStream#print()` only prints to the > screen when running from the IDE, but does not work (pri

Re: Print on screen DataStream content

2020-11-24 Thread David Anderson
rin wrote: > > I tried to `DataStream#print()` but I don't quite understand how to > > implement it. Could you please give me an example? I'm using Intellij so > > what I would need is just to see the data on my screen. > > > > Thanks > > > >

Re: taskmanager.cpu.cores 1.7976931348623157E308

2020-12-05 Thread David Anderson
taskmanager.cpu.cores is intended for internal use only -- you aren't meant to set this option. What happens if you leave it alone? Regards, David On Sat, Dec 5, 2020 at 8:04 AM Rex Fenley wrote: > We're running this in a local environment so that may be contributing to > what we're seeing. >

Re: Spill RocksDB to external Storage

2020-12-12 Thread David Anderson
RocksDB can not be configured to spill to another filesystem or object store. It is designed as an embedded database, and each task manager needs to have sufficient disk space for its state on the host disk. You might be tempted to use network attached storage for the working state, but that's usua

Re: Disk usage during savepoints

2020-12-12 Thread David Anderson
RocksDB does do compaction in the background, and incremental checkpoints simply mirror to S3 the set of RocksDB SST files needed by the current set of checkpoints. However, unlike checkpoints, which can be incremental, savepoints are always full snapshots. As for why one host would have much more

Re: How to implement a WindowableTask(similar to samza) in Apache flink?

2020-12-23 Thread David Anderson
Please note that I responded to this question on Stack Overflow: https://stackoverflow.com/questions/65414125/how-to-implement-a-windowabletask-similar-to-samza-in-apache-flink Regards, David On Wed, Dec 23, 2020 at 7:08 AM Debraj Manna wrote: > I am new to flink and this is my first post in th

Re: Issue in WordCount Example with DataStream API in BATCH RuntimeExecutionMode

2020-12-23 Thread David Anderson
I did a little experiment, and I was able to reproduce this if I use the sum aggregator on KeyedStream to do the counting. However, if I implement my own counting in a KeyedProcessFunction, or if I use the Table API, I get correct results with RuntimeExecutionMode.BATCH -- though the results are p

Re: Tumbling Time Window

2021-01-04 Thread David Anderson
For straightforward tumbling windows, the regular DSL windowing performs noticeably better than a custom process function because it takes advantage of an internal API to avoid some serialization overhead. There's a simple example of a ProcessWindowFunction in [1], and an example of using a KeyedP

Re: What is the best way to have a cache of an external database in Flink?

2021-01-22 Thread David Anderson
I provided an answer on stackoverflow, where I said the following: A few different mechanisms in Flink may be relevant to this use case, depending on your detailed requirements. *Broadcast State* Jaya Ananthram has already covered the idea

Re: RocksDB

2020-03-10 Thread David Anderson
The State Processor API goes a bit in the direction you asking about, by making it possible to query savepoints. https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/state_processor_api.html

Re: Do I need to set assignTimestampsAndWatermarks if I set my time characteristic to IngestionTime?

2020-03-10 Thread David Anderson
Watermarks are a tool for handling out-of-orderness when working with event time timestamps. They provide a mechanism for managing the tradeoff between latency and completeness, allowing you to manage how long to wait for any out-of-orderness to resolve itself. Note the way that Flink uses these te

Re: what is the difference between map vs process on a datastream?

2020-03-17 Thread David Anderson
Map applies a MapFunction (or a RichMapFunction) to a DataStream and does a one-to-one transformation of the stream elements. Process applies a ProcessFunction, which can produce zero, one, or many events in response to each event. And when used on a keyed stream, a KeyedProcessFunction can use Ti

Re: FlinkCEP - Detect absence of a certain event

2020-03-19 Thread David Anderson
Humberto, Although Flink CEP lacks notFollowedBy at the end of a Pattern, there is a way to implement this by exploiting the timeout feature. The Flink training includes an exercise [1] where the objective is to identify taxi rides with a START event that is not followed by an END event within tw

Re: Can't create a savepoint with State Processor API

2020-03-19 Thread David Anderson
github.com/alpinegizmo/ff3d2e748287853c88f21259830b29cf for my version. *David Anderson* | Training Coordinator Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time On Thu, Mar 19, 2020 at 2:54 AM Dmitry Minaev wrote: >

Re: How to debug checkpoints failing to complete

2020-03-25 Thread David Anderson
seeksst has already covered many of the relevant points, but a few more thoughts: I would start by checking to see if the checkpoints are failing because they timeout, or for some other reason. Assuming they are timing out, then a good place to start is to look and see if this can be explained by

Re: Ask for reason for choice of S3 plugins

2020-03-27 Thread David Anderson
rectory - all these existence requests are S3 HEAD requests, which have very low request rate limits *David Anderson* | Training Coordinator Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time On Fri, Mar 27, 2020 at 7

Re: Sync two DataStreams

2020-04-04 Thread David Anderson
There are a few ways to pre-ingest data from a side input before beginning to process another stream. One is to use the State Processor API [1] to create a savepoint that has the data from that side input in its state. For a simple example of bootstrapping state into a savepoint, see [2]. Another

[PROPOSAL] Contribute training materials to Apache Flink

2020-04-09 Thread David Anderson
Dear Flink Community! For some time now Ververica has been hosting some freely available Apache Flink training materials at https://training.ververica.com. This includes tutorial content covering the core concepts of the DataStream API, and hands-on exercises that accompany those explanations. We

Re: Is it possible to emulate keyed state with operator state?

2020-04-10 Thread David Anderson
Hypothetically, yes, I think this is possible to some extent. You would have to give up all the things that require a KeyedStream, such as timers, and the RocksDB state backend. And performance would suffer. As for the question of determining which key groups (and ultimately, which keys) are handl

Re: [PROPOSAL] Contribute training materials to Apache Flink

2020-04-10 Thread David Anderson
e that flink.apache.org and flink playgrounds respectively are the >> best places to host this content. >> >> On Thu, Apr 9, 2020 at 2:56 PM Niels Basjes wrote: >> >>> Hi, >>> >>> Sounds like a very nice thing to have as part of the pro

Re: Upgrading Flink

2020-04-14 Thread David Anderson
@Chesnay Flink doesn't seem to guarantee client-jobmanager compability, even for bug-fix releases. For example, some jobs compiled with 1.9.0 don't work with a cluster running 1.9.2. See https://github.com/ververica/sql-training/issues/8#issuecomment-590966210 for an example of a case when recompil

Re: [PROPOSAL] Contribute training materials to Apache Flink

2020-04-15 Thread David Anderson
gt; > One thing the community should be aware of is that we also need to > maintain the training material. Maybe you could share with us how much > effort this usually is when updating the training material for a new Flink > version, David? > > Cheers, > Till > > On

Re: "Fill in" notification messages based on event time watermark

2020-04-27 Thread David Anderson
Following up on Piotr's outline, there's an example in the documentation of how to use a KeyedProcessFunction to implement an event-time tumbling window [1]. Perhaps that can help you get started. Regards, David [1] https://ci.apache.org/projects/flink/flink-docs-master/tutorials/event_driven.htm

Re: Autoscaling flink application

2020-05-05 Thread David Anderson
There's no explicit, out-of-the-box support for autoscaling, but it can be done. Autoscaling came up a few times at the recent Virtual Flink Forward, including a talk on Autoscaling Flink at Netflix [1] by Timothy Farkas. [1] https://www.youtube.com/watch?v=NV0jvA5ZDNc Regards, David On Mon,

Re: checkpointing opening too many file

2020-05-07 Thread David Anderson
With the FsStateBackend you could also try increasing the value of state.backend.fs.memory-threshold [1]. Only those state chunks that are larger than this value are stored in separate files; smaller chunks go into the checkpoint metadata file. The default is 1KB, increasing this should reduce file

Re: Restore from savepoint through Java API

2020-06-12 Thread David Anderson
You can study LocalStreamingFileSinkTest [1] for an example of how to approach this. You can use the test harnesses [2], keeping in mind that - initializeState is called during instance creation - the provided context indicates if state is being restored from a snapshot - snapshot is called when t

Re: Question about Watermarks within a KeyedProcessFunction

2020-06-27 Thread David Anderson
With an AscendingTimestampExtractor, watermarks are not created for every event, and as your job starts up, some events will be processed before the first watermark is generated. The impossible value you see is an initial value that's in place until the first real watermark is available. On the ot

Re: CEP use case ?

2020-07-17 Thread David Anderson
If the rules can be implemented by examining events in isolation (e.g., temperature > 25), then the DataStream API is all you need. But if you want rules that are looking for temporal patterns that play across multiple events, then CEP or MATCH_RECOGNIZE (part of Flink SQL) will simplify the implem

Re: How to write junit testcases for KeyedBroadCastProcess Function

2020-07-17 Thread David Anderson
You could approach testing this in the same way that Flink has implemented its unit tests for KeyedBroadcastProcessFunctions, which is to use a KeyedTwoInputStreamOperatorTestHarness with a CoBroadcastWithKeyedOperator. To learn how to use Flink's test harnesses, see [1], and for an example of test

Re: Backpressure on Window step

2020-07-17 Thread David Anderson
Backpressure is typically caused by something like one of these things: * problems relating to i/o to external services (e.g., enrichment via an API or database lookup, or a sink) * data skew (e.g., a hot key) * under-provisioning, or competition for resources * spikes in traffic * timer storms I

Re: Backpressure on Window step

2020-07-18 Thread David Anderson
Subtasks has > between 136kb and 2.45mb of state, with a checkpoint duration of 280ms to 2 > seconds. Each of the 32 subtasks appear to be handling 1,700-50,000 records > an hour with a bytes received of 7mb and 170mb > > > > Am I barking up the wrong tree? > > > &

Re: Flink FsStatebackend is giving better performance than RocksDB

2020-07-18 Thread David Anderson
You should be able to tune your setup to avoid the OOM problem you have run into with RocksDB. It will grow to use all of the memory available to it, but shouldn't leak. Perhaps something is misconfigured. As for performance, with the FSStateBackend you can expect: * much better throughput and av

Re: Flink Sinks

2020-07-18 Thread David Anderson
Prasanna, The Flink project does not have an SQS connector, and a quick google search hasn't found one. Nor does Flink have an HTTP sink, but with a bit of googling you can find that various folks have implemented this themselves. As for implementing SQS as a custom sink, if you need exactly once

Re: How do I trigger clear custom state in ProcessWindowsFunction

2020-07-18 Thread David Anderson
ProcessWindowFunction#process is passed a Context object that contains public abstract KeyedStateStore windowState(); public abstract KeyedStateStore globalState(); which are available for you to use for custom state. Whatever you store in windowState is scoped to a window, and is cleared whe

Re: Is it possible to do state migration with checkpoints?

2020-07-23 Thread David Anderson
I believe this should work, with a couple of caveats: - You can't do this with unaligned checkpoints - If you have dropped some state, you must specify --allowNonRestoredState when you restart the job David On Wed, Jul 22, 2020 at 4:06 PM Sivaprasanna wrote: > Hi, > > We are trying out state s

Re: Count of records coming into the Flink App from KDS and at each step through Execution Graph

2020-07-25 Thread David Anderson
Setting up a Flink metrics dashboard in Grafana requires setting up and configuring one of Flink's metrics reporters [1] that is supported by Grafana as a data source. That means your options for a metrics reporter are Graphite, InfluxDB, Prometheus, or the Prometheus push reporter. If you want re

Re: Is there a way to use stream API with this program?

2020-07-25 Thread David Anderson
In this use case, couldn't the custom trigger register an event time timer for MAX_WATERMARK, which would be triggered when the bounded input reaches its end? David On Mon, Jul 20, 2020 at 5:47 PM Piotr Nowojski wrote: > Hi, > > I'm afraid that there is not out of the box way of doing this. I'v

Re: Is outputting from components other than sinks or side outputs a no-no ?

2020-07-26 Thread David Anderson
Every job is required to have a sink, but there's no requirement that all output be done via sinks. It's not uncommon, and doesn't have to cause problems, to have other operators that do I/O. What can be problematic, however, is doing blocking I/O. While your user function is blocked, the function

Re: Is there a way to use stream API with this program?

2020-07-28 Thread David Anderson
cally by the > WatermarkAssignerOperator. > > Piotrek > > pon., 27 lip 2020 o 09:16 Flavio Pompermaier > napisał(a): > >> Yes it could..where should I emit the MAX_WATERMARK and how do I detect >> that the input reached its end? >> >> On Sat, Jul 25, 202

Re: Removing stream in a job having state

2020-07-28 Thread David Anderson
When you modify a job by removing a stateful operator, then during a restart when Flink tries to restore the state, it will complain that the snapshot contains state that can not be restored. The solution to this is to restart from the savepoint (or retained checkpoint), specifying that you want t

Re: Support for Event time clock specific to each stream in parallel streams

2020-07-31 Thread David Anderson
It sounds like you would like to have something like event-time-based windowing, but with independent watermarking for every key. An approach that can work, but it is somewhat cumbersome, is to not use watermarks or windows, but instead put all of the logic in a KeyedProcessFunction (or RichFlatMap

Re: [Flink Unit Tests] Unit test for Flink streaming codes

2020-08-02 Thread David Anderson
Vijay, There's a section of the docs that describes some strategies for writing tests of various types, and it includes some Scala examples [1]. There are also some nice examples from Konstantin Knauf in [2], though they are mostly in Java. [1] https://ci.apache.org/projects/flink/flink-docs-sta

Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-08-03 Thread David Anderson
Jincheng, One thing that I like about the way that the documentation is currently organized is that it's relatively straightforward to compare the Python API with the Java and Scala versions. I'm concerned that if the PyFlink docs are more independent, it will be challenging to respond to question

Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-08-04 Thread David Anderson
merce. So I prefer to > keep the current structure. > > >> it's relatively straightforward to compare the Python API with the > Java and Scala versions. > > Regarding the comparison between Python API and Java/Scala API, I think > the majority of users, especially the b

Re: Submit Flink 1.11 job from java

2020-08-07 Thread David Anderson
Flavio, Have you looked at application mode [1] [2] [3], added in 1.11? It offers at least some of what you are looking for -- the application jar and its dependencies can be pre-uploaded to HDFS, and the main() method runs on the job manager, so none of the classes have to be loaded in the client

Re: Watermarks on map operator

2021-02-05 Thread David Anderson
Basically the only thing that Watermarks do is to trigger event time timers. Event time timers are used explicitly in KeyedProcessFunctions, but are also used internally by time windows, CEP (to sort the event stream), in various time-based join operations, and within the Table/SQL API. If you wan

Re: question on checkpointing

2021-02-05 Thread David Anderson
I've seen checkpoints timeout when using the RocksDB state backend with very large objects. The issue is that updating a ValueState stored in RocksDB requires deserializing, updating, and then re-serializing that object -- and if that's some enormous collection type, that will be slow. In such case

Re: Which type serializer is being used?

2021-02-18 Thread David Anderson
You can use TypeInformation#of(MyClass.class).createSerializer() to determine which serializer Flink will use for a given type. Best, David On Wed, Feb 17, 2021 at 7:12 PM Sudharsan R wrote: > Hi, > I would like to find out what types are being serialized with which > serializer. Is there an ea

Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-02-25 Thread David Anderson
Rion, What you want isn't really achievable with the APIs you are using. Without some sort of per-key (per-tenant) watermarking -- which Flink doesn't offer -- the watermarks and windows for one tenant can be held up by the failure of another tenant's events to arrive in a timely manner. However,

Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-02-26 Thread David Anderson
gt; // tenant/source combination in this window > > } > > > > companion object { > > private val watermark = ValueStateDescriptor( > > "watermark", > > Long::class.java > > ) > >

Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-02-27 Thread David Anderson
ogies I’ve > used in the past. I’m totally sold on it and am looking forward to doing > more incredible things with it. > > Best regards, > > Rion > > On Feb 26, 2021, at 4:36 AM, David Anderson wrote: > >  > Yes indeed, Timo is correct -- I am proposing that you not u

Re: Savepoint documentation

2021-03-02 Thread David Anderson
You are correct in thinking that the documentation wasn't updated. If you look at the master docs [1] you will see that they now say Can I move the Savepoint files on stable storage? #

Re: Savepoint documentation

2021-03-03 Thread David Anderson
r versions (so > downgrade is impossible)? > > Le mar. 2 mars 2021 à 12:25, David Anderson a > écrit : > >> You are correct in thinking that the documentation wasn't updated. If you >> look at the master docs [1] you will see that they now say >> >> Can

Re: Handling Data Separation / Watermarking from Kafka in Flink

2021-03-03 Thread David Anderson
dow >> after some event time and allowed lateness, I could set a timer or just >> explicitly keep the window open for some additional time to allow for out >> of order data to make its way into the window. >> >> Either way - I think the keying is probably the right appro

Re: Broadcasting to multiple operators

2021-03-05 Thread David Anderson
This is a watermarking issue. Whenever an operator has two or more input streams, its watermark is the minimum of watermarks of the incoming streams. In this case your broadcast stream doesn't have a watermark generator, so it is preventing the watermarks from advancing. This in turn is preventing

Re: Dynamic JDBC Sink Support

2021-03-05 Thread David Anderson
Rion, A given JdbcSink can only write to one table, but if the number of tables involved isn't unreasonable, you could use a separate sink for each table, and use side outputs [1] from a process function to steer each record to the appropriate sink. I suggest you avoid trying to implement a sink.

Re: Broadcasting to multiple operators

2021-03-05 Thread David Anderson
sense now. It was >> confusing because I was able to use the Broadcast stream prior to adding >> the second stream. However, now I realize that this part of the pipeline >> occurs after the windowing so I'm not affected the same way. This is >> definitely going to help

Re: Flink application has slightly data loss using Processing Time

2021-03-08 Thread David Anderson
Rainie, Were there any failures/restarts, or is this discrepancy observed without any disruption to the processing? Regards, David On Mon, Mar 8, 2021 at 10:14 AM Rainie Li wrote: > Thanks for the quick response, Smile. > I don't use window operators or flatmap. > Here is the core logic of my

Re: Flink application has slightly data loss using Processing Time

2021-03-08 Thread David Anderson
case? David On Mon, Mar 8, 2021 at 7:53 PM Rainie Li wrote: > Thanks Yun and David. > There were some tasks that got restarted. We configured the restart policy > and the job didn't fail. > Will task restart cause data loss? > > Thanks > Rainie > > > On Mon

Re: Does WatermarkStrategy.withIdleness work?

2021-03-12 Thread David Anderson
WatermarkStrategy.withIdleness works by marking idle streams as idle, so that downstream operators will ignore those streams and allow the watermarks to progress based only on the advancement of the watermarks of the still active streams. As you suspected, this mechanism does not provide for the wa

Re: Why does Flink FileSystem sink splits into multiple files

2021-03-15 Thread David Anderson
The first time you ran it without having specified the parallelism, and so you got the default parallelism -- which is greater than 1 (probably 4 or 8, depending on how many cores your computer has). Flink is designed to be scalable, and to achieve that, parallel instances of an operator, such as

Re: Flink application has slightly data loss using Processing Time

2021-03-20 Thread David Anderson
passed since last >>>> append >>>> 2021-02-10 01:19:27,216 INFO >>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>>> (7dab4c1a1c6984e70732b8e3f218020f) switched from state RUNNING to FAILING. >>>> org.apache.flink.util.Fl

Re: Approaches to customize the parallelism in SQL generated operators

2021-03-20 Thread David Anderson
No, there is no mechanism available for individually tuning the parallelism of the generated operators in a SQL job. Moreover, such fine-tuning is often counter-productive. In most cases you are better off simply setting the overall parallelism to whatever is needed by the busiest operator(s). Unne

Re: Do data streams partitioned by same keyBy but from different operators converge to the same sub-task?

2021-03-24 Thread David Anderson
For an example of a similar join implemented as a RichCoFlatMap, see [1]. For more background, the Flink docs have a tutorial [2] on how to work with connected streams. [1] https://github.com/apache/flink-training/tree/master/rides-and-fares [2] https://ci.apache.org/projects/flink/flink-docs-stab

Re: Flink SQL client: SELECT 'hello world' throws [ERROR] Could not execute SQL statement

2021-03-26 Thread David Anderson
There needs to be a Flink session cluster available to the SQL client on which it can run the jobs created by your queries. See the Getting Started [1] section of the SQL Client documentation for more information: The SQL Client is bundled in the regular Flink distribution and thus runnable out-of

Re: How to visualize the results of Flink processing or aggregation?

2021-03-26 Thread David Anderson
Prometheus is a metrics system; you can use Flink's Prometheus metrics reporter to send metrics to Prometheus. Grafana can also be connected to influxdb, and to databases like mysql and postgresql, for which sinks are available. And the Elasticsearch sink can be used to create visualizations with

Re: Do data streams partitioned by same keyBy but from different operators converge to the same sub-task?

2021-03-29 Thread David Anderson
Yes, since the two streams have the same type, you can union the two streams, key the resulting stream, and then apply something like a RichFlatMapFunction. Or you can connect the two streams (again, they'll need to be keyed so you can use state), and apply a RichCoFlatMapFunction. You can use whic

  1   2   3   >