Re: how to expose the current in-flight async i/o requests as metrics?

2021-11-08 Thread Dongwon Kim
Hi David, There are currently no metrics for the async work-queue size (you should be > able to see the queue stats with debug logs enabled though [1]). Thanks for the input but scraping DEBUG messages into, for example, ElasticSearch for monitoring on Grafana is not possible in my current enviro

Dependency injection for TypeSerializer?

2021-11-08 Thread Thomas Weise
Hi, I was looking into a problem that requires a configurable type serializer for communication with a schema registry. The service endpoint can change, so I would not want to make it part of the serializer snapshot but rather resolve it at graph construction time (similar to how a Kafka bootstrap

Re: Getting mini-cluster logs when debugging pyflink from IDE

2021-11-08 Thread Роман VVvKamper
Thanks for help, I’ll look into it :) Two more questions: Is there a way to configure this path? For example to write logs to a file in the working dir? And is there a way to redirect logst from file to stdout? On 9 Nov 2021, at 09:00, Dian Fu mailto:dian0511...@gmail.com>> wrote: Hi, The l

Question about using Calcite for fetching SQL column lineage

2021-11-08 Thread kangqi
Hi community, I’m trying to extract column lineage from Flink SQL jobs (all of them are single INSERT statements). Here’s what I have done: 1. From `SqlToOperationConverter#convertSqlInsert()`, get the `PlannerQueryOperation` generated by the INSERT statement. 2. Get the corresponding `RelNod

Re: Getting mini-cluster logs when debugging pyflink from IDE

2021-11-08 Thread Dian Fu
Hi, The logs should appear in the log file of the TaskManger and you could find it under directory $PYTHON_INSTALLATION_DIR/site-packages/pyflink/log/ Regards, Dian On Mon, Nov 8, 2021 at 10:53 PM Роман VVvKamper wrote: > Hello, > > I'm trying to debug flink and pyflink job from IDE using mini

Beginner: guidance on long term event stream persistence and replaying

2021-11-08 Thread Simon Paradis
Hi, We have an event processing pipeline that populates various reports from different Kafka topics and would like to centralize processing in Flink. My team is new to Flink but we did some prototyping using Kinesis. To enable new reporting based on past events, we'd like the ability to replay th

Re: Running a performance benchmark load test of Flink on new CPUs

2021-11-08 Thread Vijay Balakrishnan
Austin -the flink benchmark is for testing Flink on single machines and not a cluster. I did see this https://oceanrep.geomar.de/50729/1/bsc_nico_biernat_thesis.pdf but it is more for testing the Scaling of Flink instead of testing throughput and latency. On Mon, Nov 8, 2021 at 10:54 AM Vijay Bal

Re: Running a performance benchmark load test of Flink on new CPUs

2021-11-08 Thread Vijay Balakrishnan
Thx, Austin. I was hoping there might be a newer benchmark run similar to done by dataArtisans on Flink in 2016(old). https://www.ververica.com/blog/extending-the-yahoo-streaming-benchmark Looks like Yahoo Streaming benchmark was an initial standard in 2016. Hoping to see something updated for lat

Re: Restarting a job with drain flag set to true

2021-11-08 Thread David Morávek
It's not a recommended approach, but if you're able to handle this "side-effect" downstream (eg. ignore the "wrong / incomplete" results), then you should be OK On Mon, Nov 8, 2021 at 4:36 PM Pedro Facal wrote: > Hi David, > > Thanks for the quick response. Say our app writes to disk every 10

Re: Restarting a job with drain flag set to true

2021-11-08 Thread Pedro Facal
Hi David, Thanks for the quick response. Say our app writes to disk every 10 minutes: the difference is, in one case a single window is emitted and thus a single file is written, while if we drain and restart the pipeline we will end up with two files (because we will have two windows emitted fo

Re: Restarting a job with drain flag set to true

2021-11-08 Thread David Morávek
Hi Pedro, draining basically means that all of the sources will finish and progress their watermark to end of the global window, which will fire all of the triggers as a result. In other words, it will trigger the _ON_TIME_ results from all of the unfinished windows, even though they might not hav

Getting mini-cluster logs when debugging pyflink from IDE

2021-11-08 Thread Роман VVvKamper
Hello, I'm trying to debug flink and pyflink job from IDE using mini cluster (local mode). When i doing it in the java flink, everything works like a charm - i can see flink mini-cluster logs in the console. But when i run pyflink job in local mode (through the IDE of by simply calling python

Restarting a job with drain flag set to true

2021-11-08 Thread Pedro Facal
Hello, We have an apache beam streaming application, running under flink native kubernetes. It consolidates aws kinesis records into parquet files every few minutes. To manage the lifecycle of this app, we use the rest api to stop the job with a savepoint and then restart the cluster/job fro

Re: How to express the datatype of sparksql collect_list(named_struct(...)) in flinksql?

2021-11-08 Thread JING ZHANG
Hi Vtygoss, You could try the following SQL: ``` select COLLECT(ROW(id, name)) as info from table group by ...; ``` In the above sql, the result type of `COLLECT(ROW(id, name))` is MULTISET. `CollectAggFunction` would store the data in a MapState. key is element type, represent the row value.

Re: to join or not to join, that is the question...

2021-11-08 Thread Seth Wiesman
There is no such restriction on connected streams; either input may modify the keyed state. Regarding performance, the difference between the two should be negligible and I would go with the option with the cleanest semantics. If both streams are the same type *and* you do not care which input an e

Re: Elasticsearch6 connector in flink stand alone

2021-11-08 Thread David Morávek
Hi Ravi, I'm moving this thread to the user@flink mailing list, which is designed for these type of questions. For your issue, I don't think it's related to the elasticsearch integration. It seems like there is something wrong with your log4j setup. Either you have a conflicting log4j jars on the

Re: how to expose the current in-flight async i/o requests as metrics?

2021-11-08 Thread David Morávek
Hi Dongwon, There are currently no metrics for the async work-queue size (you should be able to see the queue stats with debug logs enabled though [1]). As far as I can tell from looking at the code, the async operator is able to checkpoint even if the work-queue is exhausted. Arvid can you pleas

Re: Fetch data from Rest API and sink to Kafka topic

2021-11-08 Thread Shuiqiang Chen
Hi Sharma, >From your description, it seem that you need to implement a custom source to fetch data from an Http server. Please refer to data sources [1] to learn how to develop a data source. And FYI, there is a si

Re: unsubscribe

2021-11-08 Thread David Morávek
Hi Peter, to unsubscribe, please send an email to user-unsubscr...@flink.apache.org Best, D. On Fri, Nov 5, 2021 at 9:28 AM Peter Schrott wrote: > unsubscribe >

Fetch data from Rest API and sink to Kafka topic

2021-11-08 Thread Manjusha Sharma
Hi I am new to Flink and just getting started. I've watched quite a few Flink Forward videos and excited to get started. I have a need where I need to pull data from a RESTFul API endpoint that is authenticated using username and password and send this data to a Kafka Topic. I would like to pull

How to express the datatype of sparksql collect_list(named_struct(...)) in flinksql?

2021-11-08 Thread vtygoss
Hi, flink community! I am working on migrating data production pipeline from SparkSQL to FlinkSQL(1.12.0). And i meet a problem about MULTISET>. ``` Spark SQL select COLLECT_LIST(named_struct('id', id, 'name', name)) as info from table group by ...; ``` - 1. how to express and store th

Re: New blog post published - Sort-Based Blocking Shuffle Implementation in Flink

2021-11-08 Thread Zhilong Hong
Thank you for writing this blog post, Daisy and Kevin! It helps me to understand what sort-based shuffle is and how to use it. Looking forward to your future improvements! On Wed, Nov 3, 2021 at 6:32 PM Yuxin Tan wrote: > Thanks Daisy and Kevin! The IO scheduling idea of the sequential reading >