Re: Pyflink 1.20.1 packages can not be installed

2025-05-12 Thread Xingbo Huang
Hi Janus, Which python version are you using? Flink 1.19 removes the support of Python 3.7. Best, Xingbo janusgraph 于2025年5月12日周一 15:03写道: > Sorry about that the snapshot is broken. > > The errors are as follows. > ``` > # pip3 install apache-flink==1.20.1 > Collecting apache-flink==1.20.1 >

Re: PyFlink on EMR on EKS

2024-09-03 Thread Ahmed Hamdy
Hi Alexandre, This seems to be complaining about the python script loading. It seems that the local file system is using `file` file prefix not `local`[1]. FYI inside your python script you can add more dependencies like connectors and so using python dependency management[2] which differs from Jav

Re: Pyflink w Nessie and Iceberg in S3 Jars

2024-04-16 Thread Robert Prat
) ?? From: Péter Váry Sent: Tuesday, April 16, 2024 9:56 PM To: Robert Prat Cc: Oscar Perez via user Subject: Re: Pyflink w Nessie and Iceberg in S3 Jars Is it intentional, that you are using iceberg-flink-runtime-1.16-1.3.1.jar with 1.18.0 PyFlink? This might cause issues later. I would

Re: Pyflink w Nessie and Iceberg in S3 Jars

2024-04-16 Thread Péter Váry
Is it intentional, that you are using iceberg-flink-runtime-1.16-1.3.1.jar with 1.18.0 PyFlink? This might cause issues later. I would try to synchronize the Flink versions throughout all the dependencies. On Tue, Apr 16, 2024, 11:23 Robert Prat wrote: > I finally managed to make it work followi

Re: Pyflink Performance and Benchmark

2024-04-16 Thread Chase Zhang
On Mon, Apr 15, 2024 at 16:17 Niklas Wilcke wrote: > Hi Flink Community, > u > I wanted to reach out to you to get some input about Pyflink performance. > Are there any resources available about Pyflink benchmarks and maybe a > comparison with the Java API? I wasn't able to find something valuabl

Re: Pyflink w Nessie and Iceberg in S3 Jars

2024-04-16 Thread Robert Prat
I finally managed to make it work following the advice of Robin Moffat who replied to the earlier email: There's a lot of permutations that you've described, so it's hard to take one reproducible test case here to try and identify the error :) It certainly looks JAR related. You could try adding

Re: [EXTERNAL]Re: Pyflink Performance and Benchmark

2024-04-15 Thread Niklas Wilcke
Hi Zhanghao Chen, thanks for sharing the link. This looks quite interesting! Regards, Niklas > On 15. Apr 2024, at 12:43, Zhanghao Chen wrote: > > When it comes down to the actual runtime, what really matters is the plan > optimization and the operator impl & shuffling. You might be intereste

Re: Pyflink Performance and Benchmark

2024-04-15 Thread Zhanghao Chen
When it comes down to the actual runtime, what really matters is the plan optimization and the operator impl & shuffling. You might be interested in this blog: https://flink.apache.org/2022/05/06/exploring-the-thread-mode-in-pyflink/, which did a benchmark on the latter with the common the JSON

Re: [PyFlink] Collect multiple elements in CoProcessFunction

2023-11-22 Thread Alexander Fedulov
Hi David, Thanks for the confirmation. Let's fix the docs: https://github.com/apache/flink/pull/23776 Thanks, Alex On Sun, 19 Nov 2023 at 01:55, David Anderson wrote: > Hi, Alex! > > Yes, in PyFlink the various flatmap and process functions are implemented > as generator functions, so they us

Re: [PyFlink] Collect multiple elements in CoProcessFunction

2023-11-18 Thread David Anderson
Hi, Alex! Yes, in PyFlink the various flatmap and process functions are implemented as generator functions, so they use yield to emit results. David On Tue, Nov 7, 2023 at 1:16 PM Alexander Fedulov < alexander.fedu...@gmail.com> wrote: > Java ProcessFunction API defines a clear way to collect d

Re: PyFlink MapState with Types.ROW() throws exception

2023-10-05 Thread Elkhan Dadashov
After digging into the flink-python code, It seems if `PYFLINK_GATEWAY_DISABLED` is set to false in an environment variable, then using Types.LIST(Types.ROW([...])) does not have any issue, once Java Gateway is launched. It was unexpected for Flink local run to set this flag to false explicitly.

Re: Pyflink unittest cases

2023-10-04 Thread Perez
. On Mon, Oct 2, 2023 at 9:21 PM joshua perez wrote: > Hello folks, > > Any help is appreciated. > > J. > > On Sat, Sep 30, 2023 at 1:47 PM joshua perez wrote: > >> Hi Team, >> >> We recently have started a use case where there would be involvement of >> Kafka and Flink's low level APIs like ma

Re: Pyflink unittest cases

2023-10-02 Thread joshua perez
Hello folks, Any help is appreciated. J. On Sat, Sep 30, 2023 at 1:47 PM joshua perez wrote: > Hi Team, > > We recently have started a use case where there would be involvement of > Kafka and Flink's low level APIs like map and process functions and since I > am entirely new to these stuffs, I

Re: PyFlink SQL from Kafka to Iceberg issues

2023-07-17 Thread Martijn Visser
Hi Dani, Plugins need to be placed in a folder inside the plugins directory, I think that might be the problem. Best regards, Martijn On Sun, Jul 9, 2023 at 7:00 PM Dániel Pálma wrote: > Thanks for the tips Martijn! > > I've fixed the library versions to 1.16 everywhere and also decided to >

Re: PyFlink SQL from Kafka to Iceberg issues

2023-07-09 Thread Dániel Pálma
Thanks for the tips Martijn! I've fixed the library versions to 1.16 everywhere and also decided to scrap pyflink and go for the sql-client instead to keep things simpler for now. This is the Dockerfile I am using for both the *jobmanager* and the *sql-client* FROM flink:1.16.2-scala_2.12-java11

Re: PyFlink SQL from Kafka to Iceberg issues

2023-06-29 Thread Martijn Visser
Hi Dani, There are two things that I notice: 1. You're mixing different Flink versions (1.16 and 1.17): all Flink artifacts should be from the same Flink version 2. S3 plugins need to be added to the plugins folder of Flink, because they are loaded via the plugin mechanism. See https://nightlies.

Re: PyFlink Error JAR files

2023-06-08 Thread Leo
Hi  Kadiyala, I think that there  is a typo from your email:     Fink  version 1.7.1 May be  1.17.1  ?     About the error,  The reason why your code can't run successfully is: the class "*org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer" had been obsolete  since Flink 1

Re: Pyflink Side Output Question and/or suggested documentation change

2023-02-13 Thread Andrew Otto
Thank you! On Mon, Feb 13, 2023 at 5:55 AM Dian Fu wrote: > Thanks Andrew, I think this is a valid advice. I will update the > documentation~ > > Regards, > Dian > > , > > On Fri, Feb 10, 2023 at 10:08 PM Andrew Otto wrote: > >> Question about side outputs and OutputTags in pyflink. The docs >

Re: Pyflink Side Output Question and/or suggested documentation change

2023-02-13 Thread Dian Fu
Thanks Andrew, I think this is a valid advice. I will update the documentation~ Regards, Dian , On Fri, Feb 10, 2023 at 10:08 PM Andrew Otto wrote: > Question about side outputs and OutputTags in pyflink. The docs >

Re: PyFlink job in kubernetes operator

2023-01-25 Thread Evgeniy Lyutikov
авлено: 25 января 2023 г. 21:03:40 Кому: Evgeniy Lyutikov Копия: user@flink.apache.org Тема: Re: PyFlink job in kubernetes operator Did you check the Python example? https://github.com/apache/flink-kubernetes-operator/tree/main/examples/flink-python-example<https://eur04.safelinks.protection.ou

Re: PyFlink job in kubernetes operator

2023-01-25 Thread Gyula Fóra
Did you check the Python example? https://github.com/apache/flink-kubernetes-operator/tree/main/examples/flink-python-example Gyula On Wed, Jan 25, 2023 at 2:54 PM Evgeniy Lyutikov wrote: > Hello > > Is there a way to run PyFlink jobs in k8s with flink kubernetes operator? > And if not, is it p

Re: Pyflink :: Conversion from DataStream to TableAPI

2022-08-17 Thread yu'an huang
Thank you for Dian's explaination. I thought pyflink suported non-keyed stream cause I saw "If key_by(...) is not called, your stream is not keyed." in the document lol. Sorry for the confusion to Ramana. On Thu, 18 Aug 2022 at 9:36 AM, Dian Fu wrote: > Hey Ramana, > > Non-keyed window will be s

Re: Pyflink :: Conversion from DataStream to TableAPI

2022-08-17 Thread Dian Fu
Hey Ramana, Non-keyed window will be supported in the coming Flink 1.16. See https://issues.apache.org/jira/browse/FLINK-26480 for more details. In releases prior to 1.16, you could work around it as following: ``` data_stream = xxx data_stream.key_by(lambda x: 'key').xxx().force_non_parallel() `

Re: Pyflink :: Conversion from DataStream to TableAPI

2022-08-16 Thread Ramana
Hi Yuan - Thanks for your response. Wondering if the window api supports non-keyed streams? On Wed, Aug 17, 2022, 06:43 yu'an huang wrote: > Hi, > > > Pyflink should support window api. You can read this document. > > https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/python/dat

Re: Pyflink :: Conversion from DataStream to TableAPI

2022-08-16 Thread yu'an huang
Hi, Pyflink should support window api. You can read this document. https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/python/datastream/operators/windows/ Hope this helps. Best, Yuan On Tue, 16 Aug 2022 at 3:11 PM, Ramana wrote: > Hi All - > > Trying to achieve the following

Re: PyFlink SQL: force maximum use of slots

2022-07-24 Thread Dian Fu
; *Sent:* 20 July 2022 05:19 > *To:* John Tipper > *Cc:* user@flink.apache.org > *Subject:* Re: PyFlink SQL: force maximum use of slots > > Hi John, > > All the operators in the same slot sharing group will be put in one slot. > The slot sharing group is only configurable

Re: PyFlink SQL: force maximum use of slots

2022-07-20 Thread John Tipper
copy all events to all slots, in which case I don't understand how parallelism assists? Many thanks, John From: Dian Fu Sent: 20 July 2022 05:19 To: John Tipper Cc: user@flink.apache.org Subject: Re: PyFlink SQL: force maximum use of slots Hi John

Re: PyFlink SQL: force maximum use of slots

2022-07-19 Thread Dian Fu
Hi John, All the operators in the same slot sharing group will be put in one slot. The slot sharing group is only configurable in DataStream API [1]. Usually you don't need to set the slot sharing group explicitly [2] and this is good to share the resource between the operators running in the same

Re: PyFlink and parallelism

2022-07-18 Thread John Tipper
s broken if the tablestream contains a timestamp, which I reported a little while ago and Dian filed as FLINK-28253. Kind regards, John From: Juntao Hu Sent: 18 July 2022 04:13 To: John Tipper Cc: user@flink.apache.org Subject: Re: PyFlink and parallelism It&

Re: PyFlink and parallelism

2022-07-17 Thread Dian Fu
Hi John, You could try to use `filtered_stream._j_data_stream.getTransformation().setParallelism(int(table_env.get_config().get('my.custom.parallelism', 1)))`. Usually you don't need to do this, however, when converting Table to DataStream, it returns a Java DataStream object which has no setPara

Re: PyFlink and parallelism

2022-07-17 Thread Juntao Hu
It's not the issue with Python-Java object conversion, you get a DataStream rather than SingleOutputStreamOperator underlying the Python DataStream wrapper after calling `to_data_stream`, and `setParallelism` is only available on SingleOutputStreamOperator. To work around this, change `set_parallel

Re: PyFlink and parallelism

2022-07-16 Thread John Tipper
I've tried this and can see there appears to be a bigger problem with PyFlink and a call to set_parallelism()​: events_table = table_env.from_path(MY_SOURCE_TABLE) filtered_table = events_table.filter( col("event_type") == "event_of_interest" ) filtered_stream = table_env.to_data_stream(filt

Re: PyFlink: restoring from savepoint

2022-07-08 Thread John Tipper
nks, John From: Dian Fu Sent: 08 July 2022 02:27 To: John Tipper Cc: user@flink.apache.org Subject: Re: PyFlink: restoring from savepoint Hi John, Could you provide more information, e.g. the exact command submitting the job, the logs file, the PyFlink ve

Re: PyFlink: restoring from savepoint

2022-07-07 Thread Dian Fu
Hi John, Could you provide more information, e.g. the exact command submitting the job, the logs file, the PyFlink version, etc? Regards, Dian On Thu, Jul 7, 2022 at 7:53 PM John Tipper wrote: > Hi all, > > I have a PyFlink job running in Kubernetes. Savepoints and checkpoints are > being suc

Re: PyFlink - java code packaging

2022-05-08 Thread aryan m
Hi ! Eventually, I came up with a solution following the instructions here https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/dependency_management/ module_home = os.path.dirname(find_spec("python_lib_internally_containing_java_lib").origin) jar_file = 'file:///' + module_ho

Re: Pyflink -> Redshift/S3/Firehose

2022-05-04 Thread Dian Fu
It uses connectors to send data to external storages. It should be noted that it shares the connector implementations between Java API and Python API and so if you could find a Java connector, usually it could be also be used in PyFlink. For firehose, it has provided a firehose sink connector in F

Re: Pyflink elastic search connectors

2022-03-30 Thread Sandeep Sharat
Thank you for your reply. Now I have a better understanding of it. On Wed, 30 Mar, 2022, 5:29 pm LuNing Wang, wrote: > Hi, > > The principle of the python datastream connector is interprocess > communication via py4j. I blocked in a class loading problem, so I haven't > achieved the PR about the

Re: Pyflink elastic search connectors

2022-03-30 Thread LuNing Wang
Hi, The principle of the python datastream connector is interprocess communication via py4j. I blocked in a class loading problem, so I haven't achieved the PR about the Python ES datastream connector yet. Compared with other connectors, the ES is a little more troublesome. Because implementing of

Re: Pyflink elastic search connectors

2022-03-30 Thread Sandeep Sharat
Hi, I am pretty much a novice in python. So writing an entire wrapper using python may be a tough nut to crack for me. But just out of curiosity, want to ask ask the question that why were the connectors not implemented in python api. Is it because of a very lesser number of use cases ???or most

Re: Pyflink elastic search connectors

2022-03-29 Thread Sandeep Sharat
Hi, Thank you for the quick responses. We are using the datastream api for pyflink. We are trying to implement a wrapper in python for the same as we speak. Hopefully it will work out. 😊 On Wed, 30 Mar, 2022, 8:02 am Xingbo Huang, wrote: > Hi, > > Are you using datastream api or table api?If yo

Re: Pyflink elastic search connectors

2022-03-29 Thread Xingbo Huang
Hi, Are you using datastream api or table api?If you are using the table api, you can use the connector by executing sql[1]. If you are using the datastream api, there is really no es connector api provided, you need to write python wrapper code, but the wrapper code is very simple. The underlying

Re: PyFlink : submission via rest

2022-03-08 Thread aryan m
Thanks Dian! That worked ! On Sun, Mar 6, 2022 at 10:47 PM Dian Fu wrote: > The dedicated REST API is still not supported. However, you could try to > use PythonDriver just like you said and just submit it like a Java Flink > job. > > Regards, > Dian > > On Sun, Mar 6, 2022 at 3:38 AM aryan m w

Re: PyFlink : submission via rest

2022-03-06 Thread Dian Fu
The dedicated REST API is still not supported. However, you could try to use PythonDriver just like you said and just submit it like a Java Flink job. Regards, Dian On Sun, Mar 6, 2022 at 3:38 AM aryan m wrote: > Thanks Zhilong for taking a look! > > Primarily I am looking for ways to start it

Re: PyFlink : submission via rest

2022-03-05 Thread aryan m
Thanks Zhilong for taking a look! Primarily I am looking for ways to start it through a REST api [1] . For Java, I pass along entry-class pointing to a main class in the jar which constructs the job graph and triggers the execute(). How do we accomplish this for pyflink jobs? The closest I enco

Re: PyFlink : submission via rest

2022-03-05 Thread Zhilong Hong
Hi, Aryan: You could refer to the official docs [1] for how to submit PyFlink jobs. $ ./bin/flink run \ --target yarn-per-job --python examples/python/table/word_count.py With this command you can submit a per-job application to YARN. The docs [2] and [3] describe how to submit jobs

Re: pyflink object to java object

2022-02-28 Thread Francis Conroy
Hi Xingbo, I think that might work for me, I'll give it a try On Tue, 1 Mar 2022 at 15:06, Xingbo Huang wrote: > Hi, > With py4j, you can call any Java method. On how to create a Java Row, you > can call the `createRowWithNamedPositions` method of `RowUtils`[1]. > > [1] > https://github.com/apa

Re: pyflink object to java object

2022-02-28 Thread Xingbo Huang
Hi, With py4j, you can call any Java method. On how to create a Java Row, you can call the `createRowWithNamedPositions` method of `RowUtils`[1]. [1] https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/types/RowUtils.java# Best, Xingbo Francis Conroy 于2022年2月25

Re: Pyflink with pulsar

2022-02-22 Thread Dian Fu
ng > *Date: *Friday, 18 February 2022 at 6:12 pm > *To: *user@flink.apache.org , Ananth Gundabattula < > agundabatt...@darwinium.com> > *Subject: *Re: Pyflink with pulsar > > The Pulsar python source connector will be released in 1.15 version. > if you want to use it right now,

Re: Pyflink with pulsar

2022-02-19 Thread Ananth Gundabattula
, Ananth Gundabattula Subject: Re: Pyflink with pulsar The Pulsar python source connector will be released in 1.15 version. if you want to use it right now, you could compile the master branch. When I completed the python connector code, I only tested the native pulsar protocol without KOP

Re: Pyflink with pulsar

2022-02-17 Thread Luning Wong
The Pulsar python source connector will be released in 1.15 version. if you want to use it right now, you could compile the master branch. When I completed the python connector code, I only tested the native pulsar protocol without KOP. Usage examples are in comments of the PulsarSource class and

Re: pyflink datastream job

2022-02-07 Thread nagi data monkey
ah, thanks, those doc pages are what I missed! On Mon, Feb 7, 2022 at 2:48 AM Dian Fu wrote: > The following code snippet should just work: > > ``` > from pyflink.datastream import StreamExecutionEnvironment > env = StreamExecutionEnvironment.get_execution_environment() > ``` > > It works both i

Re: pyflink datastream job

2022-02-06 Thread Dian Fu
The following code snippet should just work: ``` from pyflink.datastream import StreamExecutionEnvironment env = StreamExecutionEnvironment.get_execution_environment() ``` It works both in local deployment and in flink clusters. You could refer to [1] on how to submit PyFlink jobs to a remote cl

Re: pyflink mixed with Java operators

2022-01-10 Thread Francis Conroy
Thanks for this Dian. I'll give that a try. On Mon, 10 Jan 2022 at 22:51, Dian Fu wrote: > Hi, > > You could try the following method: > > ``` > from pyflink.java_gateway import get_gateway > > jvm = get_gateway().jvm > ds = ( > DataStream(ds._j_data_stream.map(jvm.com.example.MyJavaMapFunct

Re: pyflink mixed with Java operators

2022-01-10 Thread Dian Fu
Hi, You could try the following method: ``` from pyflink.java_gateway import get_gateway jvm = get_gateway().jvm ds = ( DataStream(ds._j_data_stream.map(jvm.com.example.MyJavaMapFunction())) ) ``` Regards, Dian On Fri, Jan 7, 2022 at 1:00 PM Francis Conroy wrote: > Hi all, > > Does anyon

Re: PyFlink Perfomance

2021-12-21 Thread Francis Conroy
Hi Dian, I'll build up something similar and post it, my current test code contains proprietary information. On Wed, 22 Dec 2021 at 14:49, Dian Fu wrote: > Hi Francis, > > Could you share the benchmark code you use? > > Regards, > Dian > > On Wed, Dec 22, 2021 at 11:31 AM Francis Conroy < > fran

Re: PyFlink Perfomance

2021-12-21 Thread Dian Fu
Hi Francis, Could you share the benchmark code you use? Regards, Dian On Wed, Dec 22, 2021 at 11:31 AM Francis Conroy < francis.con...@switchdin.com> wrote: > I've just run an analysis using a similar example which involves a single > python flatmap operator and we're getting 100x less through

Re: PyFlink Perfomance

2021-12-21 Thread Francis Conroy
I've just run an analysis using a similar example which involves a single python flatmap operator and we're getting 100x less through by using python over java. I'm interested to know if you can do such a comparison. I'm using Flink 14.0. Thanks, Francis On Thu, 18 Nov 2021 at 02:20, Thomas Portu

Re: pyFlink + asyncio

2021-12-16 Thread Alexander Preuß
Hi Михаил, >From looking at https://nightlies.apache.org/flink/flink-docs-master/api/python//pyflink.datastream.html there is currently no AsyncFunction / RichAsyncFunction implementation in pyFlink, so you are bound to synchronously interacting. Best regards, Alexander On Thu, Dec 16, 2021 at 1

RE: PyFlink convert data from ds to table

2021-12-08 Thread Королькевич Михаил
I found the problem.In the data stream I had an empty list, but not none (null) On 2021/12/08 13:11:31 Королькевич Михаил wrote:> Hello, Flink team!>> 1) Is it possible to save a python list to table from datastream?>> 2) and then save the accumulated data to avro file?>> For example, my data strea

Re: PyFlink import internal packages

2021-12-05 Thread Королькевич Михаил
+ CABvJ6uUPXuaKNayJ-VT7uPg-ZqDq1xzGqV8arP7RYcEosVQouA@- все Hi, thank you!it was very helpful! 03.12.2021, 12:48, "Shuiqiang Chen" :Hi, Actually, you are able to develop your app in the clean python way. It's fine to split the code into multiple files and there is no need to call `env.add_python_fi

Re: PyFlink import internal packages

2021-12-03 Thread Shuiqiang Chen
Hi, Actually, you are able to develop your app in the clean python way. It's fine to split the code into multiple files and there is no need to call `env.add_python_file()` explicitly. When submitting the PyFlink job you can specify python files and entry main module with option --pyFiles and --p

Re: Pyflink/Flink Java parquet streaming file sink for a dynamic schema stream

2021-12-03 Thread Georg Heiler
Hi, the schema of the after part depends on each table i.e. holds different columns for each table. So do you receive debezium changelog statements for all/ >1 table? I.e. is the schema in the after part different? Best, Georg Am Fr., 3. Dez. 2021 um 08:35 Uhr schrieb Kamil ty : > Yes the gener

Re: Pyflink/Flink Java parquet streaming file sink for a dynamic schema stream

2021-12-02 Thread Georg Heiler
Do the JSONs have the same schema overall? Or is each potentially structured differently? Best, Georg Am Fr., 3. Dez. 2021 um 00:12 Uhr schrieb Kamil ty : > Hello, > > I'm wondering if there is a possibility to create a parquet streaming file > sink in Pyflink (in Table API) or in Java Flink (in

Re: Pyflink 1.13.2 convert datastream into table BIG_INT type

2021-11-28 Thread Dian Fu
Hi Kamil, You need to use Types.LONG() for bigint type in Python DataStream API. See [1] for more details. Regards, Dian [1] https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/python/datastream/data_types/#supported-data-types On Sun, Nov 28, 2021 at 11:26 PM Kamil ty wrote:

Re: PyFlink SQL window aggregation data not written out to file

2021-11-22 Thread Guoqin Zheng
Hi Roman, Thanks for the update and testing locally. This is very informative. -Guoqin On Mon, Nov 22, 2021 at 10:55 AM Roman Khachatryan wrote: > Hi Guoqin, > > I was able to reproduce the problem locally. I can see that at the > time of window firing the services are already closed because o

Re: PyFlink SQL window aggregation data not written out to file

2021-11-22 Thread Roman Khachatryan
Hi Guoqin, I was able to reproduce the problem locally. I can see that at the time of window firing the services are already closed because of the emitted MAX_WATERMARK. Previously, there were some discussions around waiting for all timers to complete [1], but AFAIK there was not much demand to i

Re: PyFlink Perfomance

2021-11-17 Thread Dian Fu
Hi, Is it possible to perform some benchmark for the first map (not the whole job)? Then you could get a basic understanding of whether the map implementation is a problem. Besides the map implementation, there is also some overhead introduced by the framework, e.g. the Java and Python process com

Re: PyFlink SQL window aggregation data not written out to file

2021-11-17 Thread Guoqin Zheng
Hi Roman, Thanks for the detailed explanation. I did try 1.13 and 1.14, but it still didn't work. I explicitly enabled the checkpoint with: `env.enable_checkpointing(10)`. Any other configurations I need to set? Thanks, -Guoqin On Wed, Nov 17, 2021 at 4:30 AM Roman Khachatryan wrote: > Hi Gu

Re: PyFlink SQL window aggregation data not written out to file

2021-11-17 Thread Roman Khachatryan
Hi Guoqin, Thanks for the clarification. Processing time windows actually don't need watermarks: they fire when window end time comes. But the job will likely finish earlier because of the bounded input. Handling of this case was improved in 1.14 as part of FLIP-147, as well as in previous versi

Re: PyFlink / Table API Exception: class "BatchExecCalc$7746" grows beyond 64 KB

2021-11-15 Thread Caizhi Weng
Hi! In Flink 1.14 we've introduced the new code splitting mechanism which will solve this issue. Please consider upgrading to Flink 1.14. Schmid Christian 于2021年11月16日周二 上午12:43写道: > Hi all > > > > I ran into an exception while trying to execute a large sql query, which > basically runs "over a

Re: PyFlink SQL window aggregation data not written out to file

2021-11-15 Thread Guoqin Zheng
Hi Roman, Thanks for your quick response! Yes, it does seem to be the window closing problem. So if I change the tumble window on eventTime, which is column 'requestTime', it works fine. I guess the EOF of the test data file kicks in a watermark of Long.MAX_VALUE. But my application code needs

Re: PyFlink SQL window aggregation data not written out to file

2021-11-15 Thread Roman Khachatryan
Hi Guoqin, I think the problem might be related to watermarks and checkpointing: - if the file is too small, the only watermark will be the one after fully reading the file - in exactly once mode, sink waits for a checkpoint completion before committing the files Recently, there were some improve

Re: Pyflink PyPi build - scala 2.12 compatibility

2021-11-10 Thread Kamil ty
Thank you for the clarification. This was very helpful! Kind regards Kamil On Wed, 10 Nov 2021, 02:26 Dian Fu, wrote: > Hi Kamil, > > You are right that it comes with JAR packages of scala 2.11 in the PyFlink > package as it has to select one version of JARs to bundle, either 2.11 or > 2.12. Wh

Re: Pyflink PyPi build - scala 2.12 compatibility

2021-11-09 Thread Dian Fu
Hi Kamil, You are right that it comes with JAR packages of scala 2.11 in the PyFlink package as it has to select one version of JARs to bundle, either 2.11 or 2.12. Whether it works with scala 2.12 depends on how you submit your job. - If you execute the job locally, then it will use the JARs bund

Re: pyflink keyed stream checkpoint error

2021-10-14 Thread Dian Fu
Hi Curt, Could you try if it works by reducing python.fn-execution.bundle.size to 1000 or 100? Regards, Dian On Thu, Oct 14, 2021 at 2:47 AM Curt Buechter wrote: > Hi guys, > I'm still running into this problem. I checked the logs, and there is no > evidence that the python process crashed. I

Re: pyflink keyed stream checkpoint error

2021-10-13 Thread Curt Buechter
Hi guys, I'm still running into this problem. I checked the logs, and there is no evidence that the python process crashed. I checked the process IDs and they are still active after the error. No `killed process` messages in /var/log/messages. I don't think it's necessarily related to checkpointin

Re: PyFlink JDBC SQL Connector for SQL Server

2021-10-11 Thread Dian Fu
Hi, Currently it only supports derby, mysql, postgresql dialect. The dialect 'sqlserver' is still not supported. There is a ticket https://issues.apache.org/jira/browse/FLINK-14101 for this. Regards, Dian On Mon, Oct 11, 2021 at 9:43 PM Schmid Christian wrote: > Hi all > > > > According to the

Re: Pyflink data stream API to Table API conversion with multiple sinks.

2021-10-08 Thread Dian Fu
Hi Kamil, I guess `statement_set.execute()` should be enough. You could also check whether the job graph is expected via one of the following ways: - Call `print(statement_set.explain())` - Check the Flink web ui to see the job graph of the running job For your problems, could you double check wh

Re: pyflink keyed stream checkpoint error

2021-09-23 Thread Curt Buechter
Guess my last reply didn't go through, so here goes again... Possibly, but I don't think so. Since I submitted this, I have done some more testing. It works fine with file system or memory state backends, but not with rocksdb. I will try again and check the logs, though. I've also tested rocksdb c

Re: pyflink keyed stream checkpoint error

2021-09-23 Thread Dian Fu
PS: there are more information about this configuration in https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/python/python_config/#python-fn-execution-bundle-size > 2021年9月24日 上午10:07,Dian Fu 写道: > > I agree with Roman that it seems that the Python process has crashed. > > Be

Re: pyflink keyed stream checkpoint error

2021-09-23 Thread Dian Fu
I agree with Roman that it seems that the Python process has crashed. Besides the suggestions from Roman, I guess you could also try to configure the bundle size to smaller value via “python.fn-execution.bundle.size”. Regards, Dian > 2021年9月24日 上午3:48,Roman Khachatryan 写道: > > Hi, > > Is it

Re: pyflink keyed stream checkpoint error

2021-09-23 Thread Roman Khachatryan
Hi, Is it possible that the python process crashed or hung up? (probably performing a snapshot) Could you validate this by checking the OS logs for OOM killer messages or process status? Regards, Roman On Wed, Sep 22, 2021 at 6:30 PM Curt Buechter wrote: > > Hi, > I'm getting an error after ena

Re: pyflink table to datastream

2021-09-05 Thread Caizhi Weng
Hi! I don't quite understand this question, but I suppose you first run the table program and then run the data stream program and you want the results of the two programs to be identical? If this is the case, the job will run twice as Flink will not cache the result of a job, so in each run the

Re: PyFlink StreamingFileSink bulk-encoded format (Avro)

2021-08-17 Thread Dian Fu
Hi Kamil, AFAIK, it should still not support Avro format in Python StreamingFileSink in the Python DataStream API. However, I guess you could convert DataStream to Table[1] and then you could use all the connectors supported in the Table & SQL. In this case, you could use the FileSystem connect

Re: PyFlink performance and deployment issues

2021-08-15 Thread Dian Fu
ateflow-evaluation/pyflink_runtime.py \ > --jarfile ~/Documents/stateflow-evaluation/benchmark/bin/combined.jar > > I hope someone can help me with this because it is a blocker for me. > > Thanks in advance, > Wouter > -- Forwarded message - > From: Woute

Re: PyFlink performance and deployment issues

2021-07-08 Thread Xingbo Huang
Hi Wouter, The JIRA is https://issues.apache.org/jira/browse/FLINK-23309. `bundle time` is from the perspective of your e2e latency. Regarding the `bundle size`, generally larger value will provide better throughput, but it should not be set too large, which may cause no output to be seen downstre

Re: PyFlink performance and deployment issues

2021-07-08 Thread Wouter Zorgdrager
Hi Xingbo, all, That is good to know, thank you. Is there any Jira issue I can track? I'm curious to follow this progress! Do you have any recommendations with regard to these two configuration values, to get somewhat reasonable performance? Thanks a lot! Wouter On Thu, 8 Jul 2021 at 10:26, Xing

Re: PyFlink performance and deployment issues

2021-07-08 Thread Xingbo Huang
Hi Wouter, In fact, our users have encountered the same problem. Whenever the `bundle size` or `bundle time` is reached, the data in the buffer needs to be sent from the jvm to the pvm, and then waits for the pym to be processed and sent back to the jvm to send all the results to the downstream op

Re: PyFlink performance and deployment issues

2021-07-08 Thread Wouter Zorgdrager
Hi Dian, all, I will come back to the other points asap. However, I’m still confused about this performance. Is this what I can expect in PyFlink in terms of performance? ~ 1000ms latency for single events? I also had a very simple setup where I send 1000 events to Kafka per second and response t

Re: PyFlink performance and deployment issues

2021-07-07 Thread Dian Fu
Hi Wouter, 1) Regarding the performance difference between Beam and PyFlink, I guess it’s because you are using an in-memory runner when running it locally in Beam. In that case, the code path is totally differently compared to running in a remote cluster. 2) Regarding to `flink run`, I’m surpr

Re: PyFlink performance and deployment issues

2021-07-07 Thread Xingbo Huang
Hi Wouter, Sorry for the late reply. I will try to answer your questions in detail. 1. >>> Perforce problem. When running udf job locally, beam will use a loopback way to connect back to the python process used by the compilation job, so the time of starting up the job will come faster than pyflin

Re: PyFlink kafka producer topic override

2021-06-24 Thread Curt Buechter
Thanks. I will reconsider my architecture. On Thu, Jun 24, 2021 at 1:37 AM Arvid Heise wrote: > Hi, > > getTargetTopic can really be used as Curt needs it. So it would be good to > add to PyFlink as well. > > However, I'm a bit skeptical that Kafka can really handle that model well. > It's usual

Re: PyFlink kafka producer topic override

2021-06-24 Thread Arvid Heise
Hi Curt, Upon rechecking the code, you actually don't set the topic through KafkaContextAware but just directly on the KafkaRecord returned by the KafkaSerializationSchema. Sorry for the confusion Arvid On Thu, Jun 24, 2021 at 8:36 AM Arvid Heise wrote: > Hi, > > getTargetTopic can really be

Re: PyFlink kafka producer topic override

2021-06-23 Thread Arvid Heise
Hi, getTargetTopic can really be used as Curt needs it. So it would be good to add to PyFlink as well. However, I'm a bit skeptical that Kafka can really handle that model well. It's usually encouraged to use rather fewer, larger topics and you'd rather use partitions here instead of topics. Ther

Re: PyFlink kafka producer topic override

2021-06-23 Thread Dian Fu
OK, got it. Then it seems that split streams is also not quite suitable to address your requirements as you still need to iterate over each of the side output corresponding to each topic. Regarding to getTargetTopic [1], per my understanding, it’s not designed to dynamically assign the topic f

Re: PyFlink kafka producer topic override

2021-06-23 Thread Curt Buechter
Hi Dian, Thanks for the reply. I don't think a filter function makes sense here. I have 2,000 tenants in the source database, and I want all records for a single tenant in a tenant-specific topic. So, with a filter function, if I understand it correctly, I would need 2,000 different filters, which

Re: PyFlink kafka producer topic override

2021-06-23 Thread Dian Fu
You are right that split is still not supported. Does it make sense for you to split the stream using a filter function? There is some overhead compared the built-in stream.split as you need to provide a filter function for each sub-stream and so a record will evaluated multiple times. > 2021年6

Re: PyFlink LIST type problem

2021-06-18 Thread Dian Fu
Hi Laszlo, It seems because the json format supports object array type and doesn’t support list type. However, it still hasn’t provided object array type in PyFlink Datastream API [1]. I have created a ticket as a following up. For now, I guess you could implement it yourself and could take a

Re: PyFlink DataStream API problem

2021-06-15 Thread Dian Fu
It seems that there is something wrong during starting up the Python process. Have you installed Python 3 and also PyFlink in the docker image? Besides, you could take a look at the log of TaskManager and to see whether there are logs about the reason why the Python process starts up failed. R

Re: PyFlink: Upload resource files to Flink cluster

2021-06-14 Thread Dian Fu
Hi Sumeet, The archive files will be uploaded to the blob server. This is the same no matter specifying the archives via command line option `—pyArchives` or via `add_python_archive`. > And when I try to programmatically do this by calling add_python_archive(), > the job gets submitted but f

  1   2   3   >