Re: Flink Python API and HADOO_CLASSPATH

2021-05-18 Thread Eduard Tudenhoefner
Hi Dian, thanks a lot for the explanation and help. Option 2) is what I needed and it works. Regards Eduard On Tue, May 18, 2021 at 6:21 AM Dian Fu wrote: > Hi, > > 1) The cause of the exception: > The dependencies added via pipeline.jars / pipeline.classpaths will be > used to construct user

Re: SIGSEGV error

2021-05-18 Thread Joshua Fan
Hi Till, I also tried the job without gzip, it came into the same error. But the problem is solved now. I was about to give up to solve it, I found the mail at http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JVM-crash-SIGSEGV-in-ZIP-GetEntry-td17326.html. So I think maybe it was

Re: Root Exception can not be shown on Web UI in Flink 1.13.0

2021-05-18 Thread Matthias Pohl
Sorry, for not getting back earlier. I missed that thread. It looks like some wrong assumption on our end. Hence, Yangze and Guowei are right. I'm gonna look into the issue. Matthias On Fri, May 14, 2021 at 4:21 AM Guowei Ma wrote: > Hi, Gary > > I think it might be a bug. So would you like to

PyFlink UDF: No match found for function signature XXX

2021-05-18 Thread Yik San Chan
Hi, I have a PyFlink script that fails to use a simple UDF. The full script can be found below: ```python from pyflink.datastream import StreamExecutionEnvironment from pyflink.table import ( DataTypes, EnvironmentSettings, SqlDialect, StreamTableEnvironment, ) from pyflink.table.udf import udf

Re: PyFlink UDF: No match found for function signature XXX

2021-05-18 Thread Dian Fu
Hi Yik San, The expected input types for add are DataTypes.INT, however, the schema of aiinfra.mysource is: a bigint and b bigint. Regards, Dian > 2021年5月18日 下午5:38,Yik San Chan 写道: > > Hi, > > I have a PyFlink script that fails to use a simple UDF. The full script can > be found below: >

Re: Getting error in pod template

2021-05-18 Thread Yang Wang
Could you share how you are starting the Flink native k8s application with pod template? Usually it look like the following commands. And you need to have the Flink binary on your local machine. Please note that pod template is working with native K8s mode only. And you could not use the "kubectl"

Re: two questions about flink stream processing: kafka sources and TimerService

2021-05-18 Thread Ingo Bürk
Hi Jin, 1) As far as I know the order is only guaranteed for events from the same partition. If you want events across partitions to remain in order you may need to use parallelism 1. I'll attach some links here which might be useful: https://stackoverflow.com/questions/50340107/order-of-events-w

Re: PyFlink UDF: No match found for function signature XXX

2021-05-18 Thread Yik San Chan
Hi Dian, I changed the udf to: ```python @udf( input_types=[DataTypes.BIGINT(), DataTypes.BIGINT()], result_type=DataTypes.BIGINT(), ) def add(i, j): return i + j ``` But I still get the same error. On Tue, May 18, 2021 at 5:47 PM Dian Fu wrote: > Hi Yik San, > > The expected input types for

Stop command failure

2021-05-18 Thread V N, Suchithra (Nokia - IN/Bangalore)
Hi, Stop command is failing with below error with apache flink 1.12.3 version. Could you pls help. log":"[Flink-RestClusterClient-IO-thread-2] org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel Force-closing a channel whose registration task was not accepted by an event loop: [id:

Flink upgraded from 1.10.0 to 1.12.0

2021-05-18 Thread 王炳焱
When I upgraded from Flink1.10.0 to Flink1.12.0. Unable to restore SavePoint And prompt the following error 2021-05-14 22:02:44,716 WARN org.apache.flink.metrics.MetricGroup [] - The operator name Calc(select=[((CAST((log_info get_json_object2 _UTF-16LE'eventTime')

Re: PyFlink UDF: No match found for function signature XXX

2021-05-18 Thread Yik San Chan
With the help from Dian and friends, it turns out the root cause is: When it `create_temporary_function`, it is in the default catalog. However, when it `execute_sql(TRANSFORM)`, it is in the "hive" catalog. A function defined as a temporary function in catalog "default" is not accessible from cat

Re: SIGSEGV error

2021-05-18 Thread Till Rohrmann
Great to hear that you fixed the problem by specifying an explicit serializer for the state. Cheers, Till On Tue, May 18, 2021 at 9:43 AM Joshua Fan wrote: > Hi Till, > I also tried the job without gzip, it came into the same error. > But the problem is solved now. I was about to give up to sol

Re: Helm chart for Flink

2021-05-18 Thread Austin Cawley-Edwards
Hey all, Yeah, I'd be interested to see the Helm pre-upgrade hook setup, though I'd agree with you, Alexey, that it does not provide enough control to be a stable solution. @Pedro Silva I don't know if there are talks for an official operator yet, but Kubernetes support is important to the commu

Fastest way for decent lookup JOIN?

2021-05-18 Thread Theo Diefenthal
Hi there, I have the following (probably very common) usecase: I have some lookup data ( 100 million records ) which change only slowly (in the range of some thousands per day). My event stream is in the order of tens of billions events per day and each event needs to be enriched from the 100

Issue reading from S3

2021-05-18 Thread Angelo G.
Hi, I'm trying to read from and write to S3 with Flink 1.12.2. I'm submitting the job to local cluster (tar.gz distribution). I do not have a Hadoop installation running in the same machine. S3 (not Amazon) is running in a remote location and I have access to it via endpoint and access/secret keys

Re: After upgrade from 1.11.2 to 1.13.0 parameter taskmanager.numberOfTaskSlots set to 1.

2021-05-18 Thread Alexey Trenikhun
If flink-conf.yaml is readonly, flink will complain but work fine? From: Chesnay Schepler Sent: Wednesday, May 12, 2021 5:38 AM To: Alex Drobinsky Cc: user@flink.apache.org Subject: Re: After upgrade from 1.11.2 to 1.13.0 parameter taskmanager.numberOfTaskSlots

Prometheus Reporter Enhancement

2021-05-18 Thread Mason Chen
Hi all, Would people appreciate enhancements to the prometheus reporter to include extra labels via a configuration, as a contribution to Flink? I can see it being useful for adding labels that are not job specific, but infra specific. The change would be nicely integrated with the Flink’s Conf

Re: Prometheus Reporter Enhancement

2021-05-18 Thread Andrew Otto
Sounds useful! On Tue, May 18, 2021 at 2:02 PM Mason Chen wrote: > Hi all, > > Would people appreciate enhancements to the prometheus reporter to include > extra labels via a configuration, as a contribution to Flink? I can see it > being useful for adding labels that are not job specific, but i

Re: Prometheus Reporter Enhancement

2021-05-18 Thread Chesnay Schepler
There is already a ticket for this. Note that this functionality should be implemented in a generic fashion to be usable for all reporters. https://issues.apache.org/jira/browse/FLINK-17495 On 5/18/2021 8:16 PM, Andrew Otto wrote: Sounds useful! On Tue, May 18, 2021 at 2:02 PM Mason Chen

Re: RabbitMQ source does not stop unless message arrives in queue

2021-05-18 Thread Austin Cawley-Edwards
Hey all, Thanks for the details, John! Hmm, that doesn't look too good either 😬 but probably a different issue with the RMQ source/ sink. Hopefully, the new FLIP-27 sources will help you guys out there! The upcoming HybridSource in FLIP-150 [1] might also be interesting to you in finely controllin

Guidance for Integration Tests with External Technologies

2021-05-18 Thread Rion Williams
Hey all, I’ve been taking a very TDD-oriented approach to developing many of the Flink apps I’ve worked on, but recently I’ve encountered a problem that has me scratching my head. A majority of my integration tests leverage a few external technologies such as Kafka and typically a relational d

DataStream Batch Execution Mode and large files.

2021-05-18 Thread Marco Villalobos
Hi, I am using the DataStream API in Batch Execution Mode, and my "source" is an s3 Buckets with about 500 GB of data spread across many files. Where does Flink stored the results of processed / produced data between tasks? There is no way that 500GB will fit in memory. So I am very curious how

DataStream API Batch Execution Mode restarting...

2021-05-18 Thread Marco Villalobos
I have a DataStream running in Batch Execution mode within YARN on EMR. My job failed an hour into the job two times in a row because the task manager heartbeat timed out. Can somebody point me out how to restart a job in this situation? I can't find that section of the documentation. thank you.

Re: Root Exception can not be shown on Web UI in Flink 1.13.0

2021-05-18 Thread Gary Wu
Thanks! I have updated the detail and task manager log in https://issues.apache.org/jira/browse/FLINK-22688. Regards, -Gary On Tue, 18 May 2021 at 16:22, Matthias Pohl wrote: > Sorry, for not getting back earlier. I missed that thread. It looks like > some wrong assumption on our end. Hence, Ya

Re: Re: Handling "Global" Updating State

2021-05-18 Thread Yun Gao
Hi Rion, Sorry for the late reply, another simpler method might indeed be in initializeState, the operator directly read the data from the kafka to initialize the state. Best, Yun --Original Mail -- Sender:Rion Williams Send Date:Mon May 17 19:53:35 2021 Reci

Re: DataStream API Batch Execution Mode restarting...

2021-05-18 Thread Yun Gao
Hi Marco, Have you configured the restart strategy ? if the restart-strategy [1] is configuration into some strategies other than none, Flink should be able to restart the job automatically on failover. The restart strategy could also be configuration via StreamExecutionEnvironment#setRestartSt

Questions Flink DataStream in BATCH execution mode scalability advice

2021-05-18 Thread Marco Villalobos
Questions Flink DataStream in BATCH execution mode scalability advice. Here is the problem that I am trying to solve. Input is an S3 bucket directory with about 500 GB of data across many files. The instance that I am running on only has 50GB of EBS storage. The nature of this data is time series

Re: DataStream API Batch Execution Mode restarting...

2021-05-18 Thread Marco Villalobos
Thank you. I used the default restart strategy. I'll change that. On Tue, May 18, 2021 at 11:02 PM Yun Gao wrote: > Hi Marco, > > Have you configured the restart strategy ? if the restart-strategy [1] is > configuration > into some strategies other than none, Flink should be able to restart th

Re: DataStream Batch Execution Mode and large files.

2021-05-18 Thread Yun Gao
Hi Marco, With BATCH mode, all the ALL_TO_ALL edges would be marked as blocking and would use intermediate file to transfer data. Flink now support hash shuffle and sort shuffle for blocking edges[1], both of them stores the intermediate files in the directories configured by io.tmp.dirs[2].

flink all events getting dropped as late

2021-05-18 Thread Debraj Manna
Crossposting from stackoverflow My flink pipeline looks like below WatermarkStrategy watermarkStrategy = WatermarkStrategy .forBoundedOutOfOrderness(Duration.ofSeconds(900)) .withTimest

Re: DataStream Batch Execution Mode and large files.

2021-05-18 Thread Marco Villalobos
Thank you very much. You've been very helpful. Since my intermediate results are large, I suspect that io.tmp.dirs must literally be on the local file system. Thus, since I use EMR, I'll need to configure EBS to support more data. On Tue, May 18, 2021 at 11:08 PM Yun Gao wrote: > Hi Marco, > >