Hi Lei, Have you tried enabling these Flink configuration properties?Configurationnightlies.apache.orgSent from my iPhoneOn Apr 7, 2024, at 6:03 PM, Lei Wang wrote:I want to enable it only for specified jobs, how can I specify the configurations on cmd line when submitting a job?Thanks,LeiOn
Zhanghao is correct. You can use what is called "keyed state". It's like a
cache.
https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/datastream/fault-tolerance/state/
> On Mar 28, 2024, at 7:48 PM, Zhanghao Chen wrote:
>
> Hi,
>
> You can maintain a cache manually in your
Hi Ganesh,
I disagree. I don’t think Flink needs a dependency injection framework. I have
implemented many complex jobs without one. Can you please articulate why you
think it needs a dependency injection framework, along with some use cases that
will show its benefit?
I would rather see more
Hi Nida,
I request that you read
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/docker/
in order to learn how to Dockerize your Flink job.
You're Welcome & Regard
Marco A. Villalobos
> On Feb 13, 2024, at 12:00 AM, Fidea Lidea wr
Hi Nida,
You can find sample code for using Kafka here:
https://kafka.apache.org/documentation/
You can find sample code for using Flink here:
https://nightlies.apache.org/flink/flink-docs-stable/
You can find sample code for using Flink with Kafka here:
https://nightlies.apache.org/flink/fl
rves well as a bridge between a Flink Streaming job and
>>>>> micro-services.
>>>>
>>>> This is essentially how I use it as well, and I would also be sad to see
>>>> it sunsetted. It works well; I don't know that there is a lot of new
>
ame result.
3. You can also run the stream in batch mode.
Remember, a stream does not end (unless it is run in batch mode).
> On Apr 29, 2023, at 9:11 AM, Marco Villalobos
> wrote:
>
> Hi Luke,
>
> A batch has a beginning and an end. Although a stream has a beginning,
I am currently using Stateful Functions in my application.
I use Apache Flink for stream processing, and StateFun as a hand-off point for
the rest of the application.
It serves well as a bridge between a Flink Streaming job and micro-services.
I would be disappointed if StateFun was sunsetted.
Hello,
We are deploying a flink application cluster in kubernetes, 2 pods one for
the JM and the other for the TM.
The problem is when we launch load tests we see that task manager memory
usage increases, after the tests are finished and flink stop processing
data the memory usage never comes d
Hello,
Can someone explain to me what is the point of using HA when deploying an
application cluster with a single JM and the checkpoints are not activated.
AFAK when the pod of the JM goes down kubernetes will restart it anyway so
we don't need to activate the HA in this case.
Maybe there's som
Hello,
Does anyone know when the 1.16.0 version will be released please?
Regards.
Did this list receive my email?
I’m only asking because my last few questions have gone unanswered and maybe
the list server is blocking me.
Anybody, please let me know.
> On Sep 26, 2022, at 8:41 PM, Marco Villalobos
> wrote:
>
> I indeed see the value of Flink Statef
k you.
Marco A. Villalobos
I might need more details, but conceptually, streams can be thought of as never
ending tables
and our code as functions applied to them.
JOIN is a concept supported in the SQL API and DataStream API.
However, the SQL API is more succinct (unlike my writing ;).
So, how about the "fast stream" ma
écrit :
> Hi!
> You should check out the Flink Kubernetes Operator. I think that covers
> all your needs .
>
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
>
> Cheers,
> Gyula
>
> On Sat, 3 Sep 2022 at 13:45, marco andreas
> wrote:
>
&
We are deploying a flink application cluster on k8S. Following the official
documentation the JM is deployed As a job resource , however we are
deploying a long running flink job that is not supposed to be terminated
and also we need to update the image of the flink job.
The problem is that the j
Thus, Flink 1.12.2 was using the version of RocksDB with a known bug.
> On Sep 2, 2022, at 10:49 AM, Marco Villalobos
> wrote:
>
> What is the recommended solution for this error of too many files open during
> a checkpoint?
>
> 2022-09-02 10:04:56 java.io.IOExcepti
What is the recommended solution for this error of too many files open during a
checkpoint?
2022-09-02 10:04:56 java.io.IOException: Could not perform checkpoint 119366
for operator tag enrichment (3/4)#104. at
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(Strea
sInputFormat is reused, but
> DeserializationSchemaAdapter#Reader only do shallow copy of the produced
> data, so that the finnal result will always be the last row value.
>
> Could you please help create a jira to track it?
>
> Best regards,
> Yuxia
>
> - 原始邮件 -
hat the transient field:
private transient RecordCollector collector;
in
org.apache.flink.connector.file.table.DeserializationSchemaAdapter.LineBytesInputFormat
becomes empty on each iteration, as though it failed to serialize correctly.
Regardless, I don't know what's wrong. Any advice would deeply help.
Marco A. Villalobos
Hello everybody,
When I perform this simple set of queries, a unique line from the source file
shows up many times.
I have verified many times that a unique line in the source shows up as much as
100 times in the select statement.
Is this the correct behavior for Flink 1.15.1?
FYI, it does s
Is it possible in Flink SQL to tumble a window by row size instead of time?
Let's say that I want a window for every 1 rows for example using the Flink
SQL API.
is that possible?
I can't find any documentation on how to do that, and I don't know if it is
supported.
If the performance of a stateful function (FaaS) is very slow, how does this
impact performance on the Flink StateFun Cluster?
I am trying to figure out what is too slow for a FaaS. I expect the Flink
StateFun Cluster to receive about 2000 events per a minute, but some, not all
FaaS might take
You're right. I didn't notice that the ports were different. That was very
subtle.
Thank you for pointing this out to me. I was stuck on it for quite a while.
> On Apr 16, 2022, at 6:17 PM, Marco Villalobos
> wrote:
>
> I'm sorry, I accidentally hit send before
16, 2022, at 6:12 PM, Marco Villalobos
> wrote:
>
> IK
>
> If what you're saying is true, then why do most of the examples in the
> flink-statefun-playground example use HTTP as an alternative entry point?
>
> Here is the greeter example:
>
> https://githu
he/flink-statefun-playground/tree/main/java/greeter>
> On Apr 16, 2022, at 6:06 PM, Tymur Yarosh wrote:
>
> Hi Marco,
> The problem is that you’re trying to call the function directly via HTTP.
> This is not how it's supposed to work. Instead, check out how to define y
I'm trying to write very simple echo app with Stateful Function to prove it as
a technology for some of my use cases.
I have not been able to accept different content types though. Here is an
example of my code for a simple
echo function:
My Echo stateful function class.
package statefun.impl
Hello,
Does anyone have the same issue or have an idea why the jobmanager fails to
renew its leadership when using kubernetes ha service.
Configuration :
kubernetes.namespace: flink-ps-flink-dev
high-availability.kubernetes.leader-election.lease-duration: 200 s
high-availability.kubernetes.leader
Hello everyone,
I am deploying a flink application cluster using k8S HA .
I notice this message in the log
@timestamp":"2022-03-21T17:11:39.436+01:00","@version":"1","message":"Renew
deadline reached after 200 seconds while renewing lock ConfigMapLock:
flink-pushavoo-flink-rec -
elifibre-000
eid...@ververica.com> wrote:
> Hi Marco,
>
> I'm no expert on the Kafka producer, but I will try to help. [1] seems to
> have a decent explanation of possible error causes for the error you
> encountered.
> Which leads me to two questions:
>
I keep on receiving this exception during the execution of a simple job that
receives time series data via Kafka, transforms it into avro format, and then
sends into a Kafka topic consumed by druid.
Any advise would be appreciated as to how to resolve this type of error.
I'm using Apache Kafka
Hello flink community,
I am deploying a flink application cluster using a helm chart , the problem
is that the jobmanager component type is a "Job" , and with helm i can't do
an upgrade of the chart in order to change the application image version
because helm is unable to upgrade the docker image
KafkaSource and KafkaSourceBuilder? Should I
revert to using FlinkKafkaSource?
Any advice or insight would be very helpful.
Thank you.
Marco A. Villalobos
solution.
Again, thank you for your input.
> On Jan 26, 2022, at 1:32 PM, Alexander Fedulov
> wrote:
>
>
> Hi Marco,
>
> Not sure if I get your problem correctly, but you can process those windows
> on data "split" from the same input within the same Flink
, if
there are tens of thousands of time series names / windows?
Any help or advice would be appreciated. Thank you.
Marco A. Villalobos
Thank you. One last question. What is an RP? Where can I read it?
Marco
> On Nov 30, 2021, at 11:06 PM, Hang Ruan wrote:
>
> Hi,
>
> In 1.12.0-1.12.5 versions, committing offset to kafka when the checkpoint is
> open is the default behavior in KafkaSourceBuilder. And it
ther additional properties could be found here :
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/kafka/#additional-properties
>
> Marco Villalobos 于2021年11月30日周二 上午11:08写道:
>
>> Thank you for the information. That still does not answer my
that I am using KafkaSourceBuilder, how do I configure that
behavior so that offsets get committed on checkpoints? Or is that the
default behavior with checkpoints?
-Marco
On Mon, Nov 29, 2021 at 5:58 PM Caizhi Weng wrote:
> Hi!
>
> Flink 1.14 release note states about this. See [1
kpoint.
> The interval of drawing checkpoints therefore defines how much the program
> may have to go back at most, in case of a failure. To use fault tolerant
> Kafka Consumers, checkpointing of the topology needs to be enabled in the
> job.
> If checkpointing is disabled, the Kafka consumer will periodically commit
> the offsets to Zookeeper.
Thank you.
Marco
The FlinkKafkaConsumer that will be deprecated has the method
"setCommitOffsetsOnCheckpoints(boolan)" method.
However, that functionality is not the new KafkaSource class.
How is this behavior / functionality configured in the new API?
-Marco A. Villalobos
Dario Heinisch wrote:
> > Union creates a new stream containing all elements of the unioned
> > streams:
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/overview/#union
> >
> >
> > On 05.11.21 14:25, Marco Villalobo
Can two different streams flow to the same operator (an operator with the
same name, uid, and implementation) and then share keyed state or will that
require joining the streams first?
n either
> try to fix the API server accessibility or increase the leadership renew
> timeout (high-availability.kubernetes.leader-election.renew-deadline).
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Oct 25, 2021 at 5:08 PM marco wrote:
>
Any suggestions would be appreciated.
On 2021/10/20 16:18:39, marco wrote:
>
>
> Hello flink community::
>
> I am deploying flink application cluster standalone mode on kubernetes, but i
> am facing some problems
>
> the job starts normally and it continues to
appreciated.
-Marco
On Mon, Oct 18, 2021 at 8:03 PM Qingsheng Ren wrote:
> Hi Marco,
>
> Sorry I forgot to cc the user mailing list just now.
>
> From the exception message it looks like a versioning issue. Could you
> provide some additional information, such as Flink & Kaf
Hello flink community::
I am deploying flink application cluster standalone mode on kubernetes, but i
am facing some problems
the job starts normally and it continues to run but at some point in time it
crushes and gets restarted.
Does anyone facing the same problem or know how to resolve
I have the simplest Flink job that simply deques off of a kafka topic and
writes to another kafka topic, but with headers, and manually copying the
event time into the kafka sink.
It works as intended, but sometimes I am getting this error:
org.apache.kafka.common.errors.UnsupportedVersionExcepti
In my Flink Job, I am using event time to process time-series data.
Due to our business requirements, I need to verify that a specific subset
of data written to a JDBC sink has been written before I send an activemq
message to another component.
My job flows like this:
1. Kafka Source
2. Split s
Today, I kept on receiving a timeout exception when stopping my job with a
savepoint.
This happened with Flink version 1.12.2 running in EMR.
I had to use the deprecated cancel with savepoint feature instead.
In fact, stopping with a savepoint, creating a savepoint, and cancelling
with a savepoin
never to use -d again for this situation.
Again, thank you.
-Marco
On Thu, Sep 23, 2021 at 11:01 PM JING ZHANG wrote:
> Hi Macro,
> Do you specified drain flag when stop a job with a savepoint?
> If the --drain flag is specified, then a MAX_WATERMARK will be emitted
> before the las
Something strange happened today.
When we tried to shutdown a job with a savepoint, the watermarks became
equal to 2^63 - 1.
This caused timers to fire indefinitely and crash downstream systems with
overloaded untrue data.
We are using event time processing with Kafka as our source.
It seems imp
27;ll test my hypothesis later today.
> On Sep 7, 2021, at 2:07 AM, JING ZHANG wrote:
>
> Hi Marco,
> I'm not sure which API or SQL query do you use.
> If you use Windowed Stream API in DataStream [1]. The input data would be
> assigned to a Window based on which Window A
t triggers the timer firing. Must I create a special window assigner
that accepts late data?
On Tue, Sep 7, 2021 at 2:07 AM JING ZHANG wrote:
> Hi Marco,
> I'm not sure which API or SQL query do you use.
> If you use Windowed Stream API in DataStream [1]. The input data would be
>
If an event time timer is registered to fire exactly every 15 minutes,
starting from exactly at the top of the hour (exactly 00:00, 00:15, 00:30,
00:45 for example), and within that timer it produces an element in the
stream, what event time will that element have, and what window will it
belong to
I use event time,with Kafka as my source.
The system that I am developing requires data to be aggregated every 15
minutes, thus
I am using a Tumbling Event Time window. However, my system also is
required to take
action every 15 minutes even if there is activity.
I need the elements collected in t
Let's say that a job has operators with UIDs: W, X, Y, and Z, and uses
RocksDB as a backend with checkpoint data URI s3://checkpoints"
Then I stop the job with a savepoint at s3://savepoint-1.
I assume that all the data within the checkpoint are stored within the
given Savepoint. Is that assumpti
It was not clear to me that JdbcInputFormat was part of the DataSet api.
Now I understand.
Thank you.
On Fri, Jun 18, 2021 at 5:23 AM Timo Walther wrote:
> Hi Marco,
>
> as Robert already mentioned, the BatchTableEnvironment is simply build
> on top of the DataSet API,
bleEnvImpl.scala:555)
at
org.apache.flink.table.api.internal.BatchTableEnvImpl.translate(BatchTableEnvImpl.scala:537)
at
org.apache.flink.table.api.bridge.java.internal.BatchTableEnvironmentImpl.toDataSet(BatchTableEnvironmentImpl.scala:101)
On Thu, Jun 17, 2021 at 12:51 AM Timo Walther wrote:
> Hi Marco,
>
> which oper
the DataSet API with the Table SQL api in Flink
1.12.1?
On Wed, Jun 16, 2021 at 4:55 AM Robert Metzger wrote:
> Hi Marco,
>
> The DataSet API will not run out of memory, as it spills to disk if the
> data doesn't fit anymore.
> Load is distributed by partitioning data.
>
&
I must bootstrap state from postgres (approximately 200 GB of data) and I
notice that the state processor API requires the DataSet API in order to
bootstrap state for the Stream API.
I wish there was a way to use the SQL API and use a partitioned scan, but I
don't know if that is even possible wit
ening an issue for lacking the document?
>
> [1]
> https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-files/src/test/java/org/apache/flink/connector/file/src/FileSourceTextLinesITCase.java
> Best,
> Guowei
>
>
> On Tue, Jun 8, 2021 at 5:59 AM Marc
How do I use a hierarchical directory structure as a file source in S3 when
using the DataStream API in Batch Execution mode?
I have been trying to find out if the API supports that, because currently
our data is organized by years, halves, quarters, months, and but before I
launch the job, I flat
this operator, it only goes to one task manager node.
I need state, but I don't really need it keyed.
On Sat, Jun 5, 2021 at 4:56 AM Marco Villalobos
wrote:
> Does that work in the DataStream API in Batch Execution Mode?
>
> On Sat, Jun 5, 2021 at 12:04 AM JING ZHANG wrote:
>
> Best regards,
> JING ZHANG
>
>
> Marco Villalobos 于2021年6月5日周六 下午1:55写道:
>
>> Is it possible to use OperatorState, when NOT implementing a source or
>> sink function?
>>
>> If yes, then how?
>>
>
Is it possible to use OperatorState, when NOT implementing a source or sink
function?
If yes, then how?
I am running with one job manager and three task managers.
Each task manager is receiving at most 8 gb of data, but the job is timing
out.
What parameters must I adjust?
Sink: back fill db sink) (15/32) (50626268d1f0d4c0833c5fa548863abd)
switched from SCHEDULED to FAILED on [unassigned resource]
Hi,
Stream one has one element.
Stream two has 2 elements.
Both streams derive from side-outputs. I am using the DataStream API in
batch execution mode.
I want to join them on a common key and window.
I am certain the keys match, but the flat join does not seem to be working.
I deduce that t
I found the problem. I tried to sign timestamps to the operator (I don't
know why), and when I did that, because I used the Flink API fluently, I
was no longer referencing the operator that contained the side-outputs.
Disregard my question.
On Sat, May 22, 2021 at 9:28 PM Marco Villa
I have been struggling for two days with an issue using the DataStream API
in Batch Execution mode.
It seems as though my side-output has no elements available to downstream
operators.
However, I am certain that the downstream operator received events.
I logged the side-output element just before
Hello. I am using Flink 1.12.1 in EMR.
I am processing historical time-series data with the DataStream
API in Batch execution mode.
I must average time series data into a fifteen minute interval
and forward fill missing values.
For example, this input:
name, timestamp, value
a,2019-06-23T00:07:
> On May 19, 2021, at 7:26 AM, Yun Gao wrote:
>
> Hi Marco,
>
> For the remaining issues,
>
> 1. For the aggregation, the 500GB of files are not required to be fit into
> memory.
> Rough speaking for the keyed().window().reduce(), the input records would be
Thank you very much. You've been very helpful.
Since my intermediate results are large, I suspect that io.tmp.dirs must
literally be on the local file system. Thus, since I use EMR, I'll need to
configure EBS to support more data.
On Tue, May 18, 2021 at 11:08 PM Yun Gao wrote:
Thank you. I used the default restart strategy. I'll change that.
On Tue, May 18, 2021 at 11:02 PM Yun Gao wrote:
> Hi Marco,
>
> Have you configured the restart strategy ? if the restart-strategy [1] is
> configuration
> into some strategies other than none, Flink shoul
Questions Flink DataStream in BATCH execution mode scalability advice.
Here is the problem that I am trying to solve.
Input is an S3 bucket directory with about 500 GB of data across many
files. The instance that I am running on only has 50GB of EBS storage. The
nature of this data is time series
I have a DataStream running in Batch Execution mode within YARN on EMR.
My job failed an hour into the job two times in a row because the task
manager heartbeat timed out.
Can somebody point me out how to restart a job in this situation? I can't
find that section of the documentation.
thank you.
ry curious how
that happens.
Can somebody please explain?
Thank you.
Marco A. Villalobos
I have a tumbling window that aggregates into a process window function.
Downstream there is a keyed process function.
[window aggregate into process function] -> keyed process function
I am not quite sure how the keyed process knows which elements are at the
boundary of the window. Is there a m
eam to a stateful function can
> emit a message that
> in turn will be routed to that function using the data stream integration.
>
>
> On Wed, Apr 7, 2021 at 7:16 PM Marco Villalobos
> wrote:
>
>> Thank you for the clarification.
>>
>> BUTthere was o
Thank you for the clarification.
BUTthere was one question not addressed:
Can a stateful function be called by a process function?
On Wed, Apr 7, 2021 at 8:19 AM Igal Shilman wrote:
> Hello Marco!
>
> Your understanding is correct, but in addition
> You can also use State
another datastream within the data stream job.
Is my understanding correct?
I was hoping that a stateful function could be called by a process
function, is that possible? (I am guessing no).
Please let me know.
Thank you. Sincerely,
Marco A. Villalobos
Is it possible for an operator to receive two different kinds of
broadcasts?
Is it possible for an operator to receive two different types of streams
and a broadcast? For example, I know there is a KeyedCoProcessFunction, but
is there a version of that which can also receive broadcasts?
given:
[source] -> [operator 1] -> [operator 2] -> [sink].
If within the dashboard, operator 1 shows that it has backpressure, does
that mean I need to improve the performance of operator 2 in order to
alleviate backpressure upon operator 1?
Hi,
I am having a difficult time distinguishing the difference between
RuntimeContext state and global state when using a ProcessWindowFunction.
A ProcessWindowFunction has three access different kinds of state.
1. RuntimeContext state.
2. ProcessWindowFunction.Context global state
3. ProcessWind
.
On Fri, Feb 5, 2021 at 3:06 AM Marco Villalobos
wrote:
> as data flows from a source through a pipeline of operators and finally
> sinks, is there a means to control how many threads are used within an
> operator, and how an operator is distributed across the network?
>
> Where c
Is it possible to use different statebackends for different operators?
There are certain situations where I want the state to reside completely in
memory, and other situations where I want it stored in rocksdb.
as data flows from a source through a pipeline of operators and finally
sinks, is there a means to control how many threads are used within an
operator, and how an operator is distributed across the network?
Where can I read up on these types of details specifically?
Oh, I found the solution. I simply need to not use TRACE log level for
Flink.
On Wed, Feb 3, 2021 at 7:07 PM Marco Villalobos
wrote:
>
> Please advise me. I don't know what I am doing wrong.
>
> After I added the blink table planner to my my dependency managemen
Please advise me. I don't know what I am doing wrong.
After I added the blink table planner to my my dependency management:
dependency
"org.apache.flink:flink-table-planner-blink_${scalaVersion}:${flinkVersion}"
and added it as a dependency:
implementation "org.apache.flink:flink-table-planner-
our deployment topology for now) and package
it with my Data Stream Job?
Thank you.
> On Feb 2, 2021, at 10:08 PM, Tzu-Li (Gordon) Tai wrote:
>
> Hi Marco,
>
> In the ideal setup, enrichment data existing in external databases is
> bootstrapped into the streaming job via Fli
Hi everybody,
I am brainstorming how it might be possible to perform database enrichment
with the DataStream API, use keyed state for caching, and also utilize
Async IO.
Since AsyncIO does not support keyed state, then is it possible to use an
Iterative Stream that uses keyed state for caching in
gt; checkpoint timeout.
>
> 2) What kind of effects are you worried about?
>
> On 1/28/2021 8:05 PM, Marco Villalobos wrote:
> > Is it possible that checkpointing times out due to an operator taking
> > too long?
> >
> > Also, does windowing affect the checkpoint barriers?
>
>
>
s working. You would need to analyse what's working slower than
> expected. Checkpointing times? (Async duration? Sync duration? Start
> delay/back pressure?) Throughput? Recovery/startup? Are you being rate
> limited by Amazon?
>
> Piotrek
>
> czw., 28 sty 2021 o 03:46 M
wrote:
> Hi Marco
> You need to figure out why the checkpoint timed out(you can see the
> consumed time of each period for one checkpoint in UI), if it indeed needs
> such long time to complete the checkpoint, then you need to configure a
> longer timeout.
> If there are
errors during our onTimer calls, I don't know if
that's related.
Marco A. Villalobos
Is it possible that checkpointing times out due to an operator taking too
long?
Also, does windowing affect the checkpoint barriers?
Is it possible to use an environmental credentials provider?
On Thu, Jan 28, 2021 at 8:35 AM Arvid Heise wrote:
> Hi Marco,
>
> afaik you don't need HADOOP_HOME or core-site.xml.
>
> I'm also not sure from where you got your config keys. (I guess from the
> Presto p
.
Any advice would be appreciated.
-Marco Villalobos
at 1:17 AM Arvid Heise wrote:
> Hi Marco,
>
> In general, sending a compressed log to ML is totally fine. You can
> further minimize the log by disabling restarts.
> I looked into the logs that you provided.
>
> 2021-01-26 04:37:43,280 INFO org.apache
Just curious, has anybody had success with Amazon EMR with RocksDB and
checkpointing in S3?
That's the configuration I am trying to setup, but my system is running
more slowly than expected.
When I try to stop with a savepoint, I usually get the error below. I have
not been able to create a single save point. Please advise.
I am using Flink 1.11.0
Draining job "ed51084378323a7d9fb1c4c97c2657df" with a savepoint.
The progr
1 - 100 of 159 matches
Mail list logo