Hi Jörn,
Thanks. I had missed that EMRFS strong consistency configuration. Will try
that now.
We also had a backup solution - using Kinesis instead of S3 (I don't see
Kinesis in your suggestion, but hope that it would be alright).
"/The small size and high rate is not suitable for S3 or HDFS/" <<<
Hi Darshan,
The join implementation in SQL / Table API does what is demanded by the SQL
semantics.
Hence, what results to emit and also what data to store (state) to compute
these results is pretty much given.
You can think of the semantics of the join as writing both streams into a
relational DBM
Just some update: I tried to enable "EMRFS Consistent View" option, but it
didn't help. Not sure whether that's what you recommended, or something
else.
Thanks!
--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
How often is the product db updated? Based on that you can store product
metadata as state in Flink, maybe setup the state on cluster startup and then
update daily etc.
Also, just based on this feature, flink doesn’t seem to add a lot of value on
top of Kafka. As Jorn said below, you can very w
Dear All,
I have questions regarding the keys. In general, the questions are:
what happens if I am doing keyBy based on unlimited number of keys? How Flink
is managing each KeyedStream under the hood? Will I get memory overflow, for
example, if every KeyStream associated with a specific key is t
Sure kinesis is another way.
Can you try read after write consistency (assuming the files are not modified)
In any case it looks you would be better suited with a NoSQL store or kinesis
(I don’t know your exact use case in order to provide you more details)
> On 24. Jul 2018, at 09:51, Averell
Could you please help explain more details on "/try read after write
consistency (assuming the files are not modified) /"?
I guess that the problem I got comes from the inconsistency in S3 files
listing. Otherwise, I would have got exceptions on file not found.
My use case is to read output files
I'm trying to get a list of late elements in my Tumbling Windows application
and I noticed
that I need to use SingleOutputStreamOperator instead of DataStream to
get
access to the .sideOutputLateData(...) method.
Can someone explain what the difference is between
SingleOutputStreamOperator and
One option (which I haven't tried myself) would be to somehow get the model
into PMML format, and then use https://github.com/FlinkML/flink-jpmml to
score the model. You could either use another machine learning framework to
train the model (i.e., a framework that directly supports PMML export), or
Hi everyone,
We are using a simple Flink setup with one jobmanager and one taskmanager
running inside a docker container. We are having issues enabling the
*taskmanager.debug.memory.startLogThread
*setting. We added
*taskmanager.debug.memory.startLogThread: true*
*taskmanager.debug.memory.logInter
Hi Gerard,
the first log snippet from the client does not show anything suspicious.
The warning just says that you cannot use the Yarn CLI because it lacks the
Hadoop dependencies in the classpath.
The second snippet is indeed more interesting. If the TaskExecutors are not
notified about the chan
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html
You will find there a passage of the consistency model.
Probably the system is putting them to the folder and Flink is triggered before
they are consistent.
What happens after Flink put s them on S3 ? Are they reused by another s
Hi,
Thanks for your responses.
There is no fixed interval for the data being updated. It’s more like
whenever you onboard a new product or there are any mandates that change
will trigger the reference data to change.
It’s not just the enrichment we are doing here. Once we have enriched the
data
Hi Alex,
I'm not entirely sure what causes this problem because it is the first time
I see it.
First question would be if the problem also arises if using a different
Hadoop version.
Are you using the same Java versions on the client as well as on the server?
Could you provide us with the clust
Dear Cederic,
I did something similar as yours a while ago along this work [1] but I've
always been working within the batch context. I'm also the co-author of
flink-jpmml and, since a flink2pmml model saver library doesn't exist
currently, I'd suggest you a twofold strategy to tackle this problem:
Hi Syed,
could you check whether this class is actually contained in the twitter
example jar? If not, then you have to build an uber jar containing all
required dependencies.
Cheers,
Till
On Tue, Jul 24, 2018 at 5:11 AM syed wrote:
> I am facing the *java.lang.NoClassDefFoundError:
> com/twitt
Hi Chang Liu,
if you are dealing with an unlimited number of keys and keep state around
for every key, then your state size will keep growing with the number of
keys. If you are using the FileStateBackend which keeps state in memory,
you will eventually run into an OutOfMemoryException. One way to
Hi Chris,
a `DataStream` represents a stream of events which have the same type. A
`SingleOutputStreamOperator` is a subclass of `DataStream` and represents a
user defined transformation applied to an input `DataStream` and producing
an output `DataStream` (represented by itself). Since you can on
Hi Oliver,
which Flink image are you using? If you are using the docker image from
docker hub [1], then the memory logging will go to stdout and not to a log
file. The reason for this behavior is that the docker image configures the
logger to print to stdout such that one can easily access the log
Hello Jörn.
Thanks for your help.
"/Probably the system is putting them to the folder and Flink is triggered
before they are consistent./" <<< yes, I also guess so. However, if Flink is
triggered before they are consistent, either (a) there should be some error
messages, or (b) Flink should be abl
Hi,
The problem is that Flink tracks which files it has read by remembering the
modification time of the file that was added (or modified) last.
We use the modification time, to avoid that we have to remember the names
of all files that were ever consumed, which would be expensive to check and
sto
Hi Harshvardhan,
I agree with Ankit that this problem could actually be solved quite
elegantly with Flink's state. If you can ingest the product/account
information changes as a stream, you can keep the latest version of it in
Flink state by using a co-map function [1, 2]. One input of the co-map
Dear community,
this is the weekly community update thread #30. Please post any news and
updates you want to share with the community to this thread.
# First RC for Flink 1.6.0
The community is published the first release candidate for Flink 1.6.0 [1].
Please help the community by trying the RC
Hi Till,
How would we do the initial hydration of the Product and Account data since
it’s currently in a relational DB? Do we have to copy over data to Kafka
and then use them?
Regards,
Harsh
On Tue, Jul 24, 2018 at 09:22 Till Rohrmann wrote:
> Hi Harshvardhan,
>
> I agree with Ankit that this
Yes, using Kafka which you initialize with the initial values and then feed
changes to the Kafka topic from which you consume could be a solution.
On Tue, Jul 24, 2018 at 3:58 PM Harshvardhan Agrawal <
harshvardhan.ag...@gmail.com> wrote:
> Hi Till,
>
> How would we do the initial hydration of th
Hi Till,
Thanks for responding. Below is entrypoint logs. One thing I noticed that
"Hadoop version: 2.7.3". My fat jar is built with hadoop 2.8 client. Could
it be a reason for that error? If so how can i use same hadoop version 2.8
on flink server side? BTW job runs fine locally reading from the
Hi Alex,
Based on your log information, the potential reason is Hadoop version. To
troubleshoot the exception comes from different Hadoop version. I suggest
you match the both side of Hadoop version.
You can :
1. Upgrade the Hadoop version which Flink Cluster depends on, Flink's
official website
Alas, this suffer from the bootstrap problem. At the moment Flink does not
allow you to pause a source (the positions), so you can't fully consume the
and preload the accounts or products to perform the join before the
positions start flowing. Additionally, Flink SQL does not support
materializin
I am using a custom LoggingFactory. Is there a way to provide access to
this custom LoggingFactory to all the operators other than adding it to all
constructors?
This is somewhat related to:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Adding-Context-To-Logs-td7351.html
Ja
BTW,
We got around bootstrap problem for similar use case using a “nohup” topic as
input stream. Our CICD pipeline currently passes an initialize option to app IF
there is a need to bootstrap and waits for X minutes before taking a savepoint
and restart app normally listening to right topic(s).
What happens when one of your workers dies? Say the machine is dead is not
recoverable. How do you recover from that situation? Will the pipeline die
and you go over the entire bootstrap process?
On Tue, Jul 24, 2018 at 11:56 ashish pok wrote:
> BTW,
>
> We got around bootstrap problem for simil
Hi Jayant,
I think you should be able to implement your own StaticLoggerBinder which
returns your own LoggerFactory. That is quite similar to how the different
logging backends (log4j, logback) integrate with slf4j.
Cheers,
Till
On Tue, Jul 24, 2018 at 5:41 PM Jayant Ameta wrote:
> I am using
App is checkpointing, so will pick up if an operation fails. I suppose you mean
a TM completely crashes and even in that case another TM would spin up and it
“should” pick up from checkpoint. We are running YARN but I would assume TM
recovery would be possible in any other cluster. I havent test
Hi, there,
I am using avro format and write data to S3, recently upgraded from Flink
1.3.2 to 1.5 and noticed the following errors as below:
I am using RocksDB and checkpointDataUri is an S3 location.
My writer looks like something below.
val writer = new AvroKeyValueSinkWriter[String, R](proper
We started out with a sample project from an earlier version of flink-java.
The sample project's pom.xml contained a long list of elements
for building the fat jar. The fat jar size is slightly over 100MB in our
case.
We are looking to upgrade to Flink 1.5 so we updated the pom.xml using one
gene
The previous list exclude a number of dependencies to prevent clashes
with Flink (for example netty)
which is no longer required.
If you could provide the output of "mvn dependency:tree" we might be
able to figure out why the jar is larger.
On 24.07.2018 20:49, jlist9 wrote:
We started out wi
Dear
I am working on a streaming prediction model for which I want to try to use
the flink-jpmml extension. (https://github.com/FlinkML/flink-jpmml)
Unfortunately, it only supports only the 0.7.0-SNAPSHOT and 0.6.1 versions
of Flink and I am using the 1.7-SNAPSHOT version of Flink.
How can I downg
Vino,
Upgraded flink to Hadoop 2.8.1
$ docker exec -it flink-jobmanager cat /var/log/flink/flink.log | grep
entrypoint | grep 'Hadoop version'
2018-07-25T00:19:46.142+
[org.apache.flink.runtime.entrypoint.ClusterEntrypoint] INFO Hadoop
version: 2.8.1
but job still fails to start
Ideas?
C
Hi Alex,
Is it possible that the data has been corrupted?
Or have you confirmed that the avro version is consistent in different
Flink versions?
Also, if you don't upgrade Flink and still use version 1.3.1, can it be
recovered?
Thanks, vino.
2018-07-25 8:32 GMT+08:00 Alex Vinnik :
> Vino,
>
Hi Cederic,
I just read the project you gave, it includes the following statement in
its README file.
*“flink-jpmml is tested with the latest Flink (i.e. 1.3.2), but any working
Apache Flink version (repo) should work properly.”*
This project was born a year ago and should not rely on versions
Thank you Fabian.
I tried to implement a quick test basing on what you suggested: having an
offset from system time, and I did get improvement: with offset = 500ms -
the problem has completely gone. With offset = 50ms, I still got around 3-5
files missed out of 10,000. This number might come from
Hi all,
I have a standalone cluster with 3 jobmanagers, and set high-availability to
zookeeper. Our client submits job by REST API(POST /jars/:jarid/run), which
means we need to know the host of the any of the current alive jobmanagers. The
problem is that, how can we know which job manager is
Hello Fabian,
I created the JIRA bug https://issues.apache.org/jira/browse/FLINK-9940
BTW, I have one more question: Is it worth to checkpoint that list of
processed files? Does the current implementation of file-source guarantee
exactly-once?
Thanks for your support.
--
Sent from: http://ap
Hi Yuan Youjun,
Actually, RestClusterClient has a method named getWebMonitorBaseUrl which
will retrieve the webmonitor's leader address when you submit job
automatically.[1]
Ideally, you do not need to retrieve JM by yourself. Currently, the
webmonitor is binding with JobManager, maybe if JM fail
44 matches
Mail list logo