I would like to create a custom aggregator function for a windowed KeyedStream
which I have complete control over - i.e. instead of implementing an
AggregatorFunction, I would like to control the lifecycle of the flink state by
implementing the CheckpointedFunction interface, though I still want
Forget to mention that my target Kafka version is 0.11.x with aim to upgrade
to 1.0 when 1.0.x fixpack is released.
From: Tzu-Li (Gordon) Tai [mailto:tzuli...@apache.org]
Sent: Thursday, February 08, 2018 8:05 PM
To: user@flink.apache.org; Marchant, Hayden [ICG-IT]
Subject: RE: Kafka as source
?
Thanks
Hayden
From: Tzu-Li (Gordon) Tai [mailto:tzuli...@apache.org]
Sent: Thursday, February 08, 2018 8:05 PM
To: user@flink.apache.org; Marchant, Hayden [ICG-IT]
Subject: RE: Kafka as source for batch job
Hi Marchant,
Yes I agree. In general, the isEndOfStream method has a very ill-defined
KafkaFetcher.runFetchLoop that has slightly different logic for changing
running to be false.
What would you recommend in this case?
From: Tzu-Li (Gordon) Tai [mailto:tzuli...@apache.org]
Sent: Thursday, February 08, 2018 12:24 PM
To: user@flink.apache.org; Marchant, Hayden [ICG-IT]
Subject: Re
I know that traditionally Kafka is used as a source for a streaming job. In our
particular case, we are looking at extracting records from a Kafka topic from a
particular well-defined offset range (per partition) - i.e. from offset X to
offset Y. In this case, we'd somehow want the application
WE actually got it working. Essentially, it's an implementation of
HadoopFilesytem, and was written with the idea that it can be used with Spark
(since it has broader adoption than Flink as of now). We managed to get it
configured, and found the latency to be much lower than by using the s3
con
Thanks for the info!
-Original Message-
From: Piotr Nowojski [mailto:pi...@data-artisans.com]
Sent: Friday, February 02, 2018 4:37 PM
To: Marchant, Hayden [ICG-IT]
Cc: user@flink.apache.org
Subject: Re: Latest version of Kafka
Hi,
Flink as for now provides only a connector for Kafka
Thanks for all the ideas!!
From: Steven Wu [mailto:stevenz...@gmail.com]
Sent: Tuesday, February 06, 2018 3:46 AM
To: Stefan Richter
Cc: Marchant, Hayden [ICG-IT] ;
user@flink.apache.org; Aljoscha Krettek
Subject: Re: Joining data in Streaming
There is also a discussion of side input
https
What is the newest version of Kafka that is compatible with Flink 1.4.0? I see
the last version of Kafka supported is 0.11 , from documentation, but has any
testing been done with Kafka 1.0?
Hayden Marchant
I have 2 datasets that I need to join together in a Flink batch job. One of the
datasets needs to be created dynamically by completely 'draining' a Kafka topic
in an offset range (start and end), and create a file containing all messages
in that range. I know that in Flink streaming I can specif
Edward,
We are using Object Storage for checkpointing. I'd like to point out that we
were seeing performance problems using the S3 protocol. Btw, we had quite a few
problems using the flink-s3-fs-hadoop jar with Object Storage and had to do
some ugly hacking to get it working all over. We recen
Stefan,
So are we essentially saying that in this case, for now, I should stick to
DataSet / Batch Table API?
Thanks,
Hayden
-Original Message-
From: Stefan Richter [mailto:s.rich...@data-artisans.com]
Sent: Tuesday, January 30, 2018 4:18 PM
To: Marchant, Hayden [ICG-IT]
Cc: user
We have a use case where we have 2 data sets - One reasonable large data set (a
few million entities), and a smaller set of data. We want to do a join between
these data sets. We will be doing this join after both data sets are available.
In the world of batch processing, this is pretty straigh
native protocol that interfaces
directly to IBM Object Storage that can be configured through the hdfs-site.xml
called stocator that might speed things up.
-Original Message-
From: Aljoscha Krettek [mailto:aljos...@apache.org]
Sent: Thursday, January 25, 2018 6:30 PM
To: Marchant, Hayde
Hi,
We have a Flink Streaming application that uses S3 for storing checkpoints. We
are not using 'regular' S3, but rather IBM Object Storage which has an
S3-compatible connector. We had quite some challenges in overiding the endpoint
from the default s3.amnazonaws.com to our internal IBM Object
Hi,
I'm looking for guidelines for Reference architecture for Hardware for a
small/medium Flink cluster - we'll be installing on in-house bare-metal
servers. I'm looking for guidance for:
1. Number and spec of CPUs
2. RAM
3. Disks
4. Network
5. Proximity of servers to each other
(Most likely,
Hi,
WE are currently start to test Flink running on YARN. Till now, we've been
testing on Standalone Cluster. One thing lacking in standalone is that we have
to manually restart a Task Manager if it dies. I looked at
https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/jobmanager_h
I read in the Flink documentation that the TaskManager runs all tasks within
its own JVM, and that the recommendation is to set the taskmanager.heap.mb to
be as much as is available on the server. I have a very large server with 192GB
so thinking of giving most of it to the Task Manager.
I reca
I am attempting to run Flink 1.3.2 in HA mode with zookeeper.
When I run the start-cluster.sh, the job manager is not started, even though
the task manager is started. When I delved into this, I saw that the command:
ssh -n $FLINK_SSH_OPTS $master -- "nohup /bin/bash -l
\"${FLINK_BIN_DIR}/jobm
, October 02, 2017 2:24 PM
To: Marchant, Hayden [ICG-IT]
Cc: user@flink.apache.org
Subject: Re: In-memory cache
How about connecting two streams of data, one from the reference data and one
from the main data (I assume using key streams as you mention QueryableState)
and keep state locally within
We have an operator in our streaming application that needs to access
'reference data' that is updated by another Flink streaming application. This
reference data has about ~10,000 entries and has a small footprint. This
reference data needs to be updated ~ every 100 ms. The required latency for
I'm a newbie to Flink and am trying to understand how the recovery works using
state backends. I've read the documentation and am now trying to run a simple
test to demonstrate the abilities - I'd like to test the recovery of a flink
job and how the state is recovered from where it left off when
I can see the job running in the FlinkUI for the job, and specifically
specified the port for the Job Manager. When I provided a different port, I got
an akka exception. Here, it seems that the code is getting further. I think
that it might be connected with how I am creating the StateDescriptor
I am trying to use queryable state, and am encountering issues when querying
the state from the client. I get the following exception:
Exception in thread "main"
org.apache.flink.runtime.query.UnknownKvStateLocation: No KvStateLocation found
for KvState instance with name 'word_sums'.
a
I have a streaming application that has a keyBy operator followed by an
operator working on the keyed values (a custom sum operator). If the map
operator and aggregate operator are running on same Task Manager , will Flink
always serialize and deserialize the tuples, or is there an optimization
We're about to get started on a 9-person-month PoC using Flink Streaming.
Before we get started, I am interested to know how low-latency I can expect for
my end-to-end flow for a single event (from source to sink).
Here is a very high-level description of our Flink design:
We need at least onc
I didn’t think about NFS. That would save me the hassle of installing HDFS
cluster just for that, especially if my organization already has an NFS ‘handy’.
Thanks
Hayden
From: Tony Wei [mailto:tony19920...@gmail.com]
Sent: Thursday, August 31, 2017 12:12 PM
To: Marchant, Hayden [ICG-IT]
Cc
Whether I use RocksDB or FS State backends, if my requirements are to have
fault-tolerance and ability to recover with 'at-least once' semantics for my
Flink job, is there still a valid case for using a backing local FS for storing
states? i.e. If a Flink Node is invalidated, I would have though
28 matches
Mail list logo