You are working in a distributed system so event ordering by time may not be
sufficient (or most likely not). Due to network delays, devices offline etc it
can happen that an event arrives much later although it happened before. Check
watermarks in flink and read on at least once, mostly once an
Why don’t you get an S3 notification on SQS and do the actions from there?
You will probably need to write the content of the files to a no sql database .
Alternatively send the s3 notification to Kafka and read flink from there.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo
You are missing additional dependencies
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/avro.html
> Am 11.07.2020 um 04:16 schrieb Lian Jiang :
>
>
> Hi,
>
> According to
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/conne
I would implement them directly in Flink/Flink table API.
I don’t think Drools plays well in this distributed scenario. It expects a
centralized rule store and evaluation .
> Am 23.06.2020 um 21:03 schrieb Jaswin Shah :
>
>
> Hi I am thinking of using some rule engine like DROOLS with flink t
I think S3 is a wrong storage backend for this volumes of small messages.
Try to use a NoSQL database or write multiple messages into one file in S3
(1 or 10)
If you still want to go with your scenario then try a network optimized
instance and use s3a in Flink and configure s3 entropy.
You can create a fat jar (also called Uber jar) that includes all dependencies
in your application jar.
I would avoid to put things in the Flink lib directory as it can make
maintenance difficult. Eg deployment is challenging, upgrade of flink,
providing it on new nodes etc.
> Am 30.10.2019
Flink is merely StreamProcessing. I would not use it in a synchronous web call.
However, I would not make any complex analytic function available on a
synchronous web service. I would deploy a messaging bus (rabbitmq, zeromq etc)
and send the request there (if the source is a web app potentially
Btw. Why don’t you use the max method?
https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/datastream/KeyedStream.html#max-java.lang.String-
See here about the state solution:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/stat
You can not compare doubles in Java (and other languages). Reason is that
double has
a limited precision and is rounded. See here for some examples and discussion:
https://howtodoinjava.com/java/basics/correctly-compare-float-double/
Am 03.10.2019 um 08:26 schrieb Komal Mariam :
>
>
> Hello a
Increase replication factor and/or use HDFS cache
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html
Try to reduce the size of the Jar, eg the Flink libraries do not need to be
included.
> Am 30.08.2019 um 01:09 schrieb Elkhan Dadashov :
>
> De
I don’t know Dylibs in detail, but can you call a static method where it checks
if it has been already executed and if not then it loads the library (Singleton
pattern)?
> Am 27.08.2019 um 06:39 schrieb Vishwas Siravara :
>
> Hi guys,
> I have a flink application that loads a dylib like this
>
still figuring out some details, but hope that it can go live soon.
>
> Would be great to have your repositories listed there as well.
>
> Cheers,
> Fabian
>
>> Am Mo., 29. Juli 2019 um 23:39 Uhr schrieb Jörn Franke
>> :
>> Hi,
>>
>> I wrote som
Hi,
I wrote some time ago several connectors for Apache Flink that are open
sourced under the Apache License:
* HadoopOffice: Process Excel files (reading/writing) (.xls,.xlsx) in Flink
using the datasource or Table API:
https://github.com/ZuInnoTe/hadoopoffice/wiki/Using-Apache-Flink-to-read-wri
Just sent a dummy event from the source system every minute
> Am 24.05.2019 um 10:20 schrieb "wangl...@geekplus.com.cn"
> :
>
>
> I want to do something every one minute.
>
> Using TumblingWindow, the function will not be triigged if there's no message
> received during this minute. But i st
It would help to understand the current issues that you have with this
approach? I used a similar approach (not with Flink, but a similar big data
technology) some years ago
> Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager :
>
> Hi all,
>
> I'm working on a setup to use Apache Flink in an as
Can you check the Flink log files? You should get there a better description of
the error.
> Am 08.12.2018 um 18:15 schrieb sohimankotia :
>
> Hi ,
>
> I have installed flink-1.7.0 Hadoop 2.7 scala 2.11 . We are using
> hortonworks hadoop distribution.(hdp/2.6.1.0-129/)
>
> *Flink lib folder
Did you configure the IAM access roles correctly? Are those two machines are
allowed to communicate?
> Am 23.10.2018 um 12:55 schrieb madan :
>
> Hi,
>
> I am trying to setup cluster with 2 EC2 instances. Able to bring up the
> cluser with 1 master and 2 slaves. But I am not able to connect to
You have to file an issue. One workaround to see if this really fixes your
problem could be to use reflection to mark this method as public and then call
it (it is of course nothing for production code). You can also try a newer
Flink version.
> Am 13.10.2018 um 18:02 schrieb Seye Jin :
>
> I
Thank you very nice , I fully agree with that.
> Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu :
>
> Hi Jörn,
>
> Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact
> it is one of the two approaches that I named in the beginning of the thread.
> As also pointed out the
Would it maybe make sense to provide Flink as an engine on Hive
(„flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely
coupled than integrating hive in all possible flink core modules and thus
introducing a very tight dependency to Hive in the core.
1,2,3 could be achieved via
You don’t need to include the flink libraries themselves in the fat jar ! You
can put them as provided and this reduces the jar size! They are already
provided by your Flink installation. One exception is the table API but I
simply recommend to put it in your flink distribution folder (if your f
What do the logfiles say?
How does the source code looks like?
Is it really needed to do checkpointing every 30 seconds?
> On 19. Sep 2018, at 08:25, yuvraj singh <19yuvrajsing...@gmail.com> wrote:
>
> Hi ,
>
> I am doing checkpointing using s3 and rocksdb ,
> i am doing checkpointing per 30
If you have a window larger than hours then you need to rethink your
architecture - this is not streaming anymore. Only because you receive events
in a streamed fashion you don’t need to do all the processing in a streamed
fashion.
Can you store the events in a file or a database and then do aft
It causes more overhead (processes etc) which might make it slower. Furthermore
if you have them stored on HDFS then the bottleneck is the namenode which will
have to answer millions of requests.
The latter point will change in future Hadoop versions with
http://ozone.hadoop.apache.org/
> On 1
Or you write a custom file system for Flink... (for the tar part).
Unfortunately gz files can only be processed single threaded (there are some
multiple thread implementation but they don’t bring the big gain).
> On 10. Aug 2018, at 07:07, vino yang wrote:
>
> Hi Averell,
>
> In this case, I
(At the end of your code)
> On 8. Aug 2018, at 00:29, Jörn Franke wrote:
>
> Hi Mich,
>
> Would it be possible to share the full source code ?
> I am missing a call to streamExecEnvironment.execute
>
> Best regards
>
>> On 8. Aug 2018, at 00:02, Mich Ta
Hi Mich,
Would it be possible to share the full source code ?
I am missing a call to streamExecEnvironment.execute
Best regards
> On 8. Aug 2018, at 00:02, Mich Talebzadeh wrote:
>
> Hi Fabian,
>
> Reading your notes above I have converted the table back to DataStream.
>
> val tableEnv
How does your build.sbt looks especially dependencies?
> On 2. Aug 2018, at 00:44, Mich Talebzadeh wrote:
>
> Changed as suggested
>
>val streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment
> val dataStream = streamExecEnv
>.addSource(new FlinkKafkaConsumer011[
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html
You will find there a passage of the consistency model.
Probably the system is putting them to the folder and Flink is triggered before
they are consistent.
What happens after Flink put s them on S3 ? Are they reused by another s
Sure kinesis is another way.
Can you try read after write consistency (assuming the files are not modified)
In any case it looks you would be better suited with a NoSQL store or kinesis
(I don’t know your exact use case in order to provide you more details)
> On 24. Jul 2018, at 09:51, Averell
It could be related to S3 that seems to be configured for eventual consistency.
Maybe it helps to configure strong consistency.
However, I recommend to replace S3 with a NoSQL database (since you are amazon
Dynamo would help + Dynamodb streams, alternatively sns or sqs). The small size
and high
For the first one (lookup of single entries) you could use a NoSQL db (eg key
value store) - a relational database will not scale.
Depending on when you need to do the enrichment you could also first store the
data and enrich it later as part of a batch process.
> On 24. Jul 2018, at 05:25, Ha
You can run them in a localenvironment. I do it for my integration tests
everytime:
flinkEnvironment = ExecutionEnvironment.createLocalEnvironment(1)
Eg (even together with a local HDFS cluster)
https://github.com/ZuInnoTe/hadoopcryptoledger/blob/master/examples/scala-flink-ethereumblock/src/it/
Textinputformat defines the format of the data, it could be also different from
text , eg orc, parquet etc
> On 14. Jul 2018, at 19:15, chrisr123 wrote:
>
> I'm building a streaming app that continuously monitors a directory for new
> files and I'm confused about why I have to specify a TextInp
data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>> On Sun, 8 Jul 2018 at
That they are loosely coupled does not mean they are independent. For instance,
you would not be able to replace Kafka with zeromq in your scenario.
Unfortunately also Kafka sometimes needs to introduce breaking changes and the
dependent application needs to upgrade.
You will not be able to avo
I think it is a little bit overkill to use Flink for such a simple system.
> On 4. Jul 2018, at 18:55, Yersinia Ruckeri wrote:
>
> Hi all,
>
> I am working on a prototype which should include Flink in a reactive systems
> software. The easiest use-case with a traditional bank system where I ha
import org.apache.flink.core.fs.FileSystem
> On 3. Jul 2018, at 08:13, Mich Talebzadeh wrote:
>
> thanks Hequn.
>
> When I use as suggested, I am getting this error
>
> error]
> /home/hduser/dba/bin/flink/md_streaming/src/main/scala/myPackage/md_streaming.scala:30:
> not found: value FileSy
Looks like a version issue , have you made sure that the Kafka version is
compatible?
> On 2. Jul 2018, at 18:35, Mich Talebzadeh wrote:
>
> Have you seen this error by any chance in flink streaming with Kafka please?
>
> org.apache.flink.client.program.ProgramInvocationException:
> java.lan
Shouldn’t it be 1.5.0 instead of 1.5?
> On 1. Jul 2018, at 18:10, Mich Talebzadeh wrote:
>
> Ok some clumsy work by me not creating the pom.xml file in flink
> sub-directory (it was putting it in spark)!
>
> Anyhow this is the current issue I am facing
>
> [INFO]
> --
is distributed to different TM and the performance worse than the low
> parallelism case. Is this something expected? The more I scale the less I get?
>
> From: Siew Wai Yow
> Sent: Saturday, June 16, 2018 5:09 PM
> To: Jörn Franke
> Cc: user@flink.apache.org
> Subject: Re:
How large is the input data? If the input data is very small then it does not
make sense to scale it even more. The larger the data is the more parallelism
you will have. You can modify this behavior of course by changing the partition
on the Dataset.
> On 16. Jun 2018, at 10:41, Siew Wai Yow
Don’t use the JDBC driver to write to Hive. The performance of JDBC in general
for large volumes is suboptimal.
Write it to a file in HDFS in a format supported by HIve and point the table
definition in Hive to it.
> On 11. Jun 2018, at 04:47, sagar loke wrote:
>
> I am trying to Sink data to
Do you have the complete source?
I am missing a env.execute at the end
> On 27. May 2018, at 18:55, chrisr123 wrote:
>
> I'm using Flink 1.4.0
>
> I'm trying to save the results of a Table API query to a CSV file, but I'm
> getting an error.
> Here are the details:
>
> My Input file looks lik
Just some advice - do not use sleep to simulate a heavy task. Use real data or
generated data to simulate. This sleep is garbage from a software quality point
of view. Furthermore, it is often forgotten etc.
> On 16. May 2018, at 22:32, Vijay Balakrishnan wrote:
>
> Hi,
> Newbie question - Wha
If you want to write in batches from a streaming source you always will need
some state ie a state database (here a NoSQL database such as a key value store
makes sense). Then you can grab the data at certain points in time and convert
it to Avro. You need to make sure that the state is logicall
The problem maybe that it is still static. How will the parser use this HashMap?
> On 26. Apr 2018, at 06:42, Soheil Pourbafrani wrote:
>
> I run a code using Flink Java API that gets some bytes from Kafka and parses
> it following by inserting into Cassandra database using another library
> s
Tried with a fat jar to see if it works in general ?
> On 25. Apr 2018, at 15:32, Gyula Fóra wrote:
>
> Hey,
> Is there somewhere an end to end guide how to run a simple beam-on-flink
> application (preferrably using Gradle)? I want to run it using the standard
> per-job yarn cluster setup bu
I would disable it if possible and use the Flink parallism. The threading
might work but can create operational issues depending on how you configure
your resource manager.
> On 23. Apr 2018, at 11:54, Alexander Smirnov
> wrote:
>
> Hi,
>
> I have a co-flatmap function which reads data from
You can use the corresponding HadoopInputformat within Flink
> On 18. Apr 2018, at 07:23, sohimankotia wrote:
>
> Hi ..
>
> I have file in hdfs in format file.snappy.parquet . Can someone please
> point/help with code example of reading parquet files .
>
>
> -Sohi
>
>
>
> --
> Sent from:
easily it is to get this information ?
>
> Best, Esa
>
> -----Original Message-
> From: Jörn Franke
> Sent: Saturday, April 14, 2018 1:43 PM
> To: Esa Heikkinen
> Cc: user@flink.apache.org
> Subject: Re: Complexity of Flink
>
> I think this always depend
I think this always depends. I found Flink more clean compared to other Big
Data platforms and with some experience it is rather easy to deploy.
However how do you measure complexity? How do you plan to cater for other
components (eg deploy in the cloud, deploy locally in a Hadoop cluster etc).
Why don’t you parse the response from curl and use it to trigger the second
request?
That is easy automatable using Bash commands - or do I overlook something here?
> On 9. Apr 2018, at 18:49, Pavel Ciorba wrote:
>
> Hi everyone
>
> I make 2 cURL POST requests to upload and run a Flink job.
>
Have you checked janusgraph source code , it used also hbase as a storage
backend:
http://janusgraph.org/
It combines it with elasticsearch for indexing. Maybe you can inspire from the
architecture there.
Generally, hbase it depends a lot on how the data is written to regions, the
order of data
What was the input format, the size and the program that you tried to execute
> On 28. Mar 2018, at 08:18, Data Engineer wrote:
>
> I went through the explanation on MaxParallelism in the official docs here:
> https://ci.apache.org/projects/flink/flink-docs-master/ops/production_ready.html#set-m
How would you start implementing it? Where are you stuck?
Did you already try to implement this?
> On 18. Mar 2018, at 04:10, Dhruv Kumar wrote:
>
> Hi
>
> I am a CS PhD student at UMN Twin Cities. I am trying to use Flink for
> implementing some very specific use-cases: (They may not seem re
Alternatively static method FileSystem.get
> On 10. Mar 2018, at 10:36, flinkuser101 wrote:
>
> Where does Flink expose filesystem? Is it from env? or inputstream?
>
> TextInputFormat format = new TextInputFormat(new
> org.apache.flink.core.fs.Path(url.toString()));
> DataStream inputStream =
>
Path has a method getFileSystem
> On 10. Mar 2018, at 10:36, flinkuser101 wrote:
>
> Where does Flink expose filesystem? Is it from env? or inputstream?
>
> TextInputFormat format = new TextInputFormat(new
> org.apache.flink.core.fs.Path(url.toString()));
> DataStream inputStream =
> env.readFi
FileSystem class provides by Flink.
https://ci.apache.org/projects/flink/flink-docs-master/api/java/
> On 10. Mar 2018, at 00:44, flinkuser101 wrote:
>
> Is there any way to do that? I have been searching for way to do that but in
> vain.
>
>
>
> --
> Sent from:
> http://apache-flink-user-m
Why don’t you let your flink job move them once it’s done?
> On 9. Mar 2018, at 03:12, flinkuser101 wrote:
>
> I am reading files from a folder suppose
>
> /files/*
>
> Files are pushed into that folder.
>
> /files/file1_2018_03_09.csv
> /files/file2_2018_03_09.csv
>
> Flink is reading fil
You need to put flink-hadoop-compability*.jar in the lib folder of your flink
distribution or in the class path of your Custer nodes
> On 19. Dec 2017, at 12:38, shashank agarwal wrote:
>
> yes, it's working fine. now not getting compile time error.
>
> But when i trying to run this on cluster
If you want to really learn then I recommend you to start with a flink project
that contains unit tests and integration tests (maybe augmented with
https://wiki.apache.org/hadoop/HowToDevelopUnitTests to simulate a HDFS cluster
during unit tests). It should also include coverage reporting. These
Be careful though with racing conditions .
> On 12. Nov 2017, at 02:47, Kien Truong wrote:
>
> Hi Mans,
>
> They're not executed in the same thread, but the methods that called them are
> synchronized[1] and therefore thread-safe.
>
> Best regards,
>
> Kien
>
> [1]
> https://github.com/apa
Well you can only performance test it beforehand in different scenarios with
different configurations.
I am not sure what exactly your state holds (eg how many objects etc), but if
it is Java objects then 3 times might be a little bit low (depends also how you
initially tested state size) - ho
Amazon EMR has already a Flink package. You just need to check the checkbox. I
would not install it on your own.
I think you can find it in the advanced options.
> On 26. Sep 2017, at 07:14, Navneeth Krishnan wrote:
>
> Hello All,
>
> I'm trying to deploy flink on AWS EMR and I'm very new to
If you really need to get that low something else might be more suitable. Given
the times a custom solution might be necessary. Flink is a generic powerful
framework - hence it does not address these latencies.
> On 31. Aug 2017, at 14:50, Marchant, Hayden wrote:
>
> We're about to get starte
It looks like that in your case everything should be serializable. An
alternative would be to mark certain non-serializable things as transient, but
as far as I see this is not possible in your case.
> On 27. Aug 2017, at 11:02, Federico D'Ambrosio
> wrote:
>
> Hi,
>
> I'm trying to write on
One would need to look at your code and possible on some heap statistics. Maybe
something wrong happens when you cache them (do you use a 3rd party library or
your own implementation?). Do you use a stable version of your protobuf library
(not necessarily the most recent). You also may want to l
loaded by flink
into the cluster, so there is no data locality as with HDFS).
> On 6. Aug 2017, at 11:20, Kaepke, Marc wrote:
>
> Thanks Jörn!
>
> I expected Flink will schedule the input file to all workers.
>
>> Am 05.08.2017 um 16:25 schrieb Jörn Franke :
>>
Probably you need to refer to the file on HDFS or manually make it available on
each node as a local file. HDFS is recommended.
If it is already on HDFS then you need to provide an HDFS URL to the file.
> On 5. Aug 2017, at 14:27, Kaepke, Marc wrote:
>
> Hi there,
>
> my really small test job
What do you mean by "consistent"?
Of course you can do this only at the time the timpstamp is defined (e.g. Using
NTP). However, this is never perfect .
Then it is unrealistic that they always end up in the same window because of
network delays etc. you will need here a global state that is defi
That does not sound like a good idea to put a configuration file on every node.
What about Zookeeper?
> On 13. Jul 2017, at 17:10, Guy Harmach wrote:
>
> Hi,
>
> I’m running a flink job on YARN. I’d like to pass yaml configuration files to
> the job.
> I tried to use the flink cli –yarnship
The error that you mentioned seem to indicate that some certificates of
certification authorities could not be found. You may want to add them to the
trust store of the application.
> On 26. Jun 2017, at 16:55, ani.desh1512 wrote:
>
> As Stephan pointed out, this seems more like a MapR libs me
Hallo,
It si possible, but some caveat : flink is a distributed system, but in drools
the fact are only locally available. This may lead to strange effects when
rules update the fact base.
Best regards
> On 23. Jun 2017, at 12:49, Sridhar Chellappa wrote:
>
> Folks,
>
> I am new to Flink.
Why not use a parquet only format? Not sure why you need an avtoparquetformat.
> On 24. Apr 2017, at 18:19, Lukas Kircher
> wrote:
>
> Hello,
>
> I am trying to read Parquet files from HDFS and having problems. I use Avro
> for schema. Here is a basic example:
>
> public static void main(Str
I would use flume to import these sources to HDFS and then use flink or Hadoop
or whatever to process them. While it is possible to do it in flink, you do not
want that your processing fails because the web service is not available etc.
Via flume which is suitable for this kind of tasks it is mor
76 matches
Mail list logo