Re: Get last element of a DataSe

2018-09-25 Thread Fabian Hueske
Hi, Can you post the full stacktrace? Thanks, Fabian Am Di., 25. Sep. 2018 um 12:55 Uhr schrieb Alejandro Alcalde < algu...@gmail.com>: > Hi, > > I am trying to improve the efficiency of this code: > > discretized.map(_._2) > .name("Map V") > .reduce((_, b) ⇒ b) >

Re: How to join stream and batch data in Flink?

2018-09-25 Thread Fabian Hueske
Hi, I don't think that using the current join implementation in the Table API / SQL will work. The non-windowed join fully materializes *both* input tables in state. This is necessary, because the join needs to be able to process updates on either side. While this is not a problem for the fixed si

Re: Flink 1.5.4 -- issues w/ TaskManager connecting to ResourceManager

2018-09-26 Thread Fabian Hueske
Should we add a warning to the release announcements? Fabian Am Mi., 26. Sep. 2018 um 10:22 Uhr schrieb Robert Metzger < rmetz...@apache.org>: > Hey Jamie, > > we've been facing the same issue with dA Platform, when running Flink > 1.6.1. > I assume a lot of people will be affected by this. > >

Re: Can't get the FlinkKinesisProducer to work against Kinesalite for tests

2018-10-01 Thread Fabian Hueske
Hi Bruno, Thanks for sharing your approach! Best, Fabian Am Do., 27. Sep. 2018 um 18:11 Uhr schrieb Bruno Aranda : > Hi again, > > We managed at the end to get data into Kinesalite using the > FlinkKinesisProducer, but to do so, we had to use different configuration, > such as ignoring the 'aws

Re: Streaming to Parquet Files in HDFS

2018-10-01 Thread Fabian Hueske
Hi Bill, Flink 1.6.0 supports writing Avro records as Parquet files to HDFS via the previously mentioned StreamingFileSink [1], [2]. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-9753 [2] https://issues.apache.org/jira/browse/FLINK-9750 Am Fr., 28. Sep. 2018 um 23:36 Uhr schrieb

Re: [DISCUSS] Dropping flink-storm?

2018-10-01 Thread Fabian Hueske
+1 to drop it. Thanks, Fabian Am Sa., 29. Sep. 2018 um 12:05 Uhr schrieb Niels Basjes : > I would drop it. > > Niels Basjes > > On Sat, 29 Sep 2018, 10:38 Kostas Kloudas, > wrote: > > > +1 to drop it as nobody seems to be willing to maintain it and it also > > stands in the way for future deve

Re: Flink Scheduler Customization

2018-10-01 Thread Fabian Hueske
Hi Ananth, You can certainly do this with Flink, but there are no built-in operators for this. What you probably want to do is to compare the timestamp of the event with the current processing time and drop the record if it is too old. If the timestamp is encoded in the record, you can do this in

Re: In-Memory Lookup in Flink Operators

2018-10-01 Thread Fabian Hueske
Hi Chirag, Flink 1.5.0 added support for BroadcastState which should address your requirement of replicating the data. [1] The replicated data is stored in the configured state backend which can also be RocksDB. Regarding the reload, I would recommend Lasse's approach of having a custom source t

Re: Regarding implementation of aggregate function using a ProcessFunction

2018-10-01 Thread Fabian Hueske
Hi, There are basically three options: 1) Use an AggregateFunction and store everything that you would put into state into the Accumulator. This can become quite expensive because the Accumulator is de/serialized for every function call if you use RocksDB. The advantage is that you don't have to s

Re: I want run flink program in ubuntu x64 Mult Node Cluster what is configuration?

2018-10-01 Thread Fabian Hueske
Hi, these issues are not related to Flink but rather generic Linux / bash issues. Ensure that the start scripts are executable (can be changed with chmod) your user has the right permissions to executed the start scripts. Also, you have to use the right path to the scripts. If you are in the base

Re: flink-1.6.1-bin-scala_2.11.tgz fails signture and hash verification

2018-10-01 Thread Fabian Hueske
Hi Gianluca, I tried to validate the issue but hash and signature are OK for me. Do you remember which mirror you used to download the binaries? Best, Fabian Am Sa., 29. Sep. 2018 um 17:24 Uhr schrieb vino yang : > Hi Gianluca, > > This is very strange, Till may be able to give an explanation

Re: About the retract of the calculation result of flink sql

2018-10-01 Thread Fabian Hueske
Hi Clay, If you do env.setParallelism(1), the query won't be executed in parallel. However, looking at your screenshot the message order does not seem to be the problem here (given that you printed the content of the topic). Are you sure that it is not possible that the result decreases if some r

Re: Does Flink SQL "in" operation has length limit?

2018-10-01 Thread Fabian Hueske
Hi, I had a look into the code. From what I saw, we are translating the values into Rows. The problem here is that the IN clause is translated into a join and that the join results contains a time attribute field. This is a safety restriction to ensure that time attributes do not lose their waterm

Re: Deserialization of serializer errored

2018-10-02 Thread Fabian Hueske
Hi Elias, I am not familiar with the recovery code, but Flink might read (some of ) the savepoint data even though it is not needed and loaded into operators. That would explain why you see an exception when the case class is modified or completely removed. Maybe Stefan or Gordon can help here.

Re: Flink Python streaming

2018-10-03 Thread Fabian Hueske
Hi, AFAIK it's not that easy. Flink's Python support is based on Jython which translates Python code into JVM byte code. Therefore, native libs are not supported. Chesnay (in CC) knows the details here. Best, Fabian Hequn Cheng schrieb am Mi., 3. Okt. 2018, 04:30: > Hi Bing, > > I'm not famil

Re: Scala case class state evolution

2018-10-03 Thread Fabian Hueske
I know that Gordon (in CC) has looked closer into this problem. He should be able to share restrictions and maybe even workarounds. Best, Fabian Hequn Cheng schrieb am Mi., 3. Okt. 2018, 05:09: > Hi Elias, > > From my understanding, you can't do this since the state will no longer be > compatib

Re: Duplicates in self join

2018-10-08 Thread Fabian Hueske
Did you check the new interval join that was added with Flink 1.6.0 [1]? It might be better suited because, each record has its own boundaries based on its timestamp and the join window interval. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/operators/joi

Re: [DISCUSS] Dropping flink-storm?

2018-10-09 Thread Fabian Hueske
Yes, let's do it this way. The wrapper classes are probably not too complex and can be easily tested. We have the same for the Hadoop interfaces, although I think only the Input- and OutputFormatWrappers are actually used. Am Di., 9. Okt. 2018 um 09:46 Uhr schrieb Chesnay Schepler < ches...@apach

Re: getRuntimeContext(): The runtime context has not been initialized.

2018-10-10 Thread Fabian Hueske
Yes, it would be good to post your code. Are you using a FoldFunction in a window (if yes, what window) or as a running aggregate? In general, collecting state in a FoldFunction is usually not something that you should do. Did you consider using an AggregateFunction? Fabian Am Mi., 10. Okt. 2018

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-10 Thread Fabian Hueske
Hi Xuefu, Welcome to the Flink community and thanks for starting this discussion! Better Hive integration would be really great! Can you go into details of what you are proposing? I can think of a couple ways to improve Flink in that regard: * Support for Hive UDFs * Support for Hive metadata cat

Re: [deserialization schema] skip data, that couldn't be properly deserialized

2018-10-10 Thread Fabian Hueske
Hi Rinat, Thanks for discussing this idea. Yes, I think this would be a good feature. Can you open a Jira issue and describe the feature? Thanks, Fabian Am Do., 4. Okt. 2018 um 19:28 Uhr schrieb Rinat : > Hi mates, in accordance with the contract of > org.apache.flink.formats.avro.Deserializati

Re: Identifying missing events in keyed streams

2018-10-10 Thread Fabian Hueske
Hi Averell, I'd go with approach 2). As of Flink 1.6.0 you can delete timers. But even if you are on a pre-1.6 version, a ProcessFunction would be the way to go, IMO. You don't need to register a timer for each event. Instead, you can register the first timer with the first event and have a state

Re: Partitions vs. Subpartitions

2018-10-11 Thread Fabian Hueske
Hi Chris, The terminology in the docs and code is not always consistent. It depends on the context. Both could also mean the same if they are used in different places. Can you point to the place(s) that refer to partition and subpartition? Fabian Am Do., 11. Okt. 2018 um 04:50 Uhr schrieb Kurt

Re: Identifying missing events in keyed streams

2018-10-11 Thread Fabian Hueske
I'd go with 2) because the logic is simple and it is (IMO) much easier to understand what is going on and what state is kept. Am Do., 11. Okt. 2018 um 12:42 Uhr schrieb Averell : > Hi Fabian, > > Thanks for the suggestion. > I will try with that support of removing timers. > > I have also tried a

Re: When does Trigger.clear() get called?

2018-10-12 Thread Fabian Hueske
Hi Andrew, The PURGE action of a window removes the window state (i.e., the collected events or computed aggregate) but the window meta data including the Trigger remain. The Trigger.close() method is called, when the winodw is completely (i.e., all meta data) discarded. This happens, when the tim

Re: Not all files are processed? Stream source with ContinuousFileMonitoringFunction

2018-10-12 Thread Fabian Hueske
Hi, Which file system are you reading from? If you are reading from S3, this might be cause by S3's eventual consistency property. Have a look at FLINK-9940 [1] for a more detailed discussion. There is also an open PR [2], that you could try to patch the source operator with. Best, Fabian [1] ht

Re: When does Trigger.clear() get called?

2018-10-15 Thread Fabian Hueske
Hi, Re Q1: The main purpose of the Trigger.clean() method is to remove all custom state of the Trigger. State must be explicitly removed, otherwise the program leaks memory. Re Q3: If you are using a keyed stream, you need to manually clean up the state by calling State.clear(). If you are using a

Re: Need help to understand memory consumption

2018-10-17 Thread Fabian Hueske
Hi, As was said before, managed memory (as described in the blog post [1]) is only used for batch jobs. By default, managed memory is only lazily allocated, i.e., when a batch job is executed. Streaming jobs maintain state in state backends. Flink provides state backends that store the state on t

Re: Checkpointing when one of the sources has completed

2018-10-17 Thread Fabian Hueske
Hi Niels, Checkpoints can only complete if all sources are running. That's because the checkpoint mechanism relies on injecting checkpoint barriers into the stream at the sources. Best, Fabian Am Mi., 17. Okt. 2018 um 11:11 Uhr schrieb Paul Lam : > Hi Niels, > > Please see https://issues.apache

Re: Understand Broadcast State in Node Failure Case

2018-10-22 Thread Fabian Hueske
Hi Chengzhi, Broadcast State is checkpointed like any other state and will be restored in all failure cases (including the ones you mentioned). We added the warning to inform users that Broadcast state will also be stored in the JVM memory, even if the RocksDB StateBackend was configured (which st

Re: Need help to understand memory consumption

2018-10-22 Thread Fabian Hueske
If yes, does that mean that I have to purge old state > backend in RocksDB ? > > Thanks a lot ! > > Regards, > Julien. > > - Mail original - > De: "Fabian Hueske" > À: "wangzhijiang999" > Cc: "Paul Lam" , jpreis...@fre

Re: Java Table API and external catalog bug?

2018-10-25 Thread Fabian Hueske
IIRC, that was recently fixed. Might come out with 1.6.2 / 1.7.0. Cheers, Fabian Flavio Pompermaier schrieb am Do., 25. Okt. 2018, 14:09: > Ok thanks! I wasn't aware of this..that would be undoubtedly useful ;) > On Thu, Oct 25, 2018 at 2:00 PM Timo Walther wrote: > >> Hi Flavio, >> >> the ex

Re: Non deterministic result with Table API SQL

2018-11-05 Thread Fabian Hueske
Thanks Flavio for reporting the error helping to debug it. A job to reproduce the error is very valuable :-) Best, Fabian Am Mo., 5. Nov. 2018 um 14:38 Uhr schrieb Flavio Pompermaier < pomperma...@okkam.it>: > Here it is the JIRA ticket and, attached to if, the Flink (Java) job to > reproduce th

Re: Split one dataset into multiple

2018-11-06 Thread Fabian Hueske
You have to define a common type, like an n-ary Either type and return that from your source / operator. The resulting DataSet can be consumed by multiple FlatmapFunctions, each extracting and forwarding one of the the result types. Cheers, Fabian Am Di., 6. Nov. 2018 um 10:40 Uhr schrieb madan :

Re: Counting DataSet in DataFlow

2018-11-07 Thread Fabian Hueske
Hi, Counting always requires a job to be executed. Not sure if this is what you want to do, but if you want to prevent to get an empty result due to an empty cross input, you can use a mapPartition() with parallelism 1 to emit a special record, in case the MapPartitionFunction didn't see any data.

Re: Counting DataSet in DataFlow

2018-11-07 Thread Fabian Hueske
r(if cnt != 0).withBroadcastSet("cnt", count).doSomethingElse Fabian [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#broadcast-variables Am Mi., 7. Nov. 2018 um 14:10 Uhr schrieb Fabian Hueske : > Hi, > > Counting always requires a job to be executed. > N

Re: Manually clean SQL keyed state

2018-11-09 Thread Fabian Hueske
Hi Shahar, That's not possible at the moment. The SQL API does not provide any knobs to control state size besides the idle state retention. The reason is that it aims to be as accurate as possible. In the future it might be possible to provide more information to the system (like constraints in

Re: Always trigger calculation of a tumble window in Flink SQL

2018-11-09 Thread Fabian Hueske
Hi, SQL does not support any custom triggers or timers. In general, computations are performed when they are complete with respect to the watermarks (applies for GROUP BY windows, OVER windows, windowed and time-versioned joins, etc. Best, Fabian Am Fr., 9. Nov. 2018 um 05:08 Uhr schrieb yinhua.

Re: Stopping a streaming app from its own code : behaviour change from 1.3 to 1.6

2018-11-09 Thread Fabian Hueske
Hi Arnaud, Thanks for reporting the issue! Best, Fabian Am Do., 8. Nov. 2018 um 20:50 Uhr schrieb LINZ, Arnaud < al...@bouyguestelecom.fr>: > 1.FLINK-10832 > > Created (with heavy difficulties as typing java code in a jira description > wa

Re: Flink Serialization and case class fields limit

2018-11-16 Thread Fabian Hueske
Hi Andrea, I wrote the post 2.5 years ago. Sorry, I don't think that I kept the code somewhere, but the mechanics in Flink should still be the same. Best, Fabian Am Fr., 16. Nov. 2018, 20:06 hat Andrea Sella geschrieben: > Hi Andrey, > > My bad, I forgot to say that I am using Scala 2.11, that

Re: Group by with null keys

2018-11-20 Thread Fabian Hueske
Hi Flavio, Whether groupBy with null values works or not depends on the type of the key, or more specifically on the TypeComparator and TypeSerializer that are used to serialize, compare, and hash the key type. The processing engine supports null values If the comparator and serializer can handle

Re: Flink Table Duplicate Evaluation

2018-11-20 Thread Fabian Hueske
Hi Niklas, The workaround that you described should work fine. However, you don't need a custom sink. Converting the Table into a DataSet and registering the DataSet again as a Table is currently the way to solve this issue. Best, Fabian Am Di., 20. Nov. 2018 um 17:13 Uhr schrieb Niklas Teichman

[ANNOUNCE] Flink Forward San Francisco Call for Presentations closes soon

2018-11-20 Thread Fabian Hueske
Hi Everyone, Flink Forward San Francisco will *take place on April 1st and 2nd 2019*. Flink Forward is a community conference organized by data Artisans and gathers many members of the Flink community, including users, contributors, and committers. It is the perfect event to get in touch and conne

Re: TaskManager & task slots

2018-11-21 Thread Fabian Hueske
Yes, this hasn't changed. Best, Fabain Am Mi., 21. Nov. 2018 um 08:18 Uhr schrieb yinhua.dai < yinhua.2...@outlook.com>: > Hi Fabian, > > Is below description still remain the same in Flink 1.6? > > Slots do not guard CPU time, IO, or JVM memory. At the moment they only > isolate managed memory

Re: Can JDBCSinkFunction support exectly once?

2018-11-21 Thread Fabian Hueske
Hi, JDBCSinkFunction is a simple wrapper around the JDBCOutputFormat (the DataSet / Batch API output interface). Dominik is right, that JDBCSinkFunction does not support exactly-once output. It is not strictly required that an exactly-once sink implements TwoPhaseCommitFunction. TPCF is a conveni

Re: Flink streaming automatic scaling (1.6.1)

2018-11-21 Thread Fabian Hueske
Hi, Flink 1.6 does not support automatic scaling. However, there is a REST call to trigger the rescaling of a job. You need to call it manually though. Have a look at the */jobs/:jobid/rescaling *call in the REST docs [1]. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-

Re: Deadlock happens when sink to mysql

2018-11-22 Thread Fabian Hueske
Hi, Which TableSource and TableSink do you use? Best, Fabian Am Mo., 19. Nov. 2018 um 15:39 Uhr schrieb miki haiat : > can you share your entire code please > > On Mon, Nov 19, 2018 at 4:03 PM 徐涛 wrote: > >> Hi Experts, >> I use the following sql, and sink to mysql, >> select >> >> alb

Re: Flink JSON (string) to Pojo (and vice versa) example

2018-11-22 Thread Fabian Hueske
Thanks Flavio. This looks very useful. AFAIK, the Calcite community is also working on functions for JSON <-> Text conversions which are part of SQL:2016. Hopefully, we can leverage their implementations in Flink's SQL support. Best, Fabian Am Di., 20. Nov. 2018 um 18:27 Uhr schrieb Flavio Pompe

Re: CEP Dynamic Patterns

2018-11-26 Thread Fabian Hueske
Hi Steve, No this feature has not been contributed yet. Best, Fabian Am Fr., 23. Nov. 2018 um 20:58 Uhr schrieb Steve Bistline < srbistline.t...@gmail.com>: > Have dynamic patterns been introduced yet? > > Steve >

Re: Joining more than 2 streams

2018-11-26 Thread Fabian Hueske
Hi, Yes, your reasoning is correct. If you use two binary joins, the data of the first two streams will be buffered twice. Unioning all three streams and joining them in a custom ProcessFunction would reduce the amount of required state. Best, Fabian Am Sa., 24. Nov. 2018 um 14:08 Uhr schrieb Ga

Re: understadning kafka connector - rebalance

2018-11-26 Thread Fabian Hueske
Hi, DataStream x = ... x.rebalance().keyBy() is not a good idea. It will first distribute the records round-robin (over the network) and subsequently partition them by hash. The first shuffle is unnecessary. It does not have any effect because it is undone by the second partitioning. Btw. any m

Re: your advice please regarding state

2018-11-27 Thread Fabian Hueske
Hi Avi, I'd definitely go for approach #1. Flink will hash partition the records across all nodes. This is basically the same as a distributed key-value store sharding keys. I would not try to fine tune the partitioning. You should try to use as many keys as possible to ensure an even distribution

Re: OutOfMemoryError while doing join operation in flink

2018-11-28 Thread Fabian Hueske
Hi, Flink handles large data volumes quite well, large records are a bit more tricky to tune. You could try to reduce the number of parallel tasks per machine (#slots per TM, #TMs per machine) and/or increase the amount of available JVM memory (possible in exchange for managed memory as Zhijiang s

Re: Dataset column statistics

2018-11-29 Thread Fabian Hueske
Hi, You could try to enable object reuse. Alternatively you can give more heap memory or fine tune the GC parameters. I would not consider it a bug in Flink, but might be something that could be improved. Fabian Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier < pomperma...@okkam.

Re: Dataset column statistics

2018-11-29 Thread Fabian Hueske
ve or > Spark) in Flink Table API? > > Best, > Flavio > > On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske wrote: > >> Hi, >> >> You could try to enable object reuse. >> Alternatively you can give more heap memory or fine tune the GC >> parameters. &

Re: Looking for relevant sources related to connecting Apache Flink and Edgent.

2018-11-30 Thread Fabian Hueske
Hi Felipe, You can define TableSources (for SQL, Table API) that support filter push-down. The optimizer will figure out this opportunity and hand filters to a custom TableSource. I should add that AFAIK this feature is not used very often (expect some rough edges) and that the API is likely to c

Re: [Table API example] Table program cannot be compiled. This is a bug. Please file an issue

2018-11-30 Thread Fabian Hueske
Hi Marvin, Can you post the query (+ schema of tables) that lead to this exception? Thank you, Fabian Am Fr., 30. Nov. 2018 um 10:55 Uhr schrieb Marvin777 < xymaqingxiang...@gmail.com>: > Hi all, > > I have a simple test for looking at Flink Table API and hit an exception > reported as a bug.

Re: How to apply watermark on datastream and then do join operation on it

2018-11-30 Thread Fabian Hueske
Hi, Welcome to the mailing list. What exactly is your problem? Do you receive an error message? Is the program not compiling? Do you receive no output? Regardless of that, I would recommend to provide the timestamp extractors to the Kafka source functions. Also, I would have a close look at the w

[ANNOUNCE] Call for Presentations Extended for Flink Forward San Francisco 2019

2018-12-01 Thread Fabian Hueske
Flink Forward San Francisco Am Di., 20. Nov. 2018 um 17:57 Uhr schrieb Fabian Hueske : > Hi Everyone, > > Flink Forward San Francisco will *take place on April 1st and 2nd 2019*. > Flink Forward is a community conference organized by data Artisans and > gathers many members of the

Re: Need the way to create custom metrics

2018-12-18 Thread Fabian Hueske
Hi, AFAIK it is not possible to collect metrics for an AggregateFunction. You can open a feature request by creating a Jira issue. Thanks, Fabian Am Mo., 17. Dez. 2018 um 21:29 Uhr schrieb Gaurav Luthra < gauravluthra6...@gmail.com>: > Hi, > > I need to know the way to implement custom metrics

[ANNOUNCE] Berlin Buzzwords 2019 - Call for Presentations open until Febr. 17th

2018-12-21 Thread Fabian Hueske
Hi everyone, The Call for Presentations for the Berlin Buzzwords 2019 conference is open until Febr. 17th. Berlin Buzzwords [1] is an amazing conference on all things Scale, Search, and Stream(!) with a great and open minded community and a strong focus on open source. Next year's edition takes p

Re: Flink SQL client always cost 2 minutes to submit job to a local cluster

2019-01-03 Thread Fabian Hueske
Hi, You can try to build a JAR file with all runtime dependencies of Flink SQL (Calcite, Janino, +transitive dependencies), add it to the lib folder, and exclude the dependencies from the JAR file that is sent to the cluster when you submit a job. It would also be good to figure out what takes so

Re: [DISCUSS] Dropping flink-storm?

2019-01-10 Thread Fabian Hueske
>> > >> https://github.com/apache/flink/blob/master/flink-contrib/flink-storm/src/main/java/org/apache/flink/storm/wrappers/FlinkTopologyContext.java >> > >> > Cheers, >> > Till >> > >> > On Tue, Oct 9, 2018 at 10:08 AM Fabia

Re: Reducing runtime of Flink planner

2019-01-10 Thread Fabian Hueske
Hi Niklas, The planning time of a job does not depend on the data size. It would be the same whether you process 5MB or 5PB. FLINK-10566 (as pointed to by Timo) fixed a problem for plans with many braching and joining nodes. Looking at your plan, there are some, but (IMO) not enough to be problem

Re: Multiple select single result

2019-01-13 Thread Fabian Hueske
Hi Dhanuka, The important error message here is "AppendStreamTableSink requires that Table has only insert changes". This is because you use UNION instead of UNION ALL, which implies duplicate elimination. Unfortunately, UNION is currently internally implemented as a regular aggregration which pro

Re: Multiple select single result

2019-01-14 Thread Fabian Hueske
t;> On Mon, Jan 14, 2019 at 9:54 AM dhanuka ranasinghe < >>> dhanuka.priyan...@gmail.com> wrote: >>> >>>> Hi Fabian, >>>> >>>> Thanks for the prompt reply and its working 🤗. >>>> >>>> I am trying to depl

Re: Multiple select single result

2019-01-14 Thread Fabian Hueske
"datastreamcalcrule" grows beyond 64 kb > > Cheers > Dhanuka > > > On Mon, 14 Jan 2019, 20:30 Fabian Hueske >> Hi, >> >> you should avoid the UNION ALL approach because the query will scan the >> (identical?) Kafka topic 200 times which is highly ine

Re: delete all available flink timers on app start

2019-01-17 Thread Fabian Hueske
Hi Vipul, I'm not aware of a way to do this. You could have a list of all registered timers per key as state to be able to delete them. However, the problem is to identify in user code when an application was restarted, i.e., to know when to delete timers. Also, timer deletion would need to be don

Re: Should the entire cluster be restarted if a single Task Manager crashes?

2019-01-18 Thread Fabian Hueske
Hi Harshith, No, you don't need to restart the whole cluster. Flink only needs enough processing slots to recover the job. If you have a standby TM, the job should restart immediately (according to its restart policy). Otherwise, you have to start a new TM to provide more slots. Once the slots are

Re: [DISCUSS] Towards a leaner flink-dist

2019-01-18 Thread Fabian Hueske
Hi Chesnay, Thank you for the proposal. I think this is a good idea. We follow a similar approach already for Hadoop dependencies and connectors (although in application space). +1 Fabian Am Fr., 18. Jan. 2019 um 10:59 Uhr schrieb Chesnay Schepler < ches...@apache.org>: > Hello, > > the binary

Re: Temporal tables not behaving as expected

2019-01-22 Thread Fabian Hueske
Hi, The problem is that you are using processing time which is non-deterministic. Both inputs are consumed at the same time and joined based on which record arrived first. The result depends on a race condition. If you change the input table to have event time attributes and use these to register

Re: Kafka stream fed in batches throughout the day

2019-01-22 Thread Fabian Hueske
Hi Jonny, I think this is good use case for event time stream processing. The idea of taking a savepoint, stopping and later resuming the job is good as it frees the resources that would otherwise be occupied by the idle job. In that sense it would behave like a batch job. However, in contrast to

Re: Flink Jdbc streaming source support in 1.7.1 or in future?

2019-01-23 Thread Fabian Hueske
I think this is very hard to build in a generic way. The common approach here would be to get access to the changelog stream of the table, writing it to a message queue / event log (like Kafka, Pulsar, Kinesis, ...) and ingesting the changes from the event log into a Flink application. You can of

Re: Select feilds in Table API

2019-01-29 Thread Fabian Hueske
The problem is that the table "lineitem" does not have a field "l_returnflag". The field in "lineitem" are named [TMP_2, TMP_5, TMP_1, TMP_0, TMP_4, TMP_6, TMP_3]. I guess it depends on how you obtained lineitem. Best, Fabian Am Di., 29. Jan. 2019 um 16:38 Uhr schrieb Soheil Pourbafrani < soheil

Re: How to load multiple same-format files with single batch job?

2019-01-29 Thread Fabian Hueske
Hi, You can point a file-based input format to a directory and the input format should read all files in that directory. That works as well for TableSources that are internally use file-based input formats. Is that what you are looking for? Best, Fabian Am Mo., 28. Jan. 2019 um 17:22 Uhr schrieb

Re: How to load multiple same-format files with single batch job?

2019-02-04 Thread Fabian Hueske
> before to be processed or will all be streamed? > > All the best > François > > Le mar. 29 janv. 2019 à 22:20, Fabian Hueske a écrit : > >> Hi, >> >> You can point a file-based input format to a directory and the input >> format should read all fil

Re: Add header to a file produced using the writeAsFormattedText method

2019-02-04 Thread Fabian Hueske
Hi Konstantinos, Writing headers to files is currently not supported by the underlying TextOutputFormat. You can implement a custom OutputFormat by extending TextOutputFormat to add this functionality. Best, Fabian Am Fr., 1. Feb. 2019 um 16:04 Uhr schrieb Papadopoulos, Konstantinos < konstantin

Re: Is the order guaranteed with Windowall

2019-02-04 Thread Fabian Hueske
Hi, A WindowAll is executed in a single task. If you sort the data before the window, the sorting must also happen in a single task, i.e., with parallelism 1. The reasons is that an operator somewhat randomly merges multiple input partitions. So even if each input partition is sorted, the merging

Re: Test harness for CoProcessFunction outputting Protobuf messages

2019-02-04 Thread Fabian Hueske
Hi Alexey, I think you are right. It does not seem to be possible to provide a TypeInformation for side outputs to a TestHarness. This sounds like a useful addition. Would you mind creating a Jira issue for that? Thank you, Fabian Am So., 3. Feb. 2019 um 19:13 Uhr schrieb Alexey Trenikhun : >

Re: Reverse of KeyBy

2019-02-04 Thread Fabian Hueske
Hi, Calling keyBy twice will not work, because the second call overrides the first. You can keyBy on a composite key (MyKey, LargeMessageId). You can do the following InputStream .keyBy ( (MyKey, LargeMessageId) ) \\ use composite key .flatMap(new MyReassemblyFunction()) .keyBy(MyKey) .?

Re: Add header to a file produced using the writeAsFormattedText method

2019-02-04 Thread Fabian Hueske
nos.papadopou...@iriworldwide.com>: > Hi Fabian, > > > > Do you know if there is any plan Flink core framework to support such > functionality? > > > > Best, > > Konstantinos > > > > *From:* Fabian Hueske > *Sent:* Δευτέρα, 4 Φεβρουαρίου 2019 3:49 μμ > *To:*

Re: Reverse of KeyBy

2019-02-04 Thread Fabian Hueske
Key)? My understanding is that this will further partition the > already partitioned input stream (from 1 above) and will not help me, as I > need to process all LargeMessages for a given MyKey in order. > > > > Is there an implicit assumption here that the flatMap operation (2) a

Re: Is the order guaranteed with Windowall

2019-02-04 Thread Fabian Hueske
rtStreamKeyed.window(TumblingProcessingTimeWindowsetParallelism(1).name("Aggregate > events"); > > Thanks > David > > On 2019/02/04 13:54:14, Fabian Hueske wrote: > > Hi, > > > > A WindowAll is executed in a single task. If you sort the data bef

Re: How to add caching to async function?

2019-02-04 Thread Fabian Hueske
Hi William, Does the cache need to be fault tolerant? If not you could use a regular in-memory map as cache (+some LRU cleaning). Or do you expect the cache to group too large for the memory? Best, Fabian Am Mo., 4. Feb. 2019 um 18:00 Uhr schrieb William Saar : > Hi, > I am trying to implement

Re: Reduce one event under multiple keys

2019-02-11 Thread Fabian Hueske
Hi Stephen, A window is created with the first record that is assigned to it. If the windows are based on time and a key, than no window will be created (and not space be occupied) if there is not a first record for a key and time interval. Anyway, if tracking the number of open files & average o

Re: Is there a windowing strategy that allows a different offset per key?

2019-02-11 Thread Fabian Hueske
Hi Stephen, First of all, yes, windows computing and emitting at the same time can cause pressure on the downstream system. There are a few ways how you can achieve this: * use a custom window assigner. A window assigner decides into which window a record is assigned. This is the approach you sug

Re: fllink 1.7.1 and RollingFileSink

2019-02-11 Thread Fabian Hueske
Hi Vishal, Kostas (in CC) should be able to help here. Best, Fabian Am Mo., 11. Feb. 2019 um 00:05 Uhr schrieb Vishal Santoshi < vishal.santo...@gmail.com>: > Any one ? > > On Sun, Feb 10, 2019 at 2:07 PM Vishal Santoshi > wrote: > >> You don't have to. Thank you for the input. >> >> On Sun, F

Re: Limit in batch flink sql job

2019-02-12 Thread Fabian Hueske
Hi, It's as the error message says. LIMIT 10 without ORDER BY would pick 10 random rows and hence lead to non-deterministic results. That's why it is not supported yet. Best, Fabian Am Di., 12. Feb. 2019 um 07:02 Uhr schrieb yinhua.dai < yinhua.2...@outlook.com>: > Why flink said "Limiting the

[ANNOUNCE] New Flink PMC member Thomas Weise

2019-02-12 Thread Fabian Hueske
Hi everyone, On behalf of the Flink PMC I am happy to announce Thomas Weise as a new member of the Apache Flink PMC. Thomas is a long time contributor and member of our community. He is starting and participating in lots of discussions on our mailing lists, working on topics that are of joint int

Program of Flink Forward San Francisco 2019 announced

2019-02-12 Thread Fabian Hueske
Hi everyone, We announced the program of Flink Forward San Francisco 2019. The conference takes place at the Hotel Nikko in San Francisco on April 1st and 2nd. On the first day we offer three training sessions [1]: * Introduction to Streaming with Apache Flink * Analyzing Streaming Data with Flin

Re: Program of Flink Forward San Francisco 2019 announced

2019-02-12 Thread Fabian Hueske
Am Di., 12. Feb. 2019 um 16:26 Uhr schrieb Fabian Hueske : > Hi everyone, > > We announced the program of Flink Forward San Francisco 2019. > The conference takes place at the Hotel Nikko in San Francisco on April > 1st and 2nd. > > On the first day we offer three

Re: [DISCUSS] Adding a mid-term roadmap to the Flink website

2019-02-14 Thread Fabian Hueske
Hi, I like the idea of putting the roadmap on the website because it is much more visible (and IMO more credible, obligatory) there. However, I share the concerns about frequent updates. It think it would be great to update the "official" roadmap on the website once per release (-bugfix releases)

Re: Window elements for certain period for delayed processing

2019-02-14 Thread Fabian Hueske
Hi, I would not use a window for that. Implementing the logic with a ProcessFunction seems more straight-forward. The function simply collects all events between 00:00 and 01:00 in a ListState and emits them when the time passes 01:00. All other records are simply forwarded. Best, Fabian Am Fr.,

Re: How to load multiple same-format files with single batch job?

2019-02-15 Thread Fabian Hueske
roduce a DataSet > from a single geojson file. > This doesn't sound compatible with a custom InputFormat, don't you? > > Thanks in advance for any addition hint, all the best > > François > > Le lun. 4 févr. 2019 à 12:10, Fabian Hueske a écrit : > >> Hi

Re: [Meetup] Apache Flink+Beam+others in Seattle. Feb 21.

2019-02-18 Thread Fabian Hueske
Thank you Pablo! Am Fr., 15. Feb. 2019 um 20:42 Uhr schrieb Pablo Estrada : > Hello everyone, > There is an upcoming meetup happening in the Google Seattle office, on > February 21st, starting at 5:30pm: > https://www.meetup.com/seattle-apache-flink/events/258723322/ > > People will be chatting a

Re: [Table] Types of query result and tablesink do not match error

2019-02-18 Thread Fabian Hueske
Hi François, I had a look at the code and the GenericTypeInfo checks equality by comparing the classes the represent (Class == Class). Class does not override the default implementation of equals, so this is an instance equality check. The check can evaluate to false, if Map was loaded by two diff

Re: Reduce one event under multiple keys

2019-02-18 Thread Fabian Hueske
Am Mo., 11. Feb. 2019 um 11:29 Uhr schrieb Stephen Connolly < stephen.alan.conno...@gmail.com>: > > > On Mon, 11 Feb 2019 at 09:42, Fabian Hueske wrote: > >> Hi Stephen, >> >> A window is created with the first record that is assigned to it. >> If the wind

Re: Limit in batch flink sql job

2019-02-18 Thread Fabian Hueske
Thanks for pointing this out! This is indeed a bug in the documentation. I'll fix that. Thank you, Fabian Am Mi., 13. Feb. 2019 um 02:04 Uhr schrieb yinhua.dai < yinhua.2...@outlook.com>: > OK, thanks. > It might be better to update the document which has the following example > that confused m

Re: KafkaTopicPartition internal class treated as generic type serialization

2019-02-18 Thread Fabian Hueske
Hi Eric, I did a quick search in our Jira to check if this is a known issue but didn't find anything. Maybe Gordon (in CC) knows a bit more about this problem. Best, Fabian Am Fr., 15. Feb. 2019 um 11:08 Uhr schrieb Eric Troies : > Hi, I'm having the exact same issue with flink 1.4.0 using scal

Re: Test FileUtilsTest.testDeleteDirectory failed when building Flink

2019-02-18 Thread Fabian Hueske
Hi Paul, Which components (Flink, JDK, Docker base image, ...) are you upgrading and which versions do you come from? I think it would be good to check how (and with which options) the JVM in the container is started. Best, Fabian Am Fr., 15. Feb. 2019 um 09:50 Uhr schrieb Paul Lam : > Hi all,

<    1   2   3   4   5   6   7   8   9   10   >