Re: [Table API/SQL] Finding missing spots to resolve why 'no watermark' is presented in Flink UI

2018-07-04 Thread Fabian Hueske
in SQL might be tricky (as the semantic of SQL query is not for > multiple outputs). > > Thanks, > Jungtaek Lim (HeartSaVioR) > > 2018년 7월 4일 (수) 오후 5:49, Fabian Hueske 님이 작성: > >> Hi Jungtaek, >> >> Flink 1.5.0 features a TableSource for Kafka and JSON [1], incl. >

Re: Let BucketingSink roll file on each checkpoint

2018-07-04 Thread Fabian Hueske
Hi Xilang, I thought about this again. The bucketing sink would need to roll on event-time intervals (similar to the current processing time rolling) which are triggered by watermarks in order to support consistency. However, it would also need to maintain a write ahead log of all received rows an

Re: [Table API/SQL] Finding missing spots to resolve why 'no watermark' is presented in Flink UI

2018-07-04 Thread Fabian Hueske
ngAwareExistingField("eventTime"), > new BoundedOutOfOrderTimestamps(Time.minutes(1).toMilliseconds) > ) > .build() > > Thanks again! > Jungtaek Lim (HeartSaVioR) > > 2018년 7월 4일 (수) 오후 8:18, Fabian Hueske 님이 작성: > >> Hi Jungtaek, >> >> I

Re: A use-case for Flink and reactive systems

2018-07-05 Thread Fabian Hueske
Hi Yersinia, The main idea of an event-driven application is to hold the state (i.e., the account data) in the streaming application and not in an external database like Couchbase. This design is very scalable (state is partitioned) and avoids look-ups from the external database because all state

Re: [Table API/SQL] Finding missing spots to resolve why 'no watermark' is presented in Flink UI

2018-07-05 Thread Fabian Hueske
e.getSchema.getColumnNames, > outTable.getSchema.getTypes) > tableEnv.toRetractStream[Row](outTable).print() > > > Thanks again, > Jungtaek Lim (HeartSaVioR) > > [1] https://issues.apache.org/jira/browse/FLINK-9742 > > 2018년 7월 4일 (수) 오후 10:03, Fabian Hueske 님이 작성: >

Re: Dynamic Rule Evaluation in Flink

2018-07-05 Thread Fabian Hueske
Hi, > Flink doesn't support connecting multiple streams with heterogeneous schema This is not correct. Flink is very well able to connect streams with different schema. However, you cannot union two streams with different schema. In order to reconfigure an operator with changing rules, you can us

Re: Description of Flink event time processing

2018-07-05 Thread Fabian Hueske
Hi Elias, Thanks for the great document! I made a pass over it and left a few comments. I think we should definitely add this to the documentation. Thanks, Fabian 2018-07-04 10:30 GMT+02:00 Fabian Hueske : > Hi Elias, > > I agree, the docs lack a coherent discussion of event time

Re: Slide Window Compute Optimization

2018-07-06 Thread Fabian Hueske
Hi Yennie, You might want to have a look at the OVER windows of Flink's Table API or SQL [1]. An OVER window computes an aggregate (such as a count) for each incoming record over a range of previous events. For example the query: SELECT ip, successful, COUNT(*) OVER (PARTITION BY ip, successful

Re: A use-case for Flink and reactive systems

2018-07-06 Thread Fabian Hueske
a problem cause you have two separate events). >>> >>> Moreover, a second PoC I was considering is related to Flink CEP. Let's >>> say I am elaborating sensor data, I want to have a rule which is working on >>> the following principle: >>> - If the

Re: State sharing across trigger and evictor

2018-07-16 Thread Fabian Hueske
Hi, I don't think that is possible. The Evictor interface does not provide access to a state store, so there is no way to access state. Best, Fabian 2018-07-10 13:26 GMT+02:00 Jayant Ameta : > Hi, > I'm using the GlobalWindow with a custom CountTrigger (similar to the > CountTrigger provided by

Re: Flink job hangs/deadlocks (possibly related to out of memory)

2018-07-16 Thread Fabian Hueske
Hi Gerard, Thanks for reporting this issue. I'm pulling in Nico and Piotr who have been working on the networking stack lately and might have some ideas regarding your issue. Best, Fabian 2018-07-13 13:00 GMT+02:00 Zhijiang(wangzhijiang999) < wangzhijiang...@aliyun.com>: > Hi Gerard, > > I thou

Re: Flink on Mesos: containers question

2018-07-16 Thread Fabian Hueske
Hi Alexei, Till (in CC) is familiar with Flink's Mesos support in 1.4.x. Best, Fabian 2018-07-13 15:07 GMT+02:00 NEKRASSOV, ALEXEI : > Can someone please clarify how Flink on Mesos in containerized? > > > > On 5-node Mesos cluster I started Flink (1.4.2) with two Task Managers. > Mesos shows “f

[ANNOUNCE] Program for Flink Forward Berlin 2018 has been announced

2018-07-17 Thread Fabian Hueske
Hi everyone, I'd like to announce the program for Flink Forward Berlin 2018. The program committee [1] assembled a program of about 50 talks on use cases, operations, ecosystem, tech deep dive, and research topics. The conference will host speakers from Airbnb, Amazon, Google, ING, Lyft, Microsof

Re: Race between window assignment and same window timeout

2018-07-19 Thread Fabian Hueske
Hi Shay, This sounds very much like the off-by-one bug described by FLINK-9857 [1]. The problem was identified in another recent user ml thread and fixed for Flink 1.5.2 and 1.6.0. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-9857 2018-07-18 19:00 GMT+02:00 Andrey Zagrebin : >

Re: Why data didn't enter the time window in EventTime mode

2018-07-19 Thread Fabian Hueske
Hi Soheil, Hequn is right. This might be an issue with advancing event-time. You can monitor that by checking the watermarks in the web dashboard or print-debug it with a ProcessFunction which can lookup the current watermark. Best, Fabian 2018-07-19 3:30 GMT+02:00 Hequn Cheng : > Hi Soheil, >

Re: Production readiness of Flink Job Stop Service

2018-07-19 Thread Fabian Hueske
Hi Chirag, Stop with savepoint is not mentioned in the 1.5.0 release notes [1]. Since its a frequently requested feature, I'm pretty sure that it would have been mentioned if it was added. Best, Fabian [1] http://flink.apache.org/news/2018/05/25/release-1.5.0.html 2018-07-19 8:39 GMT+02:00 vin

Re: Description of Flink event time processing

2018-07-19 Thread Fabian Hueske
ument. >>> I think you did not to enable access for comments for the link. Would >>> you mind enabling comments for the google doc? >>> >>> Thanks, >>> Rong >>> >>> >>> On Thu, Jul 5, 2018 at 8:39 AM Fabian Hueske wrote: >&g

Re: Parallel stream partitions

2018-07-19 Thread Fabian Hueske
Hi Nick, What Ken said is correct, but let me add two more things. 1) State Usually, you only need to partition (keyBy()) the data if you want to process tuples with the same same key together. Therefore, it is necessary to hold some tuples or intermediate results (like partial or running aggrega

Re: Keeping only latest row by key?

2018-07-19 Thread Fabian Hueske
HI James, Yes, that should also do the trick. Best, Fabian 2018-07-19 16:06 GMT+02:00 Porritt, James : > It looks like the following gives me the result I’m interested in: > > > > batchEnv > > .createInput(dataset) > > .groupBy("id") > > .sortGrou

Re: Parallelism and keyed streams

2018-07-23 Thread Fabian Hueske
Hi, Flink guarantees order only within a partition. For example, if you have the program map_1 -> map_2 and both map functions run with parallelism 4, the order of records in each of the 4 partitions is not changed.. In case of a shuffle (such as a keyBy or change in parallelism) records are shipp

Re: Re: Bootstrapping the state

2018-07-23 Thread Fabian Hueske
Hi Henkka, You might want to consider implementing a dedicated job for state bootstrapping that uses the same operator UUID and state names. That might be easier than integrating the logic into your regular job. I think you have to use the monitoring file source because AFAIK it won't be possible

Re: IoT Use Case, Problem and Thoughts

2018-07-23 Thread Fabian Hueske
disabling > checkpointing). > > I can see the point of making the checkpoint triggering more flexible and > giving some control to the user. In contrast to savepoints, checkpoints are > considered for recovery. My question here would be, what would be the > triggering condition in

Re: Question regarding State in full outer join

2018-07-24 Thread Fabian Hueske
Hi Darshan, The join implementation in SQL / Table API does what is demanded by the SQL semantics. Hence, what results to emit and also what data to store (state) to compute these results is pretty much given. You can think of the semantics of the join as writing both streams into a relational DBM

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Fabian Hueske
Hi, The problem is that Flink tracks which files it has read by remembering the modification time of the file that was added (or modified) last. We use the modification time, to avoid that we have to remember the names of all files that were ever consumed, which would be expensive to check and sto

Re: S3 file source - continuous monitoring - many files missed

2018-07-25 Thread Fabian Hueske
Hi, Thanks for creating the Jira issue. I'm not sure if I would consider this a blocker but it is certainly an important problem to fix. Anyway, in the original version Flink checkpoints the modification timestamp up to which all files have been read (or at least up to which point it *thinks* to

Re: S3 file source - continuous monitoring - many files missed

2018-07-25 Thread Fabian Hueske
Hi, First of all, the ticket reports a bug (or improvement or feature suggestion) such that others are aware of the problem and understand its cause. At some point it might be picked up and implemented. In general, there is no guarantee whether or when this happens, but the Flink community is of

Re: Working out through individual messages in Flink

2018-07-30 Thread Fabian Hueske
Hi, Flink processes streams record by record, instead of micro-batching records together. Since every record comes by itself, there is no for-each. Simple record-by-record transformations can be done with a MapFunction, filtering out records with a FilterFunction. You can also implement a FlatMapF

Re: Working out through individual messages in Flink

2018-07-30 Thread Fabian Hueske
p://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author wi

Re: Questions on Unbounded number of keys

2018-07-30 Thread Fabian Hueske
Hi Chang, The state handle objects are not created per key but just once per function instance. Instead they route state accesses to the backend (JVM heap or RocksDB) for the currently active key. Best, Fabian 2018-07-30 12:19 GMT+02:00 Chang Liu : > Hi Andrey, > > Thanks for your reply. My que

Re: watermark VS window trigger

2018-07-31 Thread Fabian Hueske
Hi, Watermarks are not holding back records. Instead they define the event-time at an operator (as Vino said) and can trigger the processing of data if the logic of an operator is based on time. For example, a window operator can emit complete results for a window once the time passed the window's

Re: Small-files source - partitioning based on prefix of file

2018-07-31 Thread Fabian Hueske
Hi Averell, The records emitted by the monitoring tasks are "just" file splits, i.e., meta information that defines which data to read from where. The reader tasks receive these splits and process them by reading the corresponding files. You could of course partition the splits based on the file

Re: Event time didn't advance because of some idle slots

2018-07-31 Thread Fabian Hueske
Hi, If you are using a custom source, you can call SourceContext.markAsTemporarilyIdle() to indicate that a task is currently not producing new records [1]. Best, Fabian 2018-07-31 8:50 GMT+02:00 Reza Sameei : > It's not a real solution; but why you don't change the parallelism for > your `Sour

Re: Description of Flink event time processing

2018-07-31 Thread Fabian Hueske
Levy : > Fabian, > > You have any time to review the changes? > > On Thu, Jul 19, 2018 at 2:19 AM Fabian Hueske wrote: > >> Hi Elias, >> >> Thanks for the update! >> I'll try to have another look soon. >> >> Best, Fabian >> >> 2018

Re: Small-files source - partitioning based on prefix of file

2018-07-31 Thread Fabian Hueske
Hi Averell, please find my answers inlined. Best, Fabian 2018-07-31 13:52 GMT+02:00 Averell : > Hi Fabian, > > Thanks for the information. I will try to look at the change to that > complex > logic that you mentioned when I have time. That would save one more shuffle > (from 1 to 0), wouldn't t

Re: Converting a DataStream into a Table throws error

2018-08-01 Thread Fabian Hueske
Hi I think you are mixing Java and Scala dependencies. org.apache.flink.streaming.api.datastream.DataStream is the DataStream of the Java DataStream API. You should use the DataStream of the Scala DataStream API. Best, Fabian 2018-08-01 14:01 GMT+02:00 Mich Talebzadeh : > Hi, > > I believed I t

Re: Converting a DataStream into a Table throws error

2018-08-01 Thread Fabian Hueske
LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own r

Re: Multiple output operations in a job vs multiple jobs

2018-08-02 Thread Fabian Hueske
Hi, Paul is right. Which and how much data is stored in state for a window depends on the type of the function that is applied on the windows: - ReduceFunction: Only the reduced value is stored - AggregateFunction: Only the accumulator value is stored - WindowFunction or ProcessWindowFunction: Al

Re: Description of Flink event time processing

2018-08-02 Thread Fabian Hueske
discussion of your document. Elias, do you want to put your document into Markdown and open a PR for the documentation? Thanks, Fabian 2018-07-31 18:16 GMT+02:00 Fabian Hueske : > Hi Elias, > > Sorry for the delay. I just made a pass over the document. > I think it is very good. > >

Re: Event Time Session Window does not trigger..

2018-08-06 Thread Fabian Hueske
Hi, By setting the time characteristic to EventTime, you enable the internal handling of record timestamps and watermarks. In contrast to EventTime, ProcessingTime does not require any additional data. You can use both, EventTime and ProcessingTime in the same application and StreamExecutionEnvir

Re: Converting a DataStream into a Table throws error

2018-08-06 Thread Fabian Hueske
)Unit >>>> [error] cannot be applied to (String, org.apache.flink.streaming. >>>> api.environment.StreamExecutionEnvironment, Symbol, Symbol, Symbol, >>>> Symbol) >>>> [error] tableEnv.registerDataStream("table1", streamExecEnv, 'key, &g

Re: Accessing source table data from hive/Presto

2018-08-07 Thread Fabian Hueske
Hi Mugunthan, this depends on the type of your job. Is it a batch or a streaming job? Some queries could be ported to Flink's SQL API as suggested by the link that Hequn posted. In that case, the query would be executed in Flink. Other options are to use a JDBC InputFormat or persisting the resul

Re: Small-files source - partitioning based on prefix of file

2018-08-08 Thread Fabian Hueske
Hi Averall, As Vino said, checkpoints store the state of all operators of an application. The state of a monitoring source function is the position in the currently read split and all splits that have been received and are currently pending. In case of a recovery, the splits are recovered and the

Re: Filter-Join Ordering Issue

2018-08-08 Thread Fabian Hueske
Hi Dylan, Yes, that's a bug. As you can see from the plan, the partitioning step is pushed past the Filter. This is possible, because the optimizer knows that a Filter function cannot modify the data (it only removes records). A workaround should be to implement the filter as a FlatMapFunction. A

Re: Accessing source table data from hive/Presto

2018-08-08 Thread Fabian Hueske
om>: > Thanks for the reply. I was mainly thinking of the usecase of streaming > job. > In the approach to port to Flink's SQL API, is it possible to read parquet > data from S3 and register table in flink? > > > On Tue, Aug 7, 2018 at 1:05 PM, Fabian Hueske wrote: &g

Re: Filter-Join Ordering Issue

2018-08-08 Thread Fabian Hueske
I've created FLINK-10100 [1] to track the problem and suggest a solution and workaround. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-10100 2018-08-08 10:39 GMT+02:00 Fabian Hueske : > Hi Dylan, > > Yes, that's a bug. > As you can see from the plan, th

Re: Need help regarding Flink Batch Application

2018-08-08 Thread Fabian Hueske
The code or the execution plan (ExecutionEnvironment.getExecutionPlan()) of the job would be interesting. 2018-08-08 10:26 GMT+02:00 Chesnay Schepler : > What have you tried so far to increase performance? (Did you try different > combinations of -yn and -ys?) > > Can you provide us with your app

Re: JDBCInputFormat and SplitDataProperties

2018-08-08 Thread Fabian Hueske
Hi Alexis, First of all, I think you leverage the partitioning and sorting properties of the data returned by the database using SplitDataProperties. However, please be aware that SplitDataProperties are a rather experimental feature. If used without query parameters, the JDBCInputFormat generate

Listing in Powered By Flink directory

2018-08-08 Thread Fabian Hueske
Hi everybody, The Flink community maintains a directory of organizations and projects that use Apache Flink [1]. Please reply to this thread if you'd like to add an entry to this list. Thanks, Fabian [1] https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink

Re: Listing in Powered By Flink directory

2018-08-08 Thread Fabian Hueske
Thanks Amit! I've added Limeroad to the list with your description. Best, Fabian 2018-08-08 14:12 GMT+02:00 amit.jain : > Hi Fabian, > > We at Limeroad, are using Flink for multiple use-cases ranging from ETL > jobs, ClickStream data processing, real-time dashboard to CEP. Could you > list us on

Re: State in the Scala DataStream API

2018-08-08 Thread Fabian Hueske
Hi Juan, The state will be purged if you return None instead of a Some. However, this only happens when the function is called for a specific key, i.e., state won't be automatically removed after some time. If this is your use case, you have to implement a ProcessFunction and use timers to manuall

Re: UTF-16 support for TextInputFormat

2018-08-09 Thread Fabian Hueske
Hi David, Did you try to set the encoding on the TextInputFormat with TextInputFormat tif = ... tif.setCharsetName("UTF-16"); Best, Fabian 2018-08-08 17:45 GMT+02:00 David Dreyfus : > Hello - > > It does not appear that Flink supports a charset encoding of "UTF-16". It > particular, it doesn't

Re: JDBCInputFormat and SplitDataProperties

2018-08-09 Thread Fabian Hueske
nericInputSplit takes two parameters: > partitionNumber and totalNumberOfPartitions. Should I assume that there are > 2 splits divided into 24 partitions? > > Regards, > Alexis. > > > > On Wed, Aug 8, 2018 at 11:57 AM Fabian Hueske wrote: > >> Hi Alexis, >&

Re: Table API, custom window

2018-08-09 Thread Fabian Hueske
Hi, regarding the plans. There are no plans to support custom window assigners and evictors. There were some thoughts about supporting different result update strategies that could be used to return early results or update results in case of late data. However, these features are currently not on

Re: Small-files source - partitioning based on prefix of file

2018-08-10 Thread Fabian Hueske
Hi Averell, One comment regarding what you said: > As my files are small, I think there would not be much benefit in checkpointing file offset state. Checkpointing is not about efficiency but about consistency. If the position in a split is not checkpointed, your application won't operate with e

Re: UTF-16 support for TextInputFormat

2018-08-10 Thread Fabian Hueske
e BOM indicates Little Endian and the caller indicates > UTF-16BE, Flink should rewrite the charsetName as UTF-16LE. > > I hope this makes sense and that I haven't been testing incorrectly or > misreading the code. > > Thank you, > David > > On Thu, Aug 9, 2018

Re: Dataset.distinct - Question on deterministic results

2018-08-10 Thread Fabian Hueske
Hi Will, The distinct operator is implemented as a groupBy(distinctKeys) and a ReduceFunction that returns the first argument. Hence, it depends on the order in which the records are processed by the ReduceFunction. Flink does not maintain a deterministic order because it is quite expensive in di

Re: Flink Rebalance

2018-08-10 Thread Fabian Hueske
Hi, Elias and Paul have good points. I think the performance degradation is mostly to the lack of function chaining in the rebalance case. If all steps are just map functions, they can be chained in the no-rebalance case. That means, records are passed via function calls. If you add rebalancing,

Re: JDBCInputFormat and SplitDataProperties

2018-08-10 Thread Fabian Hueske
upBy(0, 1) >> .reduceGroup(groupReducer) >> .withForwardedFields("_1") >> .output(outputFormat) >> >> It seems to work well, and the semantic annotation does remove a hash >> partition from the execution plan. >> >> Regards, >> Ale

Re: Small-files source - partitioning based on prefix of file

2018-08-10 Thread Fabian Hueske
Hi Averell, Conceptually, you are right. Checkpoints are taken at every operator at the same "logical" time. It is not important, that each operator checkpoints at the same wallclock time. Instead, the need to take a checkpoint when they have processed the same input. This is implemented with so-c

Re: flink requires table key when insert into upsert table sink

2018-08-10 Thread Fabian Hueske
Hi Henry, The problem is that the table that results from the query does not have a unique key. You can only use an upsert sink if the table has a (composite) unique key. Since this is not the case, you cannot use upsert sink. However, you can implement a StreamRetractionTableSink which allows to

Re: Flink socketTextStream UDP connection

2018-08-13 Thread Fabian Hueske
Hi, ExecutionEnvironment.socketTextStream is deprecated and it is very likely that it will be removed because of its limited use. I would recommend to have at the implementation of the SourceFunction [1] and adapt it to your needs. Best, Fabian [1] https://github.com/apache/flink/blob/master/fli

Re: JDBCInputFormat and SplitDataProperties

2018-08-13 Thread Fabian Hueske
ong configuration of the cluster; there was only 1 > task manager with 1 slot. > > If I submit a job with "flink run -p 24 ...", will the job hang until at > least 24 slots are available? > > Regards, > Alexis. > > On Fri, 10 Aug 2018, 14:01 Fabian Hueske wrote: >

Re: Tuning checkpoint

2018-08-13 Thread Fabian Hueske
Hi Mingliang, let me answer your second question first: > Another question is about the alignment buffer, I thought it was only used for multiple input stream cases. But for keyed process function , what is actually aligned? When a task sends records to multiple downstream tasks (task not operat

Re: Introduce Barriers in stream source

2018-08-13 Thread Fabian Hueske
Hi, It is sufficient to implement the CheckpointedFunction interface. Since SourceFunctions emit records in a separate thread, you need to ensure that not record is emitted while the shapshotState method is called. Flink provides a lock to synchronize data emission and state snapshotting. See the

Re: Kerberos Configuration Does Not Apply To Krb5LoginModule

2018-08-13 Thread Fabian Hueske
Hi Paul, Maybe Aljoscha (in CC) can help you with this question. AFAIK, he has some experience with Flink and Kerberos. Best, Fabian 2018-08-13 14:51 GMT+02:00 Paul Lam : > Hi, > > I built Flink from the latest 1.5.x source code, and got some strange > outputs from the command line when submitt

Re: Managed Keyed state update

2018-08-14 Thread Fabian Hueske
Hi, It is recommended to always call update(). State modifications by modifying objects is only possible because the heap based backends do not serialize or copy records to avoid additional costs. Hence, this is rather a side effect than a provided API. As soon as you change the state backend, st

Re: Limit on number of files to read for Dataset

2018-08-14 Thread Fabian Hueske
Hi, Flink InputFormats generate their InputSplits sequentially on the JobManager. These splits are stored in the heap of the JM process and handed out to SourceTasks when they request them lazily. Split assignment is done by a InputSplitAssigner, that can be customized. FileInputFormats typically

Re: how to assign issue to someone

2018-08-14 Thread Fabian Hueske
Hi, I've given you Contributor permissions for Jira and assigned the issue to you. You can now also assign other issue to you. Looking forward to your contribution. Best, Fabian 2018-08-14 19:45 GMT+02:00 Guibo Pan : > Hello, I am a new user for flink jira. I reported an issue and would like >

Re: watermark does not progress

2018-08-15 Thread Fabian Hueske
Hi John, Watermarks cannot make progress if you have stream partitions that do not carry any data. What kind of source are you using? Best, Fabian 2018-08-15 4:25 GMT+02:00 vino yang : > Hi Johe, > > In local mode, it should also work. > When you debug, you can set a breakpoint in the getCurren

Re: Limit on number of files to read for Dataset

2018-08-15 Thread Fabian Hueske
tionManager$1.get(PoolingHttpClientConnectionMan > ager.java:263) > at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at com.amazon.ws.emr.hadoop.

Re: What's the advantage of using BroadcastState?

2018-08-20 Thread Fabian Hueske
Hi, I've recently published a blog post about Broadcast State [1]. Cheers, Fabian [1] https://data-artisans.com/blog/a-practical-guide-to-broadcast-state-in-apache-flink 2018-08-20 3:58 GMT+02:00 Paul Lam : > Hi Rong, Hequn > > Your answers are very helpful! Thank you! > > Best Regards, > Paul

Re: Will idle state retention trigger retract in dynamic table?

2018-08-21 Thread Fabian Hueske
Hi, No, it won't. I will simply remove state that has not been accessed for the configured time but not change the result. For example, if you have a GROUP BY aggregation and the state for a grouping key is removed, the operator will start a new aggregation if a record with the removed grouping ke

Re: UTF-16 support for TextInputFormat

2018-08-21 Thread Fabian Hueske
blocker or that I've identified the right component. > I'm afraid I don't have the bandwidth or knowledge to make the kind of > pull request you really need. I do hope my suggestions prove a little > useful. > > Thank you, > David > > On Fri, Aug 10, 2018 at

Re: Will idle state retention trigger retract in dynamic table?

2018-08-21 Thread Fabian Hueske
the dynamic table. > > Best, > Henry > > > 在 2018年8月21日,下午4:16,Fabian Hueske 写道: > > Hi, > > No, it won't. I will simply remove state that has not been accessed for > the configured time but not change the result. > For example, if you have a GROUP BY agg

Re: Will idle state retention trigger retract in dynamic table?

2018-08-21 Thread Fabian Hueske
rtain threshold. In that case, the query would only operate on tail of the stream, e.g., the last day or week. Best, Fabian 2018-08-21 12:03 GMT+02:00 徐涛 : > Hi Fabian, > Is the behavior a bit weird? Because it leads to data inconsistency. > > Best, > Henry > > > 在 2018年8月21日,下午5

Re: Semantic when table joins table from window

2018-08-21 Thread Fabian Hueske
Hi, The semantics of a query do not depend on the way that it is used. praiseAggr is a table that grows by one row per second and article_id. If you use that table in a join, the join will fully materialize the table. This is a special case because the same row is added multiple times, so the stat

Re: lack of function and low usability of provided function

2018-08-23 Thread Fabian Hueske
Hi Henry, Flink is an open source project. New build-in functions are constantly contributed to Flink. Right now, there are more than 5 PRs open to add or improve various functions. If you find that some functions are not working correctly or could be improved, you can open a Jira issue. The same

Re: would you join a Slack workspace for Flink?

2018-08-27 Thread Fabian Hueske
Hi, I don't think that recommending Gists is a good idea. Sure, well formatted and highlighted code is nice and much better than posting screenshots but Gists can be deleted. Deleting a Gist would make an archived thread useless. I would definitely support instructions on how to add code to a mail

Re: [DISCUSS] Remove the slides under "Community & Project Info"

2018-08-27 Thread Fabian Hueske
I agree to remove the slides section. A lot of the content is out-dated and hence not only useless but might sometimes even cause confusion. Best, Fabian Am Mo., 27. Aug. 2018 um 08:29 Uhr schrieb Renjie Liu < liurenjie2...@gmail.com>: > Hi, Stephan: > Can we put project wiki in some place? I

Re: Raising a bug in Flink's unit test scripts

2018-08-27 Thread Fabian Hueske
Hi Averell, If this is a more general error, I'd prefer a separate issue & PR. Thanks, Fabian Am Fr., 24. Aug. 2018 um 13:15 Uhr schrieb Averell : > Good day everyone, > > I'm writing unit test for the bug fix FLINK-9940, and found that in some > existing tests in flink-fs-tests cannot detect t

Re: would you join a Slack workspace for Flink?

2018-08-27 Thread Fabian Hueske
paste text instead of screenshots of text > 3. you keep formatting when pasting code in order to keep the code readable > 4. there are enough import statements to avoid ambiguities > > > > On Mon, Aug 27, 2018 at 10:51 AM Fabian Hueske wrote: > >> Hi, >> >> I

Re: What's the advantage of using BroadcastState?

2018-08-28 Thread Fabian Hueske
B 56063, > Geschäftsführer: Bo PENG, Qiuen Peng, Shengli Wang > This e-mail and its attachments contain confidential information from > HUAWEI, which is intended only for the person or entity whose address is > listed above. Any use of the information contained herein in any way > (including

Re: Small-files source - partitioning based on prefix of file

2018-08-28 Thread Fabian Hueske
Hi Averell, Barriers are injected into the regular data flow by source functions. In case of a file monitoring source, the barriers are injected into the stream of file splits that are passed to the ContinuousFileMonitoringFunction. The CFMF puts the splits into a queue and processes them with a d

Re: Semantic when table joins table from window

2018-08-28 Thread Fabian Hueske
GROUP BY >> article_id", the answer is "101,102,103" >> 2. if you change your sql to s"SELECT last_value(article_id) FROM >> praise", the answer is "100" >> >> Best, Hequn >> >> On Tue, Aug 21, 2018 at 8:52 PM, 徐涛 wrote:

Re: Small-files source - partitioning based on prefix of file

2018-08-28 Thread Fabian Hueske
Hi, CMCF is not a source, only the file monitoring function is. Barriers are injected by the FMF when the JM sends a checkpoint message. The barriers then travel to the CMCF and trigger the Checkpoint ING. Fabian Averell schrieb am Di., 28. Aug. 2018, 12:02: > Hello Fabian, > > Thanks for t

Re: Queryable state and state TTL

2018-08-29 Thread Fabian Hueske
Hi, I guess that this is not a fundamental problem but just a limitation in the current implementation. Andrey (in CC) who implemented the TTL support should be able to give more insight on this issue. Best, Fabian Am Mi., 29. Aug. 2018 um 04:06 Uhr schrieb vino yang : > Hi Elias, > > From the

Re: Why don't operations on KeyedStream return KeyedStream?

2018-08-29 Thread Fabian Hueske
Hi Elias, Your assumption is correct. An operation on a KeyedStream results in a regular DataStream because the operation might change the data type or the key field. Hence, it is not guaranteed that the same keys can be extracted from the output of the keyed operation. However, there is a way to

Re: Missing Calcite SQL functions in table API

2018-09-05 Thread Fabian Hueske
Hi You are using SQL syntax in a Table API query. You have to stick to Table API syntax or use SQL as tEnv.sqlQuery("SELECT col1,CONCAT('field1:',col2,',field2:',CAST(col3 AS string)) FROM csvTable") The Flink documentation lists all supported functions for Table API [1] and SQL [2]. Best, Fabi

Re: How does flink read a DataSet?

2018-09-12 Thread Fabian Hueske
Actually, some parts of Flink's batch engine are similar to streaming as well. If the data does not need to be sorted or put into a hash-table, the data is pipelined (like in many relational database systems). For example, if you have a job that joins two inputs with a HashJoin, only the build side

Re: How does flink read a DataSet?

2018-09-12 Thread Fabian Hueske
> is pushed from operators to operators in both stream and batch > > On Wed 12 Sep, 2018, 4:28 PM Fabian Hueske, wrote: > >> Actually, some parts of Flink's batch engine are similar to streaming as >> well. If the data does not need to be sorted or put into a hash

Re: Weird behaviour after change sources in a job.

2018-09-13 Thread Fabian Hueske
Hi, The problem is that Flink SQL does not expose the UIDs of the generated operators. We've met that issue before, but it is still not fully clear what would be the best way to this accessible. Best, Fabian 2018-09-13 5:15 GMT-04:00 Dawid Wysakowicz : > Hi Oleksandr, > > The mapping of state t

Re: Potential bug in Flink SQL HOP and TUMBLE operators

2018-09-17 Thread Fabian Hueske
Hi John, Are you sure that the first rows of the first window are dropped? When a query with processing time windows is terminated, the last window is not computed. This in fact intentional and does not apply to event-time windows. Best, Fabian 2018-09-17 17:21 GMT+02:00 John Stone : > Hello,

Re: Expire records in FLINK dynamic tables?

2018-09-17 Thread Fabian Hueske
Hi Chen, Yes, this is possible. Have a look at the configuration of idle state retention time [1]. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/table/streaming.html#idle-state-retention-time 2018-09-17 20:10 GMT+02:00 burgesschen : > Hi everyone, > I'm tryin

Re: Potential bug in Flink SQL HOP and TUMBLE operators

2018-09-17 Thread Fabian Hueske
Hmm, that's interesting. HOP and TUMBLE window aggregations are directly translated into their corresponding DataStream counterparts (Sliding, Tumble). There should be no filtering of records. I assume you tried a simple query like "SELECT * FROM MyEventTable" and received all expected data? Fabi

Re: CombinableGroupReducer says The Iterable can be iterated over only once

2018-09-17 Thread Fabian Hueske
Hi Alejandro, asScala calls iterator() the first time and reduce() another time. These iterators can only be iterated once because they are possibly backed by multiple sorted files which have been spilled to disk and are merge-sorted while iterating. I'm actually surprised that you found this cod

Re: In which case the StreamNode has multiple output edges?

2018-09-18 Thread Fabian Hueske
Hi, Any operator can have multiple out-going edges. If you implement something like: DataStream instream = ... DataStream outstream1 = instream.map(new MapFunc1()); DataStream outstream2 = instream.map(new MapFunc2()); The node representing instream will have two outgoing edges that lead to the

Re: Migration to Flink 1.6.0, issues with StreamExecutionEnvironment.registerCachedFile

2018-09-18 Thread Fabian Hueske
Hi, The functionality of the SQL ScalarFunction is backed by Flink's distributed cache and just passes on the function call. I tried it locally on my machine and it works for me. What is your setup? Are you running on Yarn? Maybe Chesnay or Dawid (added to CC) can help to track the problem down.

Re: Add operator ids for an already running job

2018-09-18 Thread Fabian Hueske
The auto-generated ids are included in the savepoint data. So, it should be possible to them from the savepoint. However, AFAIK, there is no tool to do that. You'd need to manually dig into the serialized data. Cheers, Fabian 2018-09-18 13:30 GMT+02:00 vino yang : > Hi Paul, > > Referring to the

Re: Add operator ids for an already running job

2018-09-18 Thread Fabian Hueske
, > Paul Lam > > > 在 2018年9月18日,20:09,Fabian Hueske 写道: > > The auto-generated ids are included in the savepoint data. So, it should > be possible to them from the savepoint. > However, AFAIK, there is no tool to do that. You'd need to manually dig > into the seriali

Re: Potential bug in Flink SQL HOP and TUMBLE operators

2018-09-18 Thread Fabian Hueske
Hi John, Just to clarify, this missing data is due to the starting overhead and not due to a bug? Best, Fabian 2018-09-18 15:35 GMT+02:00 John Stone : > Thank you all for your assistance. I believe I've found the root cause if > the behavior I am seeing. > > If I just use "SELECT * FROM MyEven

<    1   2   3   4   5   6   7   8   9   10   >