date:20170905

Binary encoding of categorical variables for input in ML algorithm

2017-09-05 Thread Adarsh Jain

Hi, We use Flink to process our streaming data and need to convert categorical variables to binary encoding for input in ML algo, any references on how to do it in Flink? Regards, Adarsh

FLINK-6117 issue work around

2017-09-05 Thread Sunny Yun

Hi, Using flink 1.2.0, I faced to issue https://issues.apache.org/jira/browse/FLINK-6117 https://issues.apache.org/jira/browse/FLINK-6117. This issue is fixed at version 1.3.0. But I have some reason to trying to find out work around. I did, 1. change source according to https://github.com/apache

Re: Fwd: HA : My job didn't restart even if task manager restarted.

2017-09-05 Thread sunny yun

I am still struggling to solve this problem. I have no doubt that the JOB should automatically restart after restarting the TASK MANAGER in YARN MODE. Is it a misunderstood? Problem seems that *JOB MANAGER still try to connect to old TASK MANAGER even after new TASK MANAGER container be created.*

Flink on AWS EMR Protobuf

2017-09-05 Thread ant burton

Hello, Has anybody experienced the following error on AWS EMR 5.8.0 with Flink 1.3.1 java.lang.ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to com.google.protobuf.Message Thanks,

Broadcast Config through Connected Stream

2017-09-05 Thread Navneeth Krishnan

Hi All, I looked into an earlier email about the topic broadcast config through connected stream and I couldn't find the conclusion. I can't do the below approach since I need the config to be published to all operator instances but I need keyed state for external querying. streamToBeConfigured.

RE: DataSet: partitionByHash without materializing/spilling the entire partition?

2017-09-05 Thread Newport, Billy

We have the same issue. We are finding that we cannot express the data flow in a natural way because of unnecessary spilling. Instead, we're making our own operators which combine multiple steps together and essentially hide it from flink OR sometimes we even have to read an input dataset once p

Re: State Maintenance

2017-09-05 Thread Navneeth Krishnan

Thanks Gordon for your response. I have around 80 parallel flatmap operator instances and each instance requires 3 states. Out of which one is user state in which each operator will have unique user's data and I need this data to be queryable. The other two states are kind of static states which ar

Re: Process Function

2017-09-05 Thread Navneeth Krishnan

Thanks a lot everyone. I have the user data ingested from kafka and it is keyed by userid. There are around 80 parallel flatmap operator instances after keyby and there are around few million users. The map state includes userid as the key and some value. I guess I will try the approach that Aljosc

Re: Does RocksDB need a dedicated CPU?

2017-09-05 Thread Bowen Li

Thank you, Kien! On Tue, Sep 5, 2017 at 8:01 AM, Kien Truong wrote: > Hi, > > In my experience, RocksDB uses very little CPU, and doesn't need a > dedicated CPU. > > However, it's quite disk intensive. You'd need fast, ideally dedicated > SSDs to achieve the best performance. > > Regards, > > Ki

Re: Process Function

2017-09-05 Thread Aljoscha Krettek

Hi, This is mostly correct, but you cannot register a timer in open() because we don't have an active key there. Only in process() and onTimer() can you register a timer. In your case, I would suggest to somehow clamp the timestamp to the nearest 2 minute (or whatever) interval or to keep an e

Re: DataSet: CombineHint heuristics

2017-09-05 Thread Urs Schoenenberger

Hi Gábor, thank you very much for your explanation, that makes a lot of sense. Best regards, Urs On 05.09.2017 14:32, Gábor Gévay wrote: > Hi Urs, > > Yes, the 1/10th ratio is just a very loose rule of thumb. I would > suggest to try both the SORT and HASH strategies with a workload that > is a

Re: Does RocksDB need a dedicated CPU?

2017-09-05 Thread Kien Truong

Hi, In my experience, RocksDB uses very little CPU, and doesn't need a dedicated CPU. However, it's quite disk intensive. You'd need fast, ideally dedicated SSDs to achieve the best performance. Regards, Kien On 9/5/2017 1:15 PM, Bowen Li wrote: Hi guys, Does RocksDB need a dedicated C

Re: Process Function

2017-09-05 Thread Kien Truong

Hi, You can register a processing time timer inside the onTimer and the open function to have a timer that run periodically. Pseudo-code example: |ValueState lastRuntime; void open() { ctx.timerService().registerProcessingTimeTimer(current.timestamp + 6); } void onTimer() { // Run the p

Re: Process Function

2017-09-05 Thread Tzu-Li (Gordon) Tai

Hi Navneeth, Currently, I don't think there is any built-in functionality to trigger onTimer periodically. As for the second part of your question, do you mean that you want to query on which key the fired timer was registered from? I think this also isn't possible right now. I'm looping in Aljos

Shuffling between map and keyBy operator

2017-09-05 Thread Marchant, Hayden

I have a streaming application that has a keyBy operator followed by an operator working on the keyed values (a custom sum operator). If the map operator and aggregate operator are running on same Task Manager , will Flink always serialize and deserialize the tuples, or is there an optimization

Task Manager was lost/killed due to full GC

2017-09-05 Thread ShB

Hi, I'm running a Flink batch job that reads almost 1 TB of data from S3 and then performs operations on it. A list of filenames are distributed among the TM's and each subset of files is read from S3 from each TM. This job errors out at the read step due to the following error: java.lang.Excepti

Re: LatencyMarker

2017-09-05 Thread Tzu-Li (Gordon) Tai

Hi! Yes, backpressure should also increase the latency value calculated from LatencyMarkers. LatencyMarkers are special events that flow along with the actual stream records, so they should also be affected by backpressure. Are you asking because you observed otherwise? Cheers, Gordon -- Sent

Re: State Maintenance

2017-09-05 Thread Tzu-Li (Gordon) Tai

Hi Navneeth, Answering your three questions separately: 1. Yes. Your MapState will be backed by RocksDB, so when removing an entry from the map state, the state will be removed from the local RocksDB as well. 2. If state classes are not POJOs, they will be serialized by Kryo, unless a custom ser

Re: DataSet: CombineHint heuristics

2017-09-05 Thread Gábor Gévay

Hi Urs, Yes, the 1/10th ratio is just a very loose rule of thumb. I would suggest to try both the SORT and HASH strategies with a workload that is as similar as possible to your production workload (similar data, similar parallelism, etc.), and see which one is faster for your specific use case.

Re: How to flush all window states after Kafka (0.10.x) topic was removed

2017-09-05 Thread Tzu-Li (Gordon) Tai

Hi Tony, Currently, the functionality that you described does not exist in the consumer. When a topic is deleted, as far as I know, the consumer would simply consider the partitions as unreachable and continue to try fetching records from them until they are up again. I'm not entirely sure if a re

Re: termination of stream#iterate on finite streams

2017-09-05 Thread Gábor Gévay

Hello, There is a Flink Improvement Proposal to redesign the iterations: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=66853132 This will address the termination issue. Best, Gábor On Mon, Sep 4, 2017 at 11:00 AM, Xingcan Cui wrote: > Hi Peter, > > That's a good idea, but

DataSet: partitionByHash without materializing/spilling the entire partition?

2017-09-05 Thread Urs Schoenenberger

Hi all, we have a DataSet pipeline which reads CSV input data and then essentially does a combinable GroupReduce via first(n). In our first iteration (readCsvFile -> groupBy(0) -> sortGroup(0) -> first(n)), we got a jobgraph like this: source --[Forward]--> combine --[Hash Partition on 0, Sort]-

Re: Process Function

2017-09-05 Thread Biplob Biswas

How are you determining your data is stale? Also if you want to know the key, why don't you store the key in your state as well? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Binary encoding of categorical variables for input in ML algorithm

FLINK-6117 issue work around

Re: Fwd: HA : My job didn't restart even if task manager restarted.

Flink on AWS EMR Protobuf

Broadcast Config through Connected Stream

RE: DataSet: partitionByHash without materializing/spilling the entire partition?

Re: State Maintenance

Re: Process Function

Re: Does RocksDB need a dedicated CPU?

Re: Process Function

Re: DataSet: CombineHint heuristics

Re: Does RocksDB need a dedicated CPU?

Re: Process Function

Re: Process Function

Shuffling between map and keyBy operator

Task Manager was lost/killed due to full GC

Re: LatencyMarker

Re: State Maintenance

Re: DataSet: CombineHint heuristics

Re: How to flush all window states after Kafka (0.10.x) topic was removed

Re: termination of stream#iterate on finite streams

DataSet: partitionByHash without materializing/spilling the entire partition?

Re: Process Function

23 matches

Site Navigation

Mail list logo

Footer information