RE: Kyro Intermittent Exception for Large Data

2016-02-18 Thread Ken Krugler
ok at ? As i think before fink use Java serialization ? > > Thanks a lot. > > ERROR 1 > > > > ERROR 2 > > > > > -- > Welly Tambunan > Triplelands > > http://weltam.wordpress.com > http://www.triplelands.com -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Javadoc version

2016-03-15 Thread Ken Krugler
Perusing the docs, and noticed this... https://ci.apache.org/projects/flink/flink-docs-release-1.0/api/java/ says "flink 1.0-SNAPSHOT API" I assume this shouldn't be called the snapshot version. -- Ken ---------- Ken Krugler +1 530-210-6378 http://www.sca

Input on training exercises

2016-03-18 Thread Ken Krugler
Hi list, What's the right way to provide input on the training exercises? Thanks, -- Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

RE: Javadoc version

2016-03-18 Thread Ken Krugler
that the docs / javadocs are slightly out of > sync, but usually we only merge fixes to the "release-x.y." branches, so > there should never be inconsistencies, just fixes. > > > > On Tue, Mar 15, 2016 at 5:34 PM, Ken Krugler > wrote: > Perusing the docs

RE: S3 Timeouts with lots of Files Using Flink 0.10.2

2016-03-19 Thread Ken Krugler
on.io.FileInputFormat$InputSplitOpenThread.run(FileInputFormat.java:841) > > at > org.apache.flink.api.common.io.FileInputFormat$InputSplitOpenThread.waitForCompletion(FileInputFormat.java:890) > > at > org.apache.flink.api.common.io.FileInputFormat.open(FileInputFormat.

Re: Integrate Flink with S3 on EMR cluster

2016-04-04 Thread Ken Krugler
ileInputFormat.java:450) > > at > org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:57) > > at > org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:156) > > ... 23 more > > > > Somehow, it's still tries to use EMRFS (which may be a valid thing?), but it > is failing to initialize. I don't know enough about EMRFS/S3 interop so don't > know how diagnose it further. > > I run Flink 1.0.0 compiled for Scala 2.11. > > Any advice on how to make it work is highly appreciated. > > > > Thanks, > > Timur > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Processing millions of messages in milliseconds real time -- Architecture guide required

2016-04-21 Thread Ken Krugler
essor. > I thought about using data cache as well for serving the data > The data cache should have the capability to serve the historical data in > milliseconds (may be upto 30 days of data) > > > -- > Thanks > Deepak -- Ken Krugler +1 530-

Command line arguments getting munged with CLI?

2016-04-26 Thread Ken Krugler
through as is. Any ideas? Thanks, — Ken PS - this is with flink 1.0.2 -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com <http://www.scaleunlimited.com/> custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: "No more bytes left" at deserialization

2016-04-26 Thread Ken Krugler
rs.CoGroupDriver.prepare(CoGroupDriver.java:97) > >>>>> at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450) > >>>>> ... 3 more > >>>>> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' > >>>>&

Tuning parallelism in cascading-flink planner

2016-04-26 Thread Ken Krugler
Hi all, I’m busy tuning up a workflow (defined w/Cascading, planned with Flink) that runs on a 5 slave EMR cluster. The default parallelism (from the Flink planner) is set to 40, since I’ve got 5 task managers (one per node) and 8 slots/TM. But this seems to jam things up, as I see simultaneou

Wildcards with --classpath parameter in CLI

2016-04-26 Thread Ken Krugler
Hi all, If I want to include all of the jars in a directory, I thought I could do --classpath file://http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Command line arguments getting munged with CLI?

2016-04-27 Thread Ken Krugler
gt;) it seems that double vs > single dash could make a difference, so you could try it. > > > On Tue, Apr 26, 2016 at 11:45 AM, Ken Krugler <mailto:kkrugler_li...@transpac.com>> wrote: > Hi all, > > I’m running this command, on the master in an EMR cluster: > >

Reducing parallelism leads to NoResourceAvailableException

2016-04-27 Thread Ken Krugler
: NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism… But the change was to reduce parallelism - why would that now cause this problem? Thanks, — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data

Re: Reducing parallelism leads to NoResourceAvailableException

2016-04-28 Thread Ken Krugler
Cascading-Flink planner. Fabian would know best. — Ken > > Cheers, > Aljoscha > > On Thu, 28 Apr 2016 at 00:52 Ken Krugler <mailto:kkrugler_li...@transpac.com>> wrote: > Hi all, > > In trying out different settings for performance, I run into a job failure &g

Re: Reducing parallelism leads to NoResourceAvailableException

2016-04-28 Thread Ken Krugler
ob in the web frontend). Is it really 15? Same as above, the Flink web UI is gone once the job has failed. Any suggestions for how to check the actual parallelism in this type of transient YARN environment? Thanks, — Ken > On Thu, Apr 28, 2016 at 12:52 AM, Ken Krugler > wrote: >&

Anyone going to ApacheCon Big Data in Vancouver?

2016-04-28 Thread Ken Krugler
ter-way-for-faster-workflows-ken-krugler-scale-unlimited> on my experience with using the Flink planner for Cascading.

Re: Anyone going to ApacheCon Big Data in Vancouver?

2016-04-28 Thread Ken Krugler
rson at the event... — Ken > > On Thu, Apr 28, 2016 at 6:34 PM, Ken Krugler <mailto:kkrugler_li...@transpac.com>> wrote: > Hi all, > > Is anyone else from the community going? > > It would be fun to meet up with other Flink users during the event. > > I

EMR vCores and slot allocation

2016-04-28 Thread Ken Krugler
/config.html <https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html>, the recommended configuration would then be 4 slots/TaskManager, yes? Thanks, — Ken ------ Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions &

Checking actual config values used by TaskManager

2016-04-28 Thread Ken Krugler
makes me want to see the exact values being used, versus assuming I know what’s been set :) Thanks, — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Checking actual config values used by TaskManager

2016-04-29 Thread Ken Krugler
y deployed with tasks is critical. — Ken > > On Thu, Apr 28, 2016 at 9:00 PM, Ken Krugler <mailto:kkrugler_li...@transpac.com>> wrote: > Hi all, > > I’m running jobs on EMR via YARN, and wondering how to check exactly what > configuration settings are actually being u

Re: Checking actual config values used by TaskManager

2016-05-04 Thread Ken Krugler
s an example. Thanks, — Ken > On Fri, Apr 29, 2016 at 3:18 PM, Ken Krugler > wrote: >> Hi Timur, >> >> On Apr 28, 2016, at 10:40pm, Timur Fayruzov >> wrote: >> >> If you're talking about parameters that were set on JVM startup then `ps >>

Re: How to measure Flink performance

2016-05-13 Thread Ken Krugler
ek >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-to-measure-Flink-performance-tp6741p6863.html >>> >>> <http://apache-flink-user-mailin

Use of slot sharing groups causing workflow to hang

2020-09-02 Thread Ken Krugler
e that issue is still open, so wondering what Til and Konstantin have to say about it. ------ Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Bug: Kafka producer always writes to partition 0, because KafkaSerializationSchemaWrapper does not call open() method of FlinkKafkaPartitioner

2020-09-03 Thread Ken Krugler
KafkaSerializationSchemaWrapper#serialize is later called by >> the FlinkKafkaProducer, FlinkFiexedPartitioner#partition would always >> return 0, because parallelInstanceId is not properly initialized. >> >> >> Eventually, all of the data would go only to partition 0 of the given Kafka >> topic, creating severe data skew in the sink. >> >> >> -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Speeding up CoGroup in batch job

2020-09-04 Thread Ken Krugler
improving the performance of a CoGroup? Thanks! — Ken -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Should the StreamingFileSink mark the files "finished" when all bounded input sources are depleted?

2020-09-07 Thread Ken Krugler
e sender know and delete the message > immediately. > - -- Ken Krugler http://www.scaleunlimited.com <http://www.scaleunlimited.com/> custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Use of slot sharing groups causing workflow to hang

2020-09-09 Thread Ken Krugler
esources. If you put part of the workflow to a specific slot sharing > group, it may require more slots to run the workflow than before. > Could you share logs of the ResourceManager and SlotManager, I think > there are more clues in it. > > Best, > Yangze Guo > > On Thu,

Re: Speeding up CoGroup in batch job

2020-09-17 Thread Ken Krugler
r each SSD, so that we > can spread the load across all SSDs. > - If you are saying the CPU load is 40% on a TM, we have to assume we are IO > bound: Is it the network or the disk(s)? > > I hope this is some helpful inspiration for improving the performance. > > >

Re: SocketException: Too many open files

2020-09-25 Thread Ken Krugler
nment. > > I have updated the limits.conf and also set the value of file-max > (fs.file-max = 2097152) on the master node as well as on all worker nodes > and still getting the same issue. > > Thanks > Sateesh > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Restore from savepoint with Iterations

2020-05-04 Thread Ken Krugler
Hi Ashish, Wondering if you’re running into the gridlock problem I mention on slide #25 here: https://www.slideshare.net/FlinkForward/flink-forward-san-francisco-2018-ken-krugler-building-a-scalable-focused-web-crawler-with-flink <https://www.slideshare.net/FlinkForward/flink-forward-

Re: Restore from savepoint with Iterations

2020-05-04 Thread Ken Krugler
e not light (around half billion every day) :) > >> On May 4, 2020, at 10:13 PM, Ken Krugler > <mailto:kkrugler_li...@transpac.com>> wrote: >> >> Hi Ashish, >> >> Wondering if you’re running into the gridlock problem I mention on slide #25 >> h

Re: Flink restart strategy on specific exception

2020-05-12 Thread Ken Krugler
gt; > > Your Personal Data: We may collect and process information about you that may > be subject to data protection laws. For more information about how we use and > disclose your personal data, how we protect your information, our legal basis > to use your information, your r

Re: Flink restart strategy on specific exception

2020-05-12 Thread Ken Krugler
> > > > Your Personal Data: We may collect and process information about you that may > be subject to data protection laws. For more information about how we use and > disclose your personal data, how we protect your information, our legal basis > to use your information, you

Re: Simple stateful polling source

2020-06-07 Thread Ken Krugler
rom; >>> > } >>> > } >>> > >>> > @Override >>> > public void run(SourceContext ctx) throws Exception { >>> >final Object lock = ctx.getCheckpointLock(); >>> >Client httpClient = getHttpClient(); >>> >try { >>> > pollingThread = new MyPollingThread.Builder(baseUrl, >>> > httpClient)// >>> > .setStartDate(startDateStr, datePatternStr)// >>> > .build(); >>> > // start the polling thread >>> > new Thread(pr).start(); >>> > (etc) >>> > } >>> > >>> > Is this the correct approach or did I misunderstood how stateful >>> > source functions work? >>> > >>> > Best, >>> > Flavio >>> >>> >>> >> >> > -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Flink Logging on EMR

2020-07-02 Thread Ken Krugler
nf/yarn-site.xml > > >yarn.log-aggregation-enable >true > > > It might be the case that i can only see the logs through yarn once the > application completes/finishes/fails > > Thanks > Sateesh > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Issue with single job yarn flink cluster HA

2020-08-05 Thread Ken Krugler
played. And this stays the > same even after 30 minutes. If I leave the job without yarn kill, it stays > the same forever. > Based on your suggestions till now, I guess it might be some zookeeper > problem. If that is the case, what can I lookout for in zookeeper to figure > ou

Change in sub-task id assignment from 1.9 to 1.10?

2020-08-06 Thread Ken Krugler
post 1.8 Thanks, — Ken -- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr

Re: Flink auto-scaling feature and documentation suggestions

2021-05-05 Thread Ken Krugler
wastage of resources, even with operator chaining in > place. > > That's why I think more toggles are needed to make current auto-scaling > truly shine. > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.

Error with extracted type from custom partitioner key

2021-06-04 Thread Ken Krugler
uctor, and have Flink treat it as a non-POJO? This seemed to be working fine. • Is it a bug in Flink that the extracted field from the key is being used as the expected type for partitioning? Thanks! — Ken ------ Ken Krugler http://www.scaleunlimited.com Custo

Re: Error with extracted type from custom partitioner key

2021-06-12 Thread Ken Krugler
xample how you would > like to cann partitionCustom()? > > Regards, > Timo > > On 04.06.21 15:38, Ken Krugler wrote: >> Hi all, >> I'm using Flink 1.12 and a custom partitioner/partitioning key (batch mode, >> with a DataSet) to do a better job of distributing data

Upcoming Meetup talk on using Flink & Pinot - Pinot vs Elasticsearch, a Tale of Two PoCs

2021-06-18 Thread Ken Krugler
Flink batch, not streaming. ------ Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: Memory Usage - Total Memory Usage on UI and Metric

2021-07-02 Thread Ken Krugler
g this with 3 yarn containers(2 tasks in each > container), total parallelism as 6. As soon as one container fails with this > error, the job re-starts. However, within minutes other 2 containers also > fail with the same error one by one. > > Thanks, > Hemant -- Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Best approach for S3 integration testing without using deprecated ConfigConstants.HDFS_SITE_CONFIG?

2019-08-28 Thread Ken Krugler
HDFS_SITE_CONFIG? The Javadoc says to use HADOOP_CONF_DIR, but setting up an environment variable from inside a unit test is…messy. Thanks, — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Casca

Potential block size issue with S3 binary files

2019-08-28 Thread Ken Krugler
it applies to all output files in the job. — Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Batch mode with Flink 1.8 unstable?

2019-09-18 Thread Ken Krugler
1.8, that is why it is > not an advertised feature (it only works for streaming so far). > > The goal is that this works in the 1.9 release (aka the batch fixup release) > > (3) Hang in Processing > > I think a thread dump (jstack) from the TMs would be helpful to diagnos

How to assign a UID to a KeyedStream?

2020-01-09 Thread Ken Krugler
is a PartitionTransformation. I don’t see a way to set the UID following the keyBy(), as a KeyedStream creates the PartitionTransformation without a UID. Any insight into setting the UID properly here? Or should StreamGraphGenerator.transform() skip the no-uid check for PartitionT

How to assign a UID to a KeyedStream?

2020-01-09 Thread Ken Krugler
ollowing the keyBy(), as a KeyedStream creates the PartitionTransformation without a UID. Any insight into setting the UID properly here? Or should StreamGraphGenerator.transform() skip the no-uid check for PartitionTransformation, since that’s not an operator with state? Thanks, — Ken -

Re: StreamingFileSink doesn't close multipart uploads to s3?

2020-01-10 Thread Ken Krugler
uot;2019-12-06--21/part-0-1" > >> }, > >> { > >> "Initiated": "2019-12-06T21:04:15.000Z", > >> "Key": "2019-12-06--21/part-0-2" > >> }, > >> {

Status of FLINK-12692 (Support disk spilling in HeapKeyedStateBackend)

2020-01-29 Thread Ken Krugler
Hi Yu Li, It looks like this stalled out a bit, from May of last year, and won’t make it into 1.10. I’m wondering if there’s a version in Blink (as a completely separate state backend?) that could be tried out? Thanks, — Ken -- Ken Krugler http

Using retained checkpoints as savepoints

2020-01-29 Thread Ken Krugler
-savepoint-how-is-a-savepoint-different-from-a-checkpoint> implies not: "Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn’t change between the execution attempts" Thanks, — Ken -- Ken Krugler http://www.scaleunlimit

Re: Parallel stream partitions

2018-07-17 Thread Ken Krugler
> I need to remove the crossover in the middle box, so [1] -> [1] -> [1] and > [2] -> [2] -> [2], instead of [1] -> [1] -> [1 or 2] . > > Nick -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Flink keyed stream windows

2018-08-13 Thread Ken Krugler
e users every 30 minutes. > > Since I have a lot of unique users(rpm 1.5 million), how to use Flink's timed > windows on keyed stream to solve this problem. > > Please help! > > Thanks, > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

OneInputStreamOperatorTestHarness.snapshot doesn't include timers?

2018-08-15 Thread Ken Krugler
p = context.timerService().currentProcessingTime() + _period; LOGGER.info("Setting initial timer for {}", firstTimestamp); context.timerService().registerProcessingTimeTimer(firstTimestamp); } out.c

Re: OneInputStreamOperatorTestHarness.snapshot doesn't include timers?

2018-08-16 Thread Ken Krugler
org/jira/browse/FLINK-10159 > <https://issues.apache.org/jira/browse/FLINK-10159> > > Piotrek > >> On 16 Aug 2018, at 00:24, Ken Krugler > <mailto:kkrugler_li...@transpac.com>> wrote: >> >> Hi all, >> >> It looks to me like the OperatorSu

Re: Question regarding Streaming Resources

2018-09-12 Thread Ken Krugler
Flink internal task manager is having > some strategy to utilize them for other new streams that are coming? > Regards > Bhaskar -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com <http://www.scaleunlimited.com/> Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Test harness for validating proper checkpointing of custom SourceFunction

2018-09-12 Thread Ken Krugler
). What’s the recommended approach for this? We can of course fire up the workflow with checkpointing, and add some additional logic that kills the job after a checkpoint has happened, etc. But it seems like there should be a better way. Thanks, — Ken -- Ken Krugler +1 530-210

Re: Question regarding Streaming Resources

2018-09-12 Thread Ken Krugler
Hi Bhaskar, > On 2018/09/12 20:42:22, Ken Krugler wrote: >> Hi Bhaskar, >> >> I assume you don’t have 1000 streams, but rather one (keyed) stream with >> 1000 different key values, yes? >> >> If so, then this one stream is physically partitioned based

Errors in QueryableState sample code?

2018-09-19 Thread Ken Krugler
thing seemed OK, but this doesn’t feel like the right way to solve that problem :) 2. The call to response.get() returns a ValueState>, not the Tuple2 itself. So it seems like there’s a missing “.value()”. Regards, — Ken -- Ken Krugler +1 530-210-6378 http://www.scale

Re: Regarding implementation of aggregate function using a ProcessFunction

2018-09-29 Thread Ken Krugler
des a >> RuntimeContext that allows you to access the state API). >> >> Thanks, vino. >> >> Gaurav Luthra > <mailto:gauravluthra6...@gmail.com>> 于2018年9月28日周五 下午1:38写道: >> Hi, >> >> As we are aware, Currently we cannot use RichAggregateFunction in >> aggregate() method upon windowed stream. So, To access the state in your >> customAggregateFunction, you can implement it using a ProcessFuntion. >> This issue is faced by many developers. >> So, someone must have implemented or tried to implement it. So, kindly share >> your feedback on this. >> As I need to implement this. >> >> Thanks & Regards >> Gaurav Luthra -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: In-Memory Lookup in Flink Operators

2018-09-29 Thread Ken Krugler
>> state backend(RocksDB in my case). >> 2) Having a dedicated refresh thread for each subtask instance(possibly, >> every Task Manager having multiple refresh thread) >> >> Am i thinking in the right direction? Or missing something very obvious? It >> confusing. >> >> Any leads are much appreciated. Thanks in advance. >> >> Cheers, >> Chirag -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: CsvInputFormat - read header line first

2018-10-31 Thread Ken Krugler
2 times. Is there any better way to do > this? Please suggest. > > -- > Thank you. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Join Dataset in stream

2018-11-15 Thread Ken Krugler
(mutable). I could do this on a batch job, but i will > have to triger it each time and the input are more like a slow stream, but > the computing need to be fast can i do this on a stream way? is there any > better solution ? > Thx -- Ken

Re: Writing to S3

2018-11-15 Thread Ken Krugler
at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:296) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:703) > at java.lang.Thread.run(Thread.java:748) -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Null Pointer Exception

2018-11-16 Thread Ken Krugler
ansport.ProtocolStateActor.aroundReceive(AkkaProtocolTransport.scala:285) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.

Re: About the issue caused by flink's dependency jar package submission method

2018-11-20 Thread Ken Krugler
.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:224) > ... 4 more > > is there anyone know about this? > > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ > <http://apache-fl

Re: OutOfMemoryError while doing join operation in flink

2018-11-23 Thread Ken Krugler
(RecordWriter.java:131) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > > > From the exception view in flink jo

Re: Running Flink Dataset jobs Sequentially

2021-07-09 Thread Ken Krugler
there a way to tell Flink > to run it sequentially? I tried moving the execution environment inside the > loop but it seems like it only runs the job on the first directory. I'm > running this on AWS Kinesis Data Analytics, so it's a bit hard for me to > submit n

Re: Running Flink Dataset jobs Sequentially

2021-07-14 Thread Ken Krugler
ou said write a > custom mapPartition that writes to files, did you actually write the file > inside the mapPartition function itself? So the Flink DAG ends at > mapPartition? Did you notice any performance issues as a result of this? > > Thanks again, > Jason > > On

Exception thrown during batch job execution on YARN even though job succeeded

2021-09-30 Thread Ken Krugler
blame didn’t show most lines as being modified by “Rufus Refactor”…sigh) ------ Ken Krugler http://www.scaleunlimited.com Custom big data solutions Flink, Pinot, Solr, Elasticsearch

Re: S3A AWSS3IOException from Flink's BucketingSink to S3

2018-12-09 Thread Ken Krugler
k job does not restart and resume/recover >> from the last successful checkpoint. >> >> What is the cause for this and how can it be resolved? Also, how can the job >> be configured to restart/recover from the last successful checkpoint instead >> of staying in the FAILING state? > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: How many times Flink initialize an operator?

2018-12-11 Thread Ken Krugler
y element? > stream.map(new MapFunction[I,O]).addSink(discard) > > Hao Sun > Team Lead > 1019 Market St. 7F > San Francisco, CA 94103 -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Question about key group / key state & parallelism

2018-12-12 Thread Ken Krugler
ixed ? I mean, are the state always managed by the same > instance, or does this depends on the available instance at the moment ? > > "During execution each parallel instance of a keyed operator works with the > keys for one or more Key Groups." > -> this is related

Re: Iterations and back pressure problem

2018-12-24 Thread Ken Krugler
Hi Sergey, As Andrey noted, it’s a known issue with (currently) no good solution. I talk a bit about how we worked around it on slide 26 of my Flink Forward talk <https://www.slideshare.net/FlinkForward/flink-forward-san-francisco-2018-ken-krugler-building-a-scalable-focused-web-crawler-w

Re: windowAll and AggregateFunction

2019-01-09 Thread Ken Krugler
e function). Instead, it sends all data to single instance and > > call add function there. > > > > Is here any way to make flink behave like this? I mean calculate partial > > results after consuming from kafka with paralelism of sources without > > shuffling(so

Re: windowAll and AggregateFunction

2019-01-09 Thread Ken Krugler
operator gets a slice of the unique values. — Ken > On Wed, Jan 9, 2019, 10:22 PM Ken Krugler <mailto:kkrugler_li...@transpac.com> wrote: > Hi there, > > You should be able to use a regular time-based window(), and emit the > HyperLogLog binary data as your result, which the

Re: Issue with counter metrics for large number of keys

2019-01-16 Thread Ken Krugler
er of references of counter metrics you have heard > from anyone using metrics? > > Thanks & Regards > Gaurav Luthra > Mob:- +91-9901945206 > > > On Thu, Jan 17, 2019 at 9:04 AM Ken Krugler <mailto:kkrugler_li...@transpac.com>> wrote: > I think trying to

Re: Best pattern for achieving stream enrichment (side-input) from a large static source

2019-01-26 Thread Ken Krugler
I'm attaching a simple chart for convenience, > > Thanks you very much, > > Nimrod. > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Best pattern for achieving stream enrichment (side-input) from a large static source

2019-01-26 Thread Ken Krugler
’d worked around a similar issue with a UnionedSources <https://github.com/ScaleUnlimited/flink-streaming-kmeans/blob/master/src/main/java/com/scaleunlimited/flinksources/UnionedSources.java> source function, but I haven’t validated that it handles checkpointing correctly. — K

Re: How to load Avro file in a Dataset

2019-01-27 Thread Ken Krugler
ed class. My > question is can Flink detect the Avro file schema automatically? How can I > load Avro file without any predefined class? -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Regarding json/xml/csv file splitting

2019-02-04 Thread Ken Krugler
emp2 rec and reads emp3 and emp4 > # Operator3 ignores partial emp4 and reads emp5 > Record delimiter is used to skip partial record and identifying new record. > > > > > > -- > Thank you, > Madan. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Per-workflow configurations for an S3-related property

2019-02-08 Thread Ken Krugler
Hi all, When running in EMR, we’re encountering the oh-so-common HTTP timeout that’s caused by the connection pool being too small (see below) I’d found one SO answer that said to bump fs.s3.maxConnections for the EMR S3 filesystem implementation.

Re: ElasticSearchSink - retrying doesn't work in ActionRequestFailureHandler

2019-02-13 Thread Ken Krugler
Averell > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: ElasticSearchSink - retrying doesn't work in ActionRequestFailureHandler

2019-02-13 Thread Ken Krugler
ass. > Maybe Gordon meant 1.7.2 rc2? > > Thanks and regards, > Averell > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Starting Flink cluster and running a job

2019-02-19 Thread Ken Krugler
ho "config file: " && grep '^[^\n#]' >> "$FLINK_HOME/conf/flink-conf.yaml" >> exec $(drop_privs_cmd) "$FLINK_HOME/bin/taskmanager.sh" >> start-foreground >> else >> sed -i -e "s/jobmanager.rpc.addres

Re: Checkpoints and catch-up burst (heavy back pressure)

2019-03-01 Thread Ken Krugler
> The problem does not seem to be new, but I was unable to find any practical > solution in the documentation. > > Best regards, > Arnaud > > > > > > > L'intégrité de ce message n'étant pas assurée sur internet, la société > expéditrice ne

Re: Checkpoints and catch-up burst (heavy back pressure)

2019-03-04 Thread Ken Krugler
INZ, Arnaud wrote: > > Hi, > > My source checkpoint is actually the file list. But it's not trivially small > as I may have hundreds of thousand of files, with long filenames. > My sink checkpoint is a smaller hdfs file list with current size. > > Message d

Re: Set partition number of Flink DataSet

2019-03-11 Thread Ken Krugler
her way we can achieve similar result? Thank you! > > Qi -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Side Output from AsyncFunction

2019-03-11 Thread Ken Krugler
ms to be overcomplicated. > > Am I missing something? Any help/ideas are much appreciated! > > Cheers, > Mike Pryakhin > -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Set partition number of Flink DataSet

2019-03-12 Thread Ken Krugler
o use > setParallelism() to set the output partition/file number, but when the > partition number is too large (~100K), the parallelism would be too high. Is > there any other way to achieve this? > > Thanks, > Qi > >> On Mar 11, 2019, at 11:22 PM, Ken Krugler > <

Re: Set partition number of Flink DataSet

2019-03-13 Thread Ken Krugler
t of all possible bucket values. I’m actually dealing with something similar now, so I might have a solution to share soon. — Ken > I will check Blink and give it a try anyway. > > Thank you, > Qi > >> On Mar 12, 2019, at 11:58 PM, Ken Krugler > <mailto:kkrugler_l

Re: Batch jobs stalling after initial progress

2019-03-13 Thread Ken Krugler
" > * the task earlier in the DAG from the parquet output task shows the back > pressure status as "OK", the one earlier is shown with back pressure status > "High" > > Are there any specific logs I should enable to get more information on this? > Has a

Re: Set partition number of Flink DataSet

2019-03-14 Thread Ken Krugler
reducer the number of parallel sinks, and > may also try sortPartition() so each sink could write files one by one. > Looking forward to your solution. :) > > Thanks, > Qi > >> On Mar 14, 2019, at 2:54 AM, Ken Krugler > <mailto:kkrugler_li...@transpac.com>> wrote:

Re: Async Function Not Generating Backpressure

2019-03-19 Thread Ken Krugler
y backpressured but was able to run at 14k/s. > > Is the AsyncFunction somehow not reporting the backpressure correctly? > > Thanks, > Seed -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Async Function Not Generating Backpressure

2019-03-19 Thread Ken Krugler
; >> Job 1 - without async function in front of Cassandra >> Job 2 - with async function in front of Cassandra >> >> Job 1 is backpressured because Cassandra cannot handle all the writes and >> eventually slows down the source rate to 6.5k/s. >> Job 2

Re: Async Function Not Generating Backpressure

2019-03-21 Thread Ken Krugler
ine previous steps like Kafka fetching and > Cassandra IO but I am also not sure about this explanation. > > Best, > Andrey > > > On Tue, Mar 19, 2019 at 8:05 PM Ken Krugler <mailto:kkrugler_li...@transpac.com>> wrote: > Hi Seed, > > I was assuming the Cassa

Re: Use different functions for different signal values

2019-04-02 Thread Ken Krugler
would use for each > signal set a special function. E.g. Signal1, Signal2 ==> function1, Signal3, > Signal4 ==> function2. > What is the recommended way to implement this pattern? > > Thanks! > -- Ken Krugler +1 530-210-6378 http://www.scaleunl

Re: Partitioning key range

2019-04-08 Thread Ken Krugler
00 and 1001, only one partition receives all of the >> upstream data in ksA. >> Is there any way to get information about key ranges for each downstream >> partitions? >> Or is there any way to overcome this issue? >> We can assume that I know all possible keys (in this case

Re: Write simple text into hdfs

2019-04-29 Thread Ken Krugler
t; thing ? > > Many thanks. -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com Custom big data solutions & training Flink, Solr, Hadoop, Cascading & Cassandra

Re: Error While Initializing S3A FileSystem

2019-05-15 Thread Ken Krugler
:/opt/flink/conf: > org.apache.flink.runtime.taskexecutor.TaskManagerRunner --configDir > /opt/flink/conf > > and i see that `FluentPropertyBeanIntrospector` is contained within the > following two jars: > > flink-s3-fs-hadoop-1.7.2.jar:org/apache/flink/fs/shaded/hadoop3

Re: Error While Initializing S3A FileSystem

2019-05-15 Thread Ken Krugler
t; >> /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djavax.net >> <http://djavax.net/>.ssl.trustStore=/etc/ssl/java-certs/cacerts -XX:+UseG1GC >> -Xms5530M -Xmx5530M -XX:MaxDirectMemorySize=8388607T >> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties >>

  1   2   >