Hi Seed, when you create `AsyncDataStream.(un)orderedWait` which capacity do you pass in or you use the default one (100)?
Best, Andrey On Thu, Mar 21, 2019 at 2:49 AM Seed Zeng <[email protected]> wrote: > Hey Andrey and Ken, > Sorry about the late reply. I might not have been clear in my question > The performance of writing to Cassandra is the same in both cases, only > that the source rate was higher in the case of the async function is > present. > Something is "buffering" and not propagating backpressure to slow down the > source speed from Kafka. > > In our use case, we prefer the backpressure to slow down the source so > that the write to Cassandra is not delayed while the source is consuming > fast. > > Thanks, > Seed > > On Wed, Mar 20, 2019 at 9:38 AM Andrey Zagrebin <[email protected]> > wrote: > >> Hi Seed, >> >> Sorry for confusion, I see now it is separate. Back pressure should still >> be created because internal async queue has capacity >> but not sure about reporting problem, Ken and Till probably have better >> idea. >> >> As for consumption speed up, async operator creates another thread to >> collect the result and Cassandra sink probably uses that thread to write >> data. >> This might parallelize and pipeline previous steps like Kafka fetching >> and Cassandra IO but I am also not sure about this explanation. >> >> Best, >> Andrey >> >> >> On Tue, Mar 19, 2019 at 8:05 PM Ken Krugler <[email protected]> >> wrote: >> >>> Hi Seed, >>> >>> I was assuming the Cassandra sink was separate from and after your async >>> function. >>> >>> I was trying to come up for an explanation as to why adding the async >>> function would improve your performance. >>> >>> The only very unlikely reason I thought of was that the async function >>> somehow caused data arriving at the sink to be more “batchy”, which (if the >>> Cassandra sink had an “every x seconds do a write” batch mode) could >>> improve performance. >>> >>> — Ken >>> >>> On Mar 19, 2019, at 11:35 AM, Seed Zeng <[email protected]> wrote: >>> >>> Hi Ken and Andrey, >>> >>> Thanks for the response. I think there is a confusion that the writes to >>> Cassandra are happening within the Async function. >>> In my test, the async function is just a pass-through without doing any >>> work. >>> >>> So any Cassandra related batching or buffering should not be the cause >>> for this. >>> >>> Thanks, >>> >>> Seed >>> >>> On Tue, Mar 19, 2019 at 12:35 PM Ken Krugler < >>> [email protected]> wrote: >>> >>>> Hi Seed, >>>> >>>> It’s a known issue that Flink doesn’t report back pressure properly for >>>> AsyncFunctions, due to how it monitors the output collector to gather back >>>> pressure statistics. >>>> >>>> But that wouldn’t explain how you get a faster processing with the >>>> AsyncFunction inserted into your workflow. >>>> >>>> I haven’t looked at how the Cassandra sink handles batching, but if the >>>> AsyncFunction somehow caused fewer, bigger Cassandra writes to happen then >>>> that’s one (serious hand waving) explanation. >>>> >>>> — Ken >>>> >>>> On Mar 18, 2019, at 7:48 PM, Seed Zeng <[email protected]> wrote: >>>> >>>> Flink Version - 1.6.1 >>>> >>>> In our application, we consume from Kafka and sink to Cassandra in the >>>> end. We are trying to introduce a custom async function in front of the >>>> Sink to carry out some customized operations. In our testing, it appears >>>> that the Async function is not generating backpressure to slow down our >>>> Kafka Source when Cassandra becomes unhappy. Essentially compared to an >>>> almost identical job where the only difference is the lack of the Async >>>> function, Kafka source consumption speed is much higher under the same >>>> settings and identical Cassandra cluster. The experiment is like this. >>>> >>>> Job 1 - without async function in front of Cassandra >>>> Job 2 - with async function in front of Cassandra >>>> >>>> Job 1 is backpressured because Cassandra cannot handle all the writes >>>> and eventually slows down the source rate to 6.5k/s. >>>> Job 2 is slightly backpressured but was able to run at 14k/s. >>>> >>>> Is the AsyncFunction somehow not reporting the backpressure correctly? >>>> >>>> Thanks, >>>> Seed >>>> >>>> >>>> -------------------------- >>>> Ken Krugler >>>> +1 530-210-6378 >>>> http://www.scaleunlimited.com >>>> Custom big data solutions & training >>>> Flink, Solr, Hadoop, Cascading & Cassandra >>>> >>>> >>> -------------------------- >>> Ken Krugler >>> +1 530-210-6378 >>> http://www.scaleunlimited.com >>> Custom big data solutions & training >>> Flink, Solr, Hadoop, Cascading & Cassandra >>> >>>
