Re: Understanding multi region read query and latency

Bowen Song via user Sun, 07 Aug 2022 13:18:29 -0700

Do you mean "nodetool settraceprobability"? This is not exactly new, Iremember it was available on Cassandra 2.x.


On 07/08/2022 20:43, Stéphane Alleaume wrote:

I think perhaps you already know but i read you can now trace only a %of all queries, i will look to retrieve the name of thisfonctionnality (in new Cassandra release).


Hope it will help
Kind regards
Stéphane


Le dim. 7 août 2022, 20:26, Raphael Mazelier <r...@futomaki.net> a écrit :

    > "Read repair is in the blocking read path for the query, yep"

    OK interesting. This is not what I understood from the
    documentation. And I use localOne level consistency.

    I enabled tracing (see in the attachment of my first msg)/ but I
    didn't see read repair in the trace (and btw I tried to completely
    disable it on my table setting both read_repair_chance and
    local_dc_read_repair_chance to 0).

    The problem when enabling trace in cqlsh is that I only get slow
    result. For having fast answer I need to iterate faster on my
    queries.

    I can provide again trace for analysis. I got something more
    readable in python.

    Best,

    --

    Raphael


    On 07/08/2022 19:30, C. Scott Andreas wrote:

    > but still as I understand the documentation the read repair
    should not be in the blocking path of a query ?

    Read repair is in the blocking read path for the query, yep. At
    quorum consistency levels, the read repair must complete before
    returning a result to the client to ensure the data returned
    would be visible on subsequent reads that address the remainder
    of the quorum.

    If you enable tracing - either for a single CQL statement that is
    expected to be slow, or probabilistic from the server side to
    catch a slow query in the act - that will help identify what’s
    happening.

    - Scott

    On Aug 7, 2022, at 10:25 AM, Raphael Mazelier
    <r...@futomaki.net> <mailto:r...@futomaki.net> wrote:

    

    Nope. And what really puzzle me is in the trace we really show
    the difference between queries. The fast queries only request
    read from one replicas, while slow queries request from multiple
    replicas (and not only local to the dc).

    On 07/08/2022 14:02, Stéphane Alleaume wrote:

    Hi

    Is there some GC which could affect coordinarir node ?

    Kind regards
    Stéphane

    Le dim. 7 août 2022, 13:41, Raphael Mazelier
    <r...@futomaki.net> a écrit :

        Thanks for the answer but I was well aware of this. I use
        localOne as consistency level.

        My client connect to a local seeds, then choose a local
        coordinator (as far I can understand the trace log).

        Then for a batch of request I got approximately 98% of
        request treated in 2/3ms in local DC with one read request,
        and 2% treated by many nodes (according to the trace) and
        then way longer (250ms).

        ?

        On 06/08/2022 14:30, Bowen Song via user wrote:


        See the diagram below. Your problem almost certainly
        arises from step 4, in which an incorrect consistency
        level set by the client caused the coordinator node to
        send the READ command to nodes in other DCs.

        The load balancing policy only affects step 2 and 3, not
        step 1 or 4.

        You should change the consistency level to
        LOCAL_ONE/LOCAL_QUORUM/etc. to fix the problem.

        On 05/08/2022 22:54, Bowen Song wrote:

        The DCAwareRoundRobinPolicy/TokenAwareHostPolicy
        controlls which Cassandra coordinator node the client
        sends queries to, not the nodes it connects to, nor the
        nodes that performs the actual read.

        A client sends a CQL read query to a coordinator node,
        and the coordinator node parses the CQL query, and send
        READ requests to other nodes in the cluster based on the
        consistency level.

        Have you checked the consistency level of the session
        (and the query if applicable)? Is it prefixed with
        "LOCAL_"? If not, the coordinator will send the READ
        requests to non-local DCs.


        On 05/08/2022 19:40, Raphael Mazelier wrote:


        Hi Cassandra Users,

        I'm relatively new to Cassandra and first I have to say
        I'm really impressed by the technology.

        Good design and a lot of stuff to understand the
        underlying (the Oreilly book help a lot as well as
        thelastpickle blog post).

        I have an muli-datacenter c* cluster (US, Europe,
        Singapore) with eight node on each (two seeds on each
        region), two racks on Eu, Singapore, 3 on US. Everything
        deployed in AWS.

        We have a keyspace configured with network topology and
        two replicas on every region like this: {'class':
        'NetworkTopologyStrategy', 'ap-southeast-1': '2',
        'eu-west-1': '2', 'us-east-1': '2'}


        Investigating some performance issue I noticed strange
        things in my experiment:

        What we expect is very slow latency 3/5ms max for this
        specific select query. So we want every read to be local
        the each datacenter.

        We configure DCAwareRoundRobinPolicy(local_dc=DC) in
        python, and the same in Go
        gocql.TokenAwareHostPolicy(gocql.DCAwareRoundRobinPolicy("DC"))


        Testing a bit with two short program (I can provide
        them) in go and python I notice very strange result.
        Basically I do the same query over and over with a very
        limited dataset of id.

        The first result were surprising cause the very first
        query were always more than 250ms and after with
        stressing c* (playing with sleep between query) I can
        achieve a good ratio of query at 3/4 ms (what I expected).

        My guess was that long query were somewhat executed not
        locally (or at least imply multi datacenter queries) and
        short one no.

        Activating tracing in my program (like enalbing trace in
        cqlsh) kindla confirm my suspicion.

        (I will provide trace in attachment).

        My question is why sometime C* try to read not localy?
        how we can disable it? what is the criteria for this?

        (btw I'm very not fan of this multi region design for
        theses very specific kind of issues...)

        Also side question: why C* is so slow at connection?
        it's like it's trying to reach every nodes in each DC?
        (we only provide locals seeds however). Sometimes it
        take more than 20s...

        Any help appreciated.

        Best,

--

        Raphael Mazelier

Re: Understanding multi region read query and latency

Reply via email to