Re: Evolving the client protocol

Avi Kivity Tue, 24 Apr 2018 06:26:45 -0700


On 2018-04-23 17:59, Ben Bromhead wrote:

    >> This doesn't work without additional changes, for RF>1. The
    token ring could place two replicas of the same token range on the
    same physical server, even though those are two separate cores of
    the same server. You could add another element to the hierarchy
    (cluster -> datacenter -> rack -> node -> core/shard), but that
    generates unneeded range movements when a node is added.
    > I have seen rack awareness used/abused to solve this.
    >

    But then you lose real rack awareness. It's fine for a quick hack,
    but
    not a long-term solution.

    (it also creates a lot more tokens, something nobody needs)
I'm having trouble understanding how you loose "real" rack awareness,as these shards are in the same rack anyway, because the address andport are on the same server in the same rack. So it behaves asexpected. Could you explain a situation where the shards on a singleserver would be in different racks (or fault domains)?


You're right - it continues to work.

If you wanted to support a situation where you have a single rack perDC for simple deployments, extending NetworkTopologyStrategy to behavethe way it did beforehttps://issues.apache.org/jira/browse/CASSANDRA-7544 with respect totreating InetAddresses as servers rather than the address and portwould be simple. Both this implementation in Apache Cassandra and therespective load balancing classes in the drivers are explicitlydesigned to be pluggable so that would be an easier integration pointfor you.
I'm not sure how it creates more tokens? If a server normally owns 256tokens, each shard on a different port would just advertise ownershipof 256/# of cores (e.g. 4 tokens if you had 64 cores).

Having just 4 tokens results in imbalance. CASSANDRA-7032 mitigates it,but only for one replication factor, and doesn't work for decommission.

(and if you have 60 lcores then you get between 4 and 5 tokens perlcore, which is a 20% imbalance right there)


    > Regards,
    > Ariel
    >
    >> On Apr 22, 2018, at 8:26 AM, Avi Kivity <a...@scylladb.com
    <mailto:a...@scylladb.com>> wrote:
    >>
    >>
    >>
    >>> On 2018-04-19 21:15, Ben Bromhead wrote:
    >>> Re #3:
    >>>
    >>> Yup I was thinking each shard/port would appear as a discrete
    server to the
    >>> client.
    >> This doesn't work without additional changes, for RF>1. The
    token ring could place two replicas of the same token range on the
    same physical server, even though those are two separate cores of
    the same server. You could add another element to the hierarchy
    (cluster -> datacenter -> rack -> node -> core/shard), but that
    generates unneeded range movements when a node is added.
    >>
    >>> If the per port suggestion is unacceptable due to hardware
    requirements,
    >>> remembering that Cassandra is built with the concept scaling
    *commodity*
    >>> hardware horizontally, you'll have to spend your time and
    energy convincing
    >>> the community to support a protocol feature it has no
    (current) use for or
    >>> find another interim solution.
    >> Those servers are commodity servers (not x86, but still
    commodity). In any case 60+ logical cores are common now (hello
    AWS i3.16xlarge or even i3.metal), and we can only expect logical
    core count to continue to increase (there are 48-core ARM
    processors now).
    >>
    >>> Another way, would be to build support and consensus around a
    clear
    >>> technical need in the Apache Cassandra project as it stands today.
    >>>
    >>> One way to build community support might be to contribute an
    Apache
    >>> licensed thread per core implementation in Java that matches
    the protocol
    >>> change and shard concept you are looking for ;P
    >> I doubt I'll survive the egregious top-posting that is going on
    in this list.
    >>
    >>>
    >>>> On Thu, Apr 19, 2018 at 1:43 PM Ariel Weisberg
    <ar...@weisberg.ws <mailto:ar...@weisberg.ws>> wrote:
    >>>>
    >>>> Hi,
    >>>>
    >>>> So at technical level I don't understand this yet.
    >>>>
    >>>> So you have a database consisting of single threaded shards
    and a socket
    >>>> for accept that is generating TCP connections and in advance
    you don't know
    >>>> which connection is going to send messages to which shard.
    >>>>
    >>>> What is the mechanism by which you get the packets for a
    given TCP
    >>>> connection delivered to a specific core? I know that a given
    TCP connection
    >>>> will normally have all of its packets delivered to the same
    queue from the
    >>>> NIC because the tuple of source address + port and
    destination address +
    >>>> port is typically hashed to pick one of the queues the NIC
    presents. I
    >>>> might have the contents of the tuple slightly wrong, but it
    always includes
    >>>> a component you don't get to control.
    >>>>
    >>>> Since it's hashing how do you manipulate which queue packets
    for a TCP
    >>>> connection go to and how is it made worse by having an accept
    socket per
    >>>> shard?
    >>>>
    >>>> You also mention 160 ports as bad, but it doesn't sound like
    a big number
    >>>> resource wise. Is it an operational headache?
    >>>>
    >>>> RE tokens distributed amongst shards. The way that would work
    right now is
    >>>> that each port number appears to be a discrete instance of
    the server. So
    >>>> you could have shards be actual shards that are simply
    colocated on the
    >>>> same box, run in the same process, and share resources. I
    know this pushes
    >>>> more of the complexity into the server vs the driver as the
    server expects
    >>>> all shards to share some client visible like system tables
    and certain
    >>>> identifiers.
    >>>>
    >>>> Ariel
    >>>>> On Thu, Apr 19, 2018, at 12:59 PM, Avi Kivity wrote:
    >>>>> Port-per-shard is likely the easiest option but it's too ugly to
    >>>>> contemplate. We run on machines with 160 shards (IBM POWER
    2s20c160t
    >>>>> IIRC), it will be just horrible to have 160 open ports.
    >>>>>
    >>>>>
    >>>>> It also doesn't fit will with the NICs ability to automatically
    >>>>> distribute packets among cores using multiple queues, so the
    kernel
    >>>>> would have to shuffle those packets around. Much better to
    have those
    >>>>> packets delivered directly to the core that will service them.
    >>>>>
    >>>>>
    >>>>> (also, some protocol changes are needed so the driver knows
    how tokens
    >>>>> are distributed among shards)
    >>>>>
    >>>>>> On 2018-04-19 19:46, Ben Bromhead wrote:
    >>>>>> WRT to #3
    >>>>>> To fit in the existing protocol, could you have each shard
    listen on a
    >>>>>> different port? Drivers are likely going to support this due to
    >>>>>> https://issues.apache.org/jira/browse/CASSANDRA-7544 (
    >>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11596). I'm
    not super
    >>>>>> familiar with the ticket so their might be something I'm
    missing but it
    >>>>>> sounds like a potential approach.
    >>>>>>
    >>>>>> This would give you a path forward at least for the short term.
    >>>>>>
    >>>>>>
    >>>>>> On Thu, Apr 19, 2018 at 12:10 PM Ariel Weisberg
    <ar...@weisberg.ws <mailto:ar...@weisberg.ws>>
    >>>> wrote:
    >>>>>>> Hi,
    >>>>>>>
    >>>>>>> I think that updating the protocol spec to Cassandra puts
    the onus on
    >>>> the
    >>>>>>> party changing the protocol specification to have an
    implementation
    >>>> of the
    >>>>>>> spec in Cassandra as well as the Java and Python driver
    (those are
    >>>> both
    >>>>>>> used in the Cassandra repo). Until it's implemented in
    Cassandra we
    >>>> haven't
    >>>>>>> fully evaluated the specification change. There is no
    substitute for
    >>>> trying
    >>>>>>> to make it work.
    >>>>>>>
    >>>>>>> There are also realities to consider as to what the
    maintainers of the
    >>>>>>> drivers are willing to commit.
    >>>>>>>
    >>>>>>> RE #1,
    >>>>>>>
    >>>>>>> I am +1 on the fact that we shouldn't require an extra hop
    for range
    >>>> scans.
    >>>>>>> In JIRA Jeremiah made the point that you can still do this
    from the
    >>>> client
    >>>>>>> by breaking up the token ranges, but it's a leaky
    abstraction to have
    >>>> a
    >>>>>>> paging interface that isn't a vanilla ResultSet interface.
    Serial vs.
    >>>>>>> parallel is kind of orthogonal as the driver can do either.
    >>>>>>>
    >>>>>>> I agree it looks like the current specification doesn't
    make what
    >>>> should
    >>>>>>> be simple as simple as it could be for driver implementers.
    >>>>>>>
    >>>>>>> RE #2,
    >>>>>>>
    >>>>>>> +1 on this change assuming an implementation in Cassandra
    and the
    >>>> Java and
    >>>>>>> Python drivers.
    >>>>>>>
    >>>>>>> RE #3,
    >>>>>>>
    >>>>>>> It's hard to be +1 on this because we don't benefit by boxing
    >>>> ourselves in
    >>>>>>> by defining a spec we haven't implemented, tested, and
    decided we are
    >>>>>>> satisfied with. Having it in ScyllaDB de-risks it to a certain
    >>>> extent, but
    >>>>>>> what if Cassandra decides to go a different direction in
    some way?
    >>>>>>>
    >>>>>>> I don't think there is much discussion to be had without
    an example
    >>>> of the
    >>>>>>> the changes to the CQL specification to look at, but even
    then if it
    >>>> looks
    >>>>>>> risky I am not likely to be in favor of it.
    >>>>>>>
    >>>>>>> Regards,
    >>>>>>> Ariel
    >>>>>>>
    >>>>>>>> On Thu, Apr 19, 2018, at 9:33 AM, glom...@scylladb.com
    <mailto:glom...@scylladb.com> wrote:
    >>>>>>>> On 2018/04/19 07:19:27, kurt greaves
    <k...@instaclustr.com <mailto:k...@instaclustr.com>> wrote:
    >>>>>>>>>> 1. The protocol change is developed using the Cassandra
    process in
    >>>>>>>>>>      a JIRA ticket, culminating in a patch to
    >>>>>>>>>> doc/native_protocol*.spec when consensus is achieved.
    >>>>>>>>> I don't think forking would be desirable (for anyone) so
    this seems
    >>>>>>>>> the most reasonable to me. For 1 and 2 it certainly
    makes sense but
    >>>>>>>>> can't say I know enough about sharding to comment on 3 -
    seems to me
    >>>>>>>>> like it could be locking in a design before anyone truly
    knows what
    >>>>>>>>> sharding in C* looks like. But hopefully I'm wrong and
    there are
    >>>>>>>>> devs out there that have already thought that through.
    >>>>>>>> Thanks. That is our view and is great to hear.
    >>>>>>>>
    >>>>>>>> About our proposal number 3: In my view, good protocol
    designs are
    >>>>>>>> future proof and flexible. We certainly don't want to
    propose a
    >>>> design
    >>>>>>>> that works just for Scylla, but would support reasonable
    >>>>>>>> implementations regardless of how they may look like.
    >>>>>>>>
    >>>>>>>>> Do we have driver authors who wish to support both projects?
    >>>>>>>>>
    >>>>>>>>> Surely, but I imagine it would be a minority. 
    >>>>>>>>>
    >>>>>>>>
    ---------------------------------------------------------------------
    >>>>>>>> To unsubscribe, e-mail:
    dev-unsubscr...@cassandra.apache.org
    <mailto:dev-unsubscr...@cassandra.apache.org> For
    >>>>>>>> additional commands, e-mail:
    dev-h...@cassandra.apache.org <mailto:dev-h...@cassandra.apache.org>
    >>>>>>>>
    >>>>>>>
    ---------------------------------------------------------------------
    >>>>>>> To unsubscribe, e-mail:
    dev-unsubscr...@cassandra.apache.org
    <mailto:dev-unsubscr...@cassandra.apache.org>
    >>>>>>> For additional commands, e-mail:
    dev-h...@cassandra.apache.org <mailto:dev-h...@cassandra.apache.org>
    >>>>>>>
    >>>>>>> --
    >>>>>> Ben Bromhead
    >>>>>> CTO | Instaclustr <https://www.instaclustr.com/>
    >>>>>> +1 650 284 9692 <tel:%28650%29%20284-9692> <(650)%20284-9692>
    >>>>>> Reliability at Scale
    >>>>>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and
    Softlayer
    >>>>>>
    >>>>>
    ---------------------------------------------------------------------
    >>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
    <mailto:dev-unsubscr...@cassandra.apache.org>
    >>>>> For additional commands, e-mail:
    dev-h...@cassandra.apache.org <mailto:dev-h...@cassandra.apache.org>
    >>>>>
    >>>>
    ---------------------------------------------------------------------
    >>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
    <mailto:dev-unsubscr...@cassandra.apache.org>
    >>>> For additional commands, e-mail:
    dev-h...@cassandra.apache.org <mailto:dev-h...@cassandra.apache.org>
    >>>>
    >>>> --
    >>> Ben Bromhead
    >>> CTO | Instaclustr <https://www.instaclustr.com/>
    >>> +1 650 284 9692 <tel:%28650%29%20284-9692>
    >>> Reliability at Scale
    >>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
    >>>
    >>
    >>
    ---------------------------------------------------------------------
    >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
    <mailto:dev-unsubscr...@cassandra.apache.org>
    >> For additional commands, e-mail: dev-h...@cassandra.apache.org
    <mailto:dev-h...@cassandra.apache.org>
    >>
    >
    >
    ---------------------------------------------------------------------
    > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
    <mailto:dev-unsubscr...@cassandra.apache.org>
    > For additional commands, e-mail: dev-h...@cassandra.apache.org
    <mailto:dev-h...@cassandra.apache.org>
    >

--
Ben Bromhead
CTO | Instaclustr <https://www.instaclustr.com/>
+1 650 284 9692
Reliability at Scale
Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer

Re: Evolving the client protocol

Reply via email to