Hi, > This doesn't work without additional changes, for RF>1. The token ring could > place two replicas of the same token range on the same physical server, even > though those are two separate cores of the same server. You could add another > element to the hierarchy (cluster -> datacenter -> rack -> node -> > core/shard), but that generates unneeded range movements when a node is added.
I have seen rack awareness used/abused to solve this. Regards, Ariel > On Apr 22, 2018, at 8:26 AM, Avi Kivity <a...@scylladb.com> wrote: > > > >> On 2018-04-19 21:15, Ben Bromhead wrote: >> Re #3: >> >> Yup I was thinking each shard/port would appear as a discrete server to the >> client. > > This doesn't work without additional changes, for RF>1. The token ring could > place two replicas of the same token range on the same physical server, even > though those are two separate cores of the same server. You could add another > element to the hierarchy (cluster -> datacenter -> rack -> node -> > core/shard), but that generates unneeded range movements when a node is added. > >> If the per port suggestion is unacceptable due to hardware requirements, >> remembering that Cassandra is built with the concept scaling *commodity* >> hardware horizontally, you'll have to spend your time and energy convincing >> the community to support a protocol feature it has no (current) use for or >> find another interim solution. > > Those servers are commodity servers (not x86, but still commodity). In any > case 60+ logical cores are common now (hello AWS i3.16xlarge or even > i3.metal), and we can only expect logical core count to continue to increase > (there are 48-core ARM processors now). > >> >> Another way, would be to build support and consensus around a clear >> technical need in the Apache Cassandra project as it stands today. >> >> One way to build community support might be to contribute an Apache >> licensed thread per core implementation in Java that matches the protocol >> change and shard concept you are looking for ;P > > I doubt I'll survive the egregious top-posting that is going on in this list. > >> >> >>> On Thu, Apr 19, 2018 at 1:43 PM Ariel Weisberg <ar...@weisberg.ws> wrote: >>> >>> Hi, >>> >>> So at technical level I don't understand this yet. >>> >>> So you have a database consisting of single threaded shards and a socket >>> for accept that is generating TCP connections and in advance you don't know >>> which connection is going to send messages to which shard. >>> >>> What is the mechanism by which you get the packets for a given TCP >>> connection delivered to a specific core? I know that a given TCP connection >>> will normally have all of its packets delivered to the same queue from the >>> NIC because the tuple of source address + port and destination address + >>> port is typically hashed to pick one of the queues the NIC presents. I >>> might have the contents of the tuple slightly wrong, but it always includes >>> a component you don't get to control. >>> >>> Since it's hashing how do you manipulate which queue packets for a TCP >>> connection go to and how is it made worse by having an accept socket per >>> shard? >>> >>> You also mention 160 ports as bad, but it doesn't sound like a big number >>> resource wise. Is it an operational headache? >>> >>> RE tokens distributed amongst shards. The way that would work right now is >>> that each port number appears to be a discrete instance of the server. So >>> you could have shards be actual shards that are simply colocated on the >>> same box, run in the same process, and share resources. I know this pushes >>> more of the complexity into the server vs the driver as the server expects >>> all shards to share some client visible like system tables and certain >>> identifiers. >>> >>> Ariel >>>> On Thu, Apr 19, 2018, at 12:59 PM, Avi Kivity wrote: >>>> Port-per-shard is likely the easiest option but it's too ugly to >>>> contemplate. We run on machines with 160 shards (IBM POWER 2s20c160t >>>> IIRC), it will be just horrible to have 160 open ports. >>>> >>>> >>>> It also doesn't fit will with the NICs ability to automatically >>>> distribute packets among cores using multiple queues, so the kernel >>>> would have to shuffle those packets around. Much better to have those >>>> packets delivered directly to the core that will service them. >>>> >>>> >>>> (also, some protocol changes are needed so the driver knows how tokens >>>> are distributed among shards) >>>> >>>>> On 2018-04-19 19:46, Ben Bromhead wrote: >>>>> WRT to #3 >>>>> To fit in the existing protocol, could you have each shard listen on a >>>>> different port? Drivers are likely going to support this due to >>>>> https://issues.apache.org/jira/browse/CASSANDRA-7544 ( >>>>> https://issues.apache.org/jira/browse/CASSANDRA-11596). I'm not super >>>>> familiar with the ticket so their might be something I'm missing but it >>>>> sounds like a potential approach. >>>>> >>>>> This would give you a path forward at least for the short term. >>>>> >>>>> >>>>> On Thu, Apr 19, 2018 at 12:10 PM Ariel Weisberg <ar...@weisberg.ws> >>> wrote: >>>>>> Hi, >>>>>> >>>>>> I think that updating the protocol spec to Cassandra puts the onus on >>> the >>>>>> party changing the protocol specification to have an implementation >>> of the >>>>>> spec in Cassandra as well as the Java and Python driver (those are >>> both >>>>>> used in the Cassandra repo). Until it's implemented in Cassandra we >>> haven't >>>>>> fully evaluated the specification change. There is no substitute for >>> trying >>>>>> to make it work. >>>>>> >>>>>> There are also realities to consider as to what the maintainers of the >>>>>> drivers are willing to commit. >>>>>> >>>>>> RE #1, >>>>>> >>>>>> I am +1 on the fact that we shouldn't require an extra hop for range >>> scans. >>>>>> In JIRA Jeremiah made the point that you can still do this from the >>> client >>>>>> by breaking up the token ranges, but it's a leaky abstraction to have >>> a >>>>>> paging interface that isn't a vanilla ResultSet interface. Serial vs. >>>>>> parallel is kind of orthogonal as the driver can do either. >>>>>> >>>>>> I agree it looks like the current specification doesn't make what >>> should >>>>>> be simple as simple as it could be for driver implementers. >>>>>> >>>>>> RE #2, >>>>>> >>>>>> +1 on this change assuming an implementation in Cassandra and the >>> Java and >>>>>> Python drivers. >>>>>> >>>>>> RE #3, >>>>>> >>>>>> It's hard to be +1 on this because we don't benefit by boxing >>> ourselves in >>>>>> by defining a spec we haven't implemented, tested, and decided we are >>>>>> satisfied with. Having it in ScyllaDB de-risks it to a certain >>> extent, but >>>>>> what if Cassandra decides to go a different direction in some way? >>>>>> >>>>>> I don't think there is much discussion to be had without an example >>> of the >>>>>> the changes to the CQL specification to look at, but even then if it >>> looks >>>>>> risky I am not likely to be in favor of it. >>>>>> >>>>>> Regards, >>>>>> Ariel >>>>>> >>>>>>> On Thu, Apr 19, 2018, at 9:33 AM, glom...@scylladb.com wrote: >>>>>>> On 2018/04/19 07:19:27, kurt greaves <k...@instaclustr.com> wrote: >>>>>>>>> 1. The protocol change is developed using the Cassandra process in >>>>>>>>> a JIRA ticket, culminating in a patch to >>>>>>>>> doc/native_protocol*.spec when consensus is achieved. >>>>>>>> I don't think forking would be desirable (for anyone) so this seems >>>>>>>> the most reasonable to me. For 1 and 2 it certainly makes sense but >>>>>>>> can't say I know enough about sharding to comment on 3 - seems to me >>>>>>>> like it could be locking in a design before anyone truly knows what >>>>>>>> sharding in C* looks like. But hopefully I'm wrong and there are >>>>>>>> devs out there that have already thought that through. >>>>>>> Thanks. That is our view and is great to hear. >>>>>>> >>>>>>> About our proposal number 3: In my view, good protocol designs are >>>>>>> future proof and flexible. We certainly don't want to propose a >>> design >>>>>>> that works just for Scylla, but would support reasonable >>>>>>> implementations regardless of how they may look like. >>>>>>> >>>>>>>> Do we have driver authors who wish to support both projects? >>>>>>>> >>>>>>>> Surely, but I imagine it would be a minority. >>>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For >>>>>>> additional commands, e-mail: dev-h...@cassandra.apache.org >>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>>>> >>>>>> -- >>>>> Ben Bromhead >>>>> CTO | Instaclustr <https://www.instaclustr.com/> >>>>> +1 650 284 9692 <(650)%20284-9692> >>>>> Reliability at Scale >>>>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>> >>> -- >> Ben Bromhead >> CTO | Instaclustr <https://www.instaclustr.com/> >> +1 650 284 9692 >> Reliability at Scale >> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org