Sounds interesting. Could 80% of what we gain with a “shard” approach be achieved via Mesos to wrap a stateful service? Technically it’s “Sharding” the whole machine and better utilizing resources.
-- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 19, 2018, 1:23 PM -0500, sankalp kohli <kohlisank...@gmail.com>, wrote: > If you donate Thread per core to C*, I am sure someone can help you review > it and get it committed. > > On Thu, Apr 19, 2018 at 11:15 AM, Ben Bromhead <b...@instaclustr.com> wrote: > > > Re #3: > > > > Yup I was thinking each shard/port would appear as a discrete server to the > > client. > > > > If the per port suggestion is unacceptable due to hardware requirements, > > remembering that Cassandra is built with the concept scaling *commodity* > > hardware horizontally, you'll have to spend your time and energy convincing > > the community to support a protocol feature it has no (current) use for or > > find another interim solution. > > > > Another way, would be to build support and consensus around a clear > > technical need in the Apache Cassandra project as it stands today. > > > > One way to build community support might be to contribute an Apache > > licensed thread per core implementation in Java that matches the protocol > > change and shard concept you are looking for ;P > > > > > > On Thu, Apr 19, 2018 at 1:43 PM Ariel Weisberg <ar...@weisberg.ws> wrote: > > > > > Hi, > > > > > > So at technical level I don't understand this yet. > > > > > > So you have a database consisting of single threaded shards and a socket > > > for accept that is generating TCP connections and in advance you don't > > know > > > which connection is going to send messages to which shard. > > > > > > What is the mechanism by which you get the packets for a given TCP > > > connection delivered to a specific core? I know that a given TCP > > connection > > > will normally have all of its packets delivered to the same queue from > > the > > > NIC because the tuple of source address + port and destination address + > > > port is typically hashed to pick one of the queues the NIC presents. I > > > might have the contents of the tuple slightly wrong, but it always > > includes > > > a component you don't get to control. > > > > > > Since it's hashing how do you manipulate which queue packets for a TCP > > > connection go to and how is it made worse by having an accept socket per > > > shard? > > > > > > You also mention 160 ports as bad, but it doesn't sound like a big number > > > resource wise. Is it an operational headache? > > > > > > RE tokens distributed amongst shards. The way that would work right now > > is > > > that each port number appears to be a discrete instance of the server. So > > > you could have shards be actual shards that are simply colocated on the > > > same box, run in the same process, and share resources. I know this > > pushes > > > more of the complexity into the server vs the driver as the server > > expects > > > all shards to share some client visible like system tables and certain > > > identifiers. > > > > > > Ariel > > > On Thu, Apr 19, 2018, at 12:59 PM, Avi Kivity wrote: > > > > Port-per-shard is likely the easiest option but it's too ugly to > > > > contemplate. We run on machines with 160 shards (IBM POWER 2s20c160t > > > > IIRC), it will be just horrible to have 160 open ports. > > > > > > > > > > > > It also doesn't fit will with the NICs ability to automatically > > > > distribute packets among cores using multiple queues, so the kernel > > > > would have to shuffle those packets around. Much better to have those > > > > packets delivered directly to the core that will service them. > > > > > > > > > > > > (also, some protocol changes are needed so the driver knows how tokens > > > > are distributed among shards) > > > > > > > > On 2018-04-19 19:46, Ben Bromhead wrote: > > > > > WRT to #3 > > > > > To fit in the existing protocol, could you have each shard listen on > > a > > > > > different port? Drivers are likely going to support this due to > > > > > https://issues.apache.org/jira/browse/CASSANDRA-7544 ( > > > > > https://issues.apache.org/jira/browse/CASSANDRA-11596). I'm not > > super > > > > > familiar with the ticket so their might be something I'm missing but > > it > > > > > sounds like a potential approach. > > > > > > > > > > This would give you a path forward at least for the short term. > > > > > > > > > > > > > > > On Thu, Apr 19, 2018 at 12:10 PM Ariel Weisberg <ar...@weisberg.ws > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I think that updating the protocol spec to Cassandra puts the onus > > on > > > the > > > > > > party changing the protocol specification to have an implementation > > > of the > > > > > > spec in Cassandra as well as the Java and Python driver (those are > > > both > > > > > > used in the Cassandra repo). Until it's implemented in Cassandra we > > > haven't > > > > > > fully evaluated the specification change. There is no substitute for > > > trying > > > > > > to make it work. > > > > > > > > > > > > There are also realities to consider as to what the maintainers of > > the > > > > > > drivers are willing to commit. > > > > > > > > > > > > RE #1, > > > > > > > > > > > > I am +1 on the fact that we shouldn't require an extra hop for range > > > scans. > > > > > > > > > > > > In JIRA Jeremiah made the point that you can still do this from the > > > client > > > > > > by breaking up the token ranges, but it's a leaky abstraction to > > have > > > a > > > > > > paging interface that isn't a vanilla ResultSet interface. Serial > > vs. > > > > > > parallel is kind of orthogonal as the driver can do either. > > > > > > > > > > > > I agree it looks like the current specification doesn't make what > > > should > > > > > > be simple as simple as it could be for driver implementers. > > > > > > > > > > > > RE #2, > > > > > > > > > > > > +1 on this change assuming an implementation in Cassandra and the > > > Java and > > > > > > Python drivers. > > > > > > > > > > > > RE #3, > > > > > > > > > > > > It's hard to be +1 on this because we don't benefit by boxing > > > ourselves in > > > > > > by defining a spec we haven't implemented, tested, and decided we > > are > > > > > > satisfied with. Having it in ScyllaDB de-risks it to a certain > > > extent, but > > > > > > what if Cassandra decides to go a different direction in some way? > > > > > > > > > > > > I don't think there is much discussion to be had without an example > > > of the > > > > > > the changes to the CQL specification to look at, but even then if it > > > looks > > > > > > risky I am not likely to be in favor of it. > > > > > > > > > > > > Regards, > > > > > > Ariel > > > > > > > > > > > > On Thu, Apr 19, 2018, at 9:33 AM, glom...@scylladb.com wrote: > > > > > > > > > > > > > > On 2018/04/19 07:19:27, kurt greaves <k...@instaclustr.com> wrote: > > > > > > > > > 1. The protocol change is developed using the Cassandra > > > > > > > > > process > > in > > > > > > > > > a JIRA ticket, culminating in a patch to > > > > > > > > > doc/native_protocol*.spec when consensus is achieved. > > > > > > > > I don't think forking would be desirable (for anyone) so this > > seems > > > > > > > > the most reasonable to me. For 1 and 2 it certainly makes sense > > but > > > > > > > > can't say I know enough about sharding to comment on 3 - seems > > > > > > > > to > > me > > > > > > > > like it could be locking in a design before anyone truly knows > > what > > > > > > > > sharding in C* looks like. But hopefully I'm wrong and there are > > > > > > > > devs out there that have already thought that through. > > > > > > > Thanks. That is our view and is great to hear. > > > > > > > > > > > > > > About our proposal number 3: In my view, good protocol designs are > > > > > > > future proof and flexible. We certainly don't want to propose a > > > design > > > > > > > that works just for Scylla, but would support reasonable > > > > > > > implementations regardless of how they may look like. > > > > > > > > > > > > > > > Do we have driver authors who wish to support both projects? > > > > > > > > > > > > > > > > Surely, but I imagine it would be a minority. > > > > > > > > > > > > > > > ------------------------------------------------------------ > > --------- > > > > > > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For > > > > > > > additional commands, e-mail: dev-h...@cassandra.apache.org > > > > > > > > > > > > > ------------------------------------------------------------ > > --------- > > > > > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > > > > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > > > > > > > > > -- > > > > > Ben Bromhead > > > > > CTO | Instaclustr <https://www.instaclustr.com/ > > > > > +1 650 284 9692 <(650)%20284-9692 > > > > > Reliability at Scale > > > > > Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > > > -- > > Ben Bromhead > > CTO | Instaclustr <https://www.instaclustr.com/ > > +1 650 284 9692 > > Reliability at Scale > > Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer > >