Hi,

> This doesn't work without additional changes, for RF>1. The token ring could 
> place two replicas of the same token range on the same physical server, even 
> though those are two separate cores of the same server. You could add another 
> element to the hierarchy (cluster -> datacenter -> rack -> node -> 
> core/shard), but that generates unneeded range movements when a node is added.

I have seen rack awareness used/abused to solve this.

Regards,
Ariel

> On Apr 22, 2018, at 8:26 AM, Avi Kivity <a...@scylladb.com> wrote:
> 
> 
> 
>> On 2018-04-19 21:15, Ben Bromhead wrote:
>> Re #3:
>> 
>> Yup I was thinking each shard/port would appear as a discrete server to the
>> client.
> 
> This doesn't work without additional changes, for RF>1. The token ring could 
> place two replicas of the same token range on the same physical server, even 
> though those are two separate cores of the same server. You could add another 
> element to the hierarchy (cluster -> datacenter -> rack -> node -> 
> core/shard), but that generates unneeded range movements when a node is added.
> 
>> If the per port suggestion is unacceptable due to hardware requirements,
>> remembering that Cassandra is built with the concept scaling *commodity*
>> hardware horizontally, you'll have to spend your time and energy convincing
>> the community to support a protocol feature it has no (current) use for or
>> find another interim solution.
> 
> Those servers are commodity servers (not x86, but still commodity). In any 
> case 60+ logical cores are common now (hello AWS i3.16xlarge or even 
> i3.metal), and we can only expect logical core count to continue to increase 
> (there are 48-core ARM processors now).
> 
>> 
>> Another way, would be to build support and consensus around a clear
>> technical need in the Apache Cassandra project as it stands today.
>> 
>> One way to build community support might be to contribute an Apache
>> licensed thread per core implementation in Java that matches the protocol
>> change and shard concept you are looking for ;P
> 
> I doubt I'll survive the egregious top-posting that is going on in this list.
> 
>> 
>> 
>>> On Thu, Apr 19, 2018 at 1:43 PM Ariel Weisberg <ar...@weisberg.ws> wrote:
>>> 
>>> Hi,
>>> 
>>> So at technical level I don't understand this yet.
>>> 
>>> So you have a database consisting of single threaded shards and a socket
>>> for accept that is generating TCP connections and in advance you don't know
>>> which connection is going to send messages to which shard.
>>> 
>>> What is the mechanism by which you get the packets for a given TCP
>>> connection delivered to a specific core? I know that a given TCP connection
>>> will normally have all of its packets delivered to the same queue from the
>>> NIC because the tuple of source address + port and destination address +
>>> port is typically hashed to pick one of the queues the NIC presents. I
>>> might have the contents of the tuple slightly wrong, but it always includes
>>> a component you don't get to control.
>>> 
>>> Since it's hashing how do you manipulate which queue packets for a TCP
>>> connection go to and how is it made worse by having an accept socket per
>>> shard?
>>> 
>>> You also mention 160 ports as bad, but it doesn't sound like a big number
>>> resource wise. Is it an operational headache?
>>> 
>>> RE tokens distributed amongst shards. The way that would work right now is
>>> that each port number appears to be a discrete instance of the server. So
>>> you could have shards be actual shards that are simply colocated on the
>>> same box, run in the same process, and share resources. I know this pushes
>>> more of the complexity into the server vs the driver as the server expects
>>> all shards to share some client visible like system tables and certain
>>> identifiers.
>>> 
>>> Ariel
>>>> On Thu, Apr 19, 2018, at 12:59 PM, Avi Kivity wrote:
>>>> Port-per-shard is likely the easiest option but it's too ugly to
>>>> contemplate. We run on machines with 160 shards (IBM POWER 2s20c160t
>>>> IIRC), it will be just horrible to have 160 open ports.
>>>> 
>>>> 
>>>> It also doesn't fit will with the NICs ability to automatically
>>>> distribute packets among cores using multiple queues, so the kernel
>>>> would have to shuffle those packets around. Much better to have those
>>>> packets delivered directly to the core that will service them.
>>>> 
>>>> 
>>>> (also, some protocol changes are needed so the driver knows how tokens
>>>> are distributed among shards)
>>>> 
>>>>> On 2018-04-19 19:46, Ben Bromhead wrote:
>>>>> WRT to #3
>>>>> To fit in the existing protocol, could you have each shard listen on a
>>>>> different port? Drivers are likely going to support this due to
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-7544 (
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-11596).  I'm not super
>>>>> familiar with the ticket so their might be something I'm missing but it
>>>>> sounds like a potential approach.
>>>>> 
>>>>> This would give you a path forward at least for the short term.
>>>>> 
>>>>> 
>>>>> On Thu, Apr 19, 2018 at 12:10 PM Ariel Weisberg <ar...@weisberg.ws>
>>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I think that updating the protocol spec to Cassandra puts the onus on
>>> the
>>>>>> party changing the protocol specification to have an implementation
>>> of the
>>>>>> spec in Cassandra as well as the Java and Python driver (those are
>>> both
>>>>>> used in the Cassandra repo). Until it's implemented in Cassandra we
>>> haven't
>>>>>> fully evaluated the specification change. There is no substitute for
>>> trying
>>>>>> to make it work.
>>>>>> 
>>>>>> There are also realities to consider as to what the maintainers of the
>>>>>> drivers are willing to commit.
>>>>>> 
>>>>>> RE #1,
>>>>>> 
>>>>>> I am +1 on the fact that we shouldn't require an extra hop for range
>>> scans.
>>>>>> In JIRA Jeremiah made the point that you can still do this from the
>>> client
>>>>>> by breaking up the token ranges, but it's a leaky abstraction to have
>>> a
>>>>>> paging interface that isn't a vanilla ResultSet interface. Serial vs.
>>>>>> parallel is kind of orthogonal as the driver can do either.
>>>>>> 
>>>>>> I agree it looks like the current specification doesn't make what
>>> should
>>>>>> be simple as simple as it could be for driver implementers.
>>>>>> 
>>>>>> RE #2,
>>>>>> 
>>>>>> +1 on this change assuming an implementation in Cassandra and the
>>> Java and
>>>>>> Python drivers.
>>>>>> 
>>>>>> RE #3,
>>>>>> 
>>>>>> It's hard to be +1 on this because we don't benefit by boxing
>>> ourselves in
>>>>>> by defining a spec we haven't implemented, tested, and decided we are
>>>>>> satisfied with. Having it in ScyllaDB de-risks it to a certain
>>> extent, but
>>>>>> what if Cassandra decides to go a different direction in some way?
>>>>>> 
>>>>>> I don't think there is much discussion to be had without an example
>>> of the
>>>>>> the changes to the CQL specification to look at, but even then if it
>>> looks
>>>>>> risky I am not likely to be in favor of it.
>>>>>> 
>>>>>> Regards,
>>>>>> Ariel
>>>>>> 
>>>>>>> On Thu, Apr 19, 2018, at 9:33 AM, glom...@scylladb.com wrote:
>>>>>>> On 2018/04/19 07:19:27, kurt greaves <k...@instaclustr.com> wrote:
>>>>>>>>> 1. The protocol change is developed using the Cassandra process in
>>>>>>>>>     a JIRA ticket, culminating in a patch to
>>>>>>>>>     doc/native_protocol*.spec when consensus is achieved.
>>>>>>>> I don't think forking would be desirable (for anyone) so this seems
>>>>>>>> the most reasonable to me. For 1 and 2 it certainly makes sense but
>>>>>>>> can't say I know enough about sharding to comment on 3 - seems to me
>>>>>>>> like it could be locking in a design before anyone truly knows what
>>>>>>>> sharding in C* looks like. But hopefully I'm wrong and there are
>>>>>>>> devs out there that have already thought that through.
>>>>>>> Thanks. That is our view and is great to hear.
>>>>>>> 
>>>>>>> About our proposal number 3: In my view, good protocol designs are
>>>>>>> future proof and flexible. We certainly don't want to propose a
>>> design
>>>>>>> that works just for Scylla, but would support reasonable
>>>>>>> implementations regardless of how they may look like.
>>>>>>> 
>>>>>>>> Do we have driver authors who wish to support both projects?
>>>>>>>> 
>>>>>>>> Surely, but I imagine it would be a minority. ​
>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For
>>>>>>> additional commands, e-mail: dev-h...@cassandra.apache.org
>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>>>>> 
>>>>>> --
>>>>> Ben Bromhead
>>>>> CTO | Instaclustr <https://www.instaclustr.com/>
>>>>> +1 650 284 9692 <(650)%20284-9692>
>>>>> Reliability at Scale
>>>>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>> 
>>> --
>> Ben Bromhead
>> CTO | Instaclustr <https://www.instaclustr.com/>
>> +1 650 284 9692
>> Reliability at Scale
>> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Reply via email to