Re: OPP and controlling partitioning

Adi Mon, 15 Nov 2010 12:17:20 -0800

>>1) "So if your node tokens are set as "vertexid_" all keys with the same
prefix will be in the same range."
Adding to Aaron's comment -
This will be the case if you use OrderPreservingPartitioner.
RandomPartitioner(the default) will distribute the tokens randomly across
nodes.





On Mon, Nov 15, 2010 at 2:47 PM, Aaron Morton <aa...@thelastpickle.com>wrote:

> Rows are distributed around the cluster according to the ordering from the
> Partitioner used, and the Replication Strategy. All data for the same key
> will be stored together, and then replicated RF times.
>
> To answer your questions...
> 1) Each node is responsible for the keys between the previous nodes token
> and it's own. So if your node tokens are set as "vertexid_" all keys with
> the same prefix will be in the same range. Note that the row data will be
> stored on RF replicas, and not just on the node with the appropriate token.
>
> 2) I *think* you want to look at
> o.a.c.s.StorageService.getNaturalEndpoints() , this is not exposed to the
> outside world though. However *every* read or write request is sent to all
> replicas, even those at CL ONE. There is no concept of one node been the
> only place that a row is stored.
>
> FWIW it sounds like you want to disable some of the fine work cassandra
> does to ensure your data is replicated and available. By deciding that one
> machine will be responsible for a portion of the data you are introducing a
> single point of failure. Try writing your app against a cluster and let
> cassandra take care of things, then dive into the details. For example I
> cannot remember anyone on the list having serious issues with network
> overhead.
>
> You may also want to consider flock db from twitter, it sits on top of a
> sharded MySQL db https://github.com/twitter/flockdb
>
> <https://github.com/twitter/flockdb>Hope that helps.
> Aaron
>
>
> On 16 Nov, 2010,at 03:53 AM, Claudio Martella <claudio.marte...@tis.bz.it>
> wrote:
>
> Hello list,
>
> I'm in the process of writing an application which uses cassandra as a
> "storage" backend. The application is a graph database and it's supposed
> to be a baseline application for further development in the field.
>
> The idea is to implement a property graph: a multigraph (multiple edges
> connecting two vertices are possible) with properties in the form of
> name/value for edges and vertices. The idea is to traverse the graph
> with queries like "give me all the women that are liked by men i know",
> something like:
>
> Vertex[name=claudio]=>outgoingEdge[type=knows]=>Vertex[gender=male]=>outgoingEdge[type=likes]=>Vertex[gender=female].
> This is basically a step by step expansion/filtering based on properties.
>
> In my architecture my application-logic node is coupled with the
> cassandra node storing its data. I'd like to have some kind of "atomic
> set" of data that is "granted" to be stored on the same cassandra node
> (in my case the vertex, its adj list, its properties, its edges and
> their properties), so that i can issue the required filtering and
> expansion to a particular node which will issue the logic behind it (and
> i can route such request with the same logic cassandra routes its
> requests).
> This is in an effort to (a) minimize network i/o (i'd be able to send
> the query token to the application node which would issue a local get to
> its local cassandra) and (b) distribute computation (i'd be able to
> distribute filtering between all the nodes storing for example the
> node's neighborhood). This is still not optimal, but it would be a good
> start.
>
> For this reason i thought about a datamodel that has composite keys:
>
> vertexid and edgeid are uuids while propertyname is a string.
>
> CF vertices {
>
> vertexid_propertyname {
>
> propertyvalue: null
> }
> }
>
>
> CF edges {
>
> vertexid_[in|out]_propertyname_edgeid {
>
> propertyvalue: othervertexid
> }
> }
>
> With this datamodel i could easily and efficiently issue slices and
> ranges to cassandra with the equality predicates on properties i need.
> What i need now is to partition my data on the prefix "vertexid_". Such
> a datamodel does have a concept of "ascending ordering", so i thought
> about OPP, but to my understanding OPP does not grant that all the data
> starting with the same prefix will end up in the same cassandra node,
> but only some of it. My set of data about a vertex could still be split
> between two cassandra nodes in case the token ends up being a key in the
> middle of the set, right?
>
> What i require exactly is:
>
> (1) to have all the rows belonging to the same vertexid (which is a
> uuid) on the same cassandra node. Can i achieve this?
> (2) given this partitioning, know the IP of the cassandra node storing
> that vertex data, from outside of cassandra. This is the logic cassandra
> uses to route requests for keys and i have to access it from outside.
>
> Can anybody comment about these?
>
>
> Thanks
>
>
> Claudio
>
>
> Unit Research & Development - Analyst
>
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax +39 0471 068 129
> claudio.marte...@tis.bz.it http://www.tis.bz.it
>
> Short information regarding use of personal data. According to Section 13
> of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
> process your personal data in order to fulfil contractual and fiscal
> obligations and also to send you information regarding our services and
> events. Your personal data are processed with and without electronic means
> and by respecting data subjects' rights, fundamental freedoms and dignity,
> particularly with regard to confidentiality, personal identity and the right
> to personal data protection. At any time and without formalities you can
> write an e-mail to priv...@tis.bz.it in order to object the processing of
> your personal data for the purpose of sending advertising materials and also
> to exercise the right to access personal data and other rights referred to
> in Section 7 of Decree 196/2003. The data controller is TIS Techno
> Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the
> complete information on the web site www.tis.bz.it.
>
>
>

Re: OPP and controlling partitioning

Reply via email to