Re: Correct model

Hiller, Dean Mon, 24 Sep 2012 08:24:42 -0700

I am confused.  In this email you say you want "get all requests for a user" 
and in a previous one you said "Select all the users which has new requests, 
since date D" so let me answer both…


For latter, you make ONE query into the latest partition(ONE partition) of the 
GlobalRequestsCF which gives you the most recent requests ALONG with the user 
ids of those requests.  If you queried all partitions, you would most likely 
blow out your JVM memory.

For the former, you make ONE query to the UserRequestsCF with userid = <your 
user id> to get all the requests for that user

You mean too many rows, not a row too long, right? I am assuming each request 
will be a different row, not a new column. Is having billions of ROWS something 
non performatic in Cassandra? I know Cassandra allows up to 2 billion columns 
for a CF, but I am not aware of a limitation for rows…

Sorry, I was skipping some context.  A lot of the backing indexing sometimes is 
done as a long row so in playOrm, too many rows in a partition means == too 
many columns in the indexing row for that partition.  I believe the same is 
true in cassandra for their indexing.

If I understood it correctly, if I don't specify partitions, Cassandra will 
store all my data in a single node?

Cassandra spreads all your data out on all nodes with or without partitions.  A 
single partition does have it's data co-located though.

I 99,999% of my users will have less than 100k requests, would it make sense to 
partition by user?

If you are at 100k(and the requests are rather small), you could embed all the 
requests in the user or go with Aaron's below suggestion of a UserRequestsCF.  
If your requests are rather large, you probably don't want to embed them in the 
User.  Either way, it's one query or one row key lookup.

That's cool! :D So if I need to query data split in 10 partitions, for 
instance, I can perform the query in parallel by using a multiget, right?

Multiget ignores partitions…you feed it a LIST of keys and it gets them.  It 
just so happens that partitionId had to be part of your row key.

Out of curiosity, if each get will occur on a different node, I would need to 
connect to each of the nodes? Or would I query 1 node and it would communicate 
to others?

I have used Hector and now use Astyanax, I don't worry much about that layer, 
but I feed astyanax 3 nodes and I believe it discovers some of the other ones.  
I believe the latter is true but am not 100% sure as I have not looked at that 
code.

As an analogy on the above, if you happen to have used PlayOrm, you would ONLY 
need one Requests table and you partition by user AND time(two views into the 
same data partitioned two different ways) and you can do exactly the same thing 
as Aaron's example.  PlayOrm doesn't embed the partition ids in the key leaving 
it free to partition twice like in your case….and in a refactor, you have to 
map/reduce A LOT more rows because of rows having the FK of 
<partitionid><subrowkey> whereas if you don't have partition id in the key, you 
only map/reduce the partitioned table in a redesign/refactor.  That said, we 
will be adding support for CQL partitioning in addition to PlayOrm partitioning 
even though it can be a little less flexible sometimes.

Also, CQL locates all the data on one node for a partition.  We have found it 
can be faster "sometimes" with the parallelized disks that the partitions are 
NOT all on one node so PlayOrm partitions are virtual only and do not relate to 
where the rows are stored.  An example on our 6 nodes was a join query on a 
partition with 1,000,000 rows took 60ms (of course I can't compare to CQL here 
since it doesn't do joins).  It really depends how much data is going to come 
back in the query though too?  There are tradeoff's between disk parallel nodes 
and having your data all on one node of course.

Later,
Dean



From: Marcelo Elias Del Valle <mvall...@gmail.com<mailto:mvall...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Monday, September 24, 2012 7:45 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Correct model



2012/9/23 Hiller, Dean <dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>>
You need to split data among partitions or your query won't scale as more and 
more data is added to table.  Having the partition means you are querying a lot 
less rows.
This will happen in case I can query just one partition. But if I need to query 
things in multiple partitions, wouldn't it be slower?

He means determine the ONE partition key and query that partition.  Ie. If you 
want just latest user requests, figure out the partition key based on which 
month you are in and query it.  If you want the latest independent of user, 
query the correct single partition for GlobalRequests CF.

But in this case, I didn't understand Aaron's model then. My first query is to 
get  all requests for a user. If I did partitions by time, I will need to query 
all partitions to get the results, right? In his answer it was said I would 
query ONE partition...

If I want all the requests for the user, couldn't I just select all UserRequest 
records which start with "userId"?
He designed it so the user requests table was completely scalable so he has 
partitions there.  If you don't have partitions, you could run into a row that 
is toooo long.  You don't need to design it this way if you know none of your 
users are going to go into the millions as far as number of requests.  In his 
design then, you need to pick the correct partition and query into that 
partition.
You mean too many rows, not a row too long, right? I am assuming each request 
will be a different row, not a new column. Is having billions of ROWS something 
non performatic in Cassandra? I know Cassandra allows up to 2 billion columns 
for a CF, but I am not aware of a limitation for rows...

I really didn't understand why to use partitions.
Partitions are a way if you know your rows will go into the trillions of 
breaking them up so each partition has 100k rows or so or even 1 million but 
maxes out in the millions most likely.  Without partitions, you hit a limit in 
the millions.  With partitions, you can keep scaling past that as you can have 
as many partitions as you want.

If I understood it correctly, if I don't specify partitions, Cassandra will 
store all my data in a single node? I thought Cassandra would automatically 
distribute my data among nodes as I insert rows into a CF. Of course if I use 
partitions I understand I could query just one partition (node) to get the 
data, if I have the partition field, but to the best of my knowledge, this is 
not what happens in my case, right? In the first query I would have to query 
all the partitions...
Or you are saying partitions have nothing to do with nodes?? I 99,999% of my 
users will have less than 100k requests, would it make sense to partition by 
user?

A multi-get is a query that finds IN PARALLEL all the rows with the matching 
keys you send to cassandra.  If you do 1000 gets(instead of a multi-get) with 
1ms latency, you will find, it takes 1 second+processing time.  If you do ONE 
multi-get, you only have 1 request and therefore 1ms latency.  The other 
solution is you could send 1000 "asycnh" gets but I have a feeling that would 
be slower with all the marshalling/unmarshalling of the envelope…..really 
depends on the envelope size like if we were using http, you would get killed 
doing 1000 requests instead of 1 with 1000 keys in it.
That's cool! :D So if I need to query data split in 10 partitions, for 
instance, I can perform the query in parallel by using a multiget, right? Out 
of curiosity, if each get will occur on a different node, I would need to 
connect to each of the nodes? Or would I query 1 node and it would communicate 
to others?


Later,
Dean

From: Marcelo Elias Del Valle 
<mvall...@gmail.com<mailto:mvall...@gmail.com><mailto:mvall...@gmail.com<mailto:mvall...@gmail.com>>>
Reply-To: 
"user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>"
 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
Date: Sunday, September 23, 2012 10:23 AM
To: 
"user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>"
 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>
Subject: Re: Correct model


2012/9/20 aaron morton 
<aa...@thelastpickle.com<mailto:aa...@thelastpickle.com><mailto:aa...@thelastpickle.com<mailto:aa...@thelastpickle.com>>>
I would consider:

# User CF
* row_key: user_id
* columns: user properties, key=value

# UserRequests CF
* row_key: <user_id : partition_start> where partition_start is the start of a 
time partition that makes sense in your domain. e.g. partition monthly. 
Generally want to avoid rows the grow forever, as a rule of thumb avoid rows 
more than a few 10's of MB.
* columns: two possible approaches:
1) If the requests are immutable and you generally want all of the data store 
the request in a single column using JSON or similar, with the column name a 
timestamp.
2) Otherwise use a composite column name of <timestamp : request_property> to 
store the request in many columns.
* In either case consider using Reversed comparators so the most recent columns 
are first  see http://thelastpickle.com/2011/10/03/Reverse-Comparators/

# GlobalRequests CF
* row_key: partition_start - time partition as above. It may be easier to use 
the same partition scheme.
* column name: <timestamp : user_id>
* column value: empty

Ok, I think I understood your suggestion... But the only advantage in this 
solution is to split data among partitions? I understood how it would work, but 
I didn't understand why it's better than the other solution, without the 
GlobalRequests CF

- Select all the requests for an user
Work out the current partition client side, get the first N columns. Then page.

What do you mean here by current partition? You mean I would perform a query 
for each particition? If I want all the requests for the user, couldn't I just 
select all UserRequest records which start with "userId"? I might be missing 
something here, but in my understanding if I use hector to query a column 
familly I can do that and Cassandra servers will automatically communicate to 
each other to get the data I need, right? Is it bad? I really didn't understand 
why to use partitions.


- Select all the users which has new requests, since date D
Worm out the current partition client side, get the first N columns from 
GlobalRequests, make a multi get call to UserRequests
NOTE: Assuming the size of the global requests space is not huge.
Hope that helps.
 For sure it is helping a lot. However, I don't know what is a multiget... I 
saw the hector api reference and found this method, but not sure about what 
Cassandra would do internally if I do a multiget... Is this expensive in terms 
of performance and latency?

--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr



--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: Correct model

Reply via email to