Dean, In the playOrm data modeling, if I understood it correctly, every CF has its own id, right? For instance, User would have its own ID, Activities would have its own id, etc. What if I have a trillion activities? Wouldn't be a problem to have 1 row id for each activity? Cassandra always indexes by row id, right? If I have too many row ids without using composite keys, will it scale the same way? Wouldn't the time to insert an activity be each time longer because I have too many activities?
Best regards, Marcelo Valle. 2012/9/25 Hiller, Dean <dean.hil...@nrel.gov> > If you need anything added/fixed, just let PlayOrm know. PlayOrm has been > able to quickly add so far…that may change as more and more requests come > but so far PlayOrm seems to have managed to keep up. > > We are using it live by the way already. It works out very well so far > for us (We have 5000 column families, obviously dynamically created instead > of by hand…a very interesting use case of cassandra). In our live > environment we configured astyanax with LocalQUOROM on reads AND writes so > CP style so we can afford one node out of 3 to go down but if two go down > it stops working THOUGH there is a patch in astyanax to auto switch from > LocalQUOROM to ONE NODE read/write when two nodes go down that we would > like to suck in eventually so it is always live(I don't think Hector has > that and it is a really NICE feature….ie fail localquorm read/write and > then try again with consistency level of one). > > Later, > Dean > > > From: Marcelo Elias Del Valle <mvall...@gmail.com<mailto: > mvall...@gmail.com>> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Date: Monday, September 24, 2012 1:54 PM > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org>> > Subject: Re: Correct model > > Dean, this sounds like magic :D > I don't know details about the performance on the index implementations > you chose, but it would pay the way to use it in my case, as I don't need > the best performance in the world when reading, but I need to assure > scalability and have a simple model to maintain. I liked the playOrm > concept regarding this. > I have more doubts, but I will ask them at stack over flow from now on. > > 2012/9/24 Hiller, Dean <dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>> > PlayOrm will automatically create a CF to index my CF? > > It creates 3 CF's for all indices, IntegerIndice, DecimalIndice, and > StringIndice such that the ad-hoc tool that is in development can display > the indices as it knows the prefix of the composite column name is of > Integer, Decimal or String and it knows the postfix type as well so it can > translate back from bytes to the types and properly display in a GUI (i.e. > On top of SELECT, the ad-hoc tool is adding a way to view the induce rows > so you can check if they got corrupt or not). > > Will it auto-manage it, like Cassandra's secondary indexes? > > YES > > Further detail… > > You annotated fields with @NoSqlIndexed and PlayOrm adds/removes from the > index as you add/modify/remove the entity…..a modify does a remove old val > from index and insert new value into index. > > An example would be PlayOrm stores all long, int, short, byte in a type > that uses the least amount of space so IF you have a long OR BigInteger > between –128 to 128 it only ends up storing 1 byte in cassandra(SAVING tons > of space!!!). Then if you are indexing a type that is one of those, > PlayOrm creates a IntegerIndice table. > > Right now, another guy is working on playorm-server which is a webgui to > allow ad-hoc access to all your data as well so you can ad-hoc queries to > see data and instead of showing Hex, it shows the real values by > translating the bytes to String for the schema portions that it is aware of > that is. > > Later, > Dean > > From: Marcelo Elias Del Valle <mvall...@gmail.com<mailto: > mvall...@gmail.com><mailto:mvall...@gmail.com<mailto:mvall...@gmail.com>>> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org > ><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>>> > Date: Monday, September 24, 2012 12:09 PM > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>>> > Subject: Re: Correct model > > Dean, > > There is one last thing I would like to ask about playOrm by this > list, the next questiosn will come by stackOverflow. Just because of the > context, I prefer asking this here: > When you say playOrm indexes a table (which would be a CF behind the > scenes), what do you mean? PlayOrm will automatically create a CF to index > my CF? Will it auto-manage it, like Cassandra's secondary indexes? > In Cassandra, the application is responsible for maintaining the > index, right? I might be wrong, but unless I am using secondary indexes I > need to update index values manually, right? > I got confused when you said "PlayOrm indexes the columns you > choose". How do I choose and what exactly it means? > > Best regards, > Marcelo Valle. > > 2012/9/24 Hiller, Dean <dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov > ><mailto:dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>>> > Oh, ok, you were talking about the wide row pattern, right? > > yes > > But playORM is compatible with Aaron's model, isn't it? > > Not yet, PlayOrm supports partitioning one table multiple ways as it > indexes the columns(in your case, the userid FK column and the time column) > > Can I map exactly this using playORM? > > Not yet, but the plan is to map these typical Cassandra scenarios as well. > > Can I ask playOrm questions in this list? > > The best place to ask PlayOrm questions is on stack overflow and tag with > PlayOrm though I monitor this list and stack overflow for questions(there > are already a few questions on stack overflow). > > The examples directory is empty for now, I would like to see how to set up > the connection with it. > > Running build or build.bat is always kept working and all 62 tests pass(or > we don't merge to master) so to see how to make a connection or run an > example > > 1. Run build.bat or build which generates parsing code > 2. Import into eclipse (it already has .classpath and .project for you > already there) > 3. In FactorySingleton.java you can modify IN_MEMORY to CASSANDRA or not > and run any of the tests in-memory or against localhost(We run the test > suite also against a 6 node cluster as well and all passes) > 4. FactorySingleton probably has the code you are looking for plus you > need a class called nosql.Persistence or it won't scan your jar file.(class > file not xml file like JPA) > > Do you mean I need to load all the keys in memory to do a multi get? > > No, you batch. I am not sure about CQL, but PlayOrm returns a Cursor not > the results so you can loop through every key and behind the scenes it is > doing batch requests so you can load up 100 keys and make one multi get > request for those 100 keys and then can load up the next 100 keys, etc. > etc. etc. I need to look more into the apis and protocol of CQL to see if > it allows this style of batching. PlayOrm does support this style of > batching today. Aaron would know if CQL does. > > Why did you move? Hector is being considered for being the "official" > client for Cassandra, isn't it? > > At the time, I wanted the file streaming feature. Also, Hector seemed a > bit cumbersome as well compared to astyanax or at least if you were > building a platform and had no use for typing the columns. Just personal > preference really here. > > I am not sure I understood this part. If I need to refactor, having the > partition id in the key would be a bad thing? What would be the > alternative? In my case, as I use userId : partitionId as row key, this > might be a problem, right? > > PlayOrm indexes the columns you choose(ie. The ones you want to use in the > where clause) and partitions by columns you choose not based on the key so > in PlayOrm, the key is typically a TimeUUID or something cluster > unique…..any tables referencing that TimeUUID never have to change. With > Cassandra partitioning, if you repartition that table a different way or go > for some kind of major change(usually done with map/reduce), all your > foreign keys "may" have to change….it really depends on the situation > though. Maybe you get the design right and never have to change. > > @NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT > t FROM TABLE as t "+ > "INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < > :shares"), > > What would happen behind the scenes when I execute this query? > > In this case, t or TABLE is a partitioned table since a partition is > defined. And t.activityTypeInfo refers to the ActivityTypeInfo table which > is not partitioned(AND ActivityTypeInfo won't scale to billions of rows > because there is no partitioning but maybe you don't need it!!!). Behind > the scenes when you call getResult, it returns a cursor that has NOT done > anything yet. When you start looping through the cursor, behind the scenes > it is batching requests asking for next 500 matches(configurable) so you > never run out of memory….it is EXACTLY like a database cursor. You can > even use the cursor to show a user the first set of results and when user > clicks next pick up right where the cursor left off (if you saved it to the > HttpSession). > > You can only use joins with partition keys, right? > > Nope, joins work on anything. You only need to specify the partitionId > when you have a partitioned table in the list of join tables. (That is what > the PARTITIONS clause is for, to identify partitionId = what?)…it was put > BEFORE the SQL instead of within it…CQL took the opposite approach but > PlayOrm can also join different partitions together as well ;) ). > > In this case, is partId the row id of TABLE CF? > > Nope, partId is one of the columns. There is a test case on this class in > PlayOrm …(notice the annotation NoSqlPartitionByThisField on the > column/field in the entity)… > > > https://github.com/deanhiller/playorm/blob/master/input/javasrc/com/alvazan/test/db/PartitionedSingleTrade.java > > PlayOrm allows partitioned tables AND non-partioned tables(non-partitioned > tables won't scale but maybe you will never have that many rows). You can > join any two combinations(non-partitioned with partitioned, non-partitioned > with non-partitioned, partition with another partition). > > I only prefer stackoverflow as I like referencing links/questions with > their urls. To reference this email is very hard later on as I have to > find it so in general, I HATE email lists ;) but it seems cassandra prefers > them so any questions on PlayOrm you can put there and I am not sure how > many on this may or may not be interested so it creates less noise on this > list too. > > Later, > Dean > > > From: Marcelo Elias Del Valle <mvall...@gmail.com<mailto: > mvall...@gmail.com><mailto:mvall...@gmail.com<mailto:mvall...@gmail.com > >><mailto:mvall...@gmail.com<mailto:mvall...@gmail.com><mailto: > mvall...@gmail.com<mailto:mvall...@gmail.com>>>> > Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org > ><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org > >><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org > ><mailto:user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>> > Date: Monday, September 24, 2012 11:07 AM > To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>" < > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto: > user@cassandra.apache.org<mailto:user@cassandra.apache.org>>>> > Subject: Re: Correct model > > > > 2012/9/24 Hiller, Dean <dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov > ><mailto:dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>><mailto: > dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov><mailto: > dean.hil...@nrel.gov<mailto:dean.hil...@nrel.gov>>>> > I am confused. In this email you say you want "get all requests for a > user" and in a previous one you said "Select all the users which has new > requests, since date D" so let me answer both… > > I have both needs. These are the two queries I need to perform on the > model. > > For latter, you make ONE query into the latest partition(ONE partition) of > the GlobalRequestsCF which gives you the most recent requests ALONG with > the user ids of those requests. If you queried all partitions, you would > most likely blow out your JVM memory. > > For the former, you make ONE query to the UserRequestsCF with userid = > <your user id> to get all the requests for that user > > Now I think I got the main idea! This answered a lot! > > Sorry, I was skipping some context. A lot of the backing indexing > sometimes is done as a long row so in playOrm, too many rows in a partition > means == too many columns in the indexing row for that partition. I > believe the same is true in cassandra for their indexing. > > Oh, ok, you were talking about the wide row pattern, right? But playORM is > compatible with Aaron's model, isn't it? Can I map exactly this using > playORM? The hardest thing for me to use playORM now is I don't know > Cassandra well yet, and I know playORM even less. Can I ask playOrm > questions in this list? I will try to create a POC here! > Only now I am starting to understand what it does ;-) The examples > directory is empty for now, I would like to see how to set up the > connection with it. > > Cassandra spreads all your data out on all nodes with or without > partitions. A single partition does have it's data co-located though. > > Now I see. The main advantage of using partitions is keeping the indexes > small enough. It has nothing to do with the nodes. Thanks! > > If you are at 100k(and the requests are rather small), you could embed all > the requests in the user or go with Aaron's below suggestion of a > UserRequestsCF. If your requests are rather large, you probably don't want > to embed them in the User. Either way, it's one query or one row key > lookup. > > I see it now. > > Multiget ignores partitions…you feed it a LIST of keys and it gets them. > It just so happens that partitionId had to be part of your row key. > > Do you mean I need to load all the keys in memory to do a multiget? > > I have used Hector and now use Astyanax, I don't worry much about that > layer, but I feed astyanax 3 nodes and I believe it discovers some of the > other ones. I believe the latter is true but am not 100% sure as I have > not looked at that code. > > Why did you move? Hector is being considered for being the "official" > client for Cassandra, isn't it? I looked at the Astyanax api and it seemed > much more high level though > > As an analogy on the above, if you happen to have used PlayOrm, you would > ONLY need one Requests table and you partition by user AND time(two views > into the same data partitioned two different ways) and you can do exactly > the same thing as Aaron's example. PlayOrm doesn't embed the partition ids > in the key leaving it free to partition twice like in your case….and in a > refactor, you have to map/reduce A LOT more rows because of rows having the > FK of <partitionid><subrowkey> whereas if you don't have partition id in > the key, you only map/reduce the partitioned table in a redesign/refactor. > That said, we will be adding support for CQL partitioning in addition to > PlayOrm partitioning even though it can be a little less flexible sometimes. > > I am not sure I understood this part. If I need to refactor, having the > partition id in the key would be a bad thing? What would be the > alternative? In my case, as I use userId : partitionId as row key, this > might be a problem, right? > > Also, CQL locates all the data on one node for a partition. We have found > it can be faster "sometimes" with the parallelized disks that the > partitions are NOT all on one node so PlayOrm partitions are virtual only > and do not relate to where the rows are stored. An example on our 6 nodes > was a join query on a partition with 1,000,000 rows took 60ms (of course I > can't compare to CQL here since it doesn't do joins). It really depends > how much data is going to come back in the query though too? There are > tradeoff's between disk parallel nodes and having your data all on one node > of course. > > I guess I am still not ready for this level of info. :D > In the playORM readme, we have the following: > > > @NoSqlQuery(name="findWithJoinQuery", query="PARTITIONS t(:partId) SELECT > t FROM TABLE as t "+ > "INNER JOIN t.activityTypeInfo as i WHERE i.type = :type and t.numShares < > :shares"), > > What would happen behind the scenes when I execute this query? You can > only use joins with partition keys, right? > In this case, is partId the row id of TABLE CF? > > > Thanks a lot for the answers > > -- > Marcelo Elias Del Valle > http://mvalle.com - @mvallebr > > > > -- > Marcelo Elias Del Valle > http://mvalle.com - @mvallebr > > > > -- > Marcelo Elias Del Valle > http://mvalle.com - @mvallebr > -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr