Hi Shao-Chuan, I understand everything you said above except for how we can estimate the number of rows using the index interval. I understand that the index interval is a setting that controls how often samples from an SSTable index are stored in memory, correct? I was under the impression that this is a property set in configuration.yaml and would not change as we add rows to or delete rows from a table.
BTW please let me know if this conversation belongs on the users list. I don't want to spam the dev list, but this seems like something that is kind of on the border between use and development. :) Best regards, Clint On Mon, Mar 24, 2014 at 4:13 PM, Shao-Chuan Wang < shaochuan.w...@bloomreach.com> wrote: > Tyler mentioned that client.describe_ring(myKeyspace); can be replaced by a > query of system.peers table which has the ring information. The challenge > here is to describe_splits_ex which needs the estimate the number of rows > in each sub token range (as you mentioned). > > From what I understand and trials and errors so far, I don't think Datastax > Java driver is able to do describe_splits_ex via a simple API call. If you > look at the implementation of CassandraServer.describe_splits_ex() and > StorageService.instance.getSplits(), what it does is that it is splitting a > token range into several sub token ranges, with estimated row count in each > sub token rage. Inside StorageService.instance.getSplits() call, it is > adjusting split count based on a estimated row count, too. > StorageService.instance.getSplits() is only publicly exported by thrift. It > would be non-trivial to re-build the same logic inside > StorageService.instance.getSplits(). > > That said, it looks like we could implement the splits logic at > AbstractColumnFamilyInputFormat.getSubSplits by querying > system.schema_columnfamilies and use CFMetaData.fromSchema to construct > CFMetaData. Inside CFMetaData it has the indexInterval which can be used to > estimate row count, and the next thing is to mimic the logic in > StorageService.instance.getSplits() to divide token range into several sub > token ranges and use TokenFactory (which is obtained from partitioner) to > construct sub token ranges at AbstractColumnFamilyInputFormat.getSubSplits. > Basically, it is moving the splitting code from the server side to the > client side. > > Any thoughts? > > Shao-Chuan > > > On Mon, Mar 24, 2014 at 11:54 AM, Clint Kelly <clint.ke...@gmail.com> > wrote: > > > I just saw this question about thrift in the Hadoop / Cassandra > integration > > in the discussion on the user list about freezing thrift. I have been > > working on a project to integrate Hadoop 2 and Cassandra 2 and have been > > trying to move all of the way over to the Java driver and away from > thrift. > > > > I have finished most of the driver. It is still pretty rough, but I have > > been using it for testing a prototype of the Kiji platfrom (www.kiji.org > ) > > that uses Cassandra instead of HBase. > > > > One thing I have not been able to figure out is how to calculate input > > splits without thrift. I am currently doing the following: > > > > map = client.describe_ring(myKeyspace); > > > > (where client is of type Cassandra.Client). > > > > This call returns a list of token ranges (max and min token values) for > > different nodes in the cluster. We then use this information, along with > > another thrift call, > > > > client.describe_splits_ex(cfName, range.start_token, range.end_token, > > splitSize); > > > > to estimate the number of rows in each token range, etc. > > > > I have looked all over the Java driver documentation and pinged the user > > list and have not gotten any proposals that work for the Java driver. > Does > > anyone here have any suggestions? > > > > Thanks! > > > > Best regards, > > Clint > > > > > > On Tue, Mar 11, 2014 at 12:41 PM, Shao-Chuan Wang < > > shaochuan.w...@bloomreach.com> wrote: > > > > > Hi, > > > > > > I just received this email from Jonathan regarding this deprecation of > > > thrift in 2.1 in dev emailing list. > > > > > > In fact, we migrated from thrift client to native one several months > ago; > > > however, in the Cassandra.hadoop, there are still a lot of dependencies > > on > > > thrift interface, for example describe_splits_ex in > > > org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat. > > > > > > Therefore, we had to keep thrift and native in our server but mainly, > the > > > CRUD query are through native protocol. > > > However, Jonathan says "*I don't know of any use cases for Thrift that > > > can't be **done in CQL"*. This statement makes me wonder maybe there is > > > something I don't know about native protocol yet. > > > > > > So, does anyone know how to do "describing the splits" and "describing > > the > > > local rings" using native protocol? > > > > > > Also, cqlsh uses python client, which is talking via thrift protocol > too. > > > Does it mean that it will be migrated to native protocol soon as well? > > > > > > Comments, pointers, suggestions are much appreciated. > > > > > > Many thanks, > > > > > > Shao-Chuan > > > > > >