Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

Clint Kelly Sun, 30 Mar 2014 18:32:23 -0700

Hi Shao-Chuan,

I understand everything you said above except for how we can estimate the
number of rows using the index interval.  I understand that the index
interval is a setting that controls how often samples from an SSTable index
are stored in memory, correct?  I was under the impression that this is a
property set in configuration.yaml and would not change as we add rows to
or delete rows from a table.


BTW please let me know if this conversation belongs on the users list.  I
don't want to spam the dev list, but this seems like something that is kind
of on the border between use and development.   :)

Best regards,
Clint


On Mon, Mar 24, 2014 at 4:13 PM, Shao-Chuan Wang <
shaochuan.w...@bloomreach.com> wrote:

> Tyler mentioned that client.describe_ring(myKeyspace); can be replaced by a
> query of system.peers table which has the ring information. The challenge
> here is to describe_splits_ex which needs the estimate the number of rows
> in each sub token range (as you mentioned).
>
> From what I understand and trials and errors so far, I don't think Datastax
> Java driver is able to do describe_splits_ex via a simple API call. If you
> look at the implementation of CassandraServer.describe_splits_ex() and
> StorageService.instance.getSplits(), what it does is that it is splitting a
> token range into several sub token ranges, with estimated row count in each
> sub token rage. Inside StorageService.instance.getSplits() call, it is
> adjusting split count based on a estimated row count, too.
> StorageService.instance.getSplits() is only publicly exported by thrift. It
> would be non-trivial to re-build the same logic inside
> StorageService.instance.getSplits().
>
> That said, it looks like we could implement the splits logic at
> AbstractColumnFamilyInputFormat.getSubSplits by querying
> system.schema_columnfamilies and use CFMetaData.fromSchema to construct
> CFMetaData. Inside CFMetaData it has the indexInterval which can be used to
> estimate row count, and the next thing is to mimic the logic in
> StorageService.instance.getSplits() to divide token range into several sub
> token ranges and use TokenFactory (which is obtained from partitioner) to
> construct sub token ranges at AbstractColumnFamilyInputFormat.getSubSplits.
> Basically, it is moving the splitting code from the server side to the
> client side.
>
> Any thoughts?
>
> Shao-Chuan
>
>
> On Mon, Mar 24, 2014 at 11:54 AM, Clint Kelly <clint.ke...@gmail.com>
> wrote:
>
> > I just saw this question about thrift in the Hadoop / Cassandra
> integration
> > in the discussion on the user list about freezing thrift.  I have been
> > working on a project to integrate Hadoop 2 and Cassandra 2 and have been
> > trying to move all of the way over to the Java driver and away from
> thrift.
> >
> > I have finished most of the driver.  It is still pretty rough, but I have
> > been using it for testing a prototype of the Kiji platfrom (www.kiji.org
> )
> > that uses Cassandra instead of HBase.
> >
> > One thing I have not been able to figure out is how to calculate input
> > splits without thrift.  I am currently doing the following:
> >
> >       map = client.describe_ring(myKeyspace);
> >
> > (where client is of type Cassandra.Client).
> >
> > This call returns a list of token ranges (max and min token values) for
> > different nodes in the cluster.  We then use this information, along with
> > another thrift call,
> >
> >     client.describe_splits_ex(cfName, range.start_token, range.end_token,
> > splitSize);
> >
> > to estimate the number of rows in each token range, etc.
> >
> > I have looked all over the Java driver documentation and pinged the user
> > list and have not gotten any proposals that work for the Java driver.
>  Does
> > anyone here have any suggestions?
> >
> > Thanks!
> >
> > Best regards,
> > Clint
> >
> >
> > On Tue, Mar 11, 2014 at 12:41 PM, Shao-Chuan Wang <
> > shaochuan.w...@bloomreach.com> wrote:
> >
> > > Hi,
> > >
> > > I just received this email from Jonathan regarding this deprecation of
> > > thrift in 2.1 in dev emailing list.
> > >
> > > In fact, we migrated from thrift client to native one several months
> ago;
> > > however, in the Cassandra.hadoop, there are still a lot of dependencies
> > on
> > > thrift interface, for example describe_splits_ex in
> > > org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.
> > >
> > > Therefore, we had to keep thrift and native in our server but mainly,
> the
> > > CRUD query are through native protocol.
> > > However, Jonathan says "*I don't know of any use cases for Thrift that
> > > can't be **done in CQL"*. This statement makes me wonder maybe there is
> > > something I don't know about native protocol yet.
> > >
> > > So, does anyone know how to do "describing the splits" and "describing
> > the
> > > local rings" using native protocol?
> > >
> > > Also, cqlsh uses python client, which is talking via thrift protocol
> too.
> > > Does it mean that it will be migrated to native protocol soon as well?
> > >
> > > Comments, pointers, suggestions are much appreciated.
> > >
> > > Many thanks,
> > >
> > > Shao-Chuan
> > >
> >
>

Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

Reply via email to