Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

Tyler Hobbs Tue, 01 Apr 2014 14:05:33 -0700

Split calculation can't be done client-side because it requires key
sampling (which requires reading the index summary).  This would have to be
added to CQL.


Since I can't see any alternatives and this is required for good Hadoop
support, would you mind opening a ticket to add support for this?


On Sun, Mar 30, 2014 at 8:31 PM, Clint Kelly <clint.ke...@gmail.com> wrote:

> Hi Shao-Chuan,
>
> I understand everything you said above except for how we can estimate the
> number of rows using the index interval.  I understand that the index
> interval is a setting that controls how often samples from an SSTable index
> are stored in memory, correct?  I was under the impression that this is a
> property set in configuration.yaml and would not change as we add rows to
> or delete rows from a table.
>
> BTW please let me know if this conversation belongs on the users list.  I
> don't want to spam the dev list, but this seems like something that is kind
> of on the border between use and development.   :)
>
> Best regards,
> Clint
>
>
> On Mon, Mar 24, 2014 at 4:13 PM, Shao-Chuan Wang <
> shaochuan.w...@bloomreach.com> wrote:
>
> > Tyler mentioned that client.describe_ring(myKeyspace); can be replaced
> by a
> > query of system.peers table which has the ring information. The challenge
> > here is to describe_splits_ex which needs the estimate the number of rows
> > in each sub token range (as you mentioned).
> >
> > From what I understand and trials and errors so far, I don't think
> Datastax
> > Java driver is able to do describe_splits_ex via a simple API call. If
> you
> > look at the implementation of CassandraServer.describe_splits_ex() and
> > StorageService.instance.getSplits(), what it does is that it is
> splitting a
> > token range into several sub token ranges, with estimated row count in
> each
> > sub token rage. Inside StorageService.instance.getSplits() call, it is
> > adjusting split count based on a estimated row count, too.
> > StorageService.instance.getSplits() is only publicly exported by thrift.
> It
> > would be non-trivial to re-build the same logic inside
> > StorageService.instance.getSplits().
> >
> > That said, it looks like we could implement the splits logic at
> > AbstractColumnFamilyInputFormat.getSubSplits by querying
> > system.schema_columnfamilies and use CFMetaData.fromSchema to construct
> > CFMetaData. Inside CFMetaData it has the indexInterval which can be used
> to
> > estimate row count, and the next thing is to mimic the logic in
> > StorageService.instance.getSplits() to divide token range into several
> sub
> > token ranges and use TokenFactory (which is obtained from partitioner) to
> > construct sub token ranges at
> AbstractColumnFamilyInputFormat.getSubSplits.
> > Basically, it is moving the splitting code from the server side to the
> > client side.
> >
> > Any thoughts?
> >
> > Shao-Chuan
> >
> >
> > On Mon, Mar 24, 2014 at 11:54 AM, Clint Kelly <clint.ke...@gmail.com>
> > wrote:
> >
> > > I just saw this question about thrift in the Hadoop / Cassandra
> > integration
> > > in the discussion on the user list about freezing thrift.  I have been
> > > working on a project to integrate Hadoop 2 and Cassandra 2 and have
> been
> > > trying to move all of the way over to the Java driver and away from
> > thrift.
> > >
> > > I have finished most of the driver.  It is still pretty rough, but I
> have
> > > been using it for testing a prototype of the Kiji platfrom (
> www.kiji.org
> > )
> > > that uses Cassandra instead of HBase.
> > >
> > > One thing I have not been able to figure out is how to calculate input
> > > splits without thrift.  I am currently doing the following:
> > >
> > >       map = client.describe_ring(myKeyspace);
> > >
> > > (where client is of type Cassandra.Client).
> > >
> > > This call returns a list of token ranges (max and min token values) for
> > > different nodes in the cluster.  We then use this information, along
> with
> > > another thrift call,
> > >
> > >     client.describe_splits_ex(cfName, range.start_token,
> range.end_token,
> > > splitSize);
> > >
> > > to estimate the number of rows in each token range, etc.
> > >
> > > I have looked all over the Java driver documentation and pinged the
> user
> > > list and have not gotten any proposals that work for the Java driver.
> >  Does
> > > anyone here have any suggestions?
> > >
> > > Thanks!
> > >
> > > Best regards,
> > > Clint
> > >
> > >
> > > On Tue, Mar 11, 2014 at 12:41 PM, Shao-Chuan Wang <
> > > shaochuan.w...@bloomreach.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I just received this email from Jonathan regarding this deprecation
> of
> > > > thrift in 2.1 in dev emailing list.
> > > >
> > > > In fact, we migrated from thrift client to native one several months
> > ago;
> > > > however, in the Cassandra.hadoop, there are still a lot of
> dependencies
> > > on
> > > > thrift interface, for example describe_splits_ex in
> > > > org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.
> > > >
> > > > Therefore, we had to keep thrift and native in our server but mainly,
> > the
> > > > CRUD query are through native protocol.
> > > > However, Jonathan says "*I don't know of any use cases for Thrift
> that
> > > > can't be **done in CQL"*. This statement makes me wonder maybe there
> is
> > > > something I don't know about native protocol yet.
> > > >
> > > > So, does anyone know how to do "describing the splits" and
> "describing
> > > the
> > > > local rings" using native protocol?
> > > >
> > > > Also, cqlsh uses python client, which is talking via thrift protocol
> > too.
> > > > Does it mean that it will be migrated to native protocol soon as
> well?
> > > >
> > > > Comments, pointers, suggestions are much appreciated.
> > > >
> > > > Many thanks,
> > > >
> > > > Shao-Chuan
> > > >
> > >
> >
>



-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

Reply via email to