Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

Clint Kelly Wed, 02 Apr 2014 10:47:45 -0700

Hi all,

FWIW the HBase Hadoop InputFormat does not even do this kind of estimation
of data density over various ranges; it just creates one split for every
region between the start and stop keys of the scan.  I'll probably just do
something similar by combining token ranges for virtual nodes that share
hosts and creating input splits that way.  I think the previous approach I
had taken was overengineering this somewhat.


Best regards,
Clint





On Tue, Apr 1, 2014 at 2:08 PM, Aleksey Yeschenko <alek...@yeschenko.com>wrote:

> This doesn't belong to CQL-the language.
>
> However, this could be implemented as a virtual system column family -
> sooner or later we'd need something like this anyway.
> Then you'd just run SELECT's against it as if it were a regular column
> family.
>
> --
> AY
>
>
> On Wednesday, April 2, 2014 at 00:03 AM, Tyler Hobbs wrote:
>
> > Split calculation can't be done client-side because it requires key
> > sampling (which requires reading the index summary). This would have to
> be
> > added to CQL.
> >
> > Since I can't see any alternatives and this is required for good Hadoop
> > support, would you mind opening a ticket to add support for this?
> >
> >
> > On Sun, Mar 30, 2014 at 8:31 PM, Clint Kelly <clint.ke...@gmail.com(mailto:
> clint.ke...@gmail.com)> wrote:
> >
> > > Hi Shao-Chuan,
> > >
> > > I understand everything you said above except for how we can estimate
> the
> > > number of rows using the index interval. I understand that the index
> > > interval is a setting that controls how often samples from an SSTable
> index
> > > are stored in memory, correct? I was under the impression that this is
> a
> > > property set in configuration.yaml and would not change as we add rows
> to
> > > or delete rows from a table.
> > >
> > > BTW please let me know if this conversation belongs on the users list.
> I
> > > don't want to spam the dev list, but this seems like something that is
> kind
> > > of on the border between use and development. :)
> > >
> > > Best regards,
> > > Clint
> > >
> > >
> > > On Mon, Mar 24, 2014 at 4:13 PM, Shao-Chuan Wang <
> > > shaochuan.w...@bloomreach.com (mailto:shaochuan.w...@bloomreach.com)>
> wrote:
> > >
> > > > Tyler mentioned that client.describe_ring(myKeyspace); can be
> replaced
> > > by a
> > > > query of system.peers table which has the ring information. The
> challenge
> > > > here is to describe_splits_ex which needs the estimate the number of
> rows
> > > > in each sub token range (as you mentioned).
> > > >
> > > > From what I understand and trials and errors so far, I don't think
> > > Datastax
> > > > Java driver is able to do describe_splits_ex via a simple API call.
> If
> > >
> > > you
> > > > look at the implementation of CassandraServer.describe_splits_ex()
> and
> > > > StorageService.instance.getSplits(), what it does is that it is
> > > >
> > >
> > > splitting a
> > > > token range into several sub token ranges, with estimated row count
> in
> > >
> > > each
> > > > sub token rage. Inside StorageService.instance.getSplits() call, it
> is
> > > > adjusting split count based on a estimated row count, too.
> > > > StorageService.instance.getSplits() is only publicly exported by
> thrift.
> > > >
> > >
> > > It
> > > > would be non-trivial to re-build the same logic inside
> > > > StorageService.instance.getSplits().
> > > >
> > > > That said, it looks like we could implement the splits logic at
> > > > AbstractColumnFamilyInputFormat.getSubSplits by querying
> > > > system.schema_columnfamilies and use CFMetaData.fromSchema to
> construct
> > > > CFMetaData. Inside CFMetaData it has the indexInterval which can be
> used
> > > >
> > >
> > > to
> > > > estimate row count, and the next thing is to mimic the logic in
> > > > StorageService.instance.getSplits() to divide token range into
> several
> > > >
> > >
> > > sub
> > > > token ranges and use TokenFactory (which is obtained from
> partitioner) to
> > > > construct sub token ranges at
> > > >
> > >
> > > AbstractColumnFamilyInputFormat.getSubSplits.
> > > > Basically, it is moving the splitting code from the server side to
> the
> > > > client side.
> > > >
> > > > Any thoughts?
> > > >
> > > > Shao-Chuan
> > > >
> > > >
> > > > On Mon, Mar 24, 2014 at 11:54 AM, Clint Kelly 
> > > > <clint.ke...@gmail.com(mailto:
> clint.ke...@gmail.com)>
> > > > wrote:
> > > >
> > > > > I just saw this question about thrift in the Hadoop / Cassandra
> > > > integration
> > > > > in the discussion on the user list about freezing thrift. I have
> been
> > > > > working on a project to integrate Hadoop 2 and Cassandra 2 and have
> > > > >
> > > >
> > > >
> > >
> > > been
> > > > > trying to move all of the way over to the Java driver and away from
> > > >
> > > > thrift.
> > > > >
> > > > > I have finished most of the driver. It is still pretty rough, but I
> > > have
> > > > > been using it for testing a prototype of the Kiji platfrom (
> > > >
> > >
> > > www.kiji.org (http://www.kiji.org)
> > > > )
> > > > > that uses Cassandra instead of HBase.
> > > > >
> > > > > One thing I have not been able to figure out is how to calculate
> input
> > > > > splits without thrift. I am currently doing the following:
> > > > >
> > > > > map = client.describe_ring(myKeyspace);
> > > > >
> > > > > (where client is of type Cassandra.Client).
> > > > >
> > > > > This call returns a list of token ranges (max and min token
> values) for
> > > > > different nodes in the cluster. We then use this information, along
> > > > >
> > > >
> > > >
> > >
> > > with
> > > > > another thrift call,
> > > > >
> > > > > client.describe_splits_ex(cfName, range.start_token,
> > > range.end_token,
> > > > > splitSize);
> > > > >
> > > > > to estimate the number of rows in each token range, etc.
> > > > >
> > > > > I have looked all over the Java driver documentation and pinged the
> > > user
> > > > > list and have not gotten any proposals that work for the Java
> driver.
> > > >
> > > > Does
> > > > > anyone here have any suggestions?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Best regards,
> > > > > Clint
> > > > >
> > > > >
> > > > > On Tue, Mar 11, 2014 at 12:41 PM, Shao-Chuan Wang <
> > > > > shaochuan.w...@bloomreach.com (mailto:
> shaochuan.w...@bloomreach.com)> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I just received this email from Jonathan regarding this
> deprecation
> > > of
> > > > > > thrift in 2.1 in dev emailing list.
> > > > > >
> > > > > > In fact, we migrated from thrift client to native one several
> months
> > > > ago;
> > > > > > however, in the Cassandra.hadoop, there are still a lot of
> > > > >
> > > >
> > > >
> > >
> > > dependencies
> > > > > on
> > > > > > thrift interface, for example describe_splits_ex in
> > > > > > org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.
> > > > > >
> > > > > > Therefore, we had to keep thrift and native in our server but
> mainly,
> > > > the
> > > > > > CRUD query are through native protocol.
> > > > > > However, Jonathan says "*I don't know of any use cases for Thrift
> > > > > >
> > > > >
> > > >
> > > >
> > >
> > > that
> > > > > > can't be **done in CQL"*. This statement makes me wonder maybe
> there
> > > > >
> > > >
> > >
> > > is
> > > > > > something I don't know about native protocol yet.
> > > > > >
> > > > > > So, does anyone know how to do "describing the splits" and
> > > "describing
> > > > > the
> > > > > > local rings" using native protocol?
> > > > > >
> > > > > > Also, cqlsh uses python client, which is talking via thrift
> protocol
> > > > too.
> > > > > > Does it mean that it will be migrated to native protocol soon as
> > > > >
> > > >
> > > >
> > >
> > > well?
> > > > > >
> > > > > > Comments, pointers, suggestions are much appreciated.
> > > > > >
> > > > > > Many thanks,
> > > > > >
> > > > > > Shao-Chuan
> >
> >
> >
> > --
> > Tyler Hobbs
> > DataStax <http://datastax.com/>
> >
> >
>
>
>

Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

Reply via email to