Re: 1000's of column families

Hiller, Dean Tue, 02 Oct 2012 12:26:38 -0700

So you're saying that you can access the primary index with a key range, but to 
access the secondary index, you first need to get all keys and follow up with a 
multiget, which would use the secondary index to speed the lookup of the 
matching rows?


Yes, that is how I "believe" it works.  I am by no means an expert.

I also wanted to fire off a MR to process matching rows in the "virtual" CF 
ideally running on the nodes where it reads data in.  In 0.7, I thought the M/R 
jobs did not run locally with the data like hadoop does???  Anyone know if that 
is still true or does it run locally to the data now?

Thanks,
Dean

From: Ben Hood <0x6e6...@gmail.com<mailto:0x6e6...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, October 2, 2012 1:01 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: 1000's of column families

Dean,

On Tuesday, October 2, 2012 at 18:52, Hiller, Dean wrote:

Because the data for an index is not all together(ie. Need a multi get to get 
the data). It is not contiguous.

The prefix in a partition they keep the data so all data for a prefix from what 
I understand is contiguous.





QUESTION: What I don't get in the comment is I assume you are referring to CQL 
in which case we would need to specify the partition (in addition to the 
index)which means all that data is on one node, correct? Or did I miss 
something there.

Maybe my question was just silly - I wasn't referring to CQL.

As for the locality of the data, I was hoping to be able to fire off an MR job 
to process all matching rows in the CF - I was assuming that that this job 
would get executed on the same node as the data.

But I think the real confusion in my question has to do with the way the 
ColumnFamilyInputFormat has been implemented, since it would appear that it 
ingests the entire (non-OPP) CF into Hadoop, such that the predicate needs to 
be applied in the job rather than up front in the Cassandra query.

Cheers,

Ben

Re: 1000's of column families

Reply via email to