Looking at the debug log, I see [2015-06-29 23:38:11] [main] DEBUG CqlRecordReader - cqlQuery SELECT "wpid","value" FROM "qarth_catalog_dev"."product_v1" WHERE token("wpid")>? AND token("wpid")<=? LIMIT 10 [2015-06-29 23:38:11] [main] DEBUG CqlRecordReader - created org.apache.cassandra.hadoop.cql3.CqlRecordReader$RowIterator@11963225 [2015-06-29 23:38:11] [main] DEBUG CqlRecordReader - Finished scanning 6 rows (estimate was: 0)
I know the split has about 1000 rows, so why is the record reader not paging through the whole thing? I guess I am missing something very fundamental and I cannot figure it out from the manuals or the source code for CqlInputFormat and CqlRecordReader. Anyone have a working sample code they can share? ———————————————————————————————————— Venky Kandaswamy 925-200-7124 On 6/29/15, 8:46 PM, "Venkatesh Kandaswamy" <ve...@walmartlabs.com> wrote: >Apologize, I meant version C* 2.0.16 >The latest 2.1.7 source has a different WordCount example and this does >not use the CqlPagingInputFormat. I am comparing the differences to >understand why the change was made. But if you can shed some light on the >reasoning, it is much appreciated (and will save me a few hours of digging >through the code). >———————————————————————————————————— > >Venky Kandaswamy >925-200-7124 > > > > > >On 6/29/15, 8:40 PM, "Venkatesh Kandaswamy" <ve...@walmartlabs.com> wrote: > >>I was going through the WordCount example in the latest 2.1.7 Apache C* >>source and there is a reference to >>org.apache.cassandra.hadoop.cql3.CqlPagingInputFormat, but it is not in >>the source tree or in the compiled binary. Looks like we really cannot >>use >>C* with Hadoop without a paging input format. Is there a reason why this >>was removed? But the example includes it. I am confused. Please shed some >>light if you know the answer. >> >>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹ >> >>Venky Kandaswamy >>925-200-7124 >> >> >> >> >> >>On 6/29/15, 1:15 PM, "Venkatesh Kandaswamy" <ve...@walmartlabs.com> >>wrote: >> >>>All, >>> I converted one of my C* programs to Hadoop 2.x and C* datastax >>>drivers for 2.1.0. The original program (Hadoop 1.x) worked fine when we >>>specified InputCQLPageRowSize and InputSplitSize to reasonable values. >>>For example, if we had 60K rows, a row size of 100 and split size of >>>10000 will run 6 mappers and give us 60K rows. When we switched to 2.1.x >>>version of the datastax drivers, the same program now gives only 600 >>>rows. >>> >>> It looks like the paging logic has changed and the page size is only >>>getting the first 100 rows. How do we get all the rows? >>> >>>‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹‹ >>>[cid:E4089CAC-450F-40E4-8A26-88A74F209FC9] >>>Venky Kandaswamy >>>925-200-7124 >>> >> >