Apologies. I've been too stupid to realize that I had placed the pagination statement at a ridiculous place:-(
One question though - OPP is mandatory for such pagination, isn't it? But then I've read elsewhere in this list that there are drawbacks of OPP. So how do you iterate over all records or try to find a list of all records matching a certain criteria? Is the hadoop-approach the only alternative? Arijit On 7 December 2010 15:41, Arijit Mukherjee <ariji...@gmail.com> wrote: > Hi All > > I was building an application which stores some telecom call records > in a Cassandra store and later performs some analysis on them. I > created two versions, (1) - where the key is of the form "A|B" where A > and B are two mobile numbers and A calls B, and (2) - where the key is > of the form "A|B|TS" where A and B are same as before and TS is a time > stamp (the time when the call was made). In both cases, the record > structure is as follows: [key, DURATION_S1, DURATION_S2], where S1 and > S2 are two different sources for the same information (two network > elements). > > Basically I have two files from two network elements, and I parse the > files and store the records in Cassandra. In both versions, there is a > column family (<ColumnFamily CompareWith="UTF8Type" Name="Event"/>) > which is used to store the records. In the first version, as (A|B) may > occur multiple times within the files, the DURATION_S* fields are > updated every time a duplicate key is encountered. In the second case, > (A|B|TS) is unique - so there is no need for updating the DURATION_S* > fields. Thus, in the second case, the number of records in the Event > CF is slightly more - 9590, compared to 8378 in the 1st case. > > In both versions, the records are processed and sent to Cassandra > within a reasonable period of time. The problem is in a range_slice > query. I am trying to find all the records for which DURATION_S1 != > DURATION_S2. And in fact, the code to do this is almost the same in > both versions: > > /** > * Selects all entries from the Event CF and finds out all those > entries where > * DUR_SRC1 != DUR_SRC2 > */ > public void findDurationMismatches() { > boolean iterate = true; > long count = 0; > long totalCount = 0; > int rowCount = 500; > try { > KeyRange keyRange = new KeyRange(); > keyRange.setStart_key(""); > keyRange.setEnd_key(""); > // use a keyCount of 500 - means iterate over 500 records > until all keys are considered > keyRange.setCount(rowCount); > List<byte[]> columns = new ArrayList<byte[]>(); > columns.add("DUR_SRC1".getBytes(ENCODING)); > columns.add("DUR_SRC2".getBytes(ENCODING)); > > SlicePredicate slicePredicate = new SlicePredicate(); > slicePredicate.setColumn_names(columns); > ColumnParent columnParent = new ColumnParent(EVENT_CF); > List<KeySlice> keySlices = > client.get_range_slices(KEYSPACE, columnParent, > slicePredicate, keyRange, ConsistencyLevel.ONE); > > while (iterate) { > //logger.debug("Number of rows retrieved: " + > keySlices.size()); > totalCount = totalCount + keySlices.size(); > if (keySlices.size() < rowCount) { > // this is the last set > iterate = false; > } > for (KeySlice keySlice : keySlices) { > List<ColumnOrSuperColumn> result = keySlice.getColumns(); > if (result.size() == 2) { > String count_src1 = new > String(result.get(0).getColumn().getValue(), ENCODING); > String count_src2 = new > String(result.get(1).getColumn().getValue(), ENCODING); > if (!count_src1.equals(count_src2)) { > count++; > //printToConsole(keySlice.getKey(), > keySlice.getColumns()); > } > keyRange.setStart_key(keySlice.getKey()); > } > keySlices = client.get_range_slices(KEYSPACE, > columnParent, slicePredicate, > keyRange, ConsistencyLevel.ONE); > } > } > logger.debug("Found " + count + " records with mismatched > duration fields."); > logger.debug("Total number of records processed: " + totalCount); > > } catch (Exception exception) { > exception.printStackTrace(); > logger.error("Exception: " + exception.getMessage()); > } > } > > The trouble is - the same code, takes more than 5 mins to iterate over > 9590 records in the 2nd version, whereas it takes about 2-3 seconds to > iterate over 8300 records in the 1st version - and on the same > machine. > > I can't think of any reason why the performance would change so > drastically. What am I doing wrong here? > > Regards > Arijit > > > > -- > "And when the night is cloudy, > There is still a light that shines on me, > Shine on until tomorrow, let it be." > -- "And when the night is cloudy, There is still a light that shines on me, Shine on until tomorrow, let it be."