Hi All I was building an application which stores some telecom call records in a Cassandra store and later performs some analysis on them. I created two versions, (1) - where the key is of the form "A|B" where A and B are two mobile numbers and A calls B, and (2) - where the key is of the form "A|B|TS" where A and B are same as before and TS is a time stamp (the time when the call was made). In both cases, the record structure is as follows: [key, DURATION_S1, DURATION_S2], where S1 and S2 are two different sources for the same information (two network elements).
Basically I have two files from two network elements, and I parse the files and store the records in Cassandra. In both versions, there is a column family (<ColumnFamily CompareWith="UTF8Type" Name="Event"/>) which is used to store the records. In the first version, as (A|B) may occur multiple times within the files, the DURATION_S* fields are updated every time a duplicate key is encountered. In the second case, (A|B|TS) is unique - so there is no need for updating the DURATION_S* fields. Thus, in the second case, the number of records in the Event CF is slightly more - 9590, compared to 8378 in the 1st case. In both versions, the records are processed and sent to Cassandra within a reasonable period of time. The problem is in a range_slice query. I am trying to find all the records for which DURATION_S1 != DURATION_S2. And in fact, the code to do this is almost the same in both versions: /** * Selects all entries from the Event CF and finds out all those entries where * DUR_SRC1 != DUR_SRC2 */ public void findDurationMismatches() { boolean iterate = true; long count = 0; long totalCount = 0; int rowCount = 500; try { KeyRange keyRange = new KeyRange(); keyRange.setStart_key(""); keyRange.setEnd_key(""); // use a keyCount of 500 - means iterate over 500 records until all keys are considered keyRange.setCount(rowCount); List<byte[]> columns = new ArrayList<byte[]>(); columns.add("DUR_SRC1".getBytes(ENCODING)); columns.add("DUR_SRC2".getBytes(ENCODING)); SlicePredicate slicePredicate = new SlicePredicate(); slicePredicate.setColumn_names(columns); ColumnParent columnParent = new ColumnParent(EVENT_CF); List<KeySlice> keySlices = client.get_range_slices(KEYSPACE, columnParent, slicePredicate, keyRange, ConsistencyLevel.ONE); while (iterate) { //logger.debug("Number of rows retrieved: " + keySlices.size()); totalCount = totalCount + keySlices.size(); if (keySlices.size() < rowCount) { // this is the last set iterate = false; } for (KeySlice keySlice : keySlices) { List<ColumnOrSuperColumn> result = keySlice.getColumns(); if (result.size() == 2) { String count_src1 = new String(result.get(0).getColumn().getValue(), ENCODING); String count_src2 = new String(result.get(1).getColumn().getValue(), ENCODING); if (!count_src1.equals(count_src2)) { count++; //printToConsole(keySlice.getKey(), keySlice.getColumns()); } keyRange.setStart_key(keySlice.getKey()); } keySlices = client.get_range_slices(KEYSPACE, columnParent, slicePredicate, keyRange, ConsistencyLevel.ONE); } } logger.debug("Found " + count + " records with mismatched duration fields."); logger.debug("Total number of records processed: " + totalCount); } catch (Exception exception) { exception.printStackTrace(); logger.error("Exception: " + exception.getMessage()); } } The trouble is - the same code, takes more than 5 mins to iterate over 9590 records in the 2nd version, whereas it takes about 2-3 seconds to iterate over 8300 records in the 1st version - and on the same machine. I can't think of any reason why the performance would change so drastically. What am I doing wrong here? Regards Arijit -- "And when the night is cloudy, There is still a light that shines on me, Shine on until tomorrow, let it be."