Hi All
I was building an application which stores some telecom call records
in a Cassandra store and later performs some analysis on them. I
created two versions, (1) - where the key is of the form "A|B" where A
and B are two mobile numbers and A calls B, and (2) - where the key is
of the form "A|B|TS" where A and B are same as before and TS is a time
stamp (the time when the call was made). In both cases, the record
structure is as follows: [key, DURATION_S1, DURATION_S2], where S1 and
S2 are two different sources for the same information (two network
elements).
Basically I have two files from two network elements, and I parse the
files and store the records in Cassandra. In both versions, there is a
column family (<ColumnFamily CompareWith="UTF8Type" Name="Event"/>)
which is used to store the records. In the first version, as (A|B) may
occur multiple times within the files, the DURATION_S* fields are
updated every time a duplicate key is encountered. In the second case,
(A|B|TS) is unique - so there is no need for updating the DURATION_S*
fields. Thus, in the second case, the number of records in the Event
CF is slightly more - 9590, compared to 8378 in the 1st case.
In both versions, the records are processed and sent to Cassandra
within a reasonable period of time. The problem is in a range_slice
query. I am trying to find all the records for which DURATION_S1 !=
DURATION_S2. And in fact, the code to do this is almost the same in
both versions:
/**
* Selects all entries from the Event CF and finds out all those
entries where
* DUR_SRC1 != DUR_SRC2
*/
public void findDurationMismatches() {
boolean iterate = true;
long count = 0;
long totalCount = 0;
int rowCount = 500;
try {
KeyRange keyRange = new KeyRange();
keyRange.setStart_key("");
keyRange.setEnd_key("");
// use a keyCount of 500 - means iterate over 500 records
until all keys are considered
keyRange.setCount(rowCount);
List<byte[]> columns = new ArrayList<byte[]>();
columns.add("DUR_SRC1".getBytes(ENCODING));
columns.add("DUR_SRC2".getBytes(ENCODING));
SlicePredicate slicePredicate = new SlicePredicate();
slicePredicate.setColumn_names(columns);
ColumnParent columnParent = new ColumnParent(EVENT_CF);
List<KeySlice> keySlices =
client.get_range_slices(KEYSPACE, columnParent,
slicePredicate, keyRange, ConsistencyLevel.ONE);
while (iterate) {
//logger.debug("Number of rows retrieved: " + keySlices.size());
totalCount = totalCount + keySlices.size();
if (keySlices.size() < rowCount) {
// this is the last set
iterate = false;
}
for (KeySlice keySlice : keySlices) {
List<ColumnOrSuperColumn> result = keySlice.getColumns();
if (result.size() == 2) {
String count_src1 = new
String(result.get(0).getColumn().getValue(), ENCODING);
String count_src2 = new
String(result.get(1).getColumn().getValue(), ENCODING);
if (!count_src1.equals(count_src2)) {
count++;
//printToConsole(keySlice.getKey(),
keySlice.getColumns());
}
keyRange.setStart_key(keySlice.getKey());
}
keySlices = client.get_range_slices(KEYSPACE,
columnParent, slicePredicate,
keyRange, ConsistencyLevel.ONE);
}
}
logger.debug("Found " + count + " records with mismatched
duration fields.");
logger.debug("Total number of records processed: " + totalCount);
} catch (Exception exception) {
exception.printStackTrace();
logger.error("Exception: " + exception.getMessage());
}
}
The trouble is - the same code, takes more than 5 mins to iterate over
9590 records in the 2nd version, whereas it takes about 2-3 seconds to
iterate over 8300 records in the 1st version - and on the same
machine.
I can't think of any reason why the performance would change so
drastically. What am I doing wrong here?
Regards
Arijit
--
"And when the night is cloudy,
There is still a light that shines on me,
Shine on until tomorrow, let it be."