Hi All

I was building an application which stores some telecom call records
in a Cassandra store and later performs some analysis on them. I
created two versions, (1) - where the key is of the form "A|B" where A
and B are two mobile numbers and A calls B, and (2) - where the key is
of the form "A|B|TS" where A and B are same as before and TS is a time
stamp (the time when the call was made). In both cases, the record
structure is as follows: [key, DURATION_S1, DURATION_S2], where S1 and
S2 are two different sources for the same information (two network
elements).

Basically I have two files from two network elements, and I parse the
files and store the records in Cassandra. In both versions, there is a
column family (<ColumnFamily CompareWith="UTF8Type" Name="Event"/>)
which is used to store the records. In the first version, as (A|B) may
occur multiple times within the files, the DURATION_S* fields are
updated every time a duplicate key is encountered. In the second case,
(A|B|TS) is unique - so there is no need for updating the DURATION_S*
fields. Thus, in the second case, the number of records in the Event
CF is slightly more - 9590, compared to 8378 in the 1st case.

In both versions, the records are processed and sent to Cassandra
within a reasonable period of time. The problem is in a range_slice
query. I am trying to find all the records for which DURATION_S1 !=
DURATION_S2. And in fact, the code to do this is almost the same in
both versions:

/**
     * Selects all entries from the Event CF and finds out all those
entries where
     * DUR_SRC1 != DUR_SRC2
     */
    public void findDurationMismatches() {
        boolean iterate = true;
        long count = 0;
        long totalCount = 0;
        int rowCount = 500;
        try {
            KeyRange keyRange = new KeyRange();
            keyRange.setStart_key("");
            keyRange.setEnd_key("");
            // use a keyCount of 500 - means iterate over 500 records
until all keys are considered
            keyRange.setCount(rowCount);
            List<byte[]> columns = new ArrayList<byte[]>();
            columns.add("DUR_SRC1".getBytes(ENCODING));
            columns.add("DUR_SRC2".getBytes(ENCODING));

            SlicePredicate slicePredicate = new SlicePredicate();
            slicePredicate.setColumn_names(columns);
            ColumnParent columnParent = new ColumnParent(EVENT_CF);
            List<KeySlice> keySlices =
client.get_range_slices(KEYSPACE, columnParent,
                    slicePredicate, keyRange, ConsistencyLevel.ONE);

            while (iterate) {
                //logger.debug("Number of rows retrieved: " + keySlices.size());
                totalCount = totalCount + keySlices.size();
                if (keySlices.size() < rowCount) {
                    // this is the last set
                    iterate = false;
                }
                for (KeySlice keySlice : keySlices) {
                    List<ColumnOrSuperColumn> result = keySlice.getColumns();
                    if (result.size() == 2) {
                        String count_src1 = new
String(result.get(0).getColumn().getValue(), ENCODING);
                        String count_src2 = new
String(result.get(1).getColumn().getValue(), ENCODING);
                        if (!count_src1.equals(count_src2)) {
                            count++;
                            //printToConsole(keySlice.getKey(),
keySlice.getColumns());
                        }
                        keyRange.setStart_key(keySlice.getKey());
                    }
                    keySlices = client.get_range_slices(KEYSPACE,
columnParent, slicePredicate,
                            keyRange, ConsistencyLevel.ONE);
                }
            }
            logger.debug("Found " + count + " records with mismatched
duration fields.");
            logger.debug("Total number of records processed: " + totalCount);

        } catch (Exception exception) {
            exception.printStackTrace();
            logger.error("Exception: " + exception.getMessage());
        }
    }

The trouble is - the same code, takes more than 5 mins to iterate over
9590 records in the 2nd version, whereas it takes about 2-3 seconds to
iterate over 8300 records in the 1st version - and on the same
machine.

I can't think of any reason why the performance would change so
drastically. What am I doing wrong here?

Regards
Arijit



-- 
"And when the night is cloudy,
There is still a light that shines on me,
Shine on until tomorrow, let it be."

Reply via email to