Cassandra has to deserialize all of the columns in the row for get_count().
So from Cassandra's perspective, it's almost as much work as getting the
entire row, it just doesn't have to send everything back over the network.

If you're frequently counting 8 million columns (or really, anything
significant), you need to use counters instead.  If this is a rare
occurrence, you can do the count in multiple chunks by using a starting and
ending column in the SlicePredicate for each chunk, but this requires some
rough knowledge about the distribution of the column names in the row.

- Tyler

On Sun, Dec 12, 2010 at 1:26 AM, Dave Martin <moyesys...@googlemail.com>wrote:

> Hi there,
>
> I see the following:
>
> 1) Add 8,000,000 columns to a single row. Each column name is a UUID.
> 2) Use cassandra-cli to run count keyspace.cf['myGUID']
>
> The following is reported in the logs:
>
> ERROR [DroppedMessagesLogger] 2010-12-12 18:17:36,046 CassandraDaemon.java
> (line 87) Uncaught exception in thread Thread[DroppedMessagesLogger,5,main]
> java.lang.OutOfMemoryError: Java heap space
> ERROR [pool-1-thread-2] 2010-12-12 18:17:36,046 Cassandra.java (line 1407)
> Internal error processing get_count
> java.lang.OutOfMemoryError: Java heap space
>
> and Cassandra falls over. I see the same behaviour with 0.6.6.
>
> Increasing the memory allocation with the -Xmx & -Xms args to 4GB allows
> the count to return in this particular example (i.e. no OutOfMemory is
> thrown).
>
> Here's the scala code that was ran to load the column, which uses the AKKA
> persistence API:
>
> object ColumnTest {
>        def main(args : Array[String]) : Unit = {
>                println("Super column test starting")
>                val hosts = Array{"localhost"}
>                val sessions = new
> CassandraSessionPool("occurrence",StackPool(SocketProvider("localhost",
> 9160)),Protocol.Binary,ConsistencyLevel.ONE)
>                val session = sessions.newSession
>                loadRow("myGUID", 8000000, session)
>                session.close
>        }
>
>        def loadRow(key:String, noOfColumns:Int, session:CassandraSession){
>                print("loading: "+key+", with columns: "+noOfColumns)
>                val start = System.currentTimeMillis
>                val rawPath = new ColumnPath("dr")
>                for(i <- 0 until noOfColumns){
>                        val recordUuid = UUID.randomUUID.toString
>                        session ++| (key,
> rawPath.setColumn(recordUuid.getBytes), "1".getBytes,
> System.currentTimeMillis)
>                        session.flush
>                }
>                val finish = System.currentTimeMillis
>                print(", Time taken (secs) :" +((finish-start)/1000) + "
> seconds.\n")
>        }
> }
>
> Heres the configuration used:
>
> # Arguments to pass to the JVM
> JVM_OPTS=" \
>        -ea \
>        -Xms1G \
>        -Xmx2G \
>        -XX:+UseParNewGC \
>        -XX:+UseConcMarkSweepGC \
>        -XX:+CMSParallelRemarkEnabled \
>        -XX:SurvivorRatio=8 \
>        -XX:MaxTenuringThreshold=1 \
>        -XX:CMSInitiatingOccupancyFraction=75 \
>        -XX:+UseCMSInitiatingOccupancyOnly \
>        -XX:+HeapDumpOnOutOfMemoryError \
>        -Dcom.sun.management.jmxremote.port=8080 \
>        -Dcom.sun.management.jmxremote.ssl=false \
>        -Dcom.sun.management.jmxremote.authenticate=false"
>
> Admittedly the resource allocation is small, but I wondered if there should
> be some configuration guidelines (e.g. memory allocation vs number of
> columns supported).
>
> Im running this on my MBP with a single node and java as thus:
>
> $ java -version
> java version "1.6.0_22"
> Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
> Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)
>
> Heres the CF definition:
>
>    <Keyspace Name="occurrence">
>      <ColumnFamily Name="dr"
>                    CompareWith="UTF8Type"
>                    Comment="The column family for dataset tracking"/>
>
> <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
>     <ReplicationFactor>1</ReplicationFactor>
>
> <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
>    </Keyspace>
>
> Apologies in advance if this is a known issue or a known limitation of
> 0.6.x.
> I had wondered if I was hitting the 2GB row limit for 0.6.x releases, but
> 8mill columns = 300MB approx in this particular case.
> I guess it may also be a result of the limitations with thrift (i.e. no
> streaming capabilities).
>
> Any thoughts appreciated,
>
> Dave
>
>
>
>
>
>
>
>
>

Reply via email to