Sounds like it's a similar case as mine. The files are definitely, extremely big, 10x space overhead should be a good case if you are just putting values into it.
I'm currently testing CASSANDRA-674 and hopes the better SSTable can solve the space overhead problem. Please follow my e-mail today and I'll continuously work on it today. If your values are integer and floats, with column name containing ~4 characters, as estimated from my case it will cost you 1~2TB of disk space. Best, Steve On Aug 16, 2011, at 4:20 PM, aaron morton wrote: > Are you planning to create 500,000 Super Column Families or 500,000 rows in a > single Super Column Family ? > > The former is a somewhat crazy. Cassandra schemas typically have up to a few > tens of Column Families. Each column family involves a certain amount of > memory overhead, this is now automatically managed in Cassandra 0.8 (see > http://thelastpickle.com/2011/05/04/How-are-Memtables-measured/) > > if I understand correctly you have 500K entities with 6k columns each. A > simple first approach to modelling this would be to use a Standard CF with a > row for each entity. However the best model is the one that serves your read > requests best. > > Also for background the sub columns in a super column are not indexed see > http://wiki.apache.org/cassandra/CassandraLimitations . You would probably > run into this problem if you had 6000 sub columns in a super column. > > Hope that helps. > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 17/08/2011, at 12:53 AM, Renato Bacelar da Silveira wrote: > >> I am wondering about a certain volume situation. >> >> I currently load a Keyspace with a certain amount of SCFs. >> >> Each SCF (Super Column Family) represents an entity. >> >> Each Entity may have up to 6000 values. >> >> I am planning to have 500,000 Entities (SCF) with >> 6000 Columns (within Super Columns - number of Super Columns >> unknown), and was wondering how much resources something >> like this would require? >> >> I am struggling to have 10,000 SCF with 30 Columns (within SuperColumns), >> I get very large files, and reach a 4Gb heapspace limit very quickly on >> a single node. I use Garbage Collection where needed. >> >> Is there some secret to load 500,000 Super Column Families? >> >> Regards. >> -- >> Renato da Silveira >> Senior Developer >