It's possible we have a paging bug in sstable2json. On Fri, Sep 30, 2011 at 10:29 AM, Scott Fines <scott.fi...@nisc.coop> wrote: > Hi all, > I've been messing with sstable2json as a means of mass-exporting some data > (mainly for backups, but also for some convenience trickery on an individual > nodes' data). However, I've run into a situation where sstable2json appears > to be dumping out TONS of duplicate columns for a single row. > For example, for a single key, I did > $CASSANDRA_HOME/bin/sstable2json <sstable> -k <key> > output.file > which ran until I killed it manually. Then I executed > cat output.file | sed 's/]/\n/g' | wc -l > which gave me 40 million and some change. On the other hand, > cat output.file | sed 's/]\n/g' | sort -n | uniq | wc -l > gave me around 10K (much closer to reality). > For my particular data set, the total size of any given row cannot exceed > 80K columns. So I'm wondering: Is this normal behavior for sstable2json? > Assuming that it is, is there any way in which I can massage sstable2json > into not emitting duplicates? These duplicates eat a great deal of disk > space and processing power to manipulate, which I'd like to avoid. > > Thanks for your help, > Scott > >
-- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com