Duh. I totally forgot about my snapshotting just before daily rsync backup.
k.z. On Wed, Nov 5, 2014 at 3:13 PM, Robert Coli <rc...@eventbrite.com> wrote: > On Wed, Nov 5, 2014 at 12:08 PM, KZ Win <kz...@pelotoncycle.com> wrote: >> >> I have cassandra nodes with long uptime. Disk foot print for >> cassandra data older is different when I copy to a different folder. > > >> >> I am talking about as much 100% different for 25-40GB of data. On >> copying they grow to double that. > > > 1) Cassandra automatically "snapshots" SSTables when one does certain > operations. > 2) One can also manually create snapshots. > 3) Snapshots are hard links to files. > 4) Hard links to files generally become duplicate files when copied to > another partition, unless rsync or cp is configured to maintain the hard > link relationship. > 5) snapshots are kept in a subdirectory of the data directory for the > columnfamily. > 6) This all has the pathological seeming outcome that snapshots become > effectively larger as time passes (because the hard links they contain > become the only copy of the file when the "original" is deleted from the > data directory via compaction) and might grow significantly when copied. > > tl;dr : modify your rsync to include --exclude=snapshots/ > > =Rob >