On Wed, Nov 5, 2014 at 12:08 PM, KZ Win <kz...@pelotoncycle.com> wrote:
> I have cassandra nodes with long uptime. Disk foot print for > cassandra data older is different when I copy to a different folder. > > I am talking about as much 100% different for 25-40GB of data. On > copying they grow to double that. 1) Cassandra automatically "snapshots" SSTables when one does certain operations. 2) One can also manually create snapshots. 3) Snapshots are hard links to files. 4) Hard links to files generally become duplicate files when copied to another partition, unless rsync or cp is configured to maintain the hard link relationship. 5) snapshots are kept in a subdirectory of the data directory for the columnfamily. 6) This all has the pathological seeming outcome that snapshots become effectively larger as time passes (because the hard links they contain become the only copy of the file when the "original" is deleted from the data directory via compaction) and might grow significantly when copied. tl;dr : modify your rsync to include --exclude=snapshots/ =Rob