Tks,Rob. We use spark-cassandra-connector to read data from table, then do repartition action. If some nodes with large file bring out running this tasktoo slow, maybe serveral hours which is unacceptable. But those nodes with small file running finished quickly. So I think if sstableloader can split to small size, and can balance to all nodes, thus our spark job can running quickly.
Tks,qihuang.zheng 原始邮件 发件人:Robert colirc...@eventbrite.com 收件人:user@cassandra.apache.orgu...@cassandra.apache.org 发送时间:2015年11月13日(周五) 04:04 主题:Re: Data.db too large and after sstableloader still large On Thu, Nov 12, 2015 at 6:44 AM, qihuang.zheng qihuang.zh...@fraudmetrix.cn wrote: question is : why sstableloader can’t balance data file size? Because it streams ranges from the source SStable to a distributed set of ranges, especially if you are using vnodes. It is a general property of Cassandra's streaming that it results in SStables that are likely different in size than those that result from flush. Why are you preoccupied with the filesizes of files sized in the hundreds of megabytes? Why do you care about this amount of variance in file sized? =Rob