Tks,Rob. We use spark-cassandra-connector to read data from table, then do 
repartition action.
If some nodes with large file bring out running this tasktoo slow, maybe 
serveral hours which is unacceptable.
But those nodes with small file running finished quickly.
So I think if sstableloader can split to small size, and can balance to all 
nodes, thus our spark job can running quickly.




Tks,qihuang.zheng


原始邮件
发件人:Robert colirc...@eventbrite.com
收件人:user@cassandra.apache.orgu...@cassandra.apache.org
发送时间:2015年11月13日(周五) 04:04
主题:Re: Data.db too large and after sstableloader still large


On Thu, Nov 12, 2015 at 6:44 AM, qihuang.zheng qihuang.zh...@fraudmetrix.cn 
wrote:

question is : why sstableloader can’t balance data file size?


Because it streams ranges from the source SStable to a distributed set of 
ranges, especially if you are using vnodes.


It is a general property of Cassandra's streaming that it results in SStables 
that are likely different in size than those that result from flush.


Why are you preoccupied with the filesizes of files sized in the hundreds of 
megabytes? Why do you care about this amount of variance in file sized?


=Rob

Reply via email to