We use Spark2Cassandra (this fork works with C*3.0 
https://github.com/leoromanovsky/Spark2Cassandra )
SSTables are streamed to Cassandra by Spark2Cassandra (so you need to open port 
7000 accordingly).During benchmark we used 25 EMR nodes but in production we 
use less nodes to be more gentle with Cassandra.
Best,
Romain

    Le mardi 6 février 2018 à 16:05:16 UTC+1, Julien Moumne 
<jmou...@deezer.com> a écrit :  
 
 This does look like a very viable solution. Thanks.
Could you give us some pointers/documentation on : - how can we build such 
SSTables using spark jobs, maybe https://github.com/Netflix/sstable-adaptor ? - 
how do we send these tables to cassandra? does a simple SCP work? - what is the 
recommended size for sstables for when it does not fit a single executor
On 5 February 2018 at 18:40, Romain Hardouin <romainh...@yahoo.fr.invalid> 
wrote:

  Hi Julien,
We have such a use case on some clusters. If you want to insert big batches at 
fast pace the only viable solution is to generate SSTables on Spark side and 
stream them to C*. Last time we benchmarked such a job we achieved 1.3 million 
partitions inserted per seconde on a 3 C* nodes test cluster - which is 
impossible with regular inserts.
Best,
Romain
    Le lundi 5 février 2018 à 03:54:09 UTC+1, kurt greaves 
<k...@instaclustr.com> a écrit :  
 
 
Would you know if there is evidence that inserting skinny rows in sorted order 
(no batching) helps C*?
This won't have any effect as each insert will be handled separately by the 
coordinator (or a different coordinator, even). Sorting is also very unlikely 
to help even if you did batch.

 Also, in the case of wide rows, is there evidence that sorting clustering keys 
within partition batches helps ease C*'s job?
No evidence, seems very unlikely. ​  



-- 
Julien MOUMNÉ
Software Engineering - Data Science
Mail: jmoumne@deezer.com12 rue d'Athènes 75009 Paris - France  

Reply via email to