We use Spark2Cassandra (this fork works with C*3.0 https://github.com/leoromanovsky/Spark2Cassandra ) SSTables are streamed to Cassandra by Spark2Cassandra (so you need to open port 7000 accordingly).During benchmark we used 25 EMR nodes but in production we use less nodes to be more gentle with Cassandra. Best, Romain
Le mardi 6 février 2018 à 16:05:16 UTC+1, Julien Moumne <jmou...@deezer.com> a écrit : This does look like a very viable solution. Thanks. Could you give us some pointers/documentation on : - how can we build such SSTables using spark jobs, maybe https://github.com/Netflix/sstable-adaptor ? - how do we send these tables to cassandra? does a simple SCP work? - what is the recommended size for sstables for when it does not fit a single executor On 5 February 2018 at 18:40, Romain Hardouin <romainh...@yahoo.fr.invalid> wrote: Hi Julien, We have such a use case on some clusters. If you want to insert big batches at fast pace the only viable solution is to generate SSTables on Spark side and stream them to C*. Last time we benchmarked such a job we achieved 1.3 million partitions inserted per seconde on a 3 C* nodes test cluster - which is impossible with regular inserts. Best, Romain Le lundi 5 février 2018 à 03:54:09 UTC+1, kurt greaves <k...@instaclustr.com> a écrit : Would you know if there is evidence that inserting skinny rows in sorted order (no batching) helps C*? This won't have any effect as each insert will be handled separately by the coordinator (or a different coordinator, even). Sorting is also very unlikely to help even if you did batch. Also, in the case of wide rows, is there evidence that sorting clustering keys within partition batches helps ease C*'s job? No evidence, seems very unlikely. -- Julien MOUMNÉ Software Engineering - Data Science Mail: jmoumne@deezer.com12 rue d'Athènes 75009 Paris - France