I changed it a little to spark.sql and extracted such a partitioning key table 
as You did with the userid and joined this to my table to copy and safe this to 
cassandra seemed in a First Test to utilize every given Bit of Performance the 
Cluster can provide. Dont yet know why the first code did not perform as 
expected.

Von meinem iPhone gesendet

> Am 01.02.2018 um 22:09 schrieb kurt greaves <k...@instaclustr.com>:
> 
> That extra code is not necessary, it's just to only retrieve a sampling of 
> let's. You don't want it if you're copying the whole table. It sounds like 
> you're taking the right approach, probably just need some more tuning. Might 
> be on the Cassandra side as well (concurrent_reads/writes).
> 
> 
> On 1 Feb. 2018 19:06, "Jürgen Albersdorfer" <jalbersdor...@gmail.com> wrote:
> Hi Kurt, thanks for your response.
> I indeed utilized Spark - what I've forgot to mention - and I did it nearly 
> the same as in the example you gave me.
> Just without that .select(PK).sample(false, 0.1) Instruction which I don't 
> actually get what it's useful for - and maybe that's the key to the castle.
> 
> I already found out that I require some more Spark Executors - really lots of 
> them.
> And it was a bad Idea in the first place to ./spark-submit without any 
> parameters about executor-memory, total-executor-cores and especially 
> executor-cores.
> I now submitted it with --executor-cores 1 --total-executor-cores 100 -- 
> executor-memory 8G to get more Executors out of my Cluster.
> Without that limits, a Spark Executor will utilize all of the available 
> cores. With the limitations, The Spark Worker will be able to start more 
> Workers in parallel which boosts in my example,
> but is still way to slow and far away from requiring to throttle it. And that 
> is what I actually expected when 100 Processes start beating with the 
> Database Cluster.
> 
> Definitelly I'll give your Code a try.
> 
> 2018-02-01 6:36 GMT+01:00 kurt greaves <k...@instaclustr.com>:
>> How are you copying? With CQLSH COPY or your own script? If you've got spark 
>> already it's quite simple to copy between tables and it should be pretty 
>> much as fast as you can get it. (you may even need to throttle). There's 
>> some sample code here (albeit it's copying between clusters but easily 
>> tailored to copy between tables). 
>> https://www.instaclustr.com/support/documentation/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/
>> 
>>> On 30 January 2018 at 21:05, Jürgen Albersdorfer <jalbersdor...@gmail.com> 
>>> wrote:
>>> Hi, We are using C* 3.11.1 with a 9 Node Cluster built on CentOS Servers 
>>> eac=
>>> h having 2x Quad Core Xeon, 128GB of RAM and two separate 2TB spinning 
>>> Disks=
>>> , one for Log one for Data with Spark on Top.
>>> 
>>> Due to bad Schema (Partitions of about 4 to 8 GB) I need to copy a whole 
>>> Tab=
>>> le into another having same fields but different partitioning.=20
>>> 
>>> I expected glowing Iron when I started the copy Job, but instead cannot 
>>> even=
>>> See some Impact on CPU, mem or disks. - but the Job does copy the Data over=
>>> veeerry slowly at about a MB or two per Minute.
>>> 
>>> Any suggestion where to start investigation?
>>> 
>>> Thanks already
>>> 
>>> Von meinem iPhone gesendet
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>> 
>> 
> 
> 

Reply via email to