I changed it a little to spark.sql and extracted such a partitioning key table as You did with the userid and joined this to my table to copy and safe this to cassandra seemed in a First Test to utilize every given Bit of Performance the Cluster can provide. Dont yet know why the first code did not perform as expected.
Von meinem iPhone gesendet > Am 01.02.2018 um 22:09 schrieb kurt greaves <k...@instaclustr.com>: > > That extra code is not necessary, it's just to only retrieve a sampling of > let's. You don't want it if you're copying the whole table. It sounds like > you're taking the right approach, probably just need some more tuning. Might > be on the Cassandra side as well (concurrent_reads/writes). > > > On 1 Feb. 2018 19:06, "Jürgen Albersdorfer" <jalbersdor...@gmail.com> wrote: > Hi Kurt, thanks for your response. > I indeed utilized Spark - what I've forgot to mention - and I did it nearly > the same as in the example you gave me. > Just without that .select(PK).sample(false, 0.1) Instruction which I don't > actually get what it's useful for - and maybe that's the key to the castle. > > I already found out that I require some more Spark Executors - really lots of > them. > And it was a bad Idea in the first place to ./spark-submit without any > parameters about executor-memory, total-executor-cores and especially > executor-cores. > I now submitted it with --executor-cores 1 --total-executor-cores 100 -- > executor-memory 8G to get more Executors out of my Cluster. > Without that limits, a Spark Executor will utilize all of the available > cores. With the limitations, The Spark Worker will be able to start more > Workers in parallel which boosts in my example, > but is still way to slow and far away from requiring to throttle it. And that > is what I actually expected when 100 Processes start beating with the > Database Cluster. > > Definitelly I'll give your Code a try. > > 2018-02-01 6:36 GMT+01:00 kurt greaves <k...@instaclustr.com>: >> How are you copying? With CQLSH COPY or your own script? If you've got spark >> already it's quite simple to copy between tables and it should be pretty >> much as fast as you can get it. (you may even need to throttle). There's >> some sample code here (albeit it's copying between clusters but easily >> tailored to copy between tables). >> https://www.instaclustr.com/support/documentation/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/ >> >>> On 30 January 2018 at 21:05, Jürgen Albersdorfer <jalbersdor...@gmail.com> >>> wrote: >>> Hi, We are using C* 3.11.1 with a 9 Node Cluster built on CentOS Servers >>> eac= >>> h having 2x Quad Core Xeon, 128GB of RAM and two separate 2TB spinning >>> Disks= >>> , one for Log one for Data with Spark on Top. >>> >>> Due to bad Schema (Partitions of about 4 to 8 GB) I need to copy a whole >>> Tab= >>> le into another having same fields but different partitioning.=20 >>> >>> I expected glowing Iron when I started the copy Job, but instead cannot >>> even= >>> See some Impact on CPU, mem or disks. - but the Job does copy the Data over= >>> veeerry slowly at about a MB or two per Minute. >>> >>> Any suggestion where to start investigation? >>> >>> Thanks already >>> >>> Von meinem iPhone gesendet >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>> For additional commands, e-mail: user-h...@cassandra.apache.org >>> >> > >