subject:"Re\: Help with Shuffle Read performance"

Re: Help with Shuffle Read performance

2022-09-30 Thread Igor Calabria

Thanks a lot for the answers foks. It turned out that spark was just IOPs starved. Using better disks solved my issue, so nothing related to kubernetes at all. Have a nice weekend everyone On Fri, Sep 30, 2022 at 4:27 PM Artemis User wrote: > The reduce phase is always more resource-intensive

Re: Help with Shuffle Read performance

2022-09-30 Thread Artemis User

The reduce phase is always more resource-intensive than the map phase. Couple of suggestions you may want to consider: 1. Setting the number of partitions to 18K may be way too high (the default number is only 200). You may want to just use the default and the scheduler will automaticall

Re: Help with Shuffle Read performance

2022-09-30 Thread Leszek Reimus

Hi Sungwoo, I tend to agree - for a new system, I would probably not go that route, as Spark on Kubernetes is getting there and can do a lot already. Issue I mentioned before can be fixed with proper node fencing - it is a typical stateful set problem Kubernetes has without fencing - node goes dow

Re: Help with Shuffle Read performance

2022-09-29 Thread Sungwoo Park

Hi Leszek, For running YARN on Kubernetes and then running Spark on YARN, is there a lot of overhead for maintaining YARN on Kubernetes? I thought people usually want to move from YARN to Kubernetes because of the overhead of maintaining Hadoop. Thanks, --- Sungwoo On Fri, Sep 30, 2022 at 1:37

Re: Help with Shuffle Read performance

2022-09-29 Thread Gourav Sengupta

Hi Leszek, spot on, therefore EMR being created and dynamically scaled up and down and being ephemeral proves that there is actually no advantage of using containers for large jobs. It is utterly pointless and I have attended interviews and workshops where no one has ever been able to prove its

Re: Help with Shuffle Read performance

2022-09-29 Thread Leszek Reimus

Hi Everyone, To add my 2 cents here: Advantage of containers, to me, is that it leaves the host system pristine and clean, allowing standardized devops deployment of hardware for any purpose. Way back before - when using bare metal / ansible, reusing hw always involved full reformat of base syste

Re: Help with Shuffle Read performance

2022-09-29 Thread Gourav Sengupta

Hi, dont containers finally run on systems, and the only advantage of containers is that you can do better utilisation of system resources by micro management of jobs running in it? Some say that containers have their own binaries which isolates environment, but that is a lie, because in a kuberne

Re: Help with Shuffle Read performance

2022-09-29 Thread Igor Calabria

> What's the total number of Partitions that you have ? 18k > What machines are you using ? Are you using an SSD ? Using a family of r5.4xlarges nodes. Yes I'm using five GP3 Disks which gives me about 625 MB/s of sustained throughput (which is what I see when writing the shuffle data). > can you

Re: Help with Shuffle Read performance

2022-09-29 Thread Vladimir Prus

Igor, what exact instance types do you use? Unless you use local instance storage and have actually configured your Kubernetes and Spark to use instance storage, your 30x30 exchange can run into EBS IOPS limits. You can investigate that by going to an instance, then to volume, and see monitoring c

Re: Help with Shuffle Read performance

2022-09-29 Thread Tufan Rakshit

that's Total Nonsense , EMR is total crap , use kubernetes i will help you . can you please provide whats the size of the shuffle file that is getting generated in each task . What's the total number of Partitions that you have ? What machines are you using ? Are you using an SSD ? Best Tufan On

Re: Help with Shuffle Read performance

2022-09-29 Thread Gourav Sengupta

Hi, why not use EMR or data proc, kubernetes does not provide any benefit at all for such scale of work. It is a classical case of over engineering and over complication just for the heck of it. Also I think that in case you are in AWS, Redshift Spectrum or Athena for 90% of use cases are way opt

Re: Help with Shuffle Read performance

Re: Help with Shuffle Read performance

Re: Help with Shuffle Read performance

Re: Help with Shuffle Read performance

Re: Help with Shuffle Read performance

Re: Help with Shuffle Read performance

Re: Help with Shuffle Read performance

Re: Help with Shuffle Read performance

Re: Help with Shuffle Read performance

Re: Help with Shuffle Read performance

Re: Help with Shuffle Read performance

11 matches

Site Navigation

Mail list logo

Footer information