Thanks a lot for the answers foks.
It turned out that spark was just IOPs starved. Using better disks solved
my issue, so nothing related to kubernetes at all.
Have a nice weekend everyone
On Fri, Sep 30, 2022 at 4:27 PM Artemis User wrote:
> The reduce phase is always more resource-intensive
The reduce phase is always more resource-intensive than the map phase.
Couple of suggestions you may want to consider:
1. Setting the number of partitions to 18K may be way too high (the
default number is only 200). You may want to just use the default
and the scheduler will automaticall
Hi Sungwoo,
I tend to agree - for a new system, I would probably not go that route, as
Spark on Kubernetes is getting there and can do a lot already. Issue I
mentioned before can be fixed with proper node fencing - it is a typical
stateful set problem Kubernetes has without fencing - node goes dow
Hi Leszek,
For running YARN on Kubernetes and then running Spark on YARN, is there a
lot of overhead for maintaining YARN on Kubernetes? I thought people
usually want to move from YARN to Kubernetes because of the overhead of
maintaining Hadoop.
Thanks,
--- Sungwoo
On Fri, Sep 30, 2022 at 1:37
Hi Leszek,
spot on, therefore EMR being created and dynamically scaled up and down
and being ephemeral proves that there is actually no advantage of using
containers for large jobs.
It is utterly pointless and I have attended interviews and workshops where
no one has ever been able to prove its
Hi Everyone,
To add my 2 cents here:
Advantage of containers, to me, is that it leaves the host system pristine
and clean, allowing standardized devops deployment of hardware for any
purpose. Way back before - when using bare metal / ansible, reusing hw
always involved full reformat of base syste
Hi,
dont containers finally run on systems, and the only advantage of
containers is that you can do better utilisation of system resources by
micro management of jobs running in it? Some say that containers have their
own binaries which isolates environment, but that is a lie, because in a
kuberne
> What's the total number of Partitions that you have ?
18k
> What machines are you using ? Are you using an SSD ?
Using a family of r5.4xlarges nodes. Yes I'm using five GP3 Disks which
gives me about 625 MB/s of sustained throughput (which is what I see when
writing the shuffle data).
> can you
Igor,
what exact instance types do you use? Unless you use local instance storage
and have actually configured your Kubernetes and Spark to use instance
storage, your 30x30 exchange can run into EBS IOPS limits. You can
investigate that by going to an instance, then to volume, and see
monitoring c
that's Total Nonsense , EMR is total crap , use kubernetes i will help
you .
can you please provide whats the size of the shuffle file that is getting
generated in each task .
What's the total number of Partitions that you have ?
What machines are you using ? Are you using an SSD ?
Best
Tufan
On
Hi,
why not use EMR or data proc, kubernetes does not provide any benefit at
all for such scale of work. It is a classical case of over engineering and
over complication just for the heck of it.
Also I think that in case you are in AWS, Redshift Spectrum or Athena for
90% of use cases are way opt
11 matches
Mail list logo