Pool.map requires 2 arguments. 1st a function and 2nd an iterable i.e.
list, set etc.
Check out examples from official docs how to use it:
https://docs.python.org/3/library/multiprocessing.html
On Thu, 21 Jul 2022, 21:25 Bjørn Jørgensen,
wrote:
> Thank you.
> The reason for using spark local is
Thank you.
The reason for using spark local is to test the code, and as in this case I
find the bottlenecks and fix them before I spinn up a K8S cluster.
I did test it now with
16 cores and 10 files
import time
tic = time.perf_counter()
json_to_norm_with_null("/home/jovyan/notebooks/falk/test",
One quick observation is that you allocate all your local CPUs to Spark
then execute that app with 10 Threads i.e 10 spark apps and so you will
need 160cores in total as each will need 16CPUs IMHO. Wouldn't that create
CPU bottleneck?
Also on the side note, why you need Spark if you use that on lo
So now I have tried to run this function in a ThreadPool. But it doesn't
seem to work.
[image: image.png]
-- Forwarded message -
Fra: Sean Owen
Date: ons. 20. jul. 2022 kl. 22:43
Subject: Re: Pyspark and multiprocessing
To: Bjørn Jørgensen
I don't think you eve
I have 400k of JSON files. Which is between 10 kb and 500 kb in size.
They don`t have the same schema, so I have to loop over them one at a time.
This works, but is`s very slow. This process takes 5 days!
So now I have tried to run this functions in a ThreadPool. But it don`t
seems to work.
*St