Hello Spark experts - I’m running Spark jobs in cluster mode using a
dedicated cluster for each job. Is there a way to see how much compute time
each job takes via Spark APIs, metrics, etc.? In case it makes a
difference, I’m using AWS EMR - I’d ultimately like to be able to say this
job costs $X s
Unsubscribe
Unsubscribe
unsubscribe
Hi all,
Running on Dataproc 2.0/1.3/1.4, we use INSERT INTO OVERWRITE command to insert
new (time) partitions into existing Hive tables. But we see too many failures
coming from org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles. This is where
the driver moves the successful files from stagin
--
Sergei Boitsov
JetBrains GmbH
Christoph-Rapparini-Bogen 23
80639 München
Handelsregister: Amtsgericht München, HRB 187151
Geschäftsführer: Yury Belyaev
--
“Overfitting” is not about an excessive amount of physical exercise...
Hey Mich,
Thanks for the detailed response. I get most of these options.
However, what we are trying to do is avoid having to upload the source
configs and pyspark.zip files to the cluster every time we execute the job
using spark-submit. Here is the code that does it:
https://github.com/apache/s
Hey Enrico it does help to understand it, thanks for explaining.
Regarding this comment
> PySpark and Scala should behave identically here
Is it ok that Scala and PySpark optimization works differently in this case?
вт, 5 дек. 2023 г. в 20:08, Enrico Minack :
> Hi Michail,
>
> with spark.conf
Hi Eugene,
With regard to your points
What are the PYTHONPATH and SPARK_HOME env variables in your script?
OK let us look at a typical of my Spark project structure
- project_root
|-- README.md
|-- __init__.py
|-- conf
| |-- (configuration files for Spark)
|-- deployment
| |-- d
10 matches
Mail list logo