Just checked from where the script is submitted i.e. wrt Driver, the python
env are different. Jupyter one is running within a the virtual environment
which is Python 2.7.5 and the spark-submit one uses 2.6.6. But the
executors have the same python version right? I tried doing a spark-submit
from j
Maybe you can try running it in a python shell or jupyter-console/ipython
instead of a spark-submit and check how much time it takes too.
Compare the env variables to check that no additional env configuration is
present in either environment.
Also is the python environment for both the exact sam
Ok. Can't think of why that would happen.
Am Di., 10. Sept. 2019 um 20:26 Uhr schrieb Dhrubajyoti Hati <
dhruba.w...@gmail.com>:
> As mentioned in the very first mail:
> * same cluster it is submitted.
> * from same machine they are submitted and also from same user
> * each of them has 128 execu
As mentioned in the very first mail:
* same cluster it is submitted.
* from same machine they are submitted and also from same user
* each of them has 128 executors and 2 cores per executor with 8Gigs of
memory each and both of them are getting that while running
to clarify more let me quote what
Sounds like you have done your homework to properly compare . I'm
guessing the answer to the following is yes .. but in any case: are they
both running against the same spark cluster with the same configuration
parameters especially executor memory and number of workers?
Am Di., 10. Sept. 2019
No, i checked for that, hence written "brand new" jupyter notebook. Also
the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs
compressed base64 encoded text data from a hive table and decompressing and
decoding in one of the udfs. Also the time compared is from Spark UI not
how
Hello,
Is there a way to access all of the custom listeners that have been registered
to a spark session? I want to remove the listeners that I am no longer using,
except I don't know what they were saved as, I just see testing output messages
on my streaming query. I created a stack overflow
Hi Artem,
I don't believe this is currently possible, but it could be a great
addition to PySpark since this would offer a convenient and efficient way
to parallelize nested column data. I created the JIRA
https://issues.apache.org/jira/browse/SPARK-29040 for this.
On Tue, Aug 27, 2019 at 7:55 PM
It's not obvious from what you pasted, but perhaps the juypter notebook
already is connected to a running spark context, while spark-submit needs
to get a new spot in the (YARN?) queue.
I would check the cluster job IDs for both to ensure you're getting new
cluster tasks for each.
On Tue, Sep 10,
Hi,
I am facing a weird behaviour while running a python script. Here is what
the code looks like mostly:
def fn1(ip):
some code...
...
def fn2(row):
...
some operations
...
return row1
udf_fn1 = udf(fn1)
cdf = spark.read.table("") //hive table is of size > 500 Gigs
I'm using barrier execution in my spark job but am occasionally seeing
deadlocks where the task scheduler is unable to place all the tasks. The
failure is logged but the job hangs indefinitely. I have 2 executors with 16
cores each, using standalone mode (I think? I'm using databricks). The
dataset
Hi,
I am using a org.apache.spark.sql.Encoder to serialize a custom object.
I now want to pass this column to a udf so it can do some operations on it
but this gives me the error :
Caused by: java.lang.ClassCastException: [B cannot be cast to
The code included at the problem demonstrates the is
12 matches
Mail list logo