Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-23 Thread Dhrubajyoti Hati
FYI we are using Spark 2.2.0. Should the change be present in this spark version? Wanted to check before opening a JIRA ticket? *Regards,Dhrubajyoti Hati.* On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan wrote: > This looks like a bug that path filter doesn't work for hive table &

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-22 Thread Dhrubajyoti Hati
Just wondering if any one could help me out on this. Thank you! *Regards,Dhrubajyoti Hati.* On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati wrote: > Hi, > > Is there any way to discard files starting with dot(.) or ending with .tmp > in the hive partition while reading fro

Re: Collections passed from driver to executors

2019-09-23 Thread Dhrubajyoti Hati
I was wondering if anyone could help with this question. On Fri, 20 Sep, 2019, 11:52 AM Dhrubajyoti Hati, wrote: > Hi, > > I have a question regarding passing a dictionary from driver to executors > in spark on yarn. This dictionary is needed in an udf. I am using pyspark. > &g

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
Also the performance remains identical when running the same script from jupyter terminal instead or normal terminal. In the script the spark context is created by spark = SparkSession \ .builder \ .. .. getOrCreate() command On Wed, Sep 11, 2019 at 10:28 PM Dhrubajyoti Hati wrote: >

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
e you creating the Spark Session in jupyter ? > > > On Wed, Sep 11, 2019 at 7:33 PM Dhrubajyoti Hati > wrote: > >> But would it be the case for multiple tasks running on the same worker >> and also both the tasks are running in client mode, so the one true is true >&g

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
eight > minutes. > > On Wed, Sep 11, 2019 at 3:17 AM Dhrubajyoti Hati > wrote: > >> Hi, >> >> I just ran the same script in a shell in jupyter notebook and find the >> performance to be similar. So I can confirm this is because the libraries >> used jupyter

Re: script running in jupyter 6-7x faster than spark submit

2019-09-11 Thread Dhrubajyoti Hati
. *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* On Wed, Sep 11, 2019 at 9:45 AM Dhrubajyoti Hati wrote: > Just checked from where the script is submitted i.e. wrt Driver, the > python env are different. Jupyter one is running within a the virtual > environment which is Python

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
but in any case: are they >> both running against the same spark cluster with the same configuration >> parameters especially executor memory and number of workers? >> >> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < >> dhruba.w...@gmail.com>

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
> > Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < > dhruba.w...@gmail.com>: > >> No, i checked for that, hence written "brand new" jupyter notebook. Also >> the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs >> co

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Dhrubajyoti Hati
sks for each. > > On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati > wrote: > >> Hi, >> >> I am facing a weird behaviour while running a python script. Here is what >> the code looks like mostly: >> >> def fn1(ip): >>some code.

Re: Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-29 Thread Dhrubajyoti Hati
sually > also requires more memory for the executor, but less executors. Similarly > the executor instances might be too many and they may not have enough heap. > You can also increase the memory of the executor. > > Am 29.07.2019 um 08:22 schrieb Dhrubajyoti Hati : > > Hi, >

Logistic Regression Iterations causing High GC in Spark 2.3

2019-07-28 Thread Dhrubajyoti Hati
Hi, We were running Logistic Regression in Spark 2.2.X and then we tried to see how does it do in Spark 2.3.X. Now we are facing an issue while running a Logistic Regression Model in Spark 2.3.X on top of Yarn(GCP-Dataproc). In the TreeAggregate method it takes a huge time due to very High GC Acti