Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Sanjeev Mishra
Can you reduce maxFilesPerTrigger further and see if the OOM still persists, if it does then the problem may be somewhere else. > On Jul 19, 2020, at 5:37 AM, Jungtaek Lim > wrote: > > Please provide logs and dump file for the OOM case - otherwise no one could > say what's the cause. > > Add

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
data can be loaded? > It should be simple, just open the notebook and see why the exact code you > have given does not work, and shows only 11 records. > > > Regards, > Gourav Sengupta > > On Tue, Jun 30, 2020 at 4:15 PM Sanjeev Mishra > wrote: > >> Hi Gourav

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
Sengupta > > On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra <mailto:sanjeev.mis...@gmail.com>> wrote: > There are total 11 files as part of tar. You will have to untar it to get to > actual files (.json.gz) > > No, I am getting > > Count: 33447 > > sp

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
gt; > Hi Sanjeev, > that just gives 11 records from the sample that you have loaded to the JIRA > tickets is it correct? > > > Regards, > Gourav Sengupta > > On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra <mailto:sanjeev.mis...@gmail.com>> wrote: > There is no

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Sanjeev Mishra
he Databricks engineers will find an answer or bug fix soon. > > -- ND > > On 6/29/20 12:27 PM, Sanjeev Mishra wrote: >> The tar file that I have attached has bunch of json.zip files and this is >> the file that is being processed. Each line is self contained JSON as sho

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Sanjeev Mishra
the JSON files there (or samples or code which generates JSON > files)? > > Maxim Gekk > > Software Engineer > > Databricks, Inc. > > > On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra > wrote: > >> It has read everything. As you notice the timing

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Sanjeev Mishra
Are you sure you Spark 2.4 cluster had indeed > read anything? Looks like the Input size field is empty under 2.4. > > -- ND > On 6/27/20 7:58 PM, Sanjeev Mishra wrote: > > > I have large amount of json files that Spark can read in 36 seconds but > Spark 3.0 takes almost 33 mi

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Sanjeev Mishra
ds, > Gourav > > On Sun, Jun 28, 2020 at 12:58 AM Sanjeev Mishra > wrote: > >> >> I have large amount of json files that Spark can read in 36 seconds but >> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, >> looks like Spark 3.0 is choos

Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-27 Thread Sanjeev Mishra
I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.

Spark 3.0.0 spark.read.json never completes

2020-06-27 Thread Sanjeev Mishra
HI all, I have huge amount of json files that Spark 2.4 can easily finish reading but Spark 3.0.0 never competes. I am running both Spark 2 and Spark 3 on Mac

Re: Getting PySpark Partitions Locations

2020-06-25 Thread Sanjeev Mishra
You can use catalog apis see following https://stackoverflow.com/questions/54268845/how-to-check-the-number-of-partitions-of-a-spark-dataframe-without-incurring-the/54270537 On Thu, Jun 25, 2020 at 6:19 AM Tzahi File wrote: > I don't want to query with a distinct on the partitioned columns, the