Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-29 Thread Gourav Sengupta
Hi, can you please share the SPARK code? Regards, Gourav On Sun, Jun 28, 2020 at 12:58 AM Sanjeev Mishra wrote: > > I have large amount of json files that Spark can read in 36 seconds but > Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, > looks like Spark 3.0 is choo

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Gourav Sengupta
Hi Sanjeev, that just gives 11 records from the sample that you have loaded to the JIRA tickets is it correct? Regards, Gourav Sengupta On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra wrote: > There is not much code, I am just using spark-shell and reading the data > like so > >

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Gourav Sengupta
Hi, Sanjeev, I think that I did precisely that, can you please download my ipython notebook and have a look, and let me know where I am going wrong. its attached with the JIRA ticket. Regards, Gourav Sengupta On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra wrote: > There are total 11 files

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

2020-06-30 Thread Gourav Sengupta
, and shows only 11 records. Regards, Gourav Sengupta On Tue, Jun 30, 2020 at 4:15 PM Sanjeev Mishra wrote: > Hi Gourav, > > Please check the comments of the ticket, looks like the performance > degradation is attributed to inferTimestamp option that is true by default > (I have

Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-21 Thread Gourav Sengupta
Hi, I am not sure about this but is there any requirement to use S3a at all ? Regards, Gourav On Tue, Jul 21, 2020 at 12:07 PM Steve Loughran wrote: > > > On Tue, 7 Jul 2020 at 03:42, Stephen Coy > wrote: > >> Hi Steve, >> >> While I understand your point regarding the mixing of Hadoop jars,

Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-26 Thread Gourav Sengupta
Hi, are you using s3a, which is not using EMRFS? In that case, these results does not make sense to me. Regards, Gourav Sengupta On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) < abhishek@nokia.com> wrote: > Hi All, > > > > We’re doing some pe

Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-26 Thread Gourav Sengupta
Hi, So the results does not make sense. Regards, Gourav On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) < abhishek@nokia.com> wrote: > Hi Gourav, > > > > Yes. We’re using s3a. > > > > Thanks and Regards, > > Abhishek &

Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-26 Thread Gourav Sengupta
gt; > > Thanks and Regards, > > Abhishek > > > > *From:* Gourav Sengupta > *Sent:* Wednesday, August 26, 2020 2:35 PM > *To:* Rao, Abhishek (Nokia - IN/Bangalore) > *Cc:* user@spark.apache.org > *Subject:* Re: Spark 3.0 using S3 taking long time for some set

Re: Edge AI with Spark

2020-09-24 Thread Gourav Sengupta
hi, its better to use lighter frameworks over edge. Some of the edge devices I work on run at over 40 to 50 degree celsius, therefore using lighter frameworks will be useful for the health of the device. Regards, Gourav On Thu, Sep 24, 2020 at 8:42 AM ayan guha wrote: > Too broad a question 😀

Re: Spark : Very simple query failing [Needed help please]

2020-09-26 Thread Gourav Sengupta
Hi How did you set up your environment? And can you print the schema of your table as well? It looks like you are using hive tables? Regards Gourav On Fri, 18 Sep 2020, 14:11 Debabrata Ghosh, wrote: > Hi, > I needed some help from you on the attached Spark problem > please. I am runn

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta
What is the use case? Unless you have unlimited funding and time to waste you would usually start with that. Regards, Gourav On Fri, Oct 9, 2020 at 10:29 PM Russell Spitzer wrote: > Spark in Scala (or java) Is much more performant if you are using RDD's, > those operations basically force you t

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Gourav Sengupta
Not quite sure how meaningful this discussion is, but in case someone is really faced with this query the question still is 'what is the use case'? I am just a bit confused with the one size fits all deterministic approach here thought that those days were over almost 10 years ago. Regards Gourav

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Gourav Sengupta
t;>>> and so heavy investment from spark dev community on making pandas first >>>> class citizen including Udfs. >>>> >>>> As I work with multiple clients, my exp is org culture and available >>>> people are most imp driver for this choice reg

Re: Count distinct and driver memory

2020-10-18 Thread Gourav Sengupta
Hi, 6 billion rows is quite small, I can do it in my laptop with around 4 GB RAM. What is the version of SPARK you are using and what is the effective memory that you have per executor? Regards, Gourav Sengupta On Mon, Oct 19, 2020 at 4:24 AM Lalwani, Jayesh wrote: > I have a Dataframe w

Re: mission statement : unified

2020-10-18 Thread Gourav Sengupta
several other frameworks as well now so not quite sure how unified creates a unique brand value. Regards, Gourav Sengupta On Sun, Oct 18, 2020 at 6:40 PM Hulio andres wrote: > > Apache Spark's mission statement is *Apache Spark™* is a unified > analytics engine for large-scale d

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Gourav Sengupta
advantage. Regards, Gourav Sengupta On Thu, Oct 22, 2020 at 5:13 PM Mich Talebzadeh wrote: > Today I had a discussion with a lead developer on a client site regarding > Scala or PySpark. with Spark. > > They were not doing data science and reluctantly agreed that PySpark was &g

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Gourav Sengupta
Hi, I may be wrong, but this looks like a massively complicated solution for what could have been a simple SQL. It always seems okay to be to first reduce the complexity and then solve it, rather than solve a problem which should not even exist in the first instance. Regards, Gourav On Sun, Jan

Re: Column-level encryption in Spark SQL

2021-01-21 Thread Gourav Sengupta
Hi John, as always I would start by asking what is that y0u are trying to achieve here? What is the exact security requirement? We can then start looking at the options available. Regards, Gourav Sengupta On Thu, Jan 21, 2021 at 1:59 PM Mich Talebzadeh wrote: > Most enterprise databa

Re: Connection to Presto via Spark

2021-01-21 Thread Gourav Sengupta
Terribly fascinating. Any insights into why are we not trying to use spark itself? Regards Gourav On Wed, 13 Jan 2021, 12:46 Vineet Mishra, wrote: > Hi, > > I am trying to connect to Presto via Spark shell using the following > connection string, however ending up with exception > > *-bash-4.2$

Re: Facing memory leak with Pyarrow enabled and toPandas()

2021-01-22 Thread Gourav Sengupta
Hi Can you please mention the spark version, give us the code for setting up spark session, and the operation you are talking about? It will be good to know the amount of memory that your system has as well and number of executors you are using per system In general I have faced issues when doing g

Re: Apache Spark

2021-01-26 Thread Gourav Sengupta
Hi, why do you want to buy paid SPARK? Regards, Gourav On Tue, Jan 26, 2021 at 1:22 PM Pasha Finkelshteyn < pavel.finkelsht...@gmail.com> wrote: > Hi Andrey, > > It looks like you may contact Databricks for that. > Also it would be easier for non-russian spaekers to respond you if your > name w

Re: S3a Committer

2021-02-03 Thread Gourav Sengupta
Why s3a? Regards, Gourav Sengupta On Wed, Feb 3, 2021 at 7:35 AM YoungKun Min wrote: > Hi, > > I have almost the same problem with Ceph RGW, and currently do research > about Apache Iceberg and Databricks Delta(opensource version). > I think these libraries can address the probl

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Gourav Sengupta
coalesce and writing out to the files is very large, then the issue is coalesce. Otherwise the issue is the chain of transformations before coalesce. Anyways, its 2021, and I always get confused when people use RDD's. Any particular reason why dataframes would not work? Regards, Gourav Sengupt

Re: How to control count / size of output files for

2021-02-25 Thread Gourav Sengupta
Hi Ivan, sorry but it always helps to know the version of SPARK you are using, its environment, and the format that you are writing out your files to, and any other details if possible. Regards, Gourav Sengupta On Wed, Feb 24, 2021 at 3:43 PM Ivan Petrov wrote: > Hi, I'm trying to

Re: Structured Streaming With Kafka - processing each event

2021-03-02 Thread Gourav Sengupta
Hi, Are you using structured streaming, which is the spark version and Kafka version, and where are you fetching the data from? Semantically speaking if your data in Kafka represents an action to be performed then it should be actually a queue like rabbitmq or SQS. If it is simply data then it shou

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Gourav Sengupta
Hi, it will be much help if you could at least format the message before asking people to go through it. Also I am pretty sure that the error is mentioned in the first line itself. Any ideas regarding the SPARK version, and environment that you are using? Thanks and Regards, Gourav Sengupta

Re: How to control count / size of output files for

2021-03-08 Thread Gourav Sengupta
property: spark.sql.files.maxRecordsPerFile unless there is skew in the data things will work out fine. Regards, Gourav Sengupta On Mon, Mar 8, 2021 at 4:01 PM m li wrote: > Hi Ivan, > > > > If the error you are referring to is that the data is out of order, it may > be that

Re: Spark performance over S3

2021-04-06 Thread Gourav Sengupta
point. Regards, Gourav Sengupta On Tue, Apr 6, 2021 at 7:46 PM Tzahi File wrote: > Hi All, > > We have a spark cluster on aws ec2 that has 60 X i3.4xlarge. > > The spark job running on that cluster reads from an S3 bucket and writes > to that bucket. > > the bucket and

Re: Tasks are skewed to one executor

2021-04-12 Thread Gourav Sengupta
Hi, looks like you have answered some questions whcih I generally ask. Another thing, can you please let me know the environment? Is it AWS, GCP, Azure, Databricks, HDP, etc? Regards, Gourav On Sun, Apr 11, 2021 at 8:39 AM András Kolbert wrote: > Hi, > > Sure! > > Application: > - Spark versio

Re: GPU job in Spark 3

2021-04-15 Thread Gourav Sengupta
Hi, completely agree with Hao. In case you are using YARN try to see the EMR documentation on how to enable GPU as resource in YARN before trying to use that in SPARK. This is one of the most exciting features of SPARK 3, and you can reap huge benefits out of it :) Regards, Gourav Sengupta On

Graceful shutdown SPARK Structured Streaming

2021-04-21 Thread Gourav Sengupta
advance for all your kind help. Regards, Gourav Sengupta

Fwd: Graceful shutdown SPARK Structured Streaming

2021-05-05 Thread Gourav Sengupta
, Gourav Sengupta -- Forwarded message - From: Gourav Sengupta Date: Wed, Apr 21, 2021 at 10:06 AM Subject: Graceful shutdown SPARK Structured Streaming To: Dear friends, is there any documentation available for gracefully stopping SPARK Structured Streaming in 3.1.x? I am

Re: Graceful shutdown SPARK Structured Streaming

2021-05-06 Thread Gourav Sengupta
Hi Mich, thanks a ton for your kind response, looks like we are still using the earlier methodologies for stopping a spark streaming program gracefully. Regards, Gourav Sengupta On Wed, May 5, 2021 at 6:04 PM Mich Talebzadeh wrote: > > Hi, > > > I believe I discussed this i

Re: [EXTERNAL] Urgent Help - Py Spark submit error

2021-05-14 Thread Gourav Sengupta
Hi, once again lets start with the requirement. Why are you trying to pass xml and json files to SPARK instead of reading them in SPARK? Generally when people pass on files they are python or jar files. Regards, Gourav On Sat, May 15, 2021 at 5:03 AM Amit Joshi wrote: > Hi KhajaAsmath, > > Cli

Re: Question on spark on Kubernetes

2021-05-20 Thread Gourav Sengupta
Hi Mithalee, lets start with why, Why are you using Kubernetes and not just EMR in EC2? Do you have extremely bespoke library dependencies and requirements? Or does you workloads fail in case the clusters do not scale up or down in a few minutes? Regards, Gourav Sengupta On Thu, May 20, 2021

Petastorm vs horovod vs tensorflowonspark vs spark_tensorflow_distributor

2021-06-01 Thread Gourav Sengupta
, Gourav Sengupta

Re: Reading Large File in Pyspark

2021-06-03 Thread Gourav Sengupta
Hi, could not agree more with Molotch :) Regards, Gourav Sengupta On Thu, May 27, 2021 at 7:08 PM Molotch wrote: > You can specify the line separator to make spark split your records into > separate rows. > > df = spark.read.option("lineSep","^^^").text("

Re: Petastorm vs horovod vs tensorflowonspark vs spark_tensorflow_distributor

2021-06-07 Thread Gourav Sengupta
Hi Sean, thank you so much for your kind response :) Regards, Gourav Sengupta On Sat, Jun 5, 2021 at 8:00 PM Sean Owen wrote: > All of these tools are reasonable choices. I don't think the Spark project > itself has a view on what works best. These things do different things. Fo

addPyFile error: NotADirectoryError: [Errno 20] Not a directory

2021-06-07 Thread Gourav Sengupta
I do see the following files there under sparknlp_display folder: > VERSION > __init__.py > __pycache__ > assertion.py > dep_updates.py > dependency_parser.py > entity_resolution.py > fonts > label_colors > ner.py > re_updates.py > relation_extraction.py > retemp.py > style.css > style_utils.py I will be grateful if someone could kindly let me know what am I doing wrong here. Regards, Gourav Sengupta

Re: Structuring a PySpark Application

2021-06-30 Thread Gourav Sengupta
Hi, I think that reading Matei Zaharia's book "SPARK the definitive guide" will be a good and best starting point. Regards, Gourav Sengupta On Wed, Jun 30, 2021 at 3:47 PM Kartik Ohri wrote: > Hi all! > > I am working on a Pyspark application and would like suggestio

Re: Running Spark Rapids on GPU-Powered Spark Cluster

2021-07-30 Thread Gourav Sengupta
nd nothing better than to ride on the success of SPARK. But I may be wrong, and SPARK community may still be developing those integrations. Regards, Gourav Sengupta On Fri, Jul 30, 2021 at 2:46 AM Artemis User wrote: > Has anyone had any experience with running Spark-Rapids on a GPU-powered >

Re: Running Spark Rapids on GPU-Powered Spark Cluster

2021-07-30 Thread Gourav Sengupta
. I am sure we will all find help that we seek, but the help will most likely come from those as well who are paid and supported by companies towards whom you are being so unkind Regards, Gourav Sengupta On Fri, Jul 30, 2021 at 4:02 PM Artemis User wrote: > Thanks Gourav for the i

Re: Running Spark Rapids on GPU-Powered Spark Cluster

2021-07-31 Thread Gourav Sengupta
Hi Artemis, please do not insult people here, and give your personal opinions as well. Your comments are insulting to all big corporations which pay salaries and provide platforms for a lot of people here. Best of luck with your endeavors. Regards, Gourav Sengupta

Re: [Spark Core, PySpark] Separate stage level scheduling for consecutive map functions

2021-08-01 Thread Gourav Sengupta
Hi Andreas, just to understand the question first, what is it you want to achieve by breaking the map operations across the GPU and CPU? Also it will be wonderful to understand the version of SPARK you are using, and your GPU details a bit more. Regards, Gourav On Sat, Jul 31, 2021 at 9:57 AM

Re: [Spark Core, PySpark] Separate stage level scheduling for consecutive map functions

2021-08-01 Thread Gourav Sengupta
executes the CPU > task. > > Do you have any idea, if resource assignment based scheduling for > functions is a planned feature for the future? > > Best > Andreas > > > On Sun, Aug 1, 2021 at 6:53 PM Gourav Sengupta > wrote: > >> Hi Andreas, >> >&g

Reading SPARK 3.1.x generated parquet in SPARK 2.4.x

2021-08-05 Thread Gourav Sengupta
2.4 2. when in the data lake some partitions have parquet files written in SPARK 2.4.x and some are in SPARK 3.1.x. Please note that there are no changes in schema, but later on we might end up adding or removing some columns. I will be really grateful for your kind help on this. Regards, Gourav

Re: [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated parquet in SPARK 2.4.x

2021-08-12 Thread Gourav Sengupta
Hi Saurabh, a very big note of thanks from Gourav :) Regards, Gourav Sengupta On Thu, Aug 12, 2021 at 4:16 PM Saurabh Gulati wrote: > We had issues with this migration mainly because of changes in spark date > calendars. See > <https://www.waitingforcode.com/apache-spark-sql/what

Re: How can I use sparkContext.addFile

2021-08-22 Thread Gourav Sengupta
Hi, why are you using add file for a json file? Cant you just read it as a dataframe? Regards, Gourav Sengupta On Fri, Aug 20, 2021 at 4:50 PM igyu wrote: > in spark-shell > I can run > > val url = "hdfs://nameservice1/user/jztwk/config.json" > Spark.sparkContext.ad

AWS EMR SPARK 3.1.1 date issues

2021-08-23 Thread Gourav Sengupta
making transition to SPARK 3.1.1 expensive I think. Regards, Gourav Sengupta

Re: AWS EMR SPARK 3.1.1 date issues

2021-08-23 Thread Gourav Sengupta
Hi, the query still gives the same error if we write "SELECT * FROM table_name WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS". Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0. Thanks and Regards, Gourav Sengupta On Mon, Aug 23, 2021 at 1:16 PM Sean Owen w

Re: AWS EMR SPARK 3.1.1 date issues

2021-08-24 Thread Gourav Sengupta
Hi, I received a response from AWS, this is an issue with EMR, and they are working on resolving the issue I believe. Thanks and Regards, Gourav Sengupta On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta < gourav.sengupta.develo...@gmail.com> wrote: > Hi, > > the query still gives

Re: Processing Multiple Streams in a Single Job

2021-08-24 Thread Gourav Sengupta
Hi, can you please give more details around this? What is the requirement? What is the SPARK version you are using? What do you mean by multiple sources? What are these sources? Regards, Gourav Sengupta On Wed, Aug 25, 2021 at 3:51 AM Artemis User wrote: > Thanks Daniel. I guess you w

Re: AWS EMR SPARK 3.1.1 date issues

2021-08-29 Thread Gourav Sengupta
Hi Nicolas, thanks a ton for your kind response, I will surely try this out. Regards, Gourav Sengupta On Sun, Aug 29, 2021 at 11:01 PM Nicolas Paris wrote: > as a workaround turn off pruning : > > spark.sql.hive.metastorePartitionPruning false > spark.sql.hive.convertMetastoreP

Re: Drop-In Virtual Office Half-Hour

2021-09-13 Thread Gourav Sengupta
Hi Holden, This is such a wonderful opportunity. Sadly when I click on the link it says event not found. Regards, Gourav On Tue, Sep 14, 2021 at 12:13 AM Holden Karau wrote: > Hi Folks, > > I'm going to experiment with a drop-in virtual half-hour office hour type > thing next Monday, if you've

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-03 Thread Gourav Sengupta
memory objects. Regards, Gourav Sengupta On Wed, Nov 3, 2021 at 10:09 PM Sergey Ivanychev wrote: > I want to further clarify the use case I have: an ML engineer collects > data so as to use it for training an ML model. The driver is created within > Jupiter notebook and has 64G of ram f

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Gourav Sengupta
Hi, did you get to read the excerpts from the book of Dr. Zaharia? Regards, Gourav On Thu, Nov 4, 2021 at 4:11 PM Sergey Ivanychev wrote: > I’m sure that its running in client mode. I don’t want to have the same > amount of RAM on drivers and executors since there’s no point in giving 64G > of

Re: Read file from local

2021-11-05 Thread Gourav Sengupta
Hi, can you please try file://? If you are using a cluster try to ensure that the location you mention is accessible across all the executors. Regards, Gourav Sengupta On Fri, Nov 5, 2021 at 4:16 AM Lynx Du wrote: > Hi experts, > > I am just get started using spark and scala.

Re: Using MulticlassClassificationEvaluator for NER evaluation

2021-11-11 Thread Gourav Sengupta
Hi Martin, just to confirm, you are taking the output of SPARKNLP, and then trying to feed it to SPARK ML for running algorithms on the output of NERgenerated by SPARKNLP right? Regards, Gourav Sengupta On Thu, Nov 11, 2021 at 8:00 AM wrote: > Hi Sean, > > Apologies for the dela

Re: Using MulticlassClassificationEvaluator for NER evaluation

2021-11-11 Thread Gourav Sengupta
answer Sean's question, explaining what you are trying to achieve and how, always helps. Regards, Gourav Sengupta On Thu, Nov 11, 2021 at 11:03 AM Martin Wunderlich wrote: > Hi Gourav, > > Mostly correct. The output of SparNLP here is a trained > pipeline/model/transformer.

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Gourav Sengupta
Hi Sergey, Please read the excerpts from the book of Dr. Zaharia that I had sent, they explain these fundamentals clearly. Regards, Gourav Sengupta On Thu, Nov 11, 2021 at 9:40 PM Sergey Ivanychev wrote: > Yes, in fact those are the settings that cause this behaviour. If set to >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Gourav Sengupta
Hi, Sorry Regards, Gourav Sengupta On Fri, Nov 12, 2021 at 6:48 AM Sergey Ivanychev wrote: > Hi Gourav, > > Please, read my question thoroughly. My problem is with the plan of the > execution and with the fact that toPandas collects all the data not on the > driver but on a

Re: Apache Spark 3.2.0 | Pyspark | Pycharm Setup

2021-11-17 Thread Gourav Sengupta
Hi Anil, I generally create an anaconda environment, and then install pyspark in it, and then configure the interpreter to point to that particular environment. Never faced an issue with my approach. Regards, Gourav Sengupta On Wed, Nov 17, 2021 at 7:39 AM Anil Kulkarni wrote: > Hi Sp

Re: Exploding huge array elements in spark

2021-12-03 Thread Gourav Sengupta
Hi Srikanth, what is the spark version that you are using? Can you tell us the data dictionary and the PK? Also if possible the data volumes that you are dealing with? Thanks and Regards, Gourav Sengupta On Thu, Dec 2, 2021 at 4:33 PM Shrikanth J R wrote: > Hi, > > I am facing an i

Re: Conda Python Env in K8S

2021-12-04 Thread Gourav Sengupta
Hi, also building entire environments in containers may increase their sizes massively. Regards, Gourav Sengupta On Sat, Dec 4, 2021 at 7:52 AM Bode, Meikel, NMA-CFD < meikel.b...@bertelsmann.de> wrote: > Hi Mich, > > > > sure thats possible. But distributing the compl

Re: spark 3.2.0 the different dataframe createOrReplaceTempView the same name TempView

2021-12-19 Thread Gourav Sengupta
Hi, I am pretty sure that Sean already answered the question. Also i do not think that creating iterative table definitions or data frame definitions are best practice. Regards, Gourav On Mon, Dec 13, 2021 at 4:00 PM Sean Owen wrote: > ... but the error is not "because that already exists". See

Re: question about data skew and memory issues

2021-12-19 Thread Gourav Sengupta
Hi, also if you are using SPARK 3.2.x please try to see the documentation on handling skew using SPARK settings. Regards, Gourav Sengupta On Tue, Dec 14, 2021 at 6:01 PM David Diebold wrote: > Hello all, > > I was wondering if it possible to encounter out of memory exceptions o

Re: measure running time

2021-12-23 Thread Gourav Sengupta
steps. For example, please look at the SPARK UI to see how timings are calculated in distributed computing mode, there are several well written papers on this. Thanks and Regards, Gourav Sengupta On Thu, Dec 23, 2021 at 10:57 AM wrote: > hello community, > > In pyspark how can I me

Re: How to estimate the executor memory size according by the data

2021-12-23 Thread Gourav Sengupta
Hi, just trying to understand: 1. Are you using JDBC to consume data from HIVE? 2. Or are you reading data directly from S3 and just using HIVE Metastore in SPARK just to find out where the table is stored and its metadata? Regards, Gourav Sengupta On Thu, Dec 23, 2021 at 2:13 PM Arthur Li

Re: measure running time

2021-12-24 Thread Gourav Sengupta
. Then perhaps based on that you may want to look at different options. Regards, Gourav Sengupta On Fri, Dec 24, 2021 at 10:42 AM wrote: > As you see below: > > $ pip install sparkmeasure > Collecting sparkmeasure >Using cached > > https://files.pythonhost

Re: Dataframe's storage size

2021-12-24 Thread Gourav Sengupta
. Regards, Gourav Sengupta On Fri, Dec 24, 2021 at 2:04 AM wrote: > Hello > > Is it possible to know a dataframe's total storage size in bytes? such > as: > > >>> df.size() > Traceback (most recent call last): >File "", line 1, in >File

Re: About some Spark technical help

2021-12-24 Thread Gourav Sengupta
Hi, out of sheer and utter curiosity, why JAVA? Regards, Gourav Sengupta On Thu, Dec 23, 2021 at 5:10 PM sam smith wrote: > Hi Andrew, > > Thanks, here's the Github repo to the code and the publication : > https://github.com/SamSmithDevs10/paperReplicationForReview > >

Re: Unable to use WriteStream to write to delta file.

2021-12-24 Thread Gourav Sengupta
Hi, also please ensure that you have read all the required documentation to understand whether you need to do any metadata migration or not. Regards, Gourav Sengupta On Sun, Dec 19, 2021 at 11:55 AM Alex Ott wrote: > Make sure that you're using compatible version of Delta Lake libr

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
Hi, please note that using SQL is much more performant, and easier to manage these kind of issues. You might want to look at the SPARK UI to see the advantage of using SQL over dataframes API. Regards, Gourav Sengupta On Sat, Dec 18, 2021 at 5:40 PM Andrew Davidson wrote: > Thanks Nicho

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
visible. Regards, Gourav Sengupta On Fri, Dec 24, 2021 at 2:48 PM Sean Owen wrote: > Nah, it's going to translate to the same plan as the equivalent SQL. > > On Fri, Dec 24, 2021, 5:09 AM Gourav Sengupta > wrote: > >> Hi, >> >> please note that using SQL

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
entire processing at one single go. Can you please write down the end to end SQL and share without the 16000 iterations? Regards, Gourav Sengupta On Fri, Dec 24, 2021 at 5:16 PM Andrew Davidson wrote: > Hi Sean and Gourav > > > > Thanks for the suggestions. I thought that b

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
dataframes may apply and RDD are used, but for UDF's I prefer SQL as well, but that may be a personal idiosyncrasy. The Oreilly book on data algorithms using SPARK, pyspark uses dataframes and RDD API's :) Regards, Gourav Sengupta On Fri, Dec 24, 2021 at 6:11 PM Sean Owen wrote:

Re: Dataframe's storage size

2021-12-24 Thread Gourav Sengupta
> On Fri, Dec 24, 2021, 4:54 AM Gourav Sengupta > wrote: > >> Hi, >> >> This question, once again like the last one, does not make much sense at >> all. Where are you trying to store the data frame, and how? >> >> Are you just trying to write a blog, as

Re: Pyspark debugging best practices

2021-12-28 Thread Gourav Sengupta
Hi Andrew, Any chance you might give Databricks a try in GCP? The above transformations look complicated to me, why are you adding dataframes to a list? Regards, Gourav Sengupta On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson wrote: > Hi > > > > I am having trouble debuggin

Re: How to make batch filter

2022-01-02 Thread Gourav Sengupta
.rdd.getNumPartitions() 10 Please do refer to the following page for adaptive sql execution in SPARK 3, it will be of massive help particularly in case you are handling skewed joins, https://spark.apache.org/docs/latest

Re: pyspark

2022-01-06 Thread Gourav Sengupta
Hi, I am not sure at all that we need to use SQLContext and HiveContext anymore. Can you please check your JAVA_HOME, and SPARK_HOME? I use findspark library to enable all environment variables for me regarding spark, or use conda to install pyspark using conda-forge Regards, Gourav Sengupta

Re: hive table with large column data size

2022-01-10 Thread Gourav Sengupta
-ref-datatypes.html. Parquet is definitely a columnar format, and if I am not entirely wrong, it definitely supports columnar reading of data by default in SPARK. Regards, Gourav Sengupta On Sun, Jan 9, 2022 at 2:34 PM weoccc wrote: > Hi , > > I want to store binary data (such as images)

Re: How to add a row number column with out reordering my data frame

2022-01-10 Thread Gourav Sengupta
Hi, I am a bit confused here, it is not entirely clear to me why are you creating the row numbers, and how creating the row numbers helps you with the joins? Can you please explain with some sample data? Regards, Gourav On Fri, Jan 7, 2022 at 1:14 AM Andrew Davidson wrote: > Hi > > > > I am

Re: How to add a row number column with out reordering my data frame

2022-01-11 Thread Gourav Sengupta
art *=* i *** numRows > > end *=* start *+* numRows > > print("\ni:{} start:{} end:{}"*.*format(i, start,end)) > > df *=* trainDF*.*iloc[ start:end ] > > > > There does not seem to be an easy way to do this. > > > https://spark.apache.org/docs/lates

Re: pyspark loop optimization

2022-01-11 Thread Gourav Sengupta
dataframe in each iteration to understand the effect of your loops on the explain plan - that should give some details. Regards, Gourav Sengupta On Mon, Jan 10, 2022 at 10:49 PM Ramesh Natarajan wrote: > I want to compute cume_dist on a bunch of columns in a spark dataframe, > but want to

Re: [Spark ML Pipeline]: Error Loading Pipeline Model with Custom Transformer

2022-01-12 Thread Gourav Sengupta
Hi, may be I have less time, but can you please add some inline comments in your code to explain what you are trying to do? Regards, Gourav Sengupta On Tue, Jan 11, 2022 at 5:29 PM Alana Young wrote: > I am experimenting with creating and persisting ML pipelines using custom > transf

Re: Small optimization questions

2022-01-28 Thread Gourav Sengupta
tasks to take care of memory. We do not have any other data regarding your clusters or environments therefore it is difficult to imagine things and provide more information. Regards, Gourav Sengupta On Thu, Jan 27, 2022 at 12:58 PM Aki Riisiö wrote: > Ah, sorry for spamming, I found the ans

Re: Kafka to spark streaming

2022-01-30 Thread Gourav Sengupta
Hi Amit, before answering your question, I am just trying to understand it. I am not exactly clear how do the Akka application, Kafka and SPARK Streaming application sit together, and what are you exactly trying to achieve? Can you please elaborate? Regards, Gourav On Fri, Jan 28, 2022 at 10:

Re: how can I remove the warning message

2022-01-30 Thread Gourav Sengupta
warnings in spark-shell using the Logger.getLogger("akka").setLevel(Level.OFF) in case I have not completely forgotten. Other details are mentioned here: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.setLogLevel.html Regards, Gourav Sengupta On Fri, Ja

Re: How to delete the record

2022-01-30 Thread Gourav Sengupta
third option, which is akin to the second option that Mich was mentioning, and that is basically a database transaction log, which gets very large, very expensive to store and query over a period of time. Are you creating a database transaction log? Thanks and Regards, Gourav Sengupta On Thu, Jan 27

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Gourav Sengupta
read the difference between repartition and coalesce before making any kind of assumptions. Regards, Gourav Sengupta On Sun, Jan 30, 2022 at 8:52 AM Sebastian Piu wrote: > It's probably the repartitioning and deserialising the df that you are > seeing take time. Try doing this

Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets

2022-01-30 Thread Gourav Sengupta
e not actually solving the problem and just addressing the issue. Regards, Gourav Sengupta On Wed, Jan 26, 2022 at 4:07 PM Sean Owen wrote: > Really depends on what your UDF is doing. You could read 2GB of XML into > much more than that as a DOM representation in memory. > Remember 15

Re: A Persisted Spark DataFrame is computed twice

2022-02-01 Thread Gourav Sengupta
data of the filters first. Regards, Gourav Sengupta On Mon, Jan 31, 2022 at 8:00 AM Benjamin Du wrote: > I don't think coalesce (by repartitioning I assume you mean coalesce) > itself and deserialising takes that much time. To add a little bit more > context, the computation of

Re: add an auto_increment column

2022-02-07 Thread Gourav Sengupta
nsert records multiple times in a table, and still have different values? I think without knowing the requirements all the above responses, like everything else where solutions are reached before understanding the problem, has high chances of being wrong. Regards, Gourav Sengupta On Mon, Feb 7, 20

Re: add an auto_increment column

2022-02-07 Thread Gourav Sengupta
are trying to achieve by the rankings? Regards, Gourav Sengupta On Tue, Feb 8, 2022 at 4:22 AM ayan guha wrote: > For this req you can rank or dense rank. > > On Tue, 8 Feb 2022 at 1:12 pm, wrote: > >> Hello, >> >> For this query: >> >> >>&

Re: add an auto_increment column

2022-02-08 Thread Gourav Sengupta
Hi, so do you want to rank apple and tomato both as 2? Not quite clear on the use case here though. Regards, Gourav Sengupta On Tue, Feb 8, 2022 at 7:10 AM wrote: > > Hello Gourav > > > As you see here orderBy has already give the solution for "equal &

Re: data size exceeds the total ram

2022-02-11 Thread Gourav Sengupta
Hi, just so that we understand the problem first? What is the source data (is it JSON, CSV, Parquet, etc)? Where are you reading it from (JDBC, file, etc)? What is the compression format (GZ, BZIP, etc)? What is the SPARK version that you are using? Thanks and Regards, Gourav Sengupta On Fri

Re: data size exceeds the total ram

2022-02-11 Thread Gourav Sengupta
there are different ways to manage that depending on the SPARK version. Thanks and Regards, Gourav Sengupta On Fri, Feb 11, 2022 at 11:09 AM frakass wrote: > Hello list > > I have imported the data into spark and I found there is disk IO in > every node. The memory didn't get

Re: Using Avro file format with SparkSQL

2022-02-11 Thread Gourav Sengupta
Hi Anna, Avro libraries should be inbuilt in SPARK in case I am not wrong. Any particular reason why you are using a deprecated or soon to be deprecated version of SPARK? SPARK 3.2.1 is fantastic. Please do let us know about your set up if possible. Regards, Gourav Sengupta On Thu, Feb 10

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-11 Thread Gourav Sengupta
eading its settings. Regards, Gourav Sengupta On Fri, Feb 11, 2022 at 6:00 PM Adam Binford wrote: > Writing to Delta might not support the write.option method. We set > spark.hadoop.parquet.block.size in our spark config for writing to Delta. > > Adam > > On Fri, Feb 11, 2022

Re: Unable to force small partitions in streaming job without repartitioning

2022-02-12 Thread Gourav Sengupta
hi, Did you try to sorting while writing out the data? All of this engineering may not be required in that case. Regards, Gourav Sengupta On Sat, Feb 12, 2022 at 8:42 PM Chris Coutinho wrote: > Setting the option in the cluster configuration solved the issue, and now > we'

  1   2   3   4   5   6   >