Hi,
can you please share the SPARK code?
Regards,
Gourav
On Sun, Jun 28, 2020 at 12:58 AM Sanjeev Mishra
wrote:
>
> I have large amount of json files that Spark can read in 36 seconds but
> Spark 3.0 takes almost 33 minutes to read the same. On closer analysis,
> looks like Spark 3.0 is choo
Hi Sanjeev,
that just gives 11 records from the sample that you have loaded to the JIRA
tickets is it correct?
Regards,
Gourav Sengupta
On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra
wrote:
> There is not much code, I am just using spark-shell and reading the data
> like so
>
>
Hi, Sanjeev,
I think that I did precisely that, can you please download my ipython
notebook and have a look, and let me know where I am going wrong. its
attached with the JIRA ticket.
Regards,
Gourav Sengupta
On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra
wrote:
> There are total 11 files
, and shows only 11 records.
Regards,
Gourav Sengupta
On Tue, Jun 30, 2020 at 4:15 PM Sanjeev Mishra
wrote:
> Hi Gourav,
>
> Please check the comments of the ticket, looks like the performance
> degradation is attributed to inferTimestamp option that is true by default
> (I have
Hi,
I am not sure about this but is there any requirement to use S3a at all ?
Regards,
Gourav
On Tue, Jul 21, 2020 at 12:07 PM Steve Loughran
wrote:
>
>
> On Tue, 7 Jul 2020 at 03:42, Stephen Coy
> wrote:
>
>> Hi Steve,
>>
>> While I understand your point regarding the mixing of Hadoop jars,
Hi,
are you using s3a, which is not using EMRFS? In that case, these results
does not make sense to me.
Regards,
Gourav Sengupta
On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) <
abhishek@nokia.com> wrote:
> Hi All,
>
>
>
> We’re doing some pe
Hi,
So the results does not make sense.
Regards,
Gourav
On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) <
abhishek@nokia.com> wrote:
> Hi Gourav,
>
>
>
> Yes. We’re using s3a.
>
>
>
> Thanks and Regards,
>
> Abhishek
&
gt;
>
> Thanks and Regards,
>
> Abhishek
>
>
>
> *From:* Gourav Sengupta
> *Sent:* Wednesday, August 26, 2020 2:35 PM
> *To:* Rao, Abhishek (Nokia - IN/Bangalore)
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark 3.0 using S3 taking long time for some set
hi,
its better to use lighter frameworks over edge. Some of the edge devices I
work on run at over 40 to 50 degree celsius, therefore using lighter
frameworks will be useful for the health of the device.
Regards,
Gourav
On Thu, Sep 24, 2020 at 8:42 AM ayan guha wrote:
> Too broad a question 😀
Hi
How did you set up your environment? And can you print the schema of your
table as well?
It looks like you are using hive tables?
Regards
Gourav
On Fri, 18 Sep 2020, 14:11 Debabrata Ghosh, wrote:
> Hi,
> I needed some help from you on the attached Spark problem
> please. I am runn
What is the use case?
Unless you have unlimited funding and time to waste you would usually start
with that.
Regards,
Gourav
On Fri, Oct 9, 2020 at 10:29 PM Russell Spitzer
wrote:
> Spark in Scala (or java) Is much more performant if you are using RDD's,
> those operations basically force you t
Not quite sure how meaningful this discussion is, but in case someone is
really faced with this query the question still is 'what is the use case'?
I am just a bit confused with the one size fits all deterministic approach
here thought that those days were over almost 10 years ago.
Regards
Gourav
t;>>> and so heavy investment from spark dev community on making pandas first
>>>> class citizen including Udfs.
>>>>
>>>> As I work with multiple clients, my exp is org culture and available
>>>> people are most imp driver for this choice reg
Hi,
6 billion rows is quite small, I can do it in my laptop with around 4 GB
RAM. What is the version of SPARK you are using and what is the effective
memory that you have per executor?
Regards,
Gourav Sengupta
On Mon, Oct 19, 2020 at 4:24 AM Lalwani, Jayesh
wrote:
> I have a Dataframe w
several other frameworks as well now so not quite
sure how unified creates a unique brand value.
Regards,
Gourav Sengupta
On Sun, Oct 18, 2020 at 6:40 PM Hulio andres wrote:
>
> Apache Spark's mission statement is *Apache Spark™* is a unified
> analytics engine for large-scale d
advantage.
Regards,
Gourav Sengupta
On Thu, Oct 22, 2020 at 5:13 PM Mich Talebzadeh
wrote:
> Today I had a discussion with a lead developer on a client site regarding
> Scala or PySpark. with Spark.
>
> They were not doing data science and reluctantly agreed that PySpark was
&g
Hi,
I may be wrong, but this looks like a massively complicated solution for
what could have been a simple SQL.
It always seems okay to be to first reduce the complexity and then solve
it, rather than solve a problem which should not even exist in the first
instance.
Regards,
Gourav
On Sun, Jan
Hi John,
as always I would start by asking what is that y0u are trying to achieve
here? What is the exact security requirement?
We can then start looking at the options available.
Regards,
Gourav Sengupta
On Thu, Jan 21, 2021 at 1:59 PM Mich Talebzadeh
wrote:
> Most enterprise databa
Terribly fascinating. Any insights into why are we not trying to use spark
itself?
Regards
Gourav
On Wed, 13 Jan 2021, 12:46 Vineet Mishra, wrote:
> Hi,
>
> I am trying to connect to Presto via Spark shell using the following
> connection string, however ending up with exception
>
> *-bash-4.2$
Hi
Can you please mention the spark version, give us the code for setting up
spark session, and the operation you are talking about? It will be good to
know the amount of memory that your system has as well and number of
executors you are using per system
In general I have faced issues when doing g
Hi,
why do you want to buy paid SPARK?
Regards,
Gourav
On Tue, Jan 26, 2021 at 1:22 PM Pasha Finkelshteyn <
pavel.finkelsht...@gmail.com> wrote:
> Hi Andrey,
>
> It looks like you may contact Databricks for that.
> Also it would be easier for non-russian spaekers to respond you if your
> name w
Why s3a?
Regards,
Gourav Sengupta
On Wed, Feb 3, 2021 at 7:35 AM YoungKun Min wrote:
> Hi,
>
> I have almost the same problem with Ceph RGW, and currently do research
> about Apache Iceberg and Databricks Delta(opensource version).
> I think these libraries can address the probl
coalesce and writing out to the
files is very large, then the issue is coalesce. Otherwise the issue is the
chain of transformations before coalesce.
Anyways, its 2021, and I always get confused when people use RDD's. Any
particular reason why dataframes would not work?
Regards,
Gourav Sengupt
Hi Ivan,
sorry but it always helps to know the version of SPARK you are using, its
environment, and the format that you are writing out your files to, and any
other details if possible.
Regards,
Gourav Sengupta
On Wed, Feb 24, 2021 at 3:43 PM Ivan Petrov wrote:
> Hi, I'm trying to
Hi,
Are you using structured streaming, which is the spark version and Kafka
version, and where are you fetching the data from?
Semantically speaking if your data in Kafka represents an action to be
performed then it should be actually a queue like rabbitmq or SQS. If it is
simply data then it shou
Hi,
it will be much help if you could at least format the message before asking
people to go through it. Also I am pretty sure that the error is mentioned
in the first line itself.
Any ideas regarding the SPARK version, and environment that you are using?
Thanks and Regards,
Gourav Sengupta
property: spark.sql.files.maxRecordsPerFile unless there is skew in the
data things will work out fine.
Regards,
Gourav Sengupta
On Mon, Mar 8, 2021 at 4:01 PM m li wrote:
> Hi Ivan,
>
>
>
> If the error you are referring to is that the data is out of order, it may
> be that
point.
Regards,
Gourav Sengupta
On Tue, Apr 6, 2021 at 7:46 PM Tzahi File wrote:
> Hi All,
>
> We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.
>
> The spark job running on that cluster reads from an S3 bucket and writes
> to that bucket.
>
> the bucket and
Hi,
looks like you have answered some questions whcih I generally ask. Another
thing, can you please let me know the environment? Is it AWS, GCP, Azure,
Databricks, HDP, etc?
Regards,
Gourav
On Sun, Apr 11, 2021 at 8:39 AM András Kolbert
wrote:
> Hi,
>
> Sure!
>
> Application:
> - Spark versio
Hi,
completely agree with Hao. In case you are using YARN try to see the EMR
documentation on how to enable GPU as resource in YARN before trying to use
that in SPARK.
This is one of the most exciting features of SPARK 3, and you can reap huge
benefits out of it :)
Regards,
Gourav Sengupta
On
advance for all your kind help.
Regards,
Gourav Sengupta
,
Gourav Sengupta
-- Forwarded message -
From: Gourav Sengupta
Date: Wed, Apr 21, 2021 at 10:06 AM
Subject: Graceful shutdown SPARK Structured Streaming
To:
Dear friends,
is there any documentation available for gracefully stopping SPARK
Structured Streaming in 3.1.x?
I am
Hi Mich,
thanks a ton for your kind response, looks like we are still using the
earlier methodologies for stopping a spark streaming program gracefully.
Regards,
Gourav Sengupta
On Wed, May 5, 2021 at 6:04 PM Mich Talebzadeh
wrote:
>
> Hi,
>
>
> I believe I discussed this i
Hi,
once again lets start with the requirement. Why are you trying to pass xml
and json files to SPARK instead of reading them in SPARK?
Generally when people pass on files they are python or jar files.
Regards,
Gourav
On Sat, May 15, 2021 at 5:03 AM Amit Joshi
wrote:
> Hi KhajaAsmath,
>
> Cli
Hi Mithalee,
lets start with why, Why are you using Kubernetes and not just EMR in EC2?
Do you have extremely bespoke library dependencies and requirements? Or
does you workloads fail in case the clusters do not scale up or down in a
few minutes?
Regards,
Gourav Sengupta
On Thu, May 20, 2021
,
Gourav Sengupta
Hi,
could not agree more with Molotch :)
Regards,
Gourav Sengupta
On Thu, May 27, 2021 at 7:08 PM Molotch wrote:
> You can specify the line separator to make spark split your records into
> separate rows.
>
> df = spark.read.option("lineSep","^^^").text("
Hi Sean,
thank you so much for your kind response :)
Regards,
Gourav Sengupta
On Sat, Jun 5, 2021 at 8:00 PM Sean Owen wrote:
> All of these tools are reasonable choices. I don't think the Spark project
> itself has a view on what works best. These things do different things. Fo
I do see the following files there under
sparknlp_display folder:
> VERSION
> __init__.py
> __pycache__
> assertion.py
> dep_updates.py
> dependency_parser.py
> entity_resolution.py
> fonts
> label_colors
> ner.py
> re_updates.py
> relation_extraction.py
> retemp.py
> style.css
> style_utils.py
I will be grateful if someone could kindly let me know what am I doing
wrong here.
Regards,
Gourav Sengupta
Hi,
I think that reading Matei Zaharia's book "SPARK the definitive guide" will
be a good and best starting point.
Regards,
Gourav Sengupta
On Wed, Jun 30, 2021 at 3:47 PM Kartik Ohri wrote:
> Hi all!
>
> I am working on a Pyspark application and would like suggestio
nd nothing better than
to ride on the success of SPARK. But I may be wrong, and SPARK community
may still be developing those integrations.
Regards,
Gourav Sengupta
On Fri, Jul 30, 2021 at 2:46 AM Artemis User wrote:
> Has anyone had any experience with running Spark-Rapids on a GPU-powered
>
. I am
sure we will all find help that we seek, but the help will most likely come
from those as well who are paid and supported by companies towards whom you
are being so unkind
Regards,
Gourav Sengupta
On Fri, Jul 30, 2021 at 4:02 PM Artemis User wrote:
> Thanks Gourav for the i
Hi Artemis,
please do not insult people here, and give your personal opinions as well.
Your comments are insulting to all big corporations which pay salaries and
provide platforms for a lot of people here.
Best of luck with your endeavors.
Regards,
Gourav Sengupta
Hi Andreas,
just to understand the question first, what is it you want to achieve by
breaking the map operations across the GPU and CPU?
Also it will be wonderful to understand the version of SPARK you are using,
and your GPU details a bit more.
Regards,
Gourav
On Sat, Jul 31, 2021 at 9:57 AM
executes the CPU
> task.
>
> Do you have any idea, if resource assignment based scheduling for
> functions is a planned feature for the future?
>
> Best
> Andreas
>
>
> On Sun, Aug 1, 2021 at 6:53 PM Gourav Sengupta
> wrote:
>
>> Hi Andreas,
>>
>&g
2.4
2. when in the data lake some partitions have parquet files written in
SPARK 2.4.x and some are in SPARK 3.1.x.
Please note that there are no changes in schema, but later on we might end
up adding or removing some columns.
I will be really grateful for your kind help on this.
Regards,
Gourav
Hi Saurabh,
a very big note of thanks from Gourav :)
Regards,
Gourav Sengupta
On Thu, Aug 12, 2021 at 4:16 PM Saurabh Gulati
wrote:
> We had issues with this migration mainly because of changes in spark date
> calendars. See
> <https://www.waitingforcode.com/apache-spark-sql/what
Hi,
why are you using add file for a json file? Cant you just read it as a
dataframe?
Regards,
Gourav Sengupta
On Fri, Aug 20, 2021 at 4:50 PM igyu wrote:
> in spark-shell
> I can run
>
> val url = "hdfs://nameservice1/user/jztwk/config.json"
> Spark.sparkContext.ad
making transition to
SPARK 3.1.1 expensive I think.
Regards,
Gourav Sengupta
Hi,
the query still gives the same error if we write "SELECT * FROM table_name
WHERE data_partition > CURRENT_DATE() - INTERVAL 10 DAYS".
Also the queries work fine in SPARK 3.0.x, or in EMR 6.2.0.
Thanks and Regards,
Gourav Sengupta
On Mon, Aug 23, 2021 at 1:16 PM Sean Owen w
Hi,
I received a response from AWS, this is an issue with EMR, and they are
working on resolving the issue I believe.
Thanks and Regards,
Gourav Sengupta
On Mon, Aug 23, 2021 at 1:35 PM Gourav Sengupta <
gourav.sengupta.develo...@gmail.com> wrote:
> Hi,
>
> the query still gives
Hi,
can you please give more details around this? What is the requirement? What
is the SPARK version you are using? What do you mean by multiple sources?
What are these sources?
Regards,
Gourav Sengupta
On Wed, Aug 25, 2021 at 3:51 AM Artemis User wrote:
> Thanks Daniel. I guess you w
Hi Nicolas,
thanks a ton for your kind response, I will surely try this out.
Regards,
Gourav Sengupta
On Sun, Aug 29, 2021 at 11:01 PM Nicolas Paris
wrote:
> as a workaround turn off pruning :
>
> spark.sql.hive.metastorePartitionPruning false
> spark.sql.hive.convertMetastoreP
Hi Holden,
This is such a wonderful opportunity. Sadly when I click on the link it
says event not found.
Regards,
Gourav
On Tue, Sep 14, 2021 at 12:13 AM Holden Karau wrote:
> Hi Folks,
>
> I'm going to experiment with a drop-in virtual half-hour office hour type
> thing next Monday, if you've
memory
objects.
Regards,
Gourav Sengupta
On Wed, Nov 3, 2021 at 10:09 PM Sergey Ivanychev
wrote:
> I want to further clarify the use case I have: an ML engineer collects
> data so as to use it for training an ML model. The driver is created within
> Jupiter notebook and has 64G of ram f
Hi,
did you get to read the excerpts from the book of Dr. Zaharia?
Regards,
Gourav
On Thu, Nov 4, 2021 at 4:11 PM Sergey Ivanychev
wrote:
> I’m sure that its running in client mode. I don’t want to have the same
> amount of RAM on drivers and executors since there’s no point in giving 64G
> of
Hi,
can you please try file://?
If you are using a cluster try to ensure that the location you mention is
accessible across all the executors.
Regards,
Gourav Sengupta
On Fri, Nov 5, 2021 at 4:16 AM Lynx Du wrote:
> Hi experts,
>
> I am just get started using spark and scala.
Hi Martin,
just to confirm, you are taking the output of SPARKNLP, and then trying to
feed it to SPARK ML for running algorithms on the output of NERgenerated by
SPARKNLP right?
Regards,
Gourav Sengupta
On Thu, Nov 11, 2021 at 8:00 AM wrote:
> Hi Sean,
>
> Apologies for the dela
answer Sean's question, explaining what
you are trying to achieve and how, always helps.
Regards,
Gourav Sengupta
On Thu, Nov 11, 2021 at 11:03 AM Martin Wunderlich
wrote:
> Hi Gourav,
>
> Mostly correct. The output of SparNLP here is a trained
> pipeline/model/transformer.
Hi Sergey,
Please read the excerpts from the book of Dr. Zaharia that I had sent, they
explain these fundamentals clearly.
Regards,
Gourav Sengupta
On Thu, Nov 11, 2021 at 9:40 PM Sergey Ivanychev
wrote:
> Yes, in fact those are the settings that cause this behaviour. If set to
>
Hi,
Sorry
Regards,
Gourav Sengupta
On Fri, Nov 12, 2021 at 6:48 AM Sergey Ivanychev
wrote:
> Hi Gourav,
>
> Please, read my question thoroughly. My problem is with the plan of the
> execution and with the fact that toPandas collects all the data not on the
> driver but on a
Hi Anil,
I generally create an anaconda environment, and then install pyspark in it,
and then configure the interpreter to point to that particular environment.
Never faced an issue with my approach.
Regards,
Gourav Sengupta
On Wed, Nov 17, 2021 at 7:39 AM Anil Kulkarni wrote:
> Hi Sp
Hi Srikanth,
what is the spark version that you are using?
Can you tell us the data dictionary and the PK? Also if possible the data
volumes that you are dealing with?
Thanks and Regards,
Gourav Sengupta
On Thu, Dec 2, 2021 at 4:33 PM Shrikanth J R
wrote:
> Hi,
>
> I am facing an i
Hi,
also building entire environments in containers may increase their sizes
massively.
Regards,
Gourav Sengupta
On Sat, Dec 4, 2021 at 7:52 AM Bode, Meikel, NMA-CFD <
meikel.b...@bertelsmann.de> wrote:
> Hi Mich,
>
>
>
> sure thats possible. But distributing the compl
Hi,
I am pretty sure that Sean already answered the question. Also i do not
think that creating iterative table definitions or data frame definitions
are best practice.
Regards,
Gourav
On Mon, Dec 13, 2021 at 4:00 PM Sean Owen wrote:
> ... but the error is not "because that already exists". See
Hi,
also if you are using SPARK 3.2.x please try to see the documentation on
handling skew using SPARK settings.
Regards,
Gourav Sengupta
On Tue, Dec 14, 2021 at 6:01 PM David Diebold
wrote:
> Hello all,
>
> I was wondering if it possible to encounter out of memory exceptions o
steps. For
example, please look at the SPARK UI to see how timings are calculated in
distributed computing mode, there are several well written papers on this.
Thanks and Regards,
Gourav Sengupta
On Thu, Dec 23, 2021 at 10:57 AM wrote:
> hello community,
>
> In pyspark how can I me
Hi,
just trying to understand:
1. Are you using JDBC to consume data from HIVE?
2. Or are you reading data directly from S3 and just using HIVE Metastore
in SPARK just to find out where the table is stored and its metadata?
Regards,
Gourav Sengupta
On Thu, Dec 23, 2021 at 2:13 PM Arthur Li
.
Then perhaps based on that you may want to look at different options.
Regards,
Gourav Sengupta
On Fri, Dec 24, 2021 at 10:42 AM wrote:
> As you see below:
>
> $ pip install sparkmeasure
> Collecting sparkmeasure
>Using cached
>
> https://files.pythonhost
.
Regards,
Gourav Sengupta
On Fri, Dec 24, 2021 at 2:04 AM wrote:
> Hello
>
> Is it possible to know a dataframe's total storage size in bytes? such
> as:
>
> >>> df.size()
> Traceback (most recent call last):
>File "", line 1, in
>File
Hi,
out of sheer and utter curiosity, why JAVA?
Regards,
Gourav Sengupta
On Thu, Dec 23, 2021 at 5:10 PM sam smith
wrote:
> Hi Andrew,
>
> Thanks, here's the Github repo to the code and the publication :
> https://github.com/SamSmithDevs10/paperReplicationForReview
>
>
Hi,
also please ensure that you have read all the required documentation to
understand whether you need to do any metadata migration or not.
Regards,
Gourav Sengupta
On Sun, Dec 19, 2021 at 11:55 AM Alex Ott wrote:
> Make sure that you're using compatible version of Delta Lake libr
Hi,
please note that using SQL is much more performant, and easier to manage
these kind of issues. You might want to look at the SPARK UI to see the
advantage of using SQL over dataframes API.
Regards,
Gourav Sengupta
On Sat, Dec 18, 2021 at 5:40 PM Andrew Davidson
wrote:
> Thanks Nicho
visible.
Regards,
Gourav Sengupta
On Fri, Dec 24, 2021 at 2:48 PM Sean Owen wrote:
> Nah, it's going to translate to the same plan as the equivalent SQL.
>
> On Fri, Dec 24, 2021, 5:09 AM Gourav Sengupta
> wrote:
>
>> Hi,
>>
>> please note that using SQL
entire processing at one single go.
Can you please write down the end to end SQL and share without the 16000
iterations?
Regards,
Gourav Sengupta
On Fri, Dec 24, 2021 at 5:16 PM Andrew Davidson wrote:
> Hi Sean and Gourav
>
>
>
> Thanks for the suggestions. I thought that b
dataframes may apply and RDD
are used, but for UDF's I prefer SQL as well, but that may be a
personal idiosyncrasy. The Oreilly book on data algorithms using SPARK,
pyspark uses dataframes and RDD API's :)
Regards,
Gourav Sengupta
On Fri, Dec 24, 2021 at 6:11 PM Sean Owen wrote:
> On Fri, Dec 24, 2021, 4:54 AM Gourav Sengupta
> wrote:
>
>> Hi,
>>
>> This question, once again like the last one, does not make much sense at
>> all. Where are you trying to store the data frame, and how?
>>
>> Are you just trying to write a blog, as
Hi Andrew,
Any chance you might give Databricks a try in GCP?
The above transformations look complicated to me, why are you adding
dataframes to a list?
Regards,
Gourav Sengupta
On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson
wrote:
> Hi
>
>
>
> I am having trouble debuggin
.rdd.getNumPartitions()
10
Please do refer to the following page for adaptive sql execution in SPARK
3, it will be of massive help particularly in case you are handling skewed
joins, https://spark.apache.org/docs/latest
Hi,
I am not sure at all that we need to use SQLContext and HiveContext
anymore.
Can you please check your JAVA_HOME, and SPARK_HOME? I use findspark
library to enable all environment variables for me regarding spark, or use
conda to install pyspark using conda-forge
Regards,
Gourav Sengupta
-ref-datatypes.html. Parquet is
definitely a columnar format, and if I am not entirely wrong, it definitely
supports columnar reading of data by default in SPARK.
Regards,
Gourav Sengupta
On Sun, Jan 9, 2022 at 2:34 PM weoccc wrote:
> Hi ,
>
> I want to store binary data (such as images)
Hi,
I am a bit confused here, it is not entirely clear to me why are you
creating the row numbers, and how creating the row numbers helps you with
the joins?
Can you please explain with some sample data?
Regards,
Gourav
On Fri, Jan 7, 2022 at 1:14 AM Andrew Davidson
wrote:
> Hi
>
>
>
> I am
art *=* i *** numRows
>
> end *=* start *+* numRows
>
> print("\ni:{} start:{} end:{}"*.*format(i, start,end))
>
> df *=* trainDF*.*iloc[ start:end ]
>
>
>
> There does not seem to be an easy way to do this.
>
>
> https://spark.apache.org/docs/lates
dataframe
in each iteration to understand the effect of your loops on the explain
plan - that should give some details.
Regards,
Gourav Sengupta
On Mon, Jan 10, 2022 at 10:49 PM Ramesh Natarajan
wrote:
> I want to compute cume_dist on a bunch of columns in a spark dataframe,
> but want to
Hi,
may be I have less time, but can you please add some inline comments in
your code to explain what you are trying to do?
Regards,
Gourav Sengupta
On Tue, Jan 11, 2022 at 5:29 PM Alana Young wrote:
> I am experimenting with creating and persisting ML pipelines using custom
> transf
tasks to take care of
memory.
We do not have any other data regarding your clusters or environments
therefore it is difficult to imagine things and provide more information.
Regards,
Gourav Sengupta
On Thu, Jan 27, 2022 at 12:58 PM Aki Riisiö wrote:
> Ah, sorry for spamming, I found the ans
Hi Amit,
before answering your question, I am just trying to understand it.
I am not exactly clear how do the Akka application, Kafka and SPARK
Streaming application sit together, and what are you exactly trying to
achieve?
Can you please elaborate?
Regards,
Gourav
On Fri, Jan 28, 2022 at 10:
warnings in spark-shell using the
Logger.getLogger("akka").setLevel(Level.OFF) in case I have not completely
forgotten. Other details are mentioned here:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.setLogLevel.html
Regards,
Gourav Sengupta
On Fri, Ja
third option, which is akin to the second option that Mich was
mentioning, and that is basically a database transaction log, which gets
very large, very expensive to store and query over a period of time. Are
you creating a database transaction log?
Thanks and Regards,
Gourav Sengupta
On Thu, Jan 27
read the difference between repartition
and coalesce before making any kind of assumptions.
Regards,
Gourav Sengupta
On Sun, Jan 30, 2022 at 8:52 AM Sebastian Piu
wrote:
> It's probably the repartitioning and deserialising the df that you are
> seeing take time. Try doing this
e not actually solving the problem and just addressing the issue.
Regards,
Gourav Sengupta
On Wed, Jan 26, 2022 at 4:07 PM Sean Owen wrote:
> Really depends on what your UDF is doing. You could read 2GB of XML into
> much more than that as a DOM representation in memory.
> Remember 15
data of the filters first.
Regards,
Gourav Sengupta
On Mon, Jan 31, 2022 at 8:00 AM Benjamin Du wrote:
> I don't think coalesce (by repartitioning I assume you mean coalesce)
> itself and deserialising takes that much time. To add a little bit more
> context, the computation of
nsert records multiple times in a table, and still
have different values?
I think without knowing the requirements all the above responses, like
everything else where solutions are reached before understanding the
problem, has high chances of being wrong.
Regards,
Gourav Sengupta
On Mon, Feb 7, 20
are trying to achieve by the rankings?
Regards,
Gourav Sengupta
On Tue, Feb 8, 2022 at 4:22 AM ayan guha wrote:
> For this req you can rank or dense rank.
>
> On Tue, 8 Feb 2022 at 1:12 pm, wrote:
>
>> Hello,
>>
>> For this query:
>>
>> >>&
Hi,
so do you want to rank apple and tomato both as 2? Not quite clear on the
use case here though.
Regards,
Gourav Sengupta
On Tue, Feb 8, 2022 at 7:10 AM wrote:
>
> Hello Gourav
>
>
> As you see here orderBy has already give the solution for "equal
&
Hi,
just so that we understand the problem first?
What is the source data (is it JSON, CSV, Parquet, etc)? Where are you
reading it from (JDBC, file, etc)? What is the compression format (GZ,
BZIP, etc)? What is the SPARK version that you are using?
Thanks and Regards,
Gourav Sengupta
On Fri
there are different ways to manage that
depending on the SPARK version.
Thanks and Regards,
Gourav Sengupta
On Fri, Feb 11, 2022 at 11:09 AM frakass wrote:
> Hello list
>
> I have imported the data into spark and I found there is disk IO in
> every node. The memory didn't get
Hi Anna,
Avro libraries should be inbuilt in SPARK in case I am not wrong. Any
particular reason why you are using a deprecated or soon to be deprecated
version of SPARK?
SPARK 3.2.1 is fantastic.
Please do let us know about your set up if possible.
Regards,
Gourav Sengupta
On Thu, Feb 10
eading its settings.
Regards,
Gourav Sengupta
On Fri, Feb 11, 2022 at 6:00 PM Adam Binford wrote:
> Writing to Delta might not support the write.option method. We set
> spark.hadoop.parquet.block.size in our spark config for writing to Delta.
>
> Adam
>
> On Fri, Feb 11, 2022
hi,
Did you try to sorting while writing out the data? All of this engineering
may not be required in that case.
Regards,
Gourav Sengupta
On Sat, Feb 12, 2022 at 8:42 PM Chris Coutinho
wrote:
> Setting the option in the cluster configuration solved the issue, and now
> we'
1 - 100 of 541 matches
Mail list logo