Hi Kushagra,
I believe you are referring to this warning below
WARN window.WindowExec: No Partition Defined for Window operation! Moving
all data to a single partition, this can cause serious performance
degradation.
I don't know an easy way around it. If the operation is only once you may
be
That generation of row_number() has to be performed through a window call
and I don't think there is any way around it without orderBy()
df1 =
df1.select(F.row_number().over(Window.partitionBy().orderBy(df1['amount_6m'])).alias("row_num"),"amount_6m")
The problem is that without partitionBy() cla
Hi Kushagra
I still think this is a bad idea. By definition data in a dataframe or rdd
is unordered, you are imposing an order where there is none, and if it
works it will be by chance. For example a simple repartition may disrupt
the row ordering. It is just too unpredictable.
I would suggest y
Hello,
I have got the following first testing setup:
Kubernetes Cluster 1.20 (4 nodes, each node with 120 GB hard disk, 4 cpus, 40
GB memory)
Spark installation by Binami Helm Charts
https://artifacthub.io/packages/helm/bitnami/spark (Chart Version 5.4.2 / Spark
3.1.1)
using GeoSpark versi
Hi experts:
I tried the example as shown on this page, and it is not working for
me:https://spark-packages.org/package/graphframes/graphframes
Please advise how to proceed. I also tried to unzip the zip file, ran 'sbt
assembly', and got an error of 'sbt-spark-package;0.2.6: not found'. Is there
I think it's because the bintray repo has gone away. Did you see the recent
email about the new repo for these packages?
On Wed, May 19, 2021 at 12:42 PM Wensheng Deng
wrote:
> Hi experts:
>
> I tried the example as shown on this page, and it is not working for me:
> https://spark-packages.org/p
Thanks Sean. Your are right! Yes it works when replacing the bintray repo with
repos.spark-packages.org.
On Wednesday, May 19, 2021, 02:03:14 PM EDT, Sean Owen
wrote:
I think it's because the bintray repo has gone away. Did you see the recent
email about the new repo for these pack
Hello all,
I'm hoping someone can give me some direction for troubleshooting this issue,
I'm trying to write from Spark on an HortonWorks(Cloudera) HDP cluster. I ssh
directly to the first datanode and run PySpark with the following command;
however, it is always failing no matter what size I s
Hi Clay,
Those parameters you are passing are not valid
pyspark --conf queue=default --conf executory-memory=24G
Python 3.7.3 (default, Apr 3 2021, 20:42:31)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Warning: Ignoring
How so?
From: Mich Talebzadeh
Sent: Wednesday, May 19, 2021 5:45 PM
To: Clay McDonald
Cc: user@spark.apache.org
Subject: Re: PySpark Write File Container exited with a non-zero exit code 143
*** EXTERNAL EMAIL ***
Hi Clay,
Those parameters you are passing are not valid
pyspark --conf qu
Hi -- Notice the additional "y" in red (as Mich mentioned)
pyspark --conf queue=default --conf executory-memory=24G
On Thu, May 20, 2021 at 12:02 PM Clay McDonald <
stuart.mcdon...@bateswhite.com> wrote:
> How so?
>
>
>
> *From:* Mich Talebzadeh
> *Sent:* Wednesday, May 19, 2021 5:45 PM
> *To:*
11 matches
Mail list logo