() function
takes a numPartitions column. What other options can I explore?
--gautham
-Original Message-
From: ZHANG Wei
Sent: Thursday, May 7, 2020 1:34 AM
To: Gautham Acharya
Cc: user@spark.apache.org
Subject: Re: PyArrow Exception in Pandas UDF GROUPEDAGG()
CAUTION: This email origina
and 64GB of memory each.
I've attached the full error log here as well. What are some workarounds that I
can do to get this job running? Unfortunately, we are running up to a
production release and this is becoming a severe blocker.
Thanks,
Gautham
I have a process in Apache Spark that attempts to write HFiles to S3 in a
batched process. I want the resulting HFiles in the same directory, as they are
in the same column family. However, I'm getting a 'directory already exists
error' when I try to run this on AWS EMR. How can I write Hfiles v
OpenMP than for Spark.
On Wed, Jul 17, 2019 at 3:42 PM Gautham Acharya
mailto:gauth...@alleninstitute.org>> wrote:
As I said in the my initial message, precomputing is not an option.
Retrieving only the top/bottom N most correlated is an option – would that
speed up the results?
Our S
As I said in the my initial message, precomputing is not an option.
Retrieving only the top/bottom N most correlated is an option – would that
speed up the results?
Our SLAs are soft – slight variations (+- 15 seconds) will not cause issues.
--gautham
From: Patrick McCarthy [mailto:pmccar
easily perform this computation in
spark?
--gautham
From: Bobby Evans [mailto:reva...@gmail.com]
Sent: Wednesday, July 17, 2019 7:06 AM
To: Steven Stetzler
Cc: Gautham Acharya ; user@spark.apache.org
Subject: Re: [Beginner] Run compute on large matrices and return the result in
seconds?
CAUTION
Ping? I would really appreciate advice on this! Thank you!
From: Gautham Acharya
Sent: Tuesday, July 9, 2019 4:22 PM
To: user@spark.apache.org
Subject: [Beginner] Run compute on large matrices and return the result in
seconds?
This is my first email to this mailing list, so I apologize if I
base for a different scenario (random access of rows/columns). I've
naively tired loading a dataframe from the CSV using a Spark instance hosted on
AWS EMR, but getting the results for even a single correlation takes over 20
seconds.
Thank you!
--gautham
ns on new_rdd should utilize all the cores in your cluster.
>
> Nick
>
>
> On Wed Dec 17 2014 at 1:42:16 AM Sun, Rui wrote:
>>
>> Gautham,
>>
>> How many number of gz files do you have? Maybe the reason is that gz file
>> is compressed that can't be
I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5
r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do
sc.textFile to load data from a number of gz files, it does not progress as
fast as expected. When I log-in to a child node and run top, I see only 4
10 matches
Mail list logo