RE: PyArrow Exception in Pandas UDF GROUPEDAGG()

2020-05-07 Thread Gautham Acharya
() function takes a numPartitions column. What other options can I explore? --gautham -Original Message- From: ZHANG Wei Sent: Thursday, May 7, 2020 1:34 AM To: Gautham Acharya Cc: user@spark.apache.org Subject: Re: PyArrow Exception in Pandas UDF GROUPEDAGG() CAUTION: This email origina

PyArrow Exception in Pandas UDF GROUPEDAGG()

2020-05-05 Thread Gautham Acharya
and 64GB of memory each. I've attached the full error log here as well. What are some workarounds that I can do to get this job running? Unfortunately, we are running up to a production release and this is becoming a severe blocker. Thanks, Gautham

[PySpark] How to write HFiles as an 'append' to the same directory?

2020-03-14 Thread Gautham Acharya
I have a process in Apache Spark that attempts to write HFiles to S3 in a batched process. I want the resulting HFiles in the same directory, as they are in the same column family. However, I'm getting a 'directory already exists error' when I try to run this on AWS EMR. How can I write Hfiles v

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
OpenMP than for Spark. On Wed, Jul 17, 2019 at 3:42 PM Gautham Acharya mailto:gauth...@alleninstitute.org>> wrote: As I said in the my initial message, precomputing is not an option. Retrieving only the top/bottom N most correlated is an option – would that speed up the results? Our S

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
As I said in the my initial message, precomputing is not an option. Retrieving only the top/bottom N most correlated is an option – would that speed up the results? Our SLAs are soft – slight variations (+- 15 seconds) will not cause issues. --gautham From: Patrick McCarthy [mailto:pmccar

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Gautham Acharya
easily perform this computation in spark? --gautham From: Bobby Evans [mailto:reva...@gmail.com] Sent: Wednesday, July 17, 2019 7:06 AM To: Steven Stetzler Cc: Gautham Acharya ; user@spark.apache.org Subject: Re: [Beginner] Run compute on large matrices and return the result in seconds? CAUTION

RE: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-11 Thread Gautham Acharya
Ping? I would really appreciate advice on this! Thank you! From: Gautham Acharya Sent: Tuesday, July 9, 2019 4:22 PM To: user@spark.apache.org Subject: [Beginner] Run compute on large matrices and return the result in seconds? This is my first email to this mailing list, so I apologize if I

[Beginner] Run compute on large matrices and return the result in seconds?

2019-07-09 Thread Gautham Acharya
base for a different scenario (random access of rows/columns). I've naively tired loading a dataframe from the CSV using a Spark instance hosted on AWS EMR, but getting the results for even a single correlation takes over 20 seconds. Thank you! --gautham

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2015-01-16 Thread Gautham Anil
ns on new_rdd should utilize all the cores in your cluster. > > Nick > > > On Wed Dec 17 2014 at 1:42:16 AM Sun, Rui wrote: >> >> Gautham, >> >> How many number of gz files do you have? Maybe the reason is that gz file >> is compressed that can't be

pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-09 Thread Gautham
I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5 r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do sc.textFile to load data from a number of gz files, it does not progress as fast as expected. When I log-in to a child node and run top, I see only 4