Re: Salting technique doubt

2022-07-31 Thread Jacob Lynn
The key is this line from Amit's email (emphasis added): > Change the join_col to *all possible values* of the sale. The two tables are treated asymmetrically: 1. The skewed table gets random salts appended to the join key. 2. The other table gets all possible salts appended to the join key (e.g

Re: PySpark cores

2022-07-29 Thread Jacob Lynn
I think you are looking for the spark.task.cpus configuration parameter. Op vr 29 jul. 2022 om 07:41 schreef Andrew Melo : > Hello, > > Is there a way to tell Spark that PySpark (arrow) functions use > multiple cores? If we have an executor with 8 cores, we would like to > have a single PySpark f

Re: RDD which was checkpointed is not checkpointed

2020-08-19 Thread Jacob Lynn
Hi Ivan, Unlike cache/persist, checkpoint does not operate in-place but requires the result to be assigned to a new variable. In your case: val recordsRDD = convertToRecords(anotherRDD).checkpoint() Best, Jacob Op wo 19 aug. 2020 om 14:39 schreef Ivan Petrov : > Hi! > Seems like I do smth wron

Re: Issue with UDF Int Conversion - Str to Int

2020-03-23 Thread Jacob Lynn
You are overflowing the integer type, which goes up to a max value of 2147483647 (2^31 - 1). Change the return type of `sha2Int2` to `LongType()` and it works as expected. On Mon, Mar 23, 2020 at 6:15 AM ayan guha wrote: > Hi > > I am trying to implement simple hashing/checksum logic. The key lo

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-12 Thread Jacob Lynn
issue: https://issues.apache.org/jira/browse/SPARK-1239. On Mon, Nov 11, 2019 at 4:43 PM Vadim Semenov wrote: > There's an umbrella ticket for various 2GB limitations > https://issues.apache.org/jira/browse/SPARK-6235 > > On Fri, Nov 8, 2019 at 4:11 PM Jacob Lynn wrote: > > &g

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-08 Thread Jacob Lynn
Sorry for the noise, folks! I understand that reducing the number of partitions works around the issue (at the scale I'm working at, anyway) -- as I mentioned in my initial email -- and I understand the root cause. I'm not looking for advice on how to resolve my issue. I'm just pointing out that th

Re: Driver OutOfMemoryError in MapOutputTracker$.serializeMapStatuses for 40 TB shuffle.

2019-11-08 Thread Jacob Lynn
File system is HDFS. Executors are 2 cores, 14GB RAM. But I don't think either of these relate to the problem -- this is a memory allocation issue on the driver side, and happens in an intermediate stage that has no HDFS read/write. On Fri, Nov 8, 2019 at 10:01 AM Spico Florin wrote: > Hi! > Wha