Re: Java : Testing RDD aggregateByKey

Jacek Laskowski Sat, 21 Aug 2021 03:45:31 -0700

Hi Pedro,

> Anyway, maybe the behavior is weird, I could expect that repartition to
zero was not allowed or at least warned instead of just discarting all the
data .


Interesting...

scala> spark.version
res3: String = 3.1.2

scala> spark.range(5).repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of
partitions (0) must be positive.
  at scala.Predef$.require(Predef.scala:281)
  at
org.apache.spark.sql.catalyst.plans.logical.Repartition.<init>(basicLogicalOperators.scala:1032)
  at org.apache.spark.sql.Dataset.repartition(Dataset.scala:3016)
  ... 47 elided

How are the above different from yours?

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Thu, Aug 19, 2021 at 5:43 PM Pedro Tuero <tuerope...@gmail.com> wrote:

> Hi, I'm sorry , the problem was really silly: In the test the number of
> partitions were zero  (it was a division of the original number of
> partitions of the RDD source and in the test that number was just one) and
> that's why the test was failing.
> Anyway, maybe the behavior is weird, I could expect that repartition to
> zero was not allowed or at least warned instead of just discarting all the
> data .
>
> Thanks for your time!
> Regards,
> Pedro
>
> El jue, 19 de ago. de 2021 a la(s) 07:42, Jacek Laskowski (ja...@japila.pl)
> escribió:
>
>> Hi Pedro,
>>
>> No idea what might be causing it. Do you perhaps have some code to
>> reproduce it locally?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books <https://books.japila.pl/>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> <https://twitter.com/jaceklaskowski>
>>
>>
>> On Tue, Aug 17, 2021 at 4:14 PM Pedro Tuero <tuerope...@gmail.com> wrote:
>>
>>>
>>> Context: spark-core_2.12-3.1.1
>>> Testing with maven and eclipse.
>>>
>>> I'm modifying a project and a test stops working as expected.
>>> The difference is in the parameters passed to the function
>>> aggregateByKey of JavaPairRDD.
>>>
>>> JavaSparkContext is created this way:
>>> new JavaSparkContext(new SparkConf()
>>> .setMaster("local[1]")
>>> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"));
>>> Then I construct a JavaPairRdd using sparkContext.paralellizePairs and
>>> call a method which makes an aggregateByKey over the input JavaPairRDD  and
>>> test that the result is the expected.
>>>
>>> When I use JavaPairRDD line 369 (doing .aggregateByKey(zeroValue,
>>> combiner, merger);
>>>  def aggregateByKey[U](zeroValue: U, seqFunc: JFunction2[U, V, U],
>>> combFunc: JFunction2[U, U, U]):
>>>       JavaPairRDD[K, U] = {
>>>     implicit val ctag: ClassTag[U] = fakeClassTag
>>>     fromRDD(rdd.aggregateByKey(zeroValue)(seqFunc, combFunc))
>>>   }
>>> The test works as expected.
>>> But when I use: JavaPairRDD line 355 (doing .aggregateByKey(zeroValue,
>>> *partitions*,combiner, merger);)
>>> def aggregateByKey[U](zeroValue: U, *numPartitions: Int,* seqFunc:
>>> JFunction2[U, V, U],
>>>       combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U] = {
>>>     implicit val ctag: ClassTag[U] = fakeClassTag
>>>     fromRDD(rdd.aggregateByKey(zeroValue, *numPartitions)*(seqFunc,
>>> combFunc))
>>>   }
>>> The result is always empty. It looks like there is a problem with the
>>> hashPartitioner created at PairRddFunctions :
>>>  def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions:
>>> Int)(seqOp: (U, V) => U,
>>>       combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
>>>     aggregateByKey(zeroValue, *new HashPartitioner(numPartitions)*)(seqOp,
>>> combOp)
>>>   }
>>> vs:
>>>  def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
>>>       combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
>>>     aggregateByKey(zeroValue, *defaultPartitioner*(self))(seqOp, combOp)
>>>   }
>>> I can't debug it properly with eclipse, and error occurs when threads
>>> are in spark code (system editor can only open file base resources).
>>>
>>> Does anyone know how to resolve this issue?
>>>
>>> Thanks in advance,
>>> Pedro.
>>>
>>>
>>>
>>>

Re: Java : Testing RDD aggregateByKey

Reply via email to