Re: confused by reduceByKey usage

Cheng Lian Thu, 17 Apr 2014 18:56:33 -0700

Ah, I’m not saying println is bad, it’s just that you need to go to the
right place to locate the output, e.g. you can check stdout of any executor
from the Web UI.



On Fri, Apr 18, 2014 at 9:48 AM, 诺铁 <noty...@gmail.com> wrote:

> hi,Cheng,
>
> thank you for let me know this.   so what do you think is better way to
> debug?
>
>
> On Fri, Apr 18, 2014 at 9:27 AM, Cheng Lian <lian.cs....@gmail.com> wrote:
>
>> A tip: using println is only convenient when you are working with local
>> mode. When running Spark in clustering mode (standalone/YARN/Mesos), output
>> of println goes to executor stdout.
>>
>>
>> On Fri, Apr 18, 2014 at 6:53 AM, 诺铁 <noty...@gmail.com> wrote:
>>
>>> yeah, I got it.!
>>> using println to debug is great for me to explore spark.
>>> thank you very much for your kindly help.
>>>
>>>
>>>
>>> On Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos <
>>> daniel.dara...@lynxanalytics.com> wrote:
>>>
>>>> Here's a way to debug something like this:
>>>>
>>>> scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => {
>>>>            println("v1: " + v1)
>>>>            println("v2: " + v2)
>>>>            (v1.split(" ")(1).toInt + v2.split(" ")(1).toInt).toString
>>>>        }).collect
>>>>
>>>> You get:
>>>> v1: 1 2 3 4 5
>>>> v2: 1 2 3 4 5
>>>> v1: 4
>>>> v2: 1 2 3 4 5
>>>> java.lang.ArrayIndexOutOfBoundsException: 1
>>>>
>>>> reduceByKey() works kind of like regular Scala reduce(). So it will
>>>> call the function on the first two values, then on the result of that and
>>>> the next value, then the result of that and the next value, and so on.
>>>> First you add 2+2 and get 4. Then your function is called with v1="4" and
>>>> v2 is the third line.
>>>>
>>>> What you could do instead:
>>>>
>>>> scala> d5.keyBy(_.split(" ")(0)).mapValues(_.split("
>>>> ")(1).toInt).reduceByKey((v1, v2) => v1 + v2).collect
>>>>
>>>>
>>>> On Thu, Apr 17, 2014 at 6:29 PM, 诺铁 <noty...@gmail.com> wrote:
>>>>
>>>>> HI,
>>>>>
>>>>> I am new to spark,when try to write some simple tests in spark shell,
>>>>> I met following problem.
>>>>>
>>>>> I create a very small text file,name it as 5.txt
>>>>> 1 2 3 4 5
>>>>> 1 2 3 4 5
>>>>> 1 2 3 4 5
>>>>>
>>>>> and experiment in spark shell:
>>>>>
>>>>> scala> val d5 = sc.textFile("5.txt").cache()
>>>>> d5: org.apache.spark.rdd.RDD[String] = MappedRDD[91] at textFile at
>>>>> <console>:12
>>>>>
>>>>> scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => (v1.split("
>>>>> ")(1).toInt + v2.split(" ")(1).toInt).toString).first
>>>>>
>>>>> then error occurs:
>>>>> 14/04/18 00:20:11 ERROR Executor: Exception in task ID 36
>>>>> java.lang.ArrayIndexOutOfBoundsException: 1
>>>>> at $line60.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:15)
>>>>>  at $line60.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:15)
>>>>> at
>>>>> org.apache.spark.util.collection.ExternalAppendOnlyMap$$anonfun$2.apply(ExternalAppendOnlyMap.scala:120)
>>>>>
>>>>> when I delete 1 line in the file, and make it 2 lines,the result is
>>>>> correct, I don't understand what's the problem, please help me,thanks.
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: confused by reduceByKey usage

Reply via email to