Compilaon Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi,

I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 
0.98, 

My steps:
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
tar -vxf spark-1.0.2.tgz
cd spark-1.0.2

edit project/SparkBuild.scala, set HBASE_VERSION
  // HBase version; set as appropriate.
  val HBASE_VERSION = "0.98.2"


edit pom.xml with following values
2.4.1
2.5.0
${hadoop.version}
0.98.5
3.4.6
0.13.1


SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
but it fails because of UNRESOLVED DEPENDENCIES "hbase;0.98.2"

Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I 
set HBASE_VERSION back to “0.94.6"?

Regards
Arthur




[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: org.apache.hbase#hbase;0.98.2: not found
[warn]  ::

sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not 
found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
at 
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
at 
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
at xsbt.boot.Using$.withResource(Using.scala:11)
at xsbt.boot.Using$.apply(Using.scala:10)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
at sbt.IvySbt.withIvy(Ivy.scala:101)
at sbt.IvySbt.withIvy(Ivy.scala:97)
at sbt.IvySbt$Module.withModule(Ivy.scala:116)
at sbt.IvyActions$.update(IvyActions.scala:125)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
at sbt.std.Transform$$anon$4.work(System.scala:64)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.Execute.work(Execute.scala:244)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
[error] (examples/*:update) sbt.ResolveException: unresolved dependency: 
org.apache.hbase#hbase;0.98.2: not found
[error] Total time: 270 s, completed Aug 28, 2014 9:42:05 AM





Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
(correction: "Compilation Error:  Spark 1.0.2 with HBase 0.98” , please ignore 
if duplicated) 


Hi,

I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 
0.98, 

My steps:
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
tar -vxf spark-1.0.2.tgz
cd spark-1.0.2

edit project/SparkBuild.scala, set HBASE_VERSION
  // HBase version; set as appropriate.
  val HBASE_VERSION = "0.98.2"


edit pom.xml with following values
2.4.1
2.5.0
${hadoop.version}
0.98.5
3.4.6
0.13.1


SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
but it fails because of UNRESOLVED DEPENDENCIES "hbase;0.98.2"

Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I 
set HBASE_VERSION back to “0.94.6"?

Regards
Arthur




[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: org.apache.hbase#hbase;0.98.2: not found
[warn]  ::

sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not 
found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
at 
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
at 
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
at xsbt.boot.Using$.withResource(Using.scala:11)
at xsbt.boot.Using$.apply(Using.scala:10)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
at sbt.IvySbt.withIvy(Ivy.scala:101)
at sbt.IvySbt.withIvy(Ivy.scala:97)
at sbt.IvySbt$Module.withModule(Ivy.scala:116)
at sbt.IvyActions$.update(IvyActions.scala:125)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
at sbt.std.Transform$$anon$4.work(System.scala:64)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.Execute.work(Execute.scala:244)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
[error] (examples/*:update) sbt.ResolveException: unresolved dependency: 
org.apache.hbase#hbase;0.98.2: not found
[error] Total time: 270 s, completed Aug 28, 2014 9:42:05 AM
--

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted,

Thank you so much!!

As I am new to Spark, can you please advise the steps about how to apply this 
patch to my spark-1.0.2 source folder?

Regards
Arthur


On 28 Aug, 2014, at 10:13 am, Ted Yu  wrote:

> See SPARK-1297
> 
> The pull request is here:
> https://github.com/apache/spark/pull/1893
> 
> 
> On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
>  wrote:
> (correction: "Compilation Error:  Spark 1.0.2 with HBase 0.98” , please 
> ignore if duplicated)
> 
> 
> Hi,
> 
> I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with 
> HBase 0.98,
> 
> My steps:
> wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
> tar -vxf spark-1.0.2.tgz
> cd spark-1.0.2
> 
> edit project/SparkBuild.scala, set HBASE_VERSION
>   // HBase version; set as appropriate.
>   val HBASE_VERSION = "0.98.2"
> 
> 
> edit pom.xml with following values
> 2.4.1
> 2.5.0
> ${hadoop.version}
> 0.98.5
> 3.4.6
> 0.13.1
> 
> 
> SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
> but it fails because of UNRESOLVED DEPENDENCIES "hbase;0.98.2"
> 
> Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I 
> set HBASE_VERSION back to “0.94.6"?
> 
> Regards
> Arthur
> 
> 
> 
> 
> [warn]  ::
> [warn]  ::  UNRESOLVED DEPENDENCIES ::
> [warn]  ::
> [warn]  :: org.apache.hbase#hbase;0.98.2: not found
> [warn]  ::
> 
> sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: 
> not found
> at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
> at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
> at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
> at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
> at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
> at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
> at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
> at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
> at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
> at 
> xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
> at 
> xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
> at xsbt.boot.Using$.withResource(Using.scala:11)
> at xsbt.boot.Using$.apply(Using.scala:10)
> at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
> at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
> at xsbt.boot.Locks$.apply0(Locks.scala:31)
> at xsbt.boot.Locks$.apply(Locks.scala:28)
> at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
> at sbt.IvySbt.withIvy(Ivy.scala:101)
> at sbt.IvySbt.withIvy(Ivy.scala:97)
> at sbt.IvySbt$Module.withModule(Ivy.scala:116)
> at sbt.IvyActions$.update(IvyActions.scala:125)
> at 
> sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170)
> at 
> sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168)
> at 
> sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191)
> at 
> sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189)
> at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
> at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193)
> at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188)
> at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
> at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196)
> at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161)
> at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139)
> at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
> at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
> at sbt.std.Transform$$anon$4.work(System.scala:64)
> at 
> sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
> at 
> sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
> at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
> at sbt.Execute.work(Execute.scala:244)
> at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
> at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
> at 
> sbt.ConcurrentRestriction

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted,

I tried the following steps to apply the patch 1893 but got Hunk FAILED, can 
you please advise how to get thru this error? or is my spark-1.0.2 source not 
the correct one?

Regards
Arthur
 
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
tar -vxf spark-1.0.2.tgz
cd spark-1.0.2
wget https://github.com/apache/spark/pull/1893.patch
patch  < 1893.patch
patching file pom.xml
Hunk #1 FAILED at 45.
Hunk #2 FAILED at 110.
2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
patching file pom.xml
Hunk #1 FAILED at 54.
Hunk #2 FAILED at 72.
Hunk #3 FAILED at 171.
3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
can't find file to patch at input line 267
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--
|
|From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
|From: tedyu 
|Date: Mon, 11 Aug 2014 15:57:46 -0700
|Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
| description to building-with-maven.md
|
|---
| docs/building-with-maven.md | 3 +++
| 1 file changed, 3 insertions(+)
|
|diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
|index 672d0ef..f8bcd2b 100644
|--- a/docs/building-with-maven.md
|+++ b/docs/building-with-maven.md
--
File to patch:



On 28 Aug, 2014, at 10:24 am, Ted Yu  wrote:

> You can get the patch from this URL:
> https://github.com/apache/spark/pull/1893.patch
> 
> BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml
> 
> Cheers
> 
> 
> On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
>  wrote:
> Hi Ted,
> 
> Thank you so much!!
> 
> As I am new to Spark, can you please advise the steps about how to apply this 
> patch to my spark-1.0.2 source folder?
> 
> Regards
> Arthur
> 
> 
> On 28 Aug, 2014, at 10:13 am, Ted Yu  wrote:
> 
>> See SPARK-1297
>> 
>> The pull request is here:
>> https://github.com/apache/spark/pull/1893
>> 
>> 
>> On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
>>  wrote:
>> (correction: "Compilation Error:  Spark 1.0.2 with HBase 0.98” , please 
>> ignore if duplicated)
>> 
>> 
>> Hi,
>> 
>> I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with 
>> HBase 0.98,
>> 
>> My steps:
>> wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
>> tar -vxf spark-1.0.2.tgz
>> cd spark-1.0.2
>> 
>> edit project/SparkBuild.scala, set HBASE_VERSION
>>   // HBase version; set as appropriate.
>>   val HBASE_VERSION = "0.98.2"
>> 
>> 
>> edit pom.xml with following values
>> 2.4.1
>> 2.5.0
>> ${hadoop.version}
>> 0.98.5
>> 3.4.6
>> 0.13.1
>> 
>> 
>> SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
>> but it fails because of UNRESOLVED DEPENDENCIES "hbase;0.98.2"
>> 
>> Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should 
>> I set HBASE_VERSION back to “0.94.6"?
>> 
>> Regards
>> Arthur
>> 
>> 
>> 
>> 
>> [warn]  ::
>> [warn]  ::  UNRESOLVED DEPENDENCIES ::
>> [warn]  ::
>> [warn]  :: org.apache.hbase#hbase;0.98.2: not found
>> [warn]  ::
>> 
>> sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: 
>> not found
>> at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
>> at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
>> at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
>> at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
>> at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
>> at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
>> at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
>> at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
>> at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
>> at 
>> xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
>> at 
>> xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
>> at xsbt.boot.Using$.withResource(Using.scala:11)
>> at xsbt.boot.Using$.apply(Using.scala:10)
>> at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
>> at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:5

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted, 

Thanks. 

Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.)
Is this normal?

Regards
Arthur


patch -p1 -i 1893.patch
patching file examples/pom.xml
Hunk #1 FAILED at 45.
Hunk #2 succeeded at 94 (offset -16 lines).
1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
patching file examples/pom.xml
Hunk #1 FAILED at 54.
Hunk #2 FAILED at 72.
Hunk #3 succeeded at 122 (offset -49 lines).
2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej
patching file docs/building-with-maven.md
patching file examples/pom.xml
Hunk #1 succeeded at 122 (offset -40 lines).
Hunk #2 succeeded at 195 (offset -40 lines).


On 28 Aug, 2014, at 10:53 am, Ted Yu  wrote:

> Can you use this command ?
> 
> patch -p1 -i 1893.patch
> 
> Cheers
> 
> 
> On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com 
>  wrote:
> Hi Ted,
> 
> I tried the following steps to apply the patch 1893 but got Hunk FAILED, can 
> you please advise how to get thru this error? or is my spark-1.0.2 source not 
> the correct one?
> 
> Regards
> Arthur
>  
> wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
> tar -vxf spark-1.0.2.tgz
> cd spark-1.0.2
> wget https://github.com/apache/spark/pull/1893.patch
> patch  < 1893.patch
> patching file pom.xml
> Hunk #1 FAILED at 45.
> Hunk #2 FAILED at 110.
> 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
> patching file pom.xml
> Hunk #1 FAILED at 54.
> Hunk #2 FAILED at 72.
> Hunk #3 FAILED at 171.
> 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
> can't find file to patch at input line 267
> Perhaps you should have used the -p or --strip option?
> The text leading up to this was:
> --
> |
> |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
> |From: tedyu 
> |Date: Mon, 11 Aug 2014 15:57:46 -0700
> |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
> | description to building-with-maven.md
> |
> |---
> | docs/building-with-maven.md | 3 +++
> | 1 file changed, 3 insertions(+)
> |
> |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
> |index 672d0ef..f8bcd2b 100644
> |--- a/docs/building-with-maven.md
> |+++ b/docs/building-with-maven.md
> --
> File to patch:
> 
> 
> 
> On 28 Aug, 2014, at 10:24 am, Ted Yu  wrote:
> 
>> You can get the patch from this URL:
>> https://github.com/apache/spark/pull/1893.patch
>> 
>> BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml
>> 
>> Cheers
>> 
>> 
>> On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
>>  wrote:
>> Hi Ted,
>> 
>> Thank you so much!!
>> 
>> As I am new to Spark, can you please advise the steps about how to apply 
>> this patch to my spark-1.0.2 source folder?
>> 
>> Regards
>> Arthur
>> 
>> 
>> On 28 Aug, 2014, at 10:13 am, Ted Yu  wrote:
>> 
>>> See SPARK-1297
>>> 
>>> The pull request is here:
>>> https://github.com/apache/spark/pull/1893
>>> 
>>> 
>>> On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
>>>  wrote:
>>> (correction: "Compilation Error:  Spark 1.0.2 with HBase 0.98” , please 
>>> ignore if duplicated)
>>> 
>>> 
>>> Hi,
>>> 
>>> I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with 
>>> HBase 0.98,
>>> 
>>> My steps:
>>> wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
>>> tar -vxf spark-1.0.2.tgz
>>> cd spark-1.0.2
>>> 
>>> edit project/SparkBuild.scala, set HBASE_VERSION
>>>   // HBase version; set as appropriate.
>>>   val HBASE_VERSION = "0.98.2"
>>> 
>>> 
>>> edit pom.xml with following values
>>> 2.4.1
>>> 2.5.0
>>> ${hadoop.version}
>>> 0.98.5
>>> 3.4.6
>>> 0.13.1
>>> 
>>> 
>>> SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
>>> but it fails because of UNRESOLVED DEPENDENCIES "hbase;0.98.2"
>>> 
>>> Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should 
>>> I set HBASE_VERSION back to “0.94.6"?
>>> 
>>> Regards
>>> Arthur
>>> 
>>> 
>>> 
>>> 
>>> [warn]  ::
>>> [warn]  ::  UNRESOLVED DEPENDENCIES ::
>>> [warn]  

Compilation FAILURE : Spark 1.0.2 / Project Hive (0.13.1)

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi,

I use Hadoop 2.4.1, HBase 0.98.5, Zookeeper 3.4.6 and Hive 0.13.1.

I just tried to compile Spark 1.0.2, but got error on "Spark Project Hive", can 
you please advise which repository has 
"org.spark-project.hive:hive-metastore:jar:0.13.1"?


FYI, below is my repository setting in maven which would be old:

   nexus
   local private nexus
   http://maven.oschina.net/content/groups/public/
   
 true
   
   
 false
   
 


export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -DskipTests clean package

[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS [1.892s]
[INFO] Spark Project Core  SUCCESS [1:33.698s]
[INFO] Spark Project Bagel ... SUCCESS [12.270s]
[INFO] Spark Project GraphX .. SUCCESS [2:16.343s]
[INFO] Spark Project ML Library .. SUCCESS [4:18.495s]
[INFO] Spark Project Streaming ... SUCCESS [39.765s]
[INFO] Spark Project Tools ... SUCCESS [9.173s]
[INFO] Spark Project Catalyst  SUCCESS [35.462s]
[INFO] Spark Project SQL . SUCCESS [1:16.118s]
[INFO] Spark Project Hive  FAILURE [1:36.816s]
[INFO] Spark Project REPL  SKIPPED
[INFO] Spark Project YARN Parent POM . SKIPPED
[INFO] Spark Project YARN Stable API . SKIPPED
[INFO] Spark Project Assembly  SKIPPED
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project Examples  SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 12:40.616s
[INFO] Finished at: Thu Aug 28 11:41:07 HKT 2014
[INFO] Final Memory: 37M/826M
[INFO] 
[ERROR] Failed to execute goal on project spark-hive_2.10: Could not resolve 
dependencies for project org.apache.spark:spark-hive_2.10:jar:1.0.2: The 
following artifacts could not be resolved: 
org.spark-project.hive:hive-metastore:jar:0.13.1, 
org.spark-project.hive:hive-exec:jar:0.13.1, 
org.spark-project.hive:hive-serde:jar:0.13.1: Could not find artifact 
org.spark-project.hive:hive-metastore:jar:0.13.1 in nexus-osc 
(http://maven.oschina.net/content/groups/public/) -> [Help 1]


Regards
Arthur



Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread arthur.hk.c...@gmail.com
.2:compile
[INFO] | \- javax.activation:activation:jar:1.1:compile
[INFO] +- org.apache.hbase:hbase-hadoop-compat:jar:0.98.5-hadoop1:compile
[INFO] |  \- org.apache.commons:commons-math:jar:2.1:compile
[INFO] +- 
org.apache.hbase:hbase-hadoop-compat:test-jar:tests:0.98.5-hadoop1:test
[INFO] +- com.twitter:algebird-core_2.10:jar:0.1.11:compile
[INFO] |  \- com.googlecode.javaewah:JavaEWAH:jar:0.6.6:compile
[INFO] +- org.scalatest:scalatest_2.10:jar:2.1.5:test
[INFO] |  \- org.scala-lang:scala-reflect:jar:2.10.4:compile
[INFO] +- org.scalacheck:scalacheck_2.10:jar:1.11.3:test
[INFO] |  \- org.scala-sbt:test-interface:jar:1.0:test
[INFO] +- org.apache.cassandra:cassandra-all:jar:1.2.6:compile
[INFO] |  +- net.jpountz.lz4:lz4:jar:1.1.0:compile
[INFO] |  +- org.antlr:antlr:jar:3.2:compile
[INFO] |  +- com.googlecode.json-simple:json-simple:jar:1.1:compile
[INFO] |  +- org.yaml:snakeyaml:jar:1.6:compile
[INFO] |  +- edu.stanford.ppl:snaptree:jar:0.1:compile
[INFO] |  +- org.mindrot:jbcrypt:jar:0.3m:compile
[INFO] |  +- org.apache.thrift:libthrift:jar:0.7.0:compile
[INFO] |  |  \- javax.servlet:servlet-api:jar:2.5:compile
[INFO] |  +- org.apache.cassandra:cassandra-thrift:jar:1.2.6:compile
[INFO] |  \- com.github.stephenc:jamm:jar:0.2.5:compile
[INFO] \- com.github.scopt:scopt_2.10:jar:3.2.0:compile
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 2.151s
[INFO] Finished at: Thu Aug 28 18:12:45 HKT 2014
[INFO] Final Memory: 17M/479M
[INFO] 

RegardsArthurOn 28 Aug, 2014, at 12:22 pm, Ted Yu <yuzhih...@gmail.com> wrote:I forgot to include '-Dhadoop.version=2.4.1' in the command below.The modified command passed.You can verify the dependence on hbase 0.98 through this command:
mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree > dep.txtCheersOn Wed, Aug 27, 2014 at 8:58 PM, Ted Yu <yuzhih...@gmail.com> wrote:
Looks like the patch given by that URL only had the last commit.I have attached pom.xml for spark-1.0.2 to SPARK-1297
You can download it and replace examples/pom.xml with the downloaded pom
I am running this command locally:mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean packageCheers

On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com <arthur.hk.c...@gmail.com> wrote:

Hi Ted, Thanks. Tried [patch -p1 -i 1893.patch]    (Hunk #1 FAILED at 45.)

Is this normal?RegardsArthurpatch -p1 -i 1893.patch

patching file examples/pom.xmlHunk #1 FAILED at 45.

Hunk #2 succeeded at 94 (offset -16 lines).1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

patching file examples/pom.xmlHunk #1 FAILED at 54.Hunk #2 FAILED at 72.

Hunk #3 succeeded at 122 (offset -49 lines).2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej

patching file docs/building-with-maven.md

patching file examples/pom.xmlHunk #1 succeeded at 122 (offset -40 lines).Hunk #2 succeeded at 195 (offset -40 lines).

On 28 Aug, 2014, at 10:53 am, Ted Yu <yuzhih...@gmail.com> wrote:

Can you use this command ?patch -p1 -i 1893.patchCheersOn Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com <arthur.hk.c...@gmail.com> wrote:


Hi Ted,I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one?


RegardsArthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz


tar -vxf spark-1.0.2.tgzcd spark-1.0.2


wget https://github.com/apache/spark/pull/1893.patch


patch  < 1893.patchpatching file pom.xml


Hunk #1 FAILED at 45.Hunk #2 FAILED at 110.


2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rejpatching file pom.xml


Hunk #1 FAILED at 54.Hunk #2 FAILED at 72.


Hunk #3 FAILED at 171.3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej


can't find file to patch at input line 267Perhaps you should have used the -p or --strip option?


The text leading up to this was:--


||From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001


|From: tedyu <yuzhih...@gmail.com>|Date: Mon, 11 Aug 2014 15:57:46 -0700


|Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add| description to building-with-maven.md


||---| docs/building-with-maven.md | 3 +++


| 1 file changed, 3 insertions(+)|


|diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md


|index 672d0ef..f8bcd2b 100644|--- a/docs/building-with-maven.md


|+++ b/docs/building-with-maven.md--


File to patch:
On 28 Aug, 2014, at 10:24 am, Ted Yu <yuzhih...@gmail.com> wrote:


You can get the patch from this URL:https://github.com/apache/spark/pull/1893.patch


BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,

I tried to start Spark but failed:

$ ./sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to 
/mnt/hadoop/spark-1.0.2/sbin/../logs/spark-edhuser-org.apache.spark.deploy.master.Master-1-m133.out
failed to launch org.apache.spark.deploy.master.Master:
  Failed to find Spark assembly in 
/mnt/hadoop/spark-1.0.2/assembly/target/scala-2.10/

$ ll assembly/
total 20
-rw-rw-r--. 1 hduser hadoop 11795 Jul 26 05:50 pom.xml
-rw-rw-r--. 1 hduser hadoop   507 Jul 26 05:50 README
drwxrwxr-x. 4 hduser hadoop  4096 Jul 26 05:50 src



Regards
Arthur



On 28 Aug, 2014, at 6:19 pm, Ted Yu  wrote:

> I see 0.98.5 in dep.txt
> 
> You should be good to go.
> 
> 
> On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com 
>  wrote:
> Hi,
> 
> tried 
> mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests 
> dependency:tree > dep.txt
> 
> Attached the dep. txt for your information. 
> 
> 
> Regards
> Arthur
> 
> On 28 Aug, 2014, at 12:22 pm, Ted Yu  wrote:
> 
>> I forgot to include '-Dhadoop.version=2.4.1' in the command below.
>> 
>> The modified command passed.
>> 
>> You can verify the dependence on hbase 0.98 through this command:
>> 
>> mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests 
>> dependency:tree > dep.txt
>> 
>> Cheers
>> 
>> 
>> On Wed, Aug 27, 2014 at 8:58 PM, Ted Yu  wrote:
>> Looks like the patch given by that URL only had the last commit.
>> 
>> I have attached pom.xml for spark-1.0.2 to SPARK-1297
>> You can download it and replace examples/pom.xml with the downloaded pom
>> 
>> I am running this command locally:
>> 
>> mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package
>> 
>> Cheers
>> 
>> 
>> On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com 
>>  wrote:
>> Hi Ted, 
>> 
>> Thanks. 
>> 
>> Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.)
>> Is this normal?
>> 
>> Regards
>> Arthur
>> 
>> 
>> patch -p1 -i 1893.patch
>> patching file examples/pom.xml
>> Hunk #1 FAILED at 45.
>> Hunk #2 succeeded at 94 (offset -16 lines).
>> 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
>> patching file examples/pom.xml
>> Hunk #1 FAILED at 54.
>> Hunk #2 FAILED at 72.
>> Hunk #3 succeeded at 122 (offset -49 lines).
>> 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej
>> patching file docs/building-with-maven.md
>> patching file examples/pom.xml
>> Hunk #1 succeeded at 122 (offset -40 lines).
>> Hunk #2 succeeded at 195 (offset -40 lines).
>> 
>> 
>> On 28 Aug, 2014, at 10:53 am, Ted Yu  wrote:
>> 
>>> Can you use this command ?
>>> 
>>> patch -p1 -i 1893.patch
>>> 
>>> Cheers
>>> 
>>> 
>>> On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com 
>>>  wrote:
>>> Hi Ted,
>>> 
>>> I tried the following steps to apply the patch 1893 but got Hunk FAILED, 
>>> can you please advise how to get thru this error? or is my spark-1.0.2 
>>> source not the correct one?
>>> 
>>> Regards
>>> Arthur
>>>  
>>> wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
>>> tar -vxf spark-1.0.2.tgz
>>> cd spark-1.0.2
>>> wget https://github.com/apache/spark/pull/1893.patch
>>> patch  < 1893.patch
>>> patching file pom.xml
>>> Hunk #1 FAILED at 45.
>>> Hunk #2 FAILED at 110.
>>> 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
>>> patching file pom.xml
>>> Hunk #1 FAILED at 54.
>>> Hunk #2 FAILED at 72.
>>> Hunk #3 FAILED at 171.
>>> 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
>>> can't find file to patch at input line 267
>>> Perhaps you should have used the -p or --strip option?
>>> The text leading up to this was:
>>> --
>>> |
>>> |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
>>> |From: tedyu 
>>> |Date: Mon, 11 Aug 2014 15:57:46 -0700
>>> |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
>>> | description to building-with-maven.md
>>> |
>>> |---
>>> | docs/building-with-maven.md | 3 +++
>>> | 1 file changed, 3 insertions(+)
>>> |
>>> |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
>>> |index 672d0ef..f8bcd2b 100644
>>

SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,

I have just tried to apply the patch of SPARK-1297: 
https://issues.apache.org/jira/browse/SPARK-1297

There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt 
respectively.

When applying the 2nd one, I got "Hunk #1 FAILED at 45"

Can you please advise how to fix it in order to make the compilation of Spark 
Project Examples success?
(Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2)

Regards
Arthur



patch -p1 -i spark-1297-v4.txt 
patching file examples/pom.xml
Hunk #1 FAILED at 45.
Hunk #2 succeeded at 94 (offset -16 lines).
1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

below is the content of examples/pom.xml.rej:
+++ examples/pom.xml
@@ -45,6 +45,39 @@
 
   
 
+
+  hbase-hadoop2
+  
+
+  hbase.profile
+  hadoop2
+
+  
+  
+2.5.0
+0.98.4-hadoop2
+  
+  
+
+
+  
+
+
+  hbase-hadoop1
+  
+
+  !hbase.profile
+
+  
+  
+0.98.4-hadoop1
+  
+  
+
+
+  
+
+
   
   
   


This caused the related compilation failed:
[INFO] Spark Project Examples  FAILURE [0.102s]




Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,

patch -p1 -i spark-1297-v5.txt 
can't find file to patch at input line 5
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--
|diff --git docs/building-with-maven.md docs/building-with-maven.md
|index 672d0ef..f8bcd2b 100644
|--- docs/building-with-maven.md
|+++ docs/building-with-maven.md
--
File to patch: 

Please advise
Regards
Arthur



On 29 Aug, 2014, at 12:50 am, arthur.hk.c...@gmail.com 
 wrote:

> Hi,
> 
> I have just tried to apply the patch of SPARK-1297: 
> https://issues.apache.org/jira/browse/SPARK-1297
> 
> There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt 
> respectively.
> 
> When applying the 2nd one, I got "Hunk #1 FAILED at 45"
> 
> Can you please advise how to fix it in order to make the compilation of Spark 
> Project Examples success?
> (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2)
> 
> Regards
> Arthur
> 
> 
> 
> patch -p1 -i spark-1297-v4.txt 
> patching file examples/pom.xml
> Hunk #1 FAILED at 45.
> Hunk #2 succeeded at 94 (offset -16 lines).
> 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
> 
> below is the content of examples/pom.xml.rej:
> +++ examples/pom.xml
> @@ -45,6 +45,39 @@
>  
>
>  
> +
> +  hbase-hadoop2
> +  
> +
> +  hbase.profile
> +  hadoop2
> +
> +  
> +  
> +2.5.0
> +0.98.4-hadoop2
> +  
> +  
> +
> +
> +  
> +
> +
> +  hbase-hadoop1
> +  
> +
> +  !hbase.profile
> +
> +  
> +  
> +0.98.4-hadoop1
> +  
> +  
> +
> +
> +  
> +
> +
>
>
>
> 
> 
> This caused the related compilation failed:
> [INFO] Spark Project Examples  FAILURE [0.102s]
> 
> 



Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi Ted,

I downloaded pom.xml to examples directory.
It works, thanks!!

Regards
Arthur


[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS [2.119s]
[INFO] Spark Project Core  SUCCESS [1:27.100s]
[INFO] Spark Project Bagel ... SUCCESS [10.261s]
[INFO] Spark Project GraphX .. SUCCESS [31.332s]
[INFO] Spark Project ML Library .. SUCCESS [35.226s]
[INFO] Spark Project Streaming ... SUCCESS [39.135s]
[INFO] Spark Project Tools ... SUCCESS [6.469s]
[INFO] Spark Project Catalyst  SUCCESS [36.521s]
[INFO] Spark Project SQL . SUCCESS [35.488s]
[INFO] Spark Project Hive  SUCCESS [35.296s]
[INFO] Spark Project REPL  SUCCESS [18.668s]
[INFO] Spark Project YARN Parent POM . SUCCESS [0.583s]
[INFO] Spark Project YARN Stable API . SUCCESS [15.989s]
[INFO] Spark Project Assembly  SUCCESS [11.497s]
[INFO] Spark Project External Twitter  SUCCESS [8.777s]
[INFO] Spark Project External Kafka .. SUCCESS [9.688s]
[INFO] Spark Project External Flume .. SUCCESS [10.411s]
[INFO] Spark Project External ZeroMQ . SUCCESS [9.511s]
[INFO] Spark Project External MQTT ... SUCCESS [8.451s]
[INFO] Spark Project Examples  SUCCESS [1:40.240s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 8:33.350s
[INFO] Finished at: Fri Aug 29 01:58:00 HKT 2014
[INFO] Final Memory: 82M/1086M
[INFO] 


On 29 Aug, 2014, at 1:36 am, Ted Yu  wrote:

> bq.  Spark 1.0.2
> 
> For the above release, you can download pom.xml attached to the JIRA and 
> place it in examples directory
> 
> I verified that the build against 0.98.4 worked using this command:
> 
> mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 
> -DskipTests clean package
> 
> Patch v5 is @ level 0 - you don't need to use -p1 in the patch command.
> 
> Cheers
> 
> 
> On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com 
>  wrote:
> Hi,
> 
> I have just tried to apply the patch of SPARK-1297: 
> https://issues.apache.org/jira/browse/SPARK-1297
> 
> There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt 
> respectively.
> 
> When applying the 2nd one, I got "Hunk #1 FAILED at 45"
> 
> Can you please advise how to fix it in order to make the compilation of Spark 
> Project Examples success?
> (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2)
> 
> Regards
> Arthur
> 
> 
> 
> patch -p1 -i spark-1297-v4.txt 
> patching file examples/pom.xml
> Hunk #1 FAILED at 45.
> Hunk #2 succeeded at 94 (offset -16 lines).
> 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
> 
> below is the content of examples/pom.xml.rej:
> +++ examples/pom.xml
> @@ -45,6 +45,39 @@
>  
>
>  
> +
> +  hbase-hadoop2
> +  
> +
> +  hbase.profile
> +  hadoop2
> +
> +  
> +  
> +2.5.0
> +0.98.4-hadoop2
> +  
> +  
> +
> +
> +  
> +
> +
> +  hbase-hadoop1
> +  
> +
> +  !hbase.profile
> +
> +  
> +  
> +0.98.4-hadoop1
> +  
> +  
> +
> +
> +  
> +
> +
>
>
>
> 
> 
> This caused the related compilation failed:
> [INFO] Spark Project Examples  FAILURE [0.102s]
> 
> 
> 



org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,

I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and 
HBase.
With default setting in Spark 1.0.2, when trying to load a file I got "Class 
org.apache.hadoop.io.compress.SnappyCodec not found"

Can you please advise how to enable snappy in Spark?

Regards
Arthur


scala> inFILE.first()
java.lang.RuntimeException: Error in configuring object
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.RDD.take(RDD.scala:983)
at org.apache.spark.rdd.RDD.first(RDD.scala:1015)
at $iwC$$iwC$$iwC$$iwC.(:15)
at $iwC$$iwC$$iwC.(:20)
at $iwC$$iwC.(:22)
at $iwC.(:24)
at (:26)
at .(:30)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 55 more
Caused by: java.lang.IllegalArgumentException: Compression codec   
org.apache.hadoop.io.compress.SnappyCodec not found.
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135)
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:175)
at 
org.apache.hadoop.mapred.TextInpu

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,

my check native result:

hadoop checknative
14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize 
native-bzip2 library system-native, will use pure-Java version
14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded & initialized 
native-zlib library
Native library checking:
hadoop: true 
/mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libhadoop.so
zlib:   true /lib64/libz.so.1
snappy: true 
/mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libsnappy.so.1
lz4:true revision:99
bzip2:  false

Any idea how to enable or disable  snappy in Spark?

Regards
Arthur


On 29 Aug, 2014, at 2:39 am, arthur.hk.c...@gmail.com 
 wrote:

> Hi,
> 
> I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and 
> HBase.
> With default setting in Spark 1.0.2, when trying to load a file I got "Class 
> org.apache.hadoop.io.compress.SnappyCodec not found"
> 
> Can you please advise how to enable snappy in Spark?
> 
> Regards
> Arthur
> 
> 
> scala> inFILE.first()
> java.lang.RuntimeException: Error in configuring object
>   at 
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at org.apache.spark.rdd.RDD.take(RDD.scala:983)
>   at org.apache.spark.rdd.RDD.first(RDD.scala:1015)
>   at $iwC$$iwC$$iwC$$iwC.(:15)
>   at $iwC$$iwC$$iwC.(:20)
>   at $iwC$$iwC.(:22)
>   at $iwC.(:24)
>   at (:26)
>   at .(:30)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
>   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
ore
Caused by: java.lang.IllegalArgumentException: Compression codec   
org.apache.hadoop.io.compress.GzipCodec not found.
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135)
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:175)
at 
org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
... 60 more
Caused by: java.lang.ClassNotFoundException: Class   
org.apache.hadoop.io.compress.GzipCodec not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:128)
... 62 more


Any idea to fix this issue?
Regards
Arthur


On 29 Aug, 2014, at 2:58 am, arthur.hk.c...@gmail.com 
 wrote:

> Hi,
> 
> my check native result:
> 
> hadoop checknative
> 14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize 
> native-bzip2 library system-native, will use pure-Java version
> 14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded & initialized 
> native-zlib library
> Native library checking:
> hadoop: true 
> /mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libhadoop.so
> zlib:   true /lib64/libz.so.1
> snappy: true 
> /mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libsnappy.so.1
> lz4:true revision:99
> bzip2:  false
> 
> Any idea how to enable or disable  snappy in Spark?
> 
> Regards
> Arthur
> 
> 
> On 29 Aug, 2014, at 2:39 am, arthur.hk.c...@gmail.com 
>  wrote:
> 
>> Hi,
>> 
>> I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and 
>> HBase.
>> With default setting in Spark 1.0.2, when trying to load a file I got "Class 
>> org.apache.hadoop.io.compress.SnappyCodec not found"
>> 
>> Can you please advise how to enable snappy in Spark?
>> 
>> Regards
>> Arthur
>> 
>> 
>> scala> inFILE.first()
>> java.lang.RuntimeException: Error in configuring object
>>  at 
>> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
>>  at 
>> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
>>  at 
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>>  at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158)
>>  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>>  at scala.Option.getOrElse(Option.scala:120)
>>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>>  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>>  at scala.Option.getOrElse(Option.scala:120)
>>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>>  at org.apache.spark.rdd.RDD.take(RDD.scala:983)
>>  at org.apache.spark.rdd.RDD.first(RDD.scala:1015)
>>  at $iwC$$iwC$$iwC$$iwC.(:15)
>>  at $iwC$$iwC$$iwC.(:20)
>>  at $iwC$$iwC.(:22)
>>  at $iwC.(:24)
>>  at (:26)
>>  at .(:30)
>>  at .()
>>  at .(:7)
>>  at .()
>>  at $print()
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>  at java.lang.reflect.Method.invoke(Method.java:597)
>>  at 
>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
>>  at 
>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
>>  at 
>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
>>  at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
>>  at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
>>  at 
>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
>>  at 
>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
>>  at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
>>  at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
>>  at org.apache.spark.repl.SparkIL

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,
 
I fixed the issue by copying libsnappy.so to Java ire.

Regards
Arthur

On 29 Aug, 2014, at 8:12 am, arthur.hk.c...@gmail.com 
 wrote:

> Hi,
> 
> If change my etc/hadoop/core-site.xml 
> 
> from 
>
> io.compression.codecs
> 
>   org.apache.hadoop.io.compress.SnappyCodec,
>   org.apache.hadoop.io.compress.GzipCodec,
>   org.apache.hadoop.io.compress.DefaultCodec,
>   org.apache.hadoop.io.compress.BZip2Codec
> 
>
> 
> to 
>
> io.compression.codecs
> 
>   org.apache.hadoop.io.compress.GzipCodec,
>   org.apache.hadoop.io.compress.SnappyCodec,
>   org.apache.hadoop.io.compress.DefaultCodec,
>   org.apache.hadoop.io.compress.BZip2Codec
> 
>
> 
> 
> 
> and run the test again, I found this time it cannot find 
> "org.apache.hadoop.io.compress.GzipCodec"
> 
> scala> inFILE.first()
> java.lang.RuntimeException: Error in configuring object
>   at 
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
>   at org.apache.spark.rdd.RDD.take(RDD.scala:983)
>   at org.apache.spark.rdd.RDD.first(RDD.scala:1015)
>   at $iwC$$iwC$$iwC$$iwC.(:15)
>   at $iwC$$iwC$$iwC.(:20)
>   at $iwC$$iwC.(:22)
>   at $iwC.(:24)
>   at (:26)
>   at .(:30)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
>   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>   at org.apache.sp

Spark Hive max key length is 767 bytes

2014-08-28 Thread arthur.hk.c...@gmail.com
(Please ignore if duplicated) 


Hi,

I use Spark 1.0.2 with Hive 0.13.1

I have already set the hive mysql database to latine1; 

mysql:
alter database hive character set latin1;

Spark:
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hiveContext.hql("create table test_datatype1 (testbigint bigint )")
scala> hiveContext.hql("drop table test_datatype1")


14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
14/08/29 12:31:59 ERROR DataNucleus.Datastore: An exception was thrown while 
adding/validating class(es) : Specified key was too long; max key length is 767 
bytes
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
too long; max key length is 767 bytes
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:408)
at com.mysql.jdbc.Util.getInstance(Util.java:383)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1062)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4226)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4158)

Can you please advise what would be wrong?

Regards
Arthur

Re: Spark Hive max key length is 767 bytes

2014-08-29 Thread arthur.hk.c...@gmail.com
Hi,


Tried the same thing in HIVE directly without issue:

HIVE:
hive> create table test_datatype2 (testbigint bigint );
OK
Time taken: 0.708 seconds

hive> drop table test_datatype2;
OK
Time taken: 23.272 seconds



Then tried again in SPARK:
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
14/08/29 19:33:52 INFO Configuration.deprecation: 
mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
mapreduce.reduce.speculative
hiveContext: org.apache.spark.sql.hive.HiveContext = 
org.apache.spark.sql.hive.HiveContext@395c7b94

scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
res0: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[0] at RDD at SchemaRDD.scala:104
== Query Plan ==


scala> hiveContext.hql("drop table test_datatype3")

14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown while 
adding/validating class(es) : Specified key was too long; max key length is 767 
bytes
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
too long; max key length is 767 bytes
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of 
org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted in no 
possible candidates
Error(s) were found while auto-creating/validating the datastore for classes. 
The errors are printed in the log, and are attached to this exception.
org.datanucleus.exceptions.NucleusDataStoreException: Error(s) were found while 
auto-creating/validating the datastore for classes. The errors are printed in 
the log, and are attached to this exception.
at 
org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.verifyErrors(RDBMSStoreManager.java:3609)


Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified 
key was too long; max key length is 767 bytes
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
14/08/29 19:34:25 ERROR DataNucleus.Datastore: An exception was thrown while 
adding/validating class(es) : Specified key was too long; max key length is 767 
bytes
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
too long; max key length is 767 bytes
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)


Can anyone please help?

Regards
Arthur


On 29 Aug, 2014, at 12:47 pm, arthur.hk.c...@gmail.com 
 wrote:

> (Please ignore if duplicated) 
> 
> 
> Hi,
> 
> I use Spark 1.0.2 with Hive 0.13.1
> 
> I have already set the hive mysql database to latine1; 
> 
> mysql:
> alter database hive character set latin1;
> 
> Spark:
> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> scala> hiveContext.hql("create table test_datatype1 (testbigint bigint )")
> scala> hiveContext.hql("drop table test_datatype1")
> 
> 
> 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" 
> so does not have its own datastore table.
> 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> &

Re: Spark Hive max key length is 767 bytes

2014-08-30 Thread arthur.hk.c...@gmail.com
Hi Michael,

Thank you so much!!

I have tried to change the following key length from 256 to 255 and from 767 to 
766, it still didn’t work
alter table COLUMNS_V2 modify column COMMENT VARCHAR(255);
alter table INDEX_PARAMS modify column PARAM_KEY VARCHAR(255);
alter table SD_PARAMS modify column PARAM_KEY VARCHAR(255);
alter table SERDE_PARAMS modify column PARAM_KEY VARCHAR(255);
alter table TABLE_PARAMS modify column PARAM_KEY VARCHAR(255);
alter table TBLS modify column OWNER VARCHAR(766);
alter table PART_COL_STATS modify column PARTITION_NAME VARCHAR(766);
alter table PARTITION_KEYS modify column PKEY_TYPE VARCHAR(766);
alter table PARTITIONS modify column PART_NAME VARCHAR(766);

I use Hadoop 2.4.1 HBase 0.98.5 Hive 0.13, trying Spark 1.0.2 and Shark 0.9.2, 
and JDK1.6_45.

Some questions:
shark-0.9.2 is based on which Hive version?  is HBase 0.98.x OK? is Hive 0.13.1 
OK? and which Java?  (I use JDK1.6 at the moment, it seems not working)
spark-1.0.2 is based on which Hive version?  is HBase 0.98.x OK?  

Regards
Arthur 


On 30 Aug, 2014, at 1:40 am, Michael Armbrust  wrote:

> Spark SQL is based on Hive 12.  They must have changed the maximum key size 
> between 12 and 13.
> 
> 
> On Fri, Aug 29, 2014 at 4:38 AM, arthur.hk.c...@gmail.com 
>  wrote:
> Hi,
> 
> 
> Tried the same thing in HIVE directly without issue:
> 
> HIVE:
> hive> create table test_datatype2 (testbigint bigint );
> OK
> Time taken: 0.708 seconds
> 
> hive> drop table test_datatype2;
> OK
> Time taken: 23.272 seconds
> 
> 
> 
> Then tried again in SPARK:
> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> 14/08/29 19:33:52 INFO Configuration.deprecation: 
> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
> mapreduce.reduce.speculative
> hiveContext: org.apache.spark.sql.hive.HiveContext = 
> org.apache.spark.sql.hive.HiveContext@395c7b94
> 
> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> res0: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[0] at RDD at SchemaRDD.scala:104
> == Query Plan ==
> 
> 
> scala> hiveContext.hql("drop table test_datatype3")
> 
> 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown while 
> adding/validating class(es) : Specified key was too long; max key length is 
> 767 bytes
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
> too long; max key length is 767 bytes
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> 
> 14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of 
> org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted in 
> no possible candidates
> Error(s) were found while auto-creating/validating the datastore for classes. 
> The errors are printed in the log, and are attached to this exception.
> org.datanucleus.exceptions.NucleusDataStoreException: Error(s) were found 
> while auto-creating/validating the datastore for classes. The errors are 
> printed in the log, and are attached to this exception.
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.verifyErrors(RDBMSStoreManager.java:3609)
> 
> 
> Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: 
> Specified key was too long; max key length is 767 bytes
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> 
> 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" 
> so does not have its own datastore table.
> 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" 
> so does not have its own datastore table.
> 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class 
> "org.apache.hadoop.hive.me

Re: Spark Hive max key length is 767 bytes

2014-08-30 Thread arthur.hk.c...@gmail.com
Hi,

Already done but still get the same error:

(I use HIVE 0.13.1 Spark 1.0.2, Hadoop 2.4.1)

Steps:
Step 1) mysql:
> 
>> alter database hive character set latin1;
Step 2) HIVE:
>> hive> create table test_datatype2 (testbigint bigint );
>> OK
>> Time taken: 0.708 seconds
>> 
>> hive> drop table test_datatype2;
>> OK
>> Time taken: 23.272 seconds
Step 3) scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> 14/08/29 19:33:52 INFO Configuration.deprecation: 
>> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
>> mapreduce.reduce.speculative
>> hiveContext: org.apache.spark.sql.hive.HiveContext = 
>> org.apache.spark.sql.hive.HiveContext@395c7b94
>> scala> hiveContext.hql(“create table test_datatype3 (testbigint bigint)”)
>> res0: org.apache.spark.sql.SchemaRDD = 
>> SchemaRDD[0] at RDD at SchemaRDD.scala:104
>> == Query Plan ==
>> 
>> scala> hiveContext.hql("drop table test_datatype3")
>> 
>> 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown while 
>> adding/validating class(es) : Specified key was too long; max key length is 
>> 767 bytes
>> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
>> too long; max key length is 767 bytes
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> at 
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>> 
>> 14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of 
>> org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted in 
>> no possible candidates
>> Error(s) were found while auto-creating/validating the datastore for 
>> classes. The errors are printed in the log, and are attached to this 
>> exception.
>> org.datanucleus.exceptions.NucleusDataStoreException: Error(s) were found 
>> while auto-creating/validating the datastore for classes. The errors are 
>> printed in the log, and are attached to this exception.
>> at 
>> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.verifyErrors(RDBMSStoreManager.java:3609)
>> 
>> 
>> Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: 
>> Specified key was too long; max key length is 767 bytes
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)



Should I use HIVE 0.12.0 instead of HIVE 0.13.1?

Regards
Arthur

On 31 Aug, 2014, at 6:01 am, Denny Lee  wrote:

> Oh, you may be running into an issue with your MySQL setup actually, try 
> running
> 
> alter database metastore_db character set latin1
> 
> so that way Hive (and the Spark HiveContext) can execute properly against the 
> metastore.
> 
> 
> On August 29, 2014 at 04:39:01, arthur.hk.c...@gmail.com 
> (arthur.hk.c...@gmail.com) wrote:
> 
>> Hi,
>> 
>> 
>> Tried the same thing in HIVE directly without issue:
>> 
>> HIVE:
>> hive> create table test_datatype2 (testbigint bigint );
>> OK
>> Time taken: 0.708 seconds
>> 
>> hive> drop table test_datatype2;
>> OK
>> Time taken: 23.272 seconds
>> 
>> 
>> 
>> Then tried again in SPARK:
>> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> 14/08/29 19:33:52 INFO Configuration.deprecation: 
>> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
>> mapreduce.reduce.speculative
>> hiveContext: org.apache.spark.sql.hive.HiveContext = 
>> org.apache.spark.sql.hive.HiveContext@395c7b94
>> 
>> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> res0: org.apache.spark.sql.SchemaRDD = 
>> SchemaRDD[0] at RDD at SchemaRDD.scala:104
>> == Query Plan ==
>> 
>> 
>> scala> hiveContext.hql("drop table test_datatype3")
>> 
>> 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown while 
>> adding/validating class(es) : Specified key was too long; max key length is 
>> 767 bytes
>> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
>> too long; max key length is 767 bytes
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> at 
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>> 
>> 14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of 
>> org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted in 
>> no possible candidates
>> Error(s) were found while auto-creating/validating the data

Spark Master/Slave and HA

2014-08-30 Thread arthur.hk.c...@gmail.com
Hi,

I have few questions about Spark Master and Slave setup:

Here, I have 5 Hadoop nodes (n1, n2, n3, n4, and n5 respectively), at the 
moment I run Spark under these nodes:
n1:Hadoop Active Name node, Hadoop Slave
Spark Active Master 
n2:Hadoop Standby Name Node,Hadoop Salve
Spark Slave
n3: Hadoop Salve
Spark Slave 
n4: Hadoop Salve
Spark Slave 
n5: Hadoop Salve
Spark Slave 

Questions:
Q1: If I set n1 as both Spark Master and Spark Slave, I cannot start the Spark 
Cluster. does it mean that, unlike Hadoop, I cannot use the same machine to be 
both MASTER and SLAVE in Spark?
n1:Hadoop Active Name node, Hadoop Slave
Spark Active Master Spark Slave (failed to Start Spark)
n2:Hadoop Standby Name Node,Hadoop Salve
Spark Slave
n3: Hadoop Salve
Spark Slave 
n4: Hadoop 
SalveSpark Slave 
n5: Hadoop Salve
Spark Slave 

Q2: I am planning Spark HA, what if I use n2 as "Spark Standby Master and Spark 
Slave”? is Spark allowed to run Standby Master and Slave under same machine?
n1:Hadoop Active Name node, Hadoop Slave
Spark Active Master 
n2:Hadoop Standby Name Node,Hadoop SalveSpark 
Standby MasterSpark Slave 
n3: Hadoop Salve
Spark Slave 
n4: Hadoop Salve
Spark Slave 
n5:  Hadoop Salve   
Spark Slave 

Q3: Does the Spark Master node do actual computation work like a worker or just 
a pure monitoring node? 

Regards
Arthur

Spark and Shark Node: RAM Allocation

2014-08-30 Thread arthur.hk.c...@gmail.com
Hi,

Is there any formula to calculate proper RAM allocation values for Spark and 
Shark based on Physical RAM, HADOOP and HBASE RAM usage?
e.g. if a node has 32GB physical RAM


spark-defaults.conf
spark.executor.memory   ?g

spark-env.sh
export SPARK_WORKER_MEMORY=?
export HADOOP_HEAPSIZE=?


shark-env.sh
export SPARK_MEM=?g
export SHARK_MASTER_MEM=?g

spark-defaults.conf
spark.executor.memory   ?g


Regards
Arthur




Spark and Shark

2014-09-01 Thread arthur.hk.c...@gmail.com
Hi,

I have installed Spark 1.0.2 and Shark 0.9.2 on Hadoop 2.4.1 (by compiling from 
source).

spark: 1.0.2
shark: 0.9.2
hadoop: 2.4.1
java: java version “1.7.0_67”
protobuf: 2.5.0


I have tried the smoke test in shark but got  
“java.util.NoSuchElementException” error,  can you please advise how to fix 
this?

shark> create table x1 (a INT);
FAILED: Hive Internal Error: java.util.NoSuchElementException(null)
14/09/01 23:04:24 [main]: ERROR shark.SharkDriver: FAILED: Hive Internal Error: 
java.util.NoSuchElementException(null)
java.util.NoSuchElementException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:925)
at java.util.HashMap$ValueIterator.next(HashMap.java:950)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:8117)
at 
shark.parse.SharkSemanticAnalyzer.analyzeInternal(SharkSemanticAnalyzer.scala:150)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
at shark.SharkDriver.compile(SharkDriver.scala:215)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at shark.SharkCliDriver.processCmd(SharkCliDriver.scala:340)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at shark.SharkCliDriver$.main(SharkCliDriver.scala:237)
at shark.SharkCliDriver.main(SharkCliDriver.scala)


spark-env.sh
#!/usr/bin/env bash
export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar"
export CLASSPATH="$CLASSPATH:$HIVE_HOME/lib/mysql-connector-java-5.1.31-bin.jar"
export JAVA_LIBRARY_PATH="$HADOOP_HOME/lib/native/Linux-amd64-64"
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
export 
SPARK_CLASSPATH="$SPARK_HOME/lib_managed/jars/mysql-connector-java-5.1.31-bin.jar"
export SPARK_WORKER_MEMORY=2g
export HADOOP_HEAPSIZE=2000

spark-defaults.conf
spark.executor.memory   2048m
spark.shuffle.spill.compressfalse

shark-env.sh
#!/usr/bin/env bash
export SPARK_MEM=2g
export SHARK_MASTER_MEM=2g
SPARK_JAVA_OPTS=" -Dspark.local.dir=/tmp "
SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 "
SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps "
export SPARK_JAVA_OPTS
export SHARK_EXEC_MODE=yarn
export 
SPARK_ASSEMBLY_JAR="$SCALA_HOME/assembly/target/scala-2.10/spark-assembly-1.0.2-hadoop2.4.1.jar"
export SHARK_ASSEMBLY_JAR="target/scala-2.10/shark_2.10-0.9.2.jar"
export HIVE_CONF_DIR="$HIVE_HOME/conf"
export SPARK_LIBPATH=$HADOOP_HOME/lib/native/
export SPARK_LIBRARY_PATH=$HADOOP_HOME/lib/native/
export 
SPARK_CLASSPATH="$SHARK_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar:$SHARK_HOME/lib/protobuf-java-2.5.0.jar"


Regards
Arthur



Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-10 Thread arthur.hk.c...@gmail.com
Hi, 

What is your SBT command and the parameters?

Arthur


On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler  wrote:

> Hello,
> 
> I am writing a Spark App which is already working so far.
> Now I started to build also some UnitTests, but I am running into some 
> dependecy problems and I cannot find a solution right now. Perhaps someone 
> could help me.
> 
> I build my Spark Project with SBT and it seems to be configured well, because 
> compiling, assembling and running the built jar with spark-submit are working 
> well.
> 
> Now I started with the UnitTests, which I located under /src/test/scala.
> 
> When I call "test" in sbt, I get the following:
> 
> 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered BlockManager
> 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server
> [trace] Stack trace suppressed: run last test:test for the full output.
> [error] Could not run test test.scala.SetSuite: 
> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
> [info] Run completed in 626 milliseconds.
> [info] Total number of tests run: 0
> [info] Suites: completed 0, aborted 0
> [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
> [info] All tests passed.
> [error] Error during tests:
> [error] test.scala.SetSuite
> [error] (test:test) sbt.TestsFailedException: Tests unsuccessful
> [error] Total time: 3 s, completed 10.09.2014 12:22:06
> 
> last test:test gives me the following:
> 
> > last test:test
> [debug] Running TaskDef(test.scala.SetSuite, 
> org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector])
> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
>at org.apache.spark.HttpServer.start(HttpServer.scala:54)
>at 
> org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156)
>at 
> org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
>at 
> org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
>at 
> org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
>at 
> org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
>at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
>at org.apache.spark.SparkContext.(SparkContext.scala:202)
>at test.scala.SetSuite.(SparkTest.scala:16)
> 
> I also noticed right now, that sbt run is also not working:
> 
> 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server
> [error] (run-main-2) java.lang.NoClassDefFoundError: 
> javax/servlet/http/HttpServletResponse
> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
>at org.apache.spark.HttpServer.start(HttpServer.scala:54)
>at 
> org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156)
>at 
> org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127)
>at 
> org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31)
>at 
> org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48)
>at 
> org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35)
>at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
>at org.apache.spark.SparkContext.(SparkContext.scala:202)
>at 
> main.scala.PartialDuplicateScanner$.main(PartialDuplicateScanner.scala:29)
>at main.scala.PartialDuplicateScanner.main(PartialDuplicateScanner.scala)
> 
> Here is my Testprojekt.sbt file:
> 
> name := "Testprojekt"
> 
> version := "1.0"
> 
> scalaVersion := "2.10.4"
> 
> libraryDependencies ++= {
>  Seq(
>"org.apache.lucene" % "lucene-core" % "4.9.0",
>"org.apache.lucene" % "lucene-analyzers-common" % "4.9.0",
>"org.apache.lucene" % "lucene-queryparser" % "4.9.0",
>("org.apache.spark" %% "spark-core" % "1.0.2").
>exclude("org.mortbay.jetty", "servlet-api").
>exclude("commons-beanutils", "commons-beanutils-core").
>exclude("commons-collections", "commons-collections").
>exclude("commons-collections", "commons-collections").
>exclude("com.esotericsoftware.minlog", "minlog").
>exclude("org.eclipse.jetty.orbit", "javax.mail.glassfish").
>exclude("org.eclipse.jetty.orbit", "javax.transaction").
>exclude("org.eclipse.jetty.orbit", "javax.servlet")
>  )
> }
> 
> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
> 
> 
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL -- more than two tables for join

2014-09-10 Thread arthur.hk.c...@gmail.com
Hi,

May be you can take a look about the following.

http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html

Good luck.
Arthur

On 10 Sep, 2014, at 9:09 pm, arunshell87  wrote:

> 
> Hi,
> 
> I too had tried SQL queries with joins, MINUS , subqueries etc but they did
> not work in Spark Sql. 
> 
> I did not find any documentation on what queries work and what do not work
> in Spark SQL, may be we have to wait for the Spark book to be released in
> Feb-2015.
> 
> I believe you can try HiveQL in Spark for your requirement.
> 
> Thanks,
> Arun
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p13877.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL -- more than two tables for join

2014-09-10 Thread arthur.hk.c...@gmail.com
Hi 

Some findings: 

1)  spark sql does not support multiple join 
2)  spark left join: has performance issue
3)  spark sql’s cache table: does not support two-tier query 
4)  spark sql does not support repartition

Arthur

On 10 Sep, 2014, at 10:22 pm, arthur.hk.c...@gmail.com 
 wrote:

> Hi,
> 
> May be you can take a look about the following.
> 
> http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
> 
> Good luck.
> Arthur
> 
> On 10 Sep, 2014, at 9:09 pm, arunshell87  wrote:
> 
>> 
>> Hi,
>> 
>> I too had tried SQL queries with joins, MINUS , subqueries etc but they did
>> not work in Spark Sql. 
>> 
>> I did not find any documentation on what queries work and what do not work
>> in Spark SQL, may be we have to wait for the Spark book to be released in
>> Feb-2015.
>> 
>> I believe you can try HiveQL in Spark for your requirement.
>> 
>> Thanks,
>> Arun
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p13877.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



unable to create new native thread

2014-09-11 Thread arthur.hk.c...@gmail.com
Hi 

I am trying the Spark sample program “SparkPi”, I got an error "unable to 
create new native thread", how to resolve this?

14/09/11 21:36:16 INFO scheduler.DAGScheduler: Completed ResultTask(0, 644)
14/09/11 21:36:16 INFO scheduler.TaskSetManager: Finished TID 643 in 43 ms on 
node1 (progress: 636/10)
14/09/11 21:36:16 INFO scheduler.DAGScheduler: Completed ResultTask(0, 643)
14/09/11 21:36:16 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread 
[spark-akka.actor.default-dispatcher-16] shutting down ActorSystem [spark]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at 
scala.concurrent.forkjoin.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1672)
at 
scala.concurrent.forkjoin.ForkJoinPool.signalWork(ForkJoinPool.java:1966)
at 
scala.concurrent.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1829)
at 
scala.concurrent.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool.execute(AbstractDispatcher.scala:374)
at 
akka.dispatch.ExecutorServiceDelegate$class.execute(ThreadPoolBuilder.scala:212)
at 
akka.dispatch.Dispatcher$LazyExecutorServiceDelegate.execute(Dispatcher.scala:43)
at akka.dispatch.Dispatcher.registerForExecution(Dispatcher.scala:118)
at akka.dispatch.Dispatcher.dispatch(Dispatcher.scala:59)
at akka.actor.dungeon.Dispatch$class.sendMessage(Dispatch.scala:120)
at akka.actor.ActorCell.sendMessage(ActorCell.scala:338)
at akka.actor.Cell$class.sendMessage(ActorCell.scala:259)
at akka.actor.ActorCell.sendMessage(ActorCell.scala:338)
at akka.actor.LocalActorRef.$bang(ActorRef.scala:389)
at akka.actor.ActorRef.tell(ActorRef.scala:125)
at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:489)
at akka.actor.ActorCell.invoke(ActorCell.scala:455)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Regards
Arthur



object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi, 

I have tried to to run HBaseTest.scala, but I  got following errors, any ideas 
to how to fix them?

Q1) 
scala> package org.apache.spark.examples
:1: error: illegal start of definition
   package org.apache.spark.examples


Q2) 
scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
:31: error: object hbase is not a member of package org.apache.hadoop
   import org.apache.hadoop.hbase.mapreduce.TableInputFormat



Regards
Arthur

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
If you want to run against 0.98, see:SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297CheersOn Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com <arthur.hk.c...@gmail.com> wrote:Hi, I have tried to to run HBaseTest.scala, but I  got following errors, any ideas to how to fix them?Q1) scala> package org.apache.spark.examples:1: error: illegal start of definition       package org.apache.spark.examplesQ2) scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat:31: error: object hbase is not a member of package org.apache.hadoop       import org.apache.hadoop.hbase.mapreduce.TableInputFormatRegardsArthur


Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,

Thanks!

patch -p0 -i spark-1297-v5.txt
patching file docs/building-with-maven.md
patching file examples/pom.xml
Hunk #1 FAILED at 45.
Hunk #2 FAILED at 110.
2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

Still got errors.

Regards
Arthur

On 14 Sep, 2014, at 11:33 pm, Ted Yu  wrote:

> spark-1297-v5.txt is level 0 patch
> 
> Please use spark-1297-v5.txt
> 
> Cheers
> 
> On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
>  wrote:
> Hi,
> 
> Thanks!!
> 
> I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt 
> are good here,  but not spark-1297-v5.txt:
> 
> 
> $ patch -p1 -i spark-1297-v4.txt
> patching file examples/pom.xml
> 
> $ patch -p1 -i spark-1297-v5.txt
> can't find file to patch at input line 5
> Perhaps you used the wrong -p or --strip option?
> The text leading up to this was:
> --
> |diff --git docs/building-with-maven.md docs/building-with-maven.md
> |index 672d0ef..f8bcd2b 100644
> |--- docs/building-with-maven.md
> |+++ docs/building-with-maven.md
> --
> File to patch: 
> 
> 
> 
> 
> 
> 
> Please advise.
> Regards
> Arthur
> 
> 
> 
> On 14 Sep, 2014, at 10:48 pm, Ted Yu  wrote:
> 
>> Spark examples builds against hbase 0.94 by default.
>> 
>> If you want to run against 0.98, see:
>> SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297
>> 
>> Cheers
>> 
>> On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com 
>>  wrote:
>> Hi, 
>> 
>> I have tried to to run HBaseTest.scala, but I  got following errors, any 
>> ideas to how to fix them?
>> 
>> Q1) 
>> scala> package org.apache.spark.examples
>> :1: error: illegal start of definition
>>package org.apache.spark.examples
>> 
>> 
>> Q2) 
>> scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>> :31: error: object hbase is not a member of package 
>> org.apache.hadoop
>>import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>> 
>> 
>> 
>> Regards
>> Arthur
>> 
> 
> 
> 



Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,

My bad.  Tried again, worked.


patch -p0 -i spark-1297-v5.txt
patching file docs/building-with-maven.md
patching file examples/pom.xml


Thanks!
Arthur

On 14 Sep, 2014, at 11:38 pm, arthur.hk.c...@gmail.com 
 wrote:

> Hi,
> 
> Thanks!
> 
> patch -p0 -i spark-1297-v5.txt
> patching file docs/building-with-maven.md
> patching file examples/pom.xml
> Hunk #1 FAILED at 45.
> Hunk #2 FAILED at 110.
> 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
> 
> Still got errors.
> 
> Regards
> Arthur
> 
> On 14 Sep, 2014, at 11:33 pm, Ted Yu  wrote:
> 
>> spark-1297-v5.txt is level 0 patch
>> 
>> Please use spark-1297-v5.txt
>> 
>> Cheers
>> 
>> On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com 
>>  wrote:
>> Hi,
>> 
>> Thanks!!
>> 
>> I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt 
>> are good here,  but not spark-1297-v5.txt:
>> 
>> 
>> $ patch -p1 -i spark-1297-v4.txt
>> patching file examples/pom.xml
>> 
>> $ patch -p1 -i spark-1297-v5.txt
>> can't find file to patch at input line 5
>> Perhaps you used the wrong -p or --strip option?
>> The text leading up to this was:
>> --
>> |diff --git docs/building-with-maven.md docs/building-with-maven.md
>> |index 672d0ef..f8bcd2b 100644
>> |--- docs/building-with-maven.md
>> |+++ docs/building-with-maven.md
>> --
>> File to patch: 
>> 
>> 
>> 
>> 
>> 
>> 
>> Please advise.
>> Regards
>> Arthur
>> 
>> 
>> 
>> On 14 Sep, 2014, at 10:48 pm, Ted Yu  wrote:
>> 
>>> Spark examples builds against hbase 0.94 by default.
>>> 
>>> If you want to run against 0.98, see:
>>> SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297
>>> 
>>> Cheers
>>> 
>>> On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com 
>>>  wrote:
>>> Hi, 
>>> 
>>> I have tried to to run HBaseTest.scala, but I  got following errors, any 
>>> ideas to how to fix them?
>>> 
>>> Q1) 
>>> scala> package org.apache.spark.examples
>>> :1: error: illegal start of definition
>>>package org.apache.spark.examples
>>> 
>>> 
>>> Q2) 
>>> scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>> :31: error: object hbase is not a member of package 
>>> org.apache.hadoop
>>>import org.apache.hadoop.hbase.mapreduce.TableInputFormat
>>> 
>>> 
>>> 
>>> Regards
>>> Arthur
>>> 
>> 
>> 
>> 
> 



Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,

I applied the patch.

1) patched

$ patch -p0 -i spark-1297-v5.txt
patching file docs/building-with-maven.md
patching file examples/pom.xml


2) Compilation result
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS [1.550s]
[INFO] Spark Project Core  SUCCESS [1:32.175s]
[INFO] Spark Project Bagel ... SUCCESS [10.809s]
[INFO] Spark Project GraphX .. SUCCESS [31.435s]
[INFO] Spark Project Streaming ... SUCCESS [44.518s]
[INFO] Spark Project ML Library .. SUCCESS [48.992s]
[INFO] Spark Project Tools ... SUCCESS [7.028s]
[INFO] Spark Project Catalyst  SUCCESS [40.365s]
[INFO] Spark Project SQL . SUCCESS [43.305s]
[INFO] Spark Project Hive  SUCCESS [36.464s]
[INFO] Spark Project REPL  SUCCESS [20.319s]
[INFO] Spark Project YARN Parent POM . SUCCESS [1.032s]
[INFO] Spark Project YARN Stable API . SUCCESS [19.379s]
[INFO] Spark Project Hive Thrift Server .. SUCCESS [12.470s]
[INFO] Spark Project Assembly  SUCCESS [13.822s]
[INFO] Spark Project External Twitter  SUCCESS [9.566s]
[INFO] Spark Project External Kafka .. SUCCESS [12.848s]
[INFO] Spark Project External Flume Sink . SUCCESS [10.437s]
[INFO] Spark Project External Flume .. SUCCESS [14.554s]
[INFO] Spark Project External ZeroMQ . SUCCESS [9.994s]
[INFO] Spark Project External MQTT ... SUCCESS [8.684s]
[INFO] Spark Project Examples  SUCCESS [1:31.610s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 9:41.700s
[INFO] Finished at: Sun Sep 14 23:51:56 HKT 2014
[INFO] Final Memory: 83M/1071M
[INFO] 



3) testing:  
scala> package org.apache.spark.examples
:1: error: illegal start of definition
   package org.apache.spark.examples
   ^


scala> import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HBaseAdmin

scala> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}

scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat

scala> import org.apache.spark._
import org.apache.spark._

scala> object HBaseTest {
 | def main(args: Array[String]) {
 | val sparkConf = new SparkConf().setAppName("HBaseTest")
 | val sc = new SparkContext(sparkConf)
 | val conf = HBaseConfiguration.create()
 | // Other options for configuring scan behavior are available. More 
information available at
 | // 
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html
 | conf.set(TableInputFormat.INPUT_TABLE, args(0))
 | // Initialize hBase table if necessary
 | val admin = new HBaseAdmin(conf)
 | if (!admin.isTableAvailable(args(0))) {
 | val tableDesc = new HTableDescriptor(args(0))
 | admin.createTable(tableDesc)
 | }
 | val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
 | classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
 | classOf[org.apache.hadoop.hbase.client.Result])
 | hBaseRDD.count()
 | sc.stop()
 | }
 | }
warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details
defined module HBaseTest



Now only got error when trying to run "package org.apache.spark.examples”

Please advise.
Regards
Arthur



On 14 Sep, 2014, at 11:41 pm, Ted Yu  wrote:

> I applied the patch on master branch without rejects.
> 
> If you use spark 1.0.2, use pom.xml attached to the JIRA.
> 
> On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com 
>  wrote:
> Hi,
> 
> Thanks!
> 
> patch -p0 -i spark-1297-v5.txt
> patching file docs/building-with-maven.md
> patching file examples/pom.xml
> Hunk #1 FAILED at 45.
> Hunk #2 FAILED at 110.
> 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
> 
> Still got errors.
> 
> Regards
> Arthur
> 
> On 14 Sep, 2014, at 11:33 pm, Ted Yu  wrote:
> 
>> spark-1297-v5.txt is level 0 patch
>> 
>> Please use spark-1297-v5.txt
>> 
>> Cheers
>> 
>> On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.c

Re: Spark Hive max key length is 767 bytes

2014-09-25 Thread arthur.hk.c...@gmail.com
Hi,

Fixed the issue by downgrade hive from 13.1 to 12.0, it works well now.

Regards
 

On 31 Aug, 2014, at 7:28 am, arthur.hk.c...@gmail.com 
 wrote:

> Hi,
> 
> Already done but still get the same error:
> 
> (I use HIVE 0.13.1 Spark 1.0.2, Hadoop 2.4.1)
> 
> Steps:
> Step 1) mysql:
>> 
>>> alter database hive character set latin1;
> Step 2) HIVE:
>>> hive> create table test_datatype2 (testbigint bigint );
>>> OK
>>> Time taken: 0.708 seconds
>>> 
>>> hive> drop table test_datatype2;
>>> OK
>>> Time taken: 23.272 seconds
> Step 3) scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>> 14/08/29 19:33:52 INFO Configuration.deprecation: 
>>> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
>>> mapreduce.reduce.speculative
>>> hiveContext: org.apache.spark.sql.hive.HiveContext = 
>>> org.apache.spark.sql.hive.HiveContext@395c7b94
>>> scala> hiveContext.hql(“create table test_datatype3 (testbigint bigint)”)
>>> res0: org.apache.spark.sql.SchemaRDD = 
>>> SchemaRDD[0] at RDD at SchemaRDD.scala:104
>>> == Query Plan ==
>>> 
>>> scala> hiveContext.hql("drop table test_datatype3")
>>> 
>>> 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown 
>>> while adding/validating class(es) : Specified key was too long; max key 
>>> length is 767 bytes
>>> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key 
>>> was too long; max key length is 767 bytes
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> at 
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>>> 
>>> 14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of 
>>> org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted 
>>> in no possible candidates
>>> Error(s) were found while auto-creating/validating the datastore for 
>>> classes. The errors are printed in the log, and are attached to this 
>>> exception.
>>> org.datanucleus.exceptions.NucleusDataStoreException: Error(s) were found 
>>> while auto-creating/validating the datastore for classes. The errors are 
>>> printed in the log, and are attached to this exception.
>>> at 
>>> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.verifyErrors(RDBMSStoreManager.java:3609)
>>> 
>>> 
>>> Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: 
>>> Specified key was too long; max key length is 767 bytes
>>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 
> 
> 
> Should I use HIVE 0.12.0 instead of HIVE 0.13.1?
> 
> Regards
> Arthur
> 
> On 31 Aug, 2014, at 6:01 am, Denny Lee  wrote:
> 
>> Oh, you may be running into an issue with your MySQL setup actually, try 
>> running
>> 
>> alter database metastore_db character set latin1
>> 
>> so that way Hive (and the Spark HiveContext) can execute properly against 
>> the metastore.
>> 
>> 
>> On August 29, 2014 at 04:39:01, arthur.hk.c...@gmail.com 
>> (arthur.hk.c...@gmail.com) wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> Tried the same thing in HIVE directly without issue:
>>> 
>>> HIVE:
>>> hive> create table test_datatype2 (testbigint bigint );
>>> OK
>>> Time taken: 0.708 seconds
>>> 
>>> hive> drop table test_datatype2;
>>> OK
>>> Time taken: 23.272 seconds
>>> 
>>> 
>>> 
>>> Then tried again in SPARK:
>>> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>> 14/08/29 19:33:52 INFO Configuration.deprecation: 
>>> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
>>> mapreduce.reduce.speculative
>>> hiveContext: org.apache.spark.sql.hive.HiveContext = 
>>> org.apache.spark.sql.hive.HiveContext@395c7b94
>>> 
>>> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>> res0: org.apache.spark.sql.SchemaRDD = 
>>> SchemaRDD[0] at RDD at SchemaRDD.scala:104
>>> == Query Plan ==
>>> 
>>> 
>>> scala> hiveContext.hql("drop table test_datatype3")
>>> 
>>> 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown 
>>> while adding/validating class(es) : Specified key was too long; max key 
&g

Re: SparkSQL on Hive error

2014-10-03 Thread arthur.hk.c...@gmail.com
hi,

I have just tested the same command, it works here, can you please provide your 
create table command?

regards
Arthur

scala> hiveContext.hql("show tables")
warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details
2014-10-03 17:14:33,575 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(179)) - Parsing command: show tables
2014-10-03 17:14:33,575 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(197)) - Parse Completed
2014-10-03 17:14:33,600 INFO  [main] Configuration.deprecation 
(Configuration.java:warnOnceIfDeprecated(1009)) - mapred.input.dir.recursive is 
deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
2014-10-03 17:14:33,600 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 17:14:33,600 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 17:14:33,600 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 17:14:33,600 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 17:14:33,600 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(179)) - Parsing command: show tables
2014-10-03 17:14:33,600 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(197)) - Parse Completed
2014-10-03 17:14:33,601 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 17:14:33,601 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 17:14:33,603 INFO  [main] ql.Driver (Driver.java:compile(450)) - 
Semantic Analysis Completed
2014-10-03 17:14:33,603 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 17:14:33,604 INFO  [main] exec.ListSinkOperator 
(Operator.java:initialize(338)) - Initializing Self 0 OP
2014-10-03 17:14:33,604 INFO  [main] exec.ListSinkOperator 
(Operator.java:initializeChildren(403)) - Operator 0 OP initialized
2014-10-03 17:14:33,604 INFO  [main] exec.ListSinkOperator 
(Operator.java:initialize(378)) - Initialization Done 0 OP
2014-10-03 17:14:33,604 INFO  [main] ql.Driver (Driver.java:getSchema(264)) - 
Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, 
type:string, comment:from deserializer)], properties:null)
2014-10-03 17:14:33,605 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 17:14:33,605 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 17:14:33,605 INFO  [main] ql.Driver (Driver.java:execute(1117)) - 
Starting command: show tables
2014-10-03 17:14:33,605 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 17:14:33,605 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 17:14:33,605 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 17:14:33,606 INFO  [main] metastore.HiveMetaStore 
(HiveMetaStore.java:logInfo(454)) - 0: get_database: default
2014-10-03 17:14:33,606 INFO  [main] HiveMetaStore.audit 
(HiveMetaStore.java:logAuditEvent(239)) - ugi=edhuser  ip=unknown-ip-addr  
cmd=get_database: default   
2014-10-03 17:14:33,609 INFO  [main] metastore.HiveMetaStore 
(HiveMetaStore.java:logInfo(454)) - 0: get_tables: db=default pat=.*
2014-10-03 17:14:33,609 INFO  [main] HiveMetaStore.audit 
(HiveMetaStore.java:logAuditEvent(239)) - ugi=edhuser  ip=unknown-ip-addr  
cmd=get_tables: db=default pat=.*   
2014-10-03 17:14:33,612 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 17:14:33,613 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 17:14:33,613 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 17:14:33,613 INFO  [main] ql.Driver 
(SessionState.java:printInfo(410)) - OK
2014-10-03 17:14:33,613 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 17:14:33,613 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 17:14:33,613 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 17:14:33,614 INFO  [main] mapred.FileInputFormat 
(FileInputFormat.java:listStatus(247)) - Total input paths to process : 1
2014-10-03 17:14:33,615 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 17:14:33,615 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
res4: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[3] at RDD at SchemaRDD.scala:103
== Query Plan ==


scala> 



On 3 Oct, 2014, at 4:52 pm, Michael Armbrust  wrote:

> Are you running master?  There was briefly a regression here that is 
> hopefully fixed by spark#2635.
> 
> On Fri, Oct 3, 2014 at 1:43 AM, Kevin Paul  wrote:
> Hi all, I tried to launch my application with spark-submit, the command I use 
> is:
> 
> bin/spark-submit --class ${MY_CLASS} --jars ${MY_JARS} --master local 
> myApplicationJar.jar
> 
> I've buillt spark with SPARK_HIVE=true, and was able to start HiveContext, 
> and was able to run command like, 
> hiveContext.sql("select * from myTable")
>  or hiveContext.sql("select count(*) from myTable")
> myTable is a table on my hive dat

How to save Spark log into file

2014-10-03 Thread arthur.hk.c...@gmail.com
Hi,

How can the spark log be saved into file instead of showing them on console? 

Below is my conf/log4j.properties

conf/log4j.properties
###
# Root logger option
log4j.rootLogger=INFO, file

# Direct log messages to a log file
log4j.appender.file=org.apache.log4j.RollingFileAppender

#Redirect to Tomcat logs folder
log4j.appender.file.File=/hadoop_logs/spark/spark_logging.log
log4j.appender.file.MaxFileSize=10MB
log4j.appender.file.MaxBackupIndex=10
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{-MM-dd HH:mm:ss} %-5p 
%c{1}:%L - %m%n
###


I tried to stop and start spark again, it still shows INFO WARN log on console.
Any ideas?

Regards
Arthur






scala> hiveContext.hql("show tables")
warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details
2014-10-03 19:35:01,554 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(179)) - Parsing command: show tables
2014-10-03 19:35:01,715 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(197)) - Parse Completed
2014-10-03 19:35:01,845 INFO  [main] Configuration.deprecation 
(Configuration.java:warnOnceIfDeprecated(1009)) - mapred.input.dir.recursive is 
deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
2014-10-03 19:35:01,847 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 19:35:01,847 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 19:35:01,847 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 19:35:01,863 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 19:35:01,863 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(179)) - Parsing command: show tables
2014-10-03 19:35:01,863 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(197)) - Parse Completed
2014-10-03 19:35:01,863 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 19:35:01,863 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 19:35:01,941 INFO  [main] ql.Driver (Driver.java:compile(450)) - 
Semantic Analysis Completed
2014-10-03 19:35:01,941 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 19:35:01,979 INFO  [main] exec.ListSinkOperator 
(Operator.java:initialize(338)) - Initializing Self 0 OP
2014-10-03 19:35:01,980 INFO  [main] exec.ListSinkOperator 
(Operator.java:initializeChildren(403)) - Operator 0 OP initialized
2014-10-03 19:35:01,980 INFO  [main] exec.ListSinkOperator 
(Operator.java:initialize(378)) - Initialization Done 0 OP
2014-10-03 19:35:01,985 INFO  [main] ql.Driver (Driver.java:getSchema(264)) - 
Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, 
type:string, comment:from deserializer)], properties:null)
2014-10-03 19:35:01,985 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 19:35:01,985 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 19:35:01,985 INFO  [main] Configuration.deprecation 
(Configuration.java:warnOnceIfDeprecated(1009)) - mapred.job.name is 
deprecated. Instead, use mapreduce.job.name
2014-10-03 19:35:01,986 INFO  [main] ql.Driver (Driver.java:execute(1117)) - 
Starting command: show tables
2014-10-03 19:35:01,994 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogEnd(124)) - 
2014-10-03 19:35:01,994 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 19:35:01,994 INFO  [main] ql.Driver 
(PerfLogger.java:PerfLogBegin(97)) - 
2014-10-03 19:35:02,019 INFO  [main] metastore.HiveMetaStore 
(HiveMetaStore.java:newRawStore(411)) - 0: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
2014-10-03 19:35:02,034 INFO  [main] metastore.ObjectStore 
(ObjectStore.java:initialize(232)) - ObjectStore, initialize called
2014-10-03 19:35:02,084 WARN  [main] DataNucleus.General 
(Log4JLogger.java:warn(96)) - Plugin (Bundle) "org.datanucleus.api.jdo" is 
already registered. Ensure you dont have multiple JAR versions of the same 
plugin in the classpath. The URL 
"file://hadoop/spark/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar" is already 
registered, and you are trying to register an identical plugin located at URL 
"file://hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar."
2014-10-03 19:35:02,108 WARN  [main] DataNucleus.General 
(Log4JLogger.java:warn(96)) - Plugin (Bundle) "org.datanucleus" is already 
registered. Ensure you dont have multiple JAR versions of the same plugin in 
the classpath. The URL 
"file://hadoop/spark/lib_managed/jars/datanucleus-core-3.2.2.jar" is already 
registered, and you are trying to register an identical plugin located at URL 
"file://hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-core-3.2.2.jar."
2014-10-03 19:35:02,137 WARN  [main] DataNucleus.General 
(Log4JLogger.java:warn(96)) - Plugin (Bundle) "org.datanucleus.store.rdbms" is 
already registered. Ensure you dont have multiple JAR versions of the sa

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread arthur.hk.c...@gmail.com
Wonderful !!

On 11 Oct, 2014, at 12:00 am, Nan Zhu  wrote:

> Great! Congratulations!
> 
> -- 
> Nan Zhu
> On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote:
> 
>> Brilliant stuff ! Congrats all :-)
>> This is indeed really heartening news !
>> 
>> Regards,
>> Mridul
>> 
>> 
>> On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia  
>> wrote:
>>> Hi folks,
>>> 
>>> I interrupt your regularly scheduled user / dev list to bring you some 
>>> pretty cool news for the project, which is that we've been able to use 
>>> Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x 
>>> faster on 10x fewer nodes. There's a detailed writeup at 
>>> http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
>>>  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
>>> sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 
>>> 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
>>> 
>>> I want to thank Reynold Xin for leading this effort over the past few 
>>> weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali 
>>> Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for 
>>> providing the machines to make this possible. Finally, this result would of 
>>> course not be possible without the many many other contributions, testing 
>>> and feature requests from throughout the community.
>>> 
>>> For an engine to scale from these multi-hour petabyte batch jobs down to 
>>> 100-millisecond streaming and interactive queries is quite uncommon, and 
>>> it's thanks to all of you folks that we are able to make this happen.
>>> 
>>> Matei
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 



How To Implement More Than One Subquery in Scala/Spark

2014-10-11 Thread arthur.hk.c...@gmail.com
Hi,

My Spark version is v1.1.0 and Hive is 0.12.0, I need to use more than 1 
subquery in my Spark SQL, below are my sample table structures and a SQL that 
contains more than 1 subquery. 

Question 1:  How to load a HIVE table into Scala/Spark?
Question 2:  How to implement a SQL_WITH_MORE_THAN_ONE_SUBQUERY  in SCALA/SPARK?
Question 3:  What is the DATEADD function in Scala/Spark? or how to implement  
"DATEADD(MONTH, 3, '2013-07-01')” and "DATEADD(YEAR, 1, '2014-01-01')” in Spark 
or Hive? 
I can find HIVE (date_add(string startdate, int days)) but it is in days not 
MONTH / YEAR.

Thanks.

Regards
Arthur

===
My sample SQL with more than 1 subquery: 
SELECT S_NAME, 
   COUNT(*) AS NUMWAIT 
FROM   SUPPLIER, 
   LINEITEM L1, 
   ORDERS
WHERE  S_SUPPKEY = L1.L_SUPPKEY 
   AND O_ORDERKEY = L1.L_ORDERKEY 
   AND O_ORDERSTATUS = 'F' 
   AND L1.L_RECEIPTDATE > L1.L_COMMITDATE 
   AND EXISTS (SELECT * 
   FROM   LINEITEM L2 
   WHERE  L2.L_ORDERKEY = L1.L_ORDERKEY 
  AND L2.L_SUPPKEY <> L1.L_SUPPKEY) 
   AND NOT EXISTS (SELECT * 
   FROM   LINEITEM L3 
   WHERE  L3.L_ORDERKEY = L1.L_ORDERKEY 
  AND L3.L_SUPPKEY <> L1.L_SUPPKEY 
  AND L3.L_RECEIPTDATE > L3.L_COMMITDATE) 
GROUP  BY S_NAME 
ORDER  BY NUMWAIT DESC, S_NAME
limit 100;  


===
Supplier Table:
CREATE TABLE IF NOT EXISTS SUPPLIER (
S_SUPPKEY   INTEGER PRIMARY KEY,
S_NAME  CHAR(25),
S_ADDRESS   VARCHAR(40),
S_NATIONKEY BIGINT NOT NULL, 
S_PHONE CHAR(15),
S_ACCTBAL   DECIMAL,
S_COMMENT   VARCHAR(101)
) 

===
Order Table:
CREATE TABLE IF NOT EXISTS ORDERS (
O_ORDERKEY  INTEGER PRIMARY KEY,
O_CUSTKEY   BIGINT NOT NULL, 
O_ORDERSTATUS   CHAR(1),
O_TOTALPRICEDECIMAL,
O_ORDERDATE CHAR(10),
O_ORDERPRIORITY CHAR(15),
O_CLERK CHAR(15),
O_SHIPPRIORITY  INTEGER,
O_COMMENT   VARCHAR(79)

===
LineItem Table:
CREATE TABLE IF NOT EXISTS LINEITEM (
L_ORDERKEY  BIGINT not null,
L_PARTKEY   BIGINT,
L_SUPPKEY   BIGINT,
L_LINENUMBERINTEGER not null,
L_QUANTITY  DECIMAL,
L_EXTENDEDPRICE DECIMAL,
L_DISCOUNT  DECIMAL,
L_TAX   DECIMAL,
L_SHIPDATE  CHAR(10),
L_COMMITDATECHAR(10),
L_RECEIPTDATE   CHAR(10),
L_RETURNFLAGCHAR(1),
L_LINESTATUSCHAR(1),
L_SHIPINSTRUCT  CHAR(25),
L_SHIPMODE  CHAR(10),
L_COMMENT   VARCHAR(44),
CONSTRAINT pk PRIMARY KEY (L_ORDERKEY, L_LINENUMBER )
)



Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-13 Thread arthur.hk.c...@gmail.com
Hi,

Thank you so much!

By the way, what is the DATEADD function in Scala/Spark? or how to implement  
"DATEADD(MONTH, 3, '2013-07-01')” and "DATEADD(YEAR, 1, '2014-01-01')” in Spark 
or Hive? 

Regards
Arthur


On 12 Oct, 2014, at 12:03 pm, Ilya Ganelin  wrote:

> Because of how closures work in Scala, there is no support for nested 
> map/rdd-based operations. Specifically, if you have
> 
> Context a {
> Context b {
> 
> }
> }
> 
> Operations within context b, when distributed across nodes, will no longer 
> have visibility of variables specific to context a because that context is 
> not distributed alongside that operation!
> 
> To get around this you need to serialize your operations. For example , run a 
> map job. Take the output of that and run a second map job to filter. Another 
> option is to run two separate map jobs and join their results. Keep in mind 
> that another useful technique is to execute the groupByKey routine , 
> particularly if you want to operate on a particular variable.
> 
> On Oct 11, 2014 11:09 AM, "arthur.hk.c...@gmail.com" 
>  wrote:
> Hi,
> 
> My Spark version is v1.1.0 and Hive is 0.12.0, I need to use more than 1 
> subquery in my Spark SQL, below are my sample table structures and a SQL that 
> contains more than 1 subquery. 
> 
> Question 1:  How to load a HIVE table into Scala/Spark?
> Question 2:  How to implement a SQL_WITH_MORE_THAN_ONE_SUBQUERY  in 
> SCALA/SPARK?
> Question 3:  What is the DATEADD function in Scala/Spark? or how to implement 
>  "DATEADD(MONTH, 3, '2013-07-01')” and "DATEADD(YEAR, 1, '2014-01-01')” in 
> Spark or Hive? 
> I can find HIVE (date_add(string startdate, int days)) but it is in days not 
> MONTH / YEAR.
> 
> Thanks.
> 
> Regards
> Arthur
> 
> ===
> My sample SQL with more than 1 subquery: 
> SELECT S_NAME, 
>COUNT(*) AS NUMWAIT 
> FROM   SUPPLIER, 
>LINEITEM L1, 
>ORDERS
> WHERE  S_SUPPKEY = L1.L_SUPPKEY 
>AND O_ORDERKEY = L1.L_ORDERKEY 
>AND O_ORDERSTATUS = 'F' 
>AND L1.L_RECEIPTDATE > L1.L_COMMITDATE 
>AND EXISTS (SELECT * 
>FROM   LINEITEM L2 
>WHERE  L2.L_ORDERKEY = L1.L_ORDERKEY 
>   AND L2.L_SUPPKEY <> L1.L_SUPPKEY) 
>AND NOT EXISTS (SELECT * 
>FROM   LINEITEM L3 
>WHERE  L3.L_ORDERKEY = L1.L_ORDERKEY 
>   AND L3.L_SUPPKEY <> L1.L_SUPPKEY 
>   AND L3.L_RECEIPTDATE > L3.L_COMMITDATE) 
> GROUP  BY S_NAME 
> ORDER  BY NUMWAIT DESC, S_NAME
> limit 100;  
> 
> 
> ===
> Supplier Table:
> CREATE TABLE IF NOT EXISTS SUPPLIER (
> S_SUPPKEY INTEGER PRIMARY KEY,
> S_NAMECHAR(25),
> S_ADDRESS VARCHAR(40),
> S_NATIONKEY   BIGINT NOT NULL, 
> S_PHONE   CHAR(15),
> S_ACCTBAL DECIMAL,
> S_COMMENT VARCHAR(101)
> ) 
> 
> ===
> Order Table:
> CREATE TABLE IF NOT EXISTS ORDERS (
> O_ORDERKEYINTEGER PRIMARY KEY,
> O_CUSTKEY BIGINT NOT NULL, 
> O_ORDERSTATUS CHAR(1),
> O_TOTALPRICE  DECIMAL,
> O_ORDERDATE   CHAR(10),
> O_ORDERPRIORITY   CHAR(15),
> O_CLERK   CHAR(15),
> O_SHIPPRIORITYINTEGER,
> O_COMMENT VARCHAR(79)
> 
> ===
> LineItem Table:
> CREATE TABLE IF NOT EXISTS LINEITEM (
> L_ORDERKEY  BIGINT not null,
> L_PARTKEY   BIGINT,
> L_SUPPKEY   BIGINT,
> L_LINENUMBERINTEGER not null,
> L_QUANTITY  DECIMAL,
> L_EXTENDEDPRICE DECIMAL,
> L_DISCOUNT  DECIMAL,
> L_TAX   DECIMAL,
> L_SHIPDATE  CHAR(10),
> L_COMMITDATECHAR(10),
> L_RECEIPTDATE   CHAR(10),
> L_RETURNFLAGCHAR(1),
> L_LINESTATUSCHAR(1),
> L_SHIPINSTRUCT  CHAR(25),
> L_SHIPMODE  CHAR(10),
> L_COMMENT   VARCHAR(44),
> CONSTRAINT pk PRIMARY KEY (L_ORDERKEY, L_LINENUMBER )
> )
> 



Spark Hive Snappy Error

2014-10-16 Thread arthur.hk.c...@gmail.com
Hi,

When trying Spark with Hive table, I got the “java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I” error,


val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql(“select count(1) from q8_national_market_share
sqlContext.sql("select count(1) from 
q8_national_market_share").collect().foreach(println)
java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
at 
org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79)
at 
org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)
at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83)
at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
at 
org.apache.spark.sql.hive.HadoopTableReader.(TableReader.scala:68)
at 
org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:68)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:146)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
at $iwC$$iwC$$iwC$$iwC.(:15)
at $iwC$$iwC$$iwC.(:20)
at $iwC$$iwC.(:22)
at $iwC.(:24)
at (:26)
at .(:30)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
at

Spark/HIVE Insert Into values Error

2014-10-17 Thread arthur.hk.c...@gmail.com
Hi,

When trying to insert records into HIVE, I got error,

My Spark is 1.1.0 and Hive 0.12.0

Any idea what would be wrong?
Regards
Arthur



hive> CREATE TABLE students (name VARCHAR(64), age INT, gpa int);  
OK

hive> INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1);
NoViableAltException(26@[])
at 
org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:693)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:31374)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:29083)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:28968)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:28762)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1238)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938)
at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 1:27 cannot recognize input near 'VALUES' '(' 
''fred flintstone'' in select clause




Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
hread.java:107)
2014-10-22 20:23:17,038 INFO  [sparkDriver-akka.actor.default-dispatcher-14] 
remote.RemoteActorRefProvider$RemotingTerminator 
(Slf4jLogger.scala:apply$mcV$sp(74)) - Shutting down remote daemon.
2014-10-22 20:23:17,039 INFO  [sparkDriver-akka.actor.default-dispatcher-14] 
remote.RemoteActorRefProvider$RemotingTerminator 
(Slf4jLogger.scala:apply$mcV$sp(74)) - Remote daemon shut down; proceeding with 
flushing remote transports.

 
Regards
Arthur

On 17 Oct, 2014, at 9:33 am, Shao, Saisai  wrote:

> Hi Arthur,
>  
> I think this is a known issue in Spark, you can check 
> (https://issues.apache.org/jira/browse/SPARK-3958). I’m curious about it, can 
> you always reproduce this issue, Is this issue related to some specific data 
> sets, would you mind giving me some information about you workload, Spark 
> configuration, JDK version and OS version?
>  
> Thanks
> Jerry
>  
> From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com] 
> Sent: Friday, October 17, 2014 7:13 AM
> To: user
> Cc: arthur.hk.c...@gmail.com
> Subject: Spark Hive Snappy Error
>  
> Hi,
>  
> When trying Spark with Hive table, I got the “java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I” error,
>  
>  
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql(“select count(1) from q8_national_market_share
> sqlContext.sql("select count(1) from 
> q8_national_market_share").collect().foreach(println)
> java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>  at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>  at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>  at 
> org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79)
>  at 
> org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68)
>  at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
>  at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
>  at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
>  at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
>  at 
> org.apache.spark.sql.hive.HadoopTableReader.(TableReader.scala:68)
>  at 
> org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:68)
>  at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
>  at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
>  at 
> org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
>  at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
>  at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>  at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>  at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
>  at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>  at 
> org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:146)
>  at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>  at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>  at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
>  at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
>  at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
>  at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
>  at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
>  at org.apache.spark.sql.SchemaRD

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi,

FYI, I use snappy-java-1.0.4.1.jar

Regards
Arthur


On 22 Oct, 2014, at 8:59 pm, Shao, Saisai  wrote:

> Thanks a lot, I will try to reproduce this in my local settings and dig into 
> the details, thanks for your information.
>  
>  
> BR
> Jerry
>  
> From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com] 
> Sent: Wednesday, October 22, 2014 8:35 PM
> To: Shao, Saisai
> Cc: arthur.hk.c...@gmail.com; user
> Subject: Re: Spark Hive Snappy Error
>  
> Hi,
>  
> Yes, I can always reproduce the issue:
>  
> about you workload, Spark configuration, JDK version and OS version?
>  
> I ran SparkPI 1000
>  
> java -version
> java version "1.7.0_67"
> Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
> Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
>  
> cat /etc/centos-release
> CentOS release 6.5 (Final)
>  
> My Spark’s hive-site.xml with following:
>  
>   hive.exec.compress.output
>   true
>  
>  
>  
>   mapred.output.compression.codec
>   org.apache.hadoop.io.compress.SnappyCodec
>  
>  
>  
>   mapred.output.compression.type
>   BLOCK
>  
>  
> e.g.
> MASTER=spark://m1:7077,m2:7077 ./bin/run-example SparkPi 1000
> 2014-10-22 20:23:17,033 ERROR [sparkDriver-akka.actor.default-dispatcher-18] 
> actor.ActorSystemImpl (Slf4jLogger.scala:apply$mcV$sp(66)) - Uncaught fatal 
> error from thread [sparkDriver-akka.actor.default-dispatcher-2] shutting down 
> ActorSystem [sparkDriver]
> java.lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
>  at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
>  at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316)
>  at 
> org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79)
>  at 
> org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68)
>  at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
>  at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
>  at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
>  at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:829)
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:769)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:753)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1360)
>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>  at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>  at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2014-10-22 20:23:17,036 INFO  [main] scheduler.DAGScheduler 
> (Logging.scala:logInfo(59)) - Failed to run reduce at SparkPi.scala:35
> Exception in thread "main" org.apache.spark.SparkException: Job cancelled 
> because SparkContext was shut down
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:694)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:693)
>  at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>  at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:693)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1399)
>  at 
> akka.

ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi,

I just tried sample PI calculation on Spark Cluster, after returning the Pi 
result, it shows ERROR ConnectionManager: Corresponding SendingConnection to 
ConnectionManagerId(m37,35662) not found

./bin/spark-submit   --class org.apache.spark.examples.SparkPi   --master 
spark://m33:7077   --executor-memory 512m  --total-executor-cores 40  
examples/target/spark-examples_2.10-1.1.0.jar 100

14/10/23 05:09:03 INFO TaskSetManager: Finished task 87.0 in stage 0.0 (TID 87) 
in 346 ms on m134.emblocsoft.net (99/100)
14/10/23 05:09:03 INFO TaskSetManager: Finished task 98.0 in stage 0.0 (TID 98) 
in 262 ms on m134.emblocsoft.net (100/100)
14/10/23 05:09:03 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have 
all completed, from pool 
14/10/23 05:09:03 INFO DAGScheduler: Stage 0 (reduce at SparkPi.scala:35) 
finished in 2.597 s
14/10/23 05:09:03 INFO SparkContext: Job finished: reduce at SparkPi.scala:35, 
took 2.725328861 s
Pi is roughly 3.1414948
14/10/23 05:09:03 INFO SparkUI: Stopped Spark web UI at http://m33:4040
14/10/23 05:09:03 INFO DAGScheduler: Stopping DAGScheduler
14/10/23 05:09:03 INFO SparkDeploySchedulerBackend: Shutting down all executors
14/10/23 05:09:03 INFO SparkDeploySchedulerBackend: Asking each executor to 
shut down
14/10/23 05:09:04 INFO ConnectionManager: Key not valid ? 
sun.nio.ch.SelectionKeyImpl@37852165
14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(m37,35662)
14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to 
ConnectionManagerId(m37,35662)
14/10/23 05:09:04 ERROR ConnectionManager: Corresponding SendingConnection to 
ConnectionManagerId(m37,35662) not found
14/10/23 05:09:04 INFO ConnectionManager: key already cancelled ? 
sun.nio.ch.SelectionKeyImpl@37852165
java.nio.channels.CancelledKeyException
at 
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at 
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(m36,34230)
14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(m35,50371)
14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to 
ConnectionManagerId(m36,34230)
14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to 
ConnectionManagerId(m34,41562)
14/10/23 05:09:04 ERROR ConnectionManager: Corresponding SendingConnection to 
ConnectionManagerId(m36,34230) not found
14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to 
ConnectionManagerId(m35,50371)
14/10/23 05:09:04 ERROR ConnectionManager: Corresponding SendingConnection to 
ConnectionManagerId(m35,50371) not found
14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(m33,39517)
14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to 
ConnectionManagerId(m33,39517)
14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(m34,41562)
14/10/23 05:09:04 ERROR ConnectionManager: Corresponding SendingConnection to 
ConnectionManagerId(m33,39517) not found
14/10/23 05:09:04 ERROR SendingConnection: Exception while reading 
SendingConnection to ConnectionManagerId(m34,41562)
java.nio.channels.ClosedChannelException
at 
sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
at 
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/10/23 05:09:04 INFO ConnectionManager: Handling connection error on 
connection to ConnectionManagerId(m34,41562)
14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to 
ConnectionManagerId(m34,41562)
14/10/23 05:09:04 INFO ConnectionManager: Key not valid ? 
sun.nio.ch.SelectionKeyImpl@2e0b5c4a
14/10/23 05:09:04 INFO ConnectionManager: key already cancelled ? 
sun.nio.ch.SelectionKeyImpl@2e0b5c4a
java.nio.channels.CancelledKeyException
at 
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at 
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139)
14/10/23 05:09:04 INFO ConnectionManager: Key not valid ? 
sun.nio.ch.SelectionKeyImpl@653f8844
14/10/23 05:09:04 INFO ConnectionManager: key already cancelled ? 
sun.nio.ch.SelectionKeyImpl@653f8844
java.nio.channels.CancelledKeyException
at 
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386)
at 
org.apache.spark.network.ConnectionManager$$anon$4.run(Connecti

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi,

I have managed to resolve it because a wrong setting. Please ignore this .

Regards
Arthur

On 23 Oct, 2014, at 5:14 am, arthur.hk.c...@gmail.com 
 wrote:

> 
> 14/10/23 05:09:04 WARN ConnectionManager: All connections not cleaned up
> 



Spark: Order by Failed, java.lang.NullPointerException

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi,

I got java.lang.NullPointerException. Please help!


sqlContext.sql("select l_orderkey, l_linenumber, l_partkey, l_quantity, 
l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem limit 
10").collect().foreach(println);

2014-10-23 08:20:12,024 INFO  [sparkDriver-akka.actor.default-dispatcher-31] 
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Stage 41 (runJob at 
basicOperators.scala:136) finished in 0.086 s
2014-10-23 08:20:12,024 INFO  [Result resolver thread-1] 
scheduler.TaskSchedulerImpl (Logging.scala:logInfo(59)) - Removed TaskSet 41.0, 
whose tasks have all completed, from pool 
2014-10-23 08:20:12,024 INFO  [main] spark.SparkContext 
(Logging.scala:logInfo(59)) - Job finished: runJob at basicOperators.scala:136, 
took 0.090129332 s
[9001,6,-4584121,17,1997-01-04,N,O]
[9002,1,-2818574,23,1996-02-16,N,O]
[9002,2,-2449102,21,1993-12-12,A,F]
[9002,3,-5810699,26,1994-04-06,A,F]
[9002,4,-489283,18,1994-11-11,R,F]
[9002,5,2169683,15,1997-09-14,N,O]
[9002,6,2405081,4,1992-08-03,R,F]
[9002,7,3835341,40,1998-04-28,N,O]
[9003,1,1900071,4,1994-05-05,R,F]
[9004,1,-2614665,41,1993-06-13,A,F]


If "order by L_LINESTATUS” is added then error:
sqlContext.sql("select l_orderkey, l_linenumber, l_partkey, l_quantity, 
l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by L_LINESTATUS 
limit 10").collect().foreach(println);

2014-10-23 08:22:08,524 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(179)) - Parsing command: select l_orderkey, 
l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS 
from lineitem order by L_LINESTATUS limit 10
2014-10-23 08:22:08,525 INFO  [main] parse.ParseDriver 
(ParseDriver.java:parse(197)) - Parse Completed
2014-10-23 08:22:08,526 INFO  [main] metastore.HiveMetaStore 
(HiveMetaStore.java:logInfo(454)) - 0: get_table : db=boc_12 tbl=lineitem
2014-10-23 08:22:08,526 INFO  [main] HiveMetaStore.audit 
(HiveMetaStore.java:logAuditEvent(239)) - ugi=hd   ip=unknown-ip-addr  
cmd=get_table : db=boc_12 tbl=lineitem  
java.lang.NullPointerException
at 
org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1262)
at 
org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1269)
at 
org.apache.spark.sql.hive.HadoopTableReader.(TableReader.scala:63)
at 
org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:68)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at 
org.apache.spark.sql.execution.SparkStrategies$TakeOrdered$.apply(SparkStrategies.scala:191)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
at $iwC$$iwC$$iwC$$iwC.(:15)
at $iwC$$iwC$$iwC.(:20)
at $iwC$$iwC.(:22)
at $iwC.(:24)
at (:26)
at .(:30)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.ap

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi,Please find the attached file.{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210
{\fonttbl\f0\fnil\fcharset0 Menlo-Regular;}
{\colortbl;\red255\green255\blue255;}
\paperw11900\paperh16840\margl1440\margr1440\vieww26300\viewh12480\viewkind0
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural

\f0\fs22 \cf0 \CocoaLigature0 lsof -p 16459 (Master)\
COMMAND   PIDUSER   FD   TYPE DEVICE  SIZE/OFF NODE NAME\
java16459 tester  cwdDIR  253,2  4096  6039786 /hadoop/spark-1.1.0_patched\
java16459 tester  rtdDIR  253,0  40962 /\
java16459 tester  txtREG  253,0 12150  2780995 /usr/lib/jvm/jdk1.7.0_67/bin/java\
java16459 tester  memREG  253,0156928  2228230 /lib64/ld-2.12.so\
java16459 tester  memREG  253,0   1926680  2228250 /lib64/libc-2.12.so\
java16459 tester  memREG  253,0145896  2228251 /lib64/libpthread-2.12.so\
java16459 tester  memREG  253,0 22536  2228254 /lib64/libdl-2.12.so\
java16459 tester  memREG  253,0109006  2759278 /usr/lib/jvm/jdk1.7.0_67/lib/amd64/jli/libjli.so\
java16459 tester  memREG  253,0599384  2228264 /lib64/libm-2.12.so\
java16459 tester  memREG  253,0 47064  2228295 /lib64/librt-2.12.so\
java16459 tester  memREG  253,0113952  2228328 /lib64/libresolv-2.12.so\
java16459 tester  memREG  253,0  99158576  2388225 /usr/lib/locale/locale-archive\
java16459 tester  memREG  253,0 27424  2228249 /lib64/libnss_dns-2.12.so\
java16459 tester  memREG  253,2 138832345  6555616 /hadoop/spark-1.1.0_patched/assembly/target/scala-2.10/spark-assembly-1.1.0-hadoop2.4.1.jar\
java16459 tester  memREG  253,0580624  2893171 /usr/lib/jvm/jdk1.7.0_67/jre/lib/jsse.jar\
java16459 tester  memREG  253,0114742  2893221 /usr/lib/jvm/jdk1.7.0_67/jre/lib/amd64/libnet.so\
java16459 tester  memREG  253,0 91178  2893222 /usr/lib/jvm/jdk1.7.0_67/jre/lib/amd64/libnio.so\
java16459 tester  memREG  253,2   1769726  6816963 /hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-rdbms-3.2.1.jar\
java16459 tester  memREG  253,2337012  6816961 /hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar\
java16459 tester  memREG  253,2   1801810  6816962 /hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-core-3.2.2.jar\
java16459 tester  memREG  253,2 25153  7079998 /hadoop/hive-0.12.0-bin/csv-serde-1.1.2-0.11.0-all.jar\
java16459 tester  memREG  253,2 21817  6032989 /hadoop/hbase-0.98.5-hadoop2/lib/gmbal-api-only-3.0.0-b023.jar\
java16459 tester  memREG  253,2177131  6032940 /hadoop/hbase-0.98.5-hadoop2/lib/jetty-util-6.1.26.jar\
java16459 tester  memREG  253,2 32677  6032915 /hadoop/hbase-0.98.5-hadoop2/lib/hbase-hadoop-compat-0.98.5-hadoop2.jar\
java16459 tester  memREG  253,2143602  6032959 /hadoop/hbase-0.98.5-hadoop2/lib/commons-digester-1.8.jar\
java16459 tester  memREG  253,2 97738  6032917 /hadoop/hbase-0.98.5-hadoop2/lib/hbase-prefix-tree-0.98.5-hadoop2.jar\
java16459 tester  memREG  253,2 17884  6032949 /hadoop/hbase-0.98.5-hadoop2/lib/jackson-jaxrs-1.8.8.jar\
java16459 tester  memREG  253,2253086  6032987 /hadoop/hbase-0.98.5-hadoop2/lib/grizzly-http-2.1.2.jar\
java16459 tester  memREG  253,2 73778  6032916 /hadoop/hbase-0.98.5-hadoop2/lib/hbase-hadoop2-compat-0.98.5-hadoop2.jar\
java16459 tester  memREG  253,2336904  6032993 /hadoop/hbase-0.98.5-hadoop2/lib/grizzly-http-servlet-2.1.2.jar\
java16459 tester  memREG  253,2927415  6032914 /hadoop/hbase-0.98.5-hadoop2/lib/hbase-client-0.98.5-hadoop2.jar\
java16459 tester  memREG  253,2125740  6033008 /hadoop/hbase-0.98.5-hadoop2/lib/hadoop-yarn-server-applicationhistoryservice-2.4.1.jar\
java16459 tester  memREG  253,2 15010  6032936 /hadoop/hbase-0.98.5-hadoop2/lib/xmlenc-0.52.jar\
java16459 tester  memREG  253,2 60686  6032926 /hadoop/hbase-0.98.5-hadoop2/lib/commons-logging-1.1.1.jar\
java16459 tester  memREG  253,2259600  6032927 /hadoop/hbase-0.98.5-hadoop2/lib/commons-codec-1.7.jar\
java16459 tester  memREG  253,2321806  6032957 /hadoop/hbase-0.98.5-hadoop2/lib/jets3t-0.6.1.jar\
java16459 tester  memREG  253,2 85353  6032982 /hadoop/hbase-0.98.5-hadoop2/lib/javax.servlet-api-3.0.1.jar\
java16459 tester  memREG

Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
Hi

May I know where to configure Spark to load libhadoop.so?

Regards
Arthur

On 23 Oct, 2014, at 11:31 am, arthur.hk.c...@gmail.com 
 wrote:

> Hi,
> 
> Please find the attached file.
> 
> 
> 
> 
> my spark-default.xml
> # Default system properties included when running spark-submit.
> # This is useful for setting default environmental settings.
> #
> # Example:
> # spark.masterspark://master:7077
> # spark.eventLog.enabled  true
> # spark.eventLog.dirhdfs://namenode:8021/directory
> # spark.serializerorg.apache.spark.serializer.KryoSerializer
> #
> spark.executor.memory   2048m
> spark.shuffle.spill.compressfalse
> spark.io.compression.codecorg.apache.spark.io.SnappyCompressionCodec
> 
> 
> 
> my spark-env.sh
> #!/usr/bin/env bash
> export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar"
> export 
> CLASSPATH="$CLASSPATH:$HIVE_HOME/lib/mysql-connector-java-5.1.31-bin.jar"
> export JAVA_LIBRARY_PATH="$HADOOP_HOME/lib/native/Linux-amd64-64"
> export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
> export SPARK_WORKER_DIR="/edh/hadoop_data/spark_work/"
> export SPARK_LOG_DIR="/edh/hadoop_logs/spark"
> export SPARK_LIBRARY_PATH="$HADOOP_HOME/lib/native/Linux-amd64-64"
> export 
> SPARK_CLASSPATH="$SPARK_HOME/lib_managed/jars/mysql-connector-java-5.1.31-bin.jar"
> export 
> SPARK_CLASSPATH="$SPARK_CLASSPATH:$HBASE_HOME/lib/*:$HIVE_HOME/csv-serde-1.1.2-0.11.0-all.jar:"
> export SPARK_WORKER_MEMORY=2g
> export HADOOP_HEAPSIZE=2000
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
> -Dspark.deploy.zookeeper.url=m35:2181,m33:2181,m37:2181"
> export SPARK_JAVA_OPTS=" -XX:+UseConcMarkSweepGC"
> 
> 
> ll $HADOOP_HOME/lib/native/Linux-amd64-64
> -rw-rw-r--. 1 tester tester50523 Aug 27 14:12 hadoop-auth-2.4.1.jar
> -rw-rw-r--. 1 tester tester  1062640 Aug 27 12:19 libhadoop.a
> -rw-rw-r--. 1 tester tester  1487564 Aug 27 11:14 libhadooppipes.a
> lrwxrwxrwx. 1 tester tester   24 Aug 27 07:08 libhadoopsnappy.so -> 
> libhadoopsnappy.so.0.0.1
> lrwxrwxrwx. 1 tester tester   24 Aug 27 07:08 libhadoopsnappy.so.0 -> 
> libhadoopsnappy.so.0.0.1
> -rwxr-xr-x. 1 tester tester54961 Aug 27 07:08 libhadoopsnappy.so.0.0.1
> -rwxrwxr-x. 1 tester tester   630328 Aug 27 12:19 libhadoop.so
> -rwxrwxr-x. 1 tester tester   630328 Aug 27 12:19 libhadoop.so.1.0.0
> -rw-rw-r--. 1 tester tester   582472 Aug 27 11:14 libhadooputils.a
> -rw-rw-r--. 1 tester tester   298626 Aug 27 11:14 libhdfs.a
> -rwxrwxr-x. 1 tester tester   200370 Aug 27 11:14 libhdfs.so
> -rwxrwxr-x. 1 tester tester   200370 Aug 27 11:14 libhdfs.so.0.0.0
> lrwxrwxrwx. 1 tester tester 55 Aug 27 07:08 libjvm.so -> 
> /usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64/server/libjvm.so
> lrwxrwxrwx. 1 tester tester   25 Aug 27 07:08 libprotobuf-lite.so -> 
> libprotobuf-lite.so.8.0.0
> lrwxrwxrwx. 1 tester tester   25 Aug 27 07:08 libprotobuf-lite.so.8 
> -> libprotobuf-lite.so.8.0.0
> -rwxr-xr-x. 1 tester tester   964689 Aug 27 07:08 
> libprotobuf-lite.so.8.0.0
> lrwxrwxrwx. 1 tester tester   20 Aug 27 07:08 libprotobuf.so -> 
> libprotobuf.so.8.0.0
> lrwxrwxrwx. 1 tester tester   20 Aug 27 07:08 libprotobuf.so.8 -> 
> libprotobuf.so.8.0.0
> -rwxr-xr-x. 1 tester tester  8300050 Aug 27 07:08 libprotobuf.so.8.0.0
> lrwxrwxrwx. 1 tester tester   18 Aug 27 07:08 libprotoc.so -> 
> libprotoc.so.8.0.0
> lrwxrwxrwx. 1 tester tester   18 Aug 27 07:08 libprotoc.so.8 -> 
> libprotoc.so.8.0.0
> -rwxr-xr-x. 1 tester tester  9935810 Aug 27 07:08 libprotoc.so.8.0.0
> -rw-r--r--. 1 tester tester   233554 Aug 27 15:19 libsnappy.a
> lrwxrwxrwx. 1 tester tester   23 Aug 27 11:32 libsnappy.so -> 
> /usr/lib64/libsnappy.so
> lrwxrwxrwx. 1 tester tester   23 Aug 27 11:33 libsnappy.so.1 -> 
> /usr/lib64/libsnappy.so
> -rwxr-xr-x. 1 tester tester   147726 Aug 27 07:08 libsnappy.so.1.2.0
> drwxr-xr-x. 2 tester tester 4096 Aug 27 07:08 pkgconfig
> 
> 
> Regards
> Arthur
> 
> 
> On 23 Oct, 2014, at 10:57 am, Shao, Saisai  wrote:
> 
>> Hi Arthur,
>>  
>> I think your problem might be different from what 
>> SPARK-3958(https://issues.apache.org/jira/browse/SPARK-3958) mentioned, 
>> seems your problem is more likely to be a library link problem, would you 
>> mind checking your Spark runtime to see if the snappy.so is loaded or not? 
>> (through lsof -p).
>>  
>> I guess your problem is more likely to be a library not found problem.
>>  
>>  
>> Thanks
>> Jerry
>>  
>> 



Re: Spark Hive Snappy Error

2014-10-22 Thread arthur.hk.c...@gmail.com
HI

Removed export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar”

It works, THANK YOU!!

Regards 
Arthur
 

On 23 Oct, 2014, at 1:00 pm, Shao, Saisai  wrote:

> Seems you just add snappy library into your classpath:
>  
> export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar"
>  
> But for spark itself, it depends on snappy-0.2.jar. Is there any possibility 
> that this problem caused by different version of snappy?
>  
> Thanks
> Jerry
>  
> From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com] 
> Sent: Thursday, October 23, 2014 11:32 AM
> To: Shao, Saisai
> Cc: arthur.hk.c...@gmail.com; user
> Subject: Re: Spark Hive Snappy Error
>  
> Hi,
>  
> Please find the attached file.
>  
>  
>  
> my spark-default.xml
> # Default system properties included when running spark-submit.
> # This is useful for setting default environmental settings.
> #
> # Example:
> # spark.masterspark://master:7077
> # spark.eventLog.enabled  true
> # spark.eventLog.dir
>   hdfs://namenode:8021/directory
> # spark.serializerorg.apache.spark.serializer.KryoSerializer
> #
> spark.executor.memory   2048m
> spark.shuffle.spill.compressfalse
> spark.io.compression.codec
> org.apache.spark.io.SnappyCompressionCodec
>  
>  
>  
> my spark-env.sh
> #!/usr/bin/env bash
> export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar"
> export 
> CLASSPATH="$CLASSPATH:$HIVE_HOME/lib/mysql-connector-java-5.1.31-bin.jar"
> export JAVA_LIBRARY_PATH="$HADOOP_HOME/lib/native/Linux-amd64-64"
> export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}
> export SPARK_WORKER_DIR="/edh/hadoop_data/spark_work/"
> export SPARK_LOG_DIR="/edh/hadoop_logs/spark"
> export SPARK_LIBRARY_PATH="$HADOOP_HOME/lib/native/Linux-amd64-64"
> export 
> SPARK_CLASSPATH="$SPARK_HOME/lib_managed/jars/mysql-connector-java-5.1.31-bin.jar"
> export 
> SPARK_CLASSPATH="$SPARK_CLASSPATH:$HBASE_HOME/lib/*:$HIVE_HOME/csv-serde-1.1.2-0.11.0-all.jar:"
> export SPARK_WORKER_MEMORY=2g
> export HADOOP_HEAPSIZE=2000
> export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER 
> -Dspark.deploy.zookeeper.url=m35:2181,m33:2181,m37:2181"
> export SPARK_JAVA_OPTS=" -XX:+UseConcMarkSweepGC"
>  
>  
> ll $HADOOP_HOME/lib/native/Linux-amd64-64
> -rw-rw-r--. 1 tester tester50523 Aug 27 14:12 hadoop-auth-2.4.1.jar
> -rw-rw-r--. 1 tester tester  1062640 Aug 27 12:19 libhadoop.a
> -rw-rw-r--. 1 tester tester  1487564 Aug 27 11:14 libhadooppipes.a
> lrwxrwxrwx. 1 tester tester   24 Aug 27 07:08 libhadoopsnappy.so -> 
> libhadoopsnappy.so.0.0.1
> lrwxrwxrwx. 1 tester tester   24 Aug 27 07:08 libhadoopsnappy.so.0 -> 
> libhadoopsnappy.so.0.0.1
> -rwxr-xr-x. 1 tester tester54961 Aug 27 07:08 libhadoopsnappy.so.0.0.1
> -rwxrwxr-x. 1 tester tester   630328 Aug 27 12:19 libhadoop.so
> -rwxrwxr-x. 1 tester tester   630328 Aug 27 12:19 libhadoop.so.1.0.0
> -rw-rw-r--. 1 tester tester   582472 Aug 27 11:14 libhadooputils.a
> -rw-rw-r--. 1 tester tester   298626 Aug 27 11:14 libhdfs.a
> -rwxrwxr-x. 1 tester tester   200370 Aug 27 11:14 libhdfs.so
> -rwxrwxr-x. 1 tester tester   200370 Aug 27 11:14 libhdfs.so.0.0.0
> lrwxrwxrwx. 1 tester tester 55 Aug 27 07:08 libjvm.so 
> ->/usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64/server/libjvm.so
> lrwxrwxrwx. 1 tester tester   25 Aug 27 07:08 libprotobuf-lite.so -> 
> libprotobuf-lite.so.8.0.0
> lrwxrwxrwx. 1 tester tester   25 Aug 27 07:08 libprotobuf-lite.so.8 
> -> libprotobuf-lite.so.8.0.0
> -rwxr-xr-x. 1 tester tester   964689 Aug 27 07:08 
> libprotobuf-lite.so.8.0.0
> lrwxrwxrwx. 1 tester tester   20 Aug 27 07:08 libprotobuf.so -> 
> libprotobuf.so.8.0.0
> lrwxrwxrwx. 1 tester tester   20 Aug 27 07:08 libprotobuf.so.8 -> 
> libprotobuf.so.8.0.0
> -rwxr-xr-x. 1 tester tester  8300050 Aug 27 07:08 libprotobuf.so.8.0.0
> lrwxrwxrwx. 1 tester tester   18 Aug 27 07:08 libprotoc.so -> 
> libprotoc.so.8.0.0
> lrwxrwxrwx. 1 tester tester   18 Aug 27 07:08 libprotoc.so.8 -> 
> libprotoc.so.8.0.0
> -rwxr-xr-x. 1 tester tester  9935810 Aug 27 07:08 libprotoc.so.8.0.0
> -rw-r--r--. 1 tester tester   233554 Aug 27 15:19 libsnappy.a
> lrwxrwxrwx. 1 tester tester   23 Aug 27 11:32 libsnappy.so -> 
> /usr/lib64/libsnappy.so
> lrwxrwxrwx. 1 tester tester   23 Aug 27 11:33 libsnappy.so.1 -> 
> /usr/lib64/libsnappy.so
> -rwxr-xr-x. 1 tester tester   147726 Aug 27 07:08 libsnappy.so.1.2.0

Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-23 Thread arthur.hk.c...@gmail.com
Hi,

I got $TreeNodeException, few questions:
Q1) How should I do aggregation in SparK? Can I use aggregation directly in 
SQL? or
Q1) Should I use SQL to load the data to form RDD then use scala to do the 
aggregation?

Regards
Arthur


MySQL (good one, without aggregation): 
sqlContext.sql("SELECT L_RETURNFLAG FROM LINEITEM WHERE  
L_SHIPDATE<='1998-09-02'  GROUP  BY L_RETURNFLAG, L_LINESTATUS ORDER  BY 
L_RETURNFLAG, L_LINESTATUS").collect().foreach(println);
[A]
[N]
[N]
[R]


My SQL (problem SQL, with aggregation):
sqlContext.sql("SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY) AS SUM_QTY, 
SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE, SUM(L_EXTENDEDPRICE * ( 1 - L_DISCOUNT 
)) AS SUM_DISC_PRICE, SUM(L_EXTENDEDPRICE * ( 1 - L_DISCOUNT ) * ( 1 + L_TAX )) 
AS SUM_CHARGE, AVG(L_QUANTITY) AS AVG_QTY, AVG(L_EXTENDEDPRICE) AS AVG_PRICE, 
AVG(L_DISCOUNT) AS AVG_DISC, COUNT(*) AS COUNT_ORDER  FROM LINEITEM WHERE  
L_SHIPDATE<='1998-09-02'  GROUP  BY L_RETURNFLAG, L_LINESTATUS ORDER  BY 
L_RETURNFLAG, L_LINESTATUS").collect().foreach(println);

14/10/23 20:38:31 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks 
have all completed, from pool 
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:
Sort [l_returnflag#200 ASC,l_linestatus#201 ASC], true
 Exchange (RangePartitioning [l_returnflag#200 ASC,l_linestatus#201 ASC], 200)
  Aggregate false, [l_returnflag#200,l_linestatus#201], 
[l_returnflag#200,l_linestatus#201,SUM(PartialSum#216) AS 
sum_qty#181,SUM(PartialSum#217) AS sum_base_price#182,SUM(PartialSum#218) AS 
sum_disc_price#183,SUM(PartialSum#219) AS 
sum_charge#184,(CAST(SUM(PartialSum#220), DoubleType) / 
CAST(SUM(PartialCount#221L), DoubleType)) AS 
avg_qty#185,(CAST(SUM(PartialSum#222), DoubleType) / 
CAST(SUM(PartialCount#223L), DoubleType)) AS 
avg_price#186,(CAST(SUM(PartialSum#224), DoubleType) / 
CAST(SUM(PartialCount#225L), DoubleType)) AS 
avg_disc#187,SUM(PartialCount#226L) AS count_order#188L]
   Exchange (HashPartitioning [l_returnflag#200,l_linestatus#201], 200)
Aggregate true, [l_returnflag#200,l_linestatus#201], 
[l_returnflag#200,l_linestatus#201,COUNT(l_discount#195) AS 
PartialCount#225L,SUM(l_discount#195) AS PartialSum#224,COUNT(1) AS 
PartialCount#226L,SUM((l_extendedprice#194 * (1.0 - l_discount#195))) AS 
PartialSum#218,SUM(l_extendedprice#194) AS PartialSum#217,COUNT(l_quantity#193) 
AS PartialCount#221L,SUM(l_quantity#193) AS 
PartialSum#220,COUNT(l_extendedprice#194) AS 
PartialCount#223L,SUM(l_extendedprice#194) AS 
PartialSum#222,SUM(((l_extendedprice#194 * (1.0 - l_discount#195)) * (1.0 + 
l_tax#196))) AS PartialSum#219,SUM(l_quantity#193) AS PartialSum#216]
 Project 
[l_extendedprice#194,l_linestatus#201,l_discount#195,l_returnflag#200,l_tax#196,l_quantity#193]
  Filter (l_shipdate#197 <= 1998-09-02)
   HiveTableScan 
[l_shipdate#197,l_extendedprice#194,l_linestatus#201,l_discount#195,l_returnflag#200,l_tax#196,l_quantity#193],
 (MetastoreRelation boc_12, lineitem, None), None

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at org.apache.spark.sql.execution.Sort.execute(basicOperators.scala:191)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
at $iwC$$iwC$$iwC$$iwC.(:15)
at $iwC$$iwC$$iwC.(:20)
at $iwC$$iwC.(:22)
at $iwC.(:24)
at (:26)
at .(:30)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at 
org.apache.spark.repl.SparkILoop$$an

Re: Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-23 Thread arthur.hk.c...@gmail.com
HI,


My step to create LINEITEM:
$HADOOP_HOME/bin/hadoop fs -mkdir /tpch/lineitem
$HADOOP_HOME/bin/hadoop fs -copyFromLocal lineitem.tbl /tpch/lineitem/

Create external table lineitem (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, 
L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, 
L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, 
L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE 
STRING, L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED 
AS TEXTFILE LOCATION '/tpch/lineitem’;

Regards
Arthur


On 23 Oct, 2014, at 9:36 pm, Yin Huai  wrote:

> Hello Arthur,
> 
> You can use do aggregations in SQL. How did you create LINEITEM?
> 
> Thanks,
> 
> Yin
> 
> On Thu, Oct 23, 2014 at 8:54 AM, arthur.hk.c...@gmail.com 
>  wrote:
> Hi,
> 
> I got $TreeNodeException, few questions:
> Q1) How should I do aggregation in SparK? Can I use aggregation directly in 
> SQL? or
> Q1) Should I use SQL to load the data to form RDD then use scala to do the 
> aggregation?
> 
> Regards
> Arthur
> 
> 
> MySQL (good one, without aggregation): 
> sqlContext.sql("SELECT L_RETURNFLAG FROM LINEITEM WHERE  
> L_SHIPDATE<='1998-09-02'  GROUP  BY L_RETURNFLAG, L_LINESTATUS ORDER  BY 
> L_RETURNFLAG, L_LINESTATUS").collect().foreach(println);
> [A]
> [N]
> [N]
> [R]
> 
> 
> My SQL (problem SQL, with aggregation):
> sqlContext.sql("SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY) AS 
> SUM_QTY, SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE, SUM(L_EXTENDEDPRICE * ( 1 - 
> L_DISCOUNT )) AS SUM_DISC_PRICE, SUM(L_EXTENDEDPRICE * ( 1 - L_DISCOUNT ) * ( 
> 1 + L_TAX )) AS SUM_CHARGE, AVG(L_QUANTITY) AS AVG_QTY, AVG(L_EXTENDEDPRICE) 
> AS AVG_PRICE, AVG(L_DISCOUNT) AS AVG_DISC, COUNT(*) AS COUNT_ORDER  FROM 
> LINEITEM WHERE  L_SHIPDATE<='1998-09-02'  GROUP  BY L_RETURNFLAG, 
> L_LINESTATUS ORDER  BY L_RETURNFLAG, 
> L_LINESTATUS").collect().foreach(println);
> 
> 14/10/23 20:38:31 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks 
> have all completed, from pool 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree:
> Sort [l_returnflag#200 ASC,l_linestatus#201 ASC], true
>  Exchange (RangePartitioning [l_returnflag#200 ASC,l_linestatus#201 ASC], 200)
>   Aggregate false, [l_returnflag#200,l_linestatus#201], 
> [l_returnflag#200,l_linestatus#201,SUM(PartialSum#216) AS 
> sum_qty#181,SUM(PartialSum#217) AS sum_base_price#182,SUM(PartialSum#218) AS 
> sum_disc_price#183,SUM(PartialSum#219) AS 
> sum_charge#184,(CAST(SUM(PartialSum#220), DoubleType) / 
> CAST(SUM(PartialCount#221L), DoubleType)) AS 
> avg_qty#185,(CAST(SUM(PartialSum#222), DoubleType) / 
> CAST(SUM(PartialCount#223L), DoubleType)) AS 
> avg_price#186,(CAST(SUM(PartialSum#224), DoubleType) / 
> CAST(SUM(PartialCount#225L), DoubleType)) AS 
> avg_disc#187,SUM(PartialCount#226L) AS count_order#188L]
>Exchange (HashPartitioning [l_returnflag#200,l_linestatus#201], 200)
> Aggregate true, [l_returnflag#200,l_linestatus#201], 
> [l_returnflag#200,l_linestatus#201,COUNT(l_discount#195) AS 
> PartialCount#225L,SUM(l_discount#195) AS PartialSum#224,COUNT(1) AS 
> PartialCount#226L,SUM((l_extendedprice#194 * (1.0 - l_discount#195))) AS 
> PartialSum#218,SUM(l_extendedprice#194) AS 
> PartialSum#217,COUNT(l_quantity#193) AS PartialCount#221L,SUM(l_quantity#193) 
> AS PartialSum#220,COUNT(l_extendedprice#194) AS 
> PartialCount#223L,SUM(l_extendedprice#194) AS 
> PartialSum#222,SUM(((l_extendedprice#194 * (1.0 - l_discount#195)) * (1.0 + 
> l_tax#196))) AS PartialSum#219,SUM(l_quantity#193) AS PartialSum#216]
>  Project 
> [l_extendedprice#194,l_linestatus#201,l_discount#195,l_returnflag#200,l_tax#196,l_quantity#193]
>   Filter (l_shipdate#197 <= 1998-09-02)
>HiveTableScan 
> [l_shipdate#197,l_extendedprice#194,l_linestatus#201,l_discount#195,l_returnflag#200,l_tax#196,l_quantity#193],
>  (MetastoreRelation boc_12, lineitem, None), None
> 
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
>   at org.apache.spark.sql.execution.Sort.execute(basicOperators.scala:191)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
>   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
>   at $iwC$$iwC$$iwC$$iwC.(:15)
>   at $iwC$$iwC$$iwC.(:20)
>   at $iwC$$iwC.(:22)
>   at $iwC.(:24)
>   at (:26)
>   at .(:30)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMetho

Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-23 Thread arthur.hk.c...@gmail.com
(Please ignore if duplicated)


Hi,

My Spark is 1.1.0 and Hive is 0.12,  I tried to run the same query in both 
Hive-0.12.0 then Spark-1.1.0,  HiveQL works while SparkSQL failed. 

hive> select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, 
o_orderdate, o_shippriority from customer c join orders o on c.c_mktsegment = 
'BUILDING' and c.c_custkey = o.o_custkey join lineitem l on l.l_orderkey = 
o.o_orderkey where o_orderdate < '1995-03-15' and l_shipdate > '1995-03-15' 
group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, 
o_orderdate limit 10;
Ended Job = job_1414067367860_0011
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 2.0 sec   HDFS Read: 261 HDFS Write: 
96 SUCCESS
Job 1: Map: 1  Reduce: 1   Cumulative CPU: 0.88 sec   HDFS Read: 458 HDFS 
Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 880 msec
OK
Time taken: 38.771 seconds


scala> sqlContext.sql("""select l_orderkey, sum(l_extendedprice*(1-l_discount)) 
as revenue, o_orderdate, o_shippriority from customer c join orders o on 
c.c_mktsegment = 'BUILDING' and c.c_custkey = o.o_custkey join lineitem l on 
l.l_orderkey = o.o_orderkey where o_orderdate < '1995-03-15' and l_shipdate > 
'1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue 
desc, o_orderdate limit 10""").collect().foreach(println);
org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in 
stage 5.0 failed 4 times, most recent failure: Lost task 14.3 in stage 5.0 (TID 
568, m34): java.lang.ClassCastException: java.lang.String cannot be cast to 
scala.math.BigDecimal
scala.math.Numeric$BigDecimalIsFractional$.minus(Numeric.scala:182)

org.apache.spark.sql.catalyst.expressions.Subtract$$anonfun$eval$3.apply(arithmetic.scala:64)

org.apache.spark.sql.catalyst.expressions.Subtract$$anonfun$eval$3.apply(arithmetic.scala:64)

org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:114)

org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:64)

org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:108)

org.apache.spark.sql.catalyst.expressions.Multiply.eval(arithmetic.scala:70)

org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:47)

org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:108)
org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:58)

org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:69)

org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:433)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 

Re: Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-24 Thread arthur.hk.c...@gmail.com
.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)14/10/25
 06:50:15 WARN TaskSetManager: Lost task 7.3 in stage 5.0 (TID 575, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 WARN TaskSetManager: Lost task 24.2 in stage 5.0 (TID 560, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 WARN TaskSetManager: Lost task 22.2 in stage 5.0 (TID 561, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 WARN TaskSetManager: Lost task 20.2 in stage 5.0 (TID 564, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 WARN TaskSetManager: Lost task 13.2 in stage 5.0 (TID 562, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 WARN TaskSetManager: Lost task 27.2 in stage 5.0 (TID 565, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 WARN TaskSetManager: Lost task 34.2 in stage 5.0 (TID 568, 
m137.emblocsoft.net): TaskKilled (killed intentionally)
14/10/25 06:50:15 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have 
all completed, from pool 

Regards
Arthur


On 24 Oct, 2014, at 6:56 am, Michael Armbrust  wrote:

> Can you show the DDL for the table?  It looks like the SerDe might be saying 
> it will produce a decimal type but is actually producing a string.
> 
> On Thu, Oct 23, 2014 at 3:17 PM, arthur.hk.c...@gmail.com 
>  wrote:
> Hi
> 
> My Spark is 1.1.0 and Hive is 0.12,  I tried to run the same query in both 
> Hive-0.12.0 then Spark-1.1.0,  HiveQL works while SparkSQL failed. 
> 
> 
> hive> select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, 
> o_orderdate, o_shippriority from customer c join orders o on c.c_mktsegment = 
> 'BUILDING' and c.c_custkey = o.o_custkey join lineitem l on l.l_orderkey = 
> o.o_orderkey where o_orderdate < '1995-03-15' and l_shipdate > '1995-03-15' 
> group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, 
> o_orderdate limit 10;
> Ended Job = job_1414067367860_0011
> MapReduce Jobs Launched: 
> Job 0: Map: 1  Reduce: 1   Cumulative CPU: 2.0 sec   HDFS Read: 261 HDFS 
> Write: 96 SUCCESS
> Job 1: Map: 1  Reduce: 1   Cumulative CPU: 0.88 sec   HDFS Read: 458 HDFS 
> Write: 0 SUCCESS
> Total MapReduce CPU Time Spent: 2 seconds 880 msec
> OK
> Time taken: 38.771 seconds
> 
> 
> scala> sqlContext.sql("""select l_orderkey, 
> sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate, o_shippriority 
>

Re: Spark: Order by Failed, java.lang.NullPointerException

2014-10-24 Thread arthur.hk.c...@gmail.com
Hi, 

Added “l_linestatus” it works, THANK YOU!!

sqlContext.sql("select l_linestatus, l_orderkey, l_linenumber, l_partkey, 
l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by 
L_LINESTATUS limit 10").collect().foreach(println);
14/10/25 07:03:24 INFO DAGScheduler: Stage 12 (takeOrdered at 
basicOperators.scala:171) finished in 54.358 s
14/10/25 07:03:24 INFO TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks 
have all completed, from pool 
14/10/25 07:03:24 INFO SparkContext: Job finished: takeOrdered at 
basicOperators.scala:171, took 54.374629175 s
[F,71769288,2,-5859884,13,1993-12-13,R,F]
[F,71769319,4,-4098165,19,1992-10-12,R,F]
[F,71769288,3,2903707,44,1994-10-08,R,F]
[F,71769285,2,-741439,42,1994-04-22,R,F]
[F,71769313,5,-1276467,12,1992-08-15,R,F]
[F,71769314,7,-5595080,13,1992-03-28,A,F]
[F,71769316,1,-1766622,16,1993-12-05,R,F]
[F,71769287,2,-767340,50,1993-06-21,A,F]
[F,71769317,2,665847,15,1992-05-03,A,F]
[F,71769286,1,-5667701,15,1994-04-17,A,F]


Regards 
Arthur




On 24 Oct, 2014, at 2:58 pm, Akhil Das  wrote:

> Not sure if this would help, but make sure you are having the column 
> l_linestatus in the data.
> 
> Thanks
> Best Regards
> 
> On Thu, Oct 23, 2014 at 5:59 AM, arthur.hk.c...@gmail.com 
>  wrote:
> Hi,
> 
> I got java.lang.NullPointerException. Please help!
> 
> 
> sqlContext.sql("select l_orderkey, l_linenumber, l_partkey, l_quantity, 
> l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem limit 
> 10").collect().foreach(println);
> 
> 2014-10-23 08:20:12,024 INFO  [sparkDriver-akka.actor.default-dispatcher-31] 
> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Stage 41 (runJob at 
> basicOperators.scala:136) finished in 0.086 s
> 2014-10-23 08:20:12,024 INFO  [Result resolver thread-1] 
> scheduler.TaskSchedulerImpl (Logging.scala:logInfo(59)) - Removed TaskSet 
> 41.0, whose tasks have all completed, from pool 
> 2014-10-23 08:20:12,024 INFO  [main] spark.SparkContext 
> (Logging.scala:logInfo(59)) - Job finished: runJob at 
> basicOperators.scala:136, took 0.090129332 s
> [9001,6,-4584121,17,1997-01-04,N,O]
> [9002,1,-2818574,23,1996-02-16,N,O]
> [9002,2,-2449102,21,1993-12-12,A,F]
> [9002,3,-5810699,26,1994-04-06,A,F]
> [9002,4,-489283,18,1994-11-11,R,F]
> [9002,5,2169683,15,1997-09-14,N,O]
> [9002,6,2405081,4,1992-08-03,R,F]
> [9002,7,3835341,40,1998-04-28,N,O]
> [9003,1,1900071,4,1994-05-05,R,F]
> [9004,1,-2614665,41,1993-06-13,A,F]
> 
> 
> If "order by L_LINESTATUS” is added then error:
> sqlContext.sql("select l_orderkey, l_linenumber, l_partkey, l_quantity, 
> l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by L_LINESTATUS 
> limit 10").collect().foreach(println);
> 
> 2014-10-23 08:22:08,524 INFO  [main] parse.ParseDriver 
> (ParseDriver.java:parse(179)) - Parsing command: select l_orderkey, 
> l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS 
> from lineitem order by L_LINESTATUS limit 10
> 2014-10-23 08:22:08,525 INFO  [main] parse.ParseDriver 
> (ParseDriver.java:parse(197)) - Parse Completed
> 2014-10-23 08:22:08,526 INFO  [main] metastore.HiveMetaStore 
> (HiveMetaStore.java:logInfo(454)) - 0: get_table : db=boc_12 tbl=lineitem
> 2014-10-23 08:22:08,526 INFO  [main] HiveMetaStore.audit 
> (HiveMetaStore.java:logAuditEvent(239)) - ugi=hd ip=unknown-ip-addr  
> cmd=get_table : db=boc_12 tbl=lineitem  
> java.lang.NullPointerException
>   at 
> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1262)
>   at 
> org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1269)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader.(TableReader.scala:63)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:68)
>   at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
>   at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
>   at 
> org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
>   at 
> org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   a

Re: Spark/HIVE Insert Into values Error

2014-10-25 Thread arthur.hk.c...@gmail.com
Hi,

I have already found the way about how to “insert into HIVE_TABLE values (…..)

Regards
Arthur

On 18 Oct, 2014, at 10:09 pm, Cheng Lian  wrote:

> Currently Spark SQL uses Hive 0.12.0, which doesn't support the INSERT INTO 
> ... VALUES ... syntax.
> 
> On 10/18/14 1:33 AM, arthur.hk.c...@gmail.com wrote:
>> Hi,
>> 
>> When trying to insert records into HIVE, I got error,
>> 
>> My Spark is 1.1.0 and Hive 0.12.0
>> 
>> Any idea what would be wrong?
>> Regards
>> Arthur
>> 
>> 
>> 
>> hive> CREATE TABLE students (name VARCHAR(64), age INT, gpa int);  
>> OK
>> 
>> hive> INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1);
>> NoViableAltException(26@[])
>>  at 
>> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:693)
>>  at 
>> org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:31374)
>>  at 
>> org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:29083)
>>  at 
>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:28968)
>>  at 
>> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:28762)
>>  at 
>> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1238)
>>  at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938)
>>  at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190)
>>  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424)
>>  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
>>  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
>>  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
>>  at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
>>  at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
>>  at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
>>  at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
>>  at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
>>  at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614)
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  at java.lang.reflect.Method.invoke(Method.java:606)
>>  at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>> FAILED: ParseException line 1:27 cannot recognize input near 'VALUES' '(' 
>> ''fred flintstone'' in select clause
>> 
>> 
> 



Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread arthur.hk.c...@gmail.com
Hi,

My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13?  Please advise.

Or, any news about when will Spark 1.1.0 on Hive 0.1.3.1 be available?

Regards
Arthur
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread arthur.hk.c...@gmail.com
Hi,

Thanks for your update. Any idea when will Spark 1.2 be GA?

Regards
Arthur


On 29 Oct, 2014, at 8:22 pm, Cheng Lian  wrote:

> Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0, and 
> related PRs are already merged or being merged to the master branch.
> 
> On 10/29/14 7:43 PM, arthur.hk.c...@gmail.com wrote:
>> Hi,
>> 
>> My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13?  Please advise.
>> 
>> Or, any news about when will Spark 1.1.0 on Hive 0.1.3.1 be available?
>> 
>> Regards
>> Arthur
>>  
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: OOM with groupBy + saveAsTextFile

2014-11-01 Thread arthur.hk.c...@gmail.com

Hi,

FYI as follows.  Could you post your heap size settings as well your Spark app 
code?

Regards
Arthur

3.1.3 Detail Message: Requested array size exceeds VM limit

The detail message Requested array size exceeds VM limit indicates that the 
application (or APIs used by that application) attempted to allocate an array 
that is larger than the heap size. For example, if an application attempts to 
allocate an array of 512MB but the maximum heap size is 256MB then 
OutOfMemoryError will be thrown with the reason Requested array size exceeds VM 
limit. In most cases the problem is either a configuration issue (heap size too 
small), or a bug that results in an application attempting to create a huge 
array, for example, when the number of elements in the array are computed using 
an algorithm that computes an incorrect size.”




On 2 Nov, 2014, at 12:25 pm, Bharath Ravi Kumar  wrote:

> Resurfacing the thread. Oom shouldn't be the norm for a common groupby / sort 
> use case in a framework that is leading in sorting bench marks? Or is there 
> something fundamentally wrong in the usage?
> 
> On 02-Nov-2014 1:06 am, "Bharath Ravi Kumar"  wrote:
> Hi,
> 
> I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of 
> count ~ 100 million. The data size is 20GB and groupBy results in an RDD of 
> 1061 keys with values being Iterable String>>. The job runs on 3 hosts in a standalone setup with each host's 
> executor having 100G RAM and 24 cores dedicated to it. While the groupBy 
> stage completes successfully with ~24GB of shuffle write, the saveAsTextFile 
> fails after repeated retries with each attempt failing due to an out of 
> memory error [1]. I understand that a few partitions may be overloaded as a 
> result of the groupBy and I've tried the following config combinations 
> unsuccessfully:
> 
> 1) Repartition the initial rdd (44 input partitions but 1061 keys) across 
> 1061 paritions and have max cores = 3 so that each key is a "logical" 
> partition (though many partitions will end up on very few hosts), and each 
> host likely runs saveAsTextFile on a single key at a time due to max cores = 
> 3 with 3 hosts in the cluster. The level of parallelism is unspecified.
> 
> 2) Leave max cores unspecified, set the level of parallelism to 72, and leave 
> number of partitions unspecified (in which case the # input partitions was 
> used, which is 44)
> Since I do not intend to cache RDD's, I have set 
> spark.storage.memoryFraction=0.2 in both cases.
> 
> My understanding is that if each host is processing a single logical 
> partition to saveAsTextFile and is reading from other hosts to write out the 
> RDD, it is unlikely that it would run out of memory. My interpretation of the 
> spark tuning guide is that the degree of parallelism has little impact in 
> case (1) above since max cores = number of hosts. Can someone explain why 
> there are still OOM's with 100G being available? On a related note, 
> intuitively (though I haven't read the source), it appears that an entire 
> key-value pair needn't fit into memory of a single host for saveAsTextFile 
> since a single shuffle read from a remote can be written to HDFS before the 
> next remote read is carried out. This way, not all data needs to be collected 
> at the same time. 
> 
> Lastly, if an OOM is (but shouldn't be) a common occurrence (as per the 
> tuning guide and even as per Datastax's spark introduction), there may need 
> to be more documentation around the internals of spark to help users take 
> better informed tuning decisions with parallelism, max cores, number 
> partitions and other tunables. Is there any ongoing effort on that front?
> 
> Thanks,
> Bharath
> 
> 
> [1] OOM stack trace and logs
> 14/11/01 12:26:37 WARN TaskSetManager: Lost task 61.0 in stage 36.0 (TID 
> 1264, proc1.foo.bar.com): java.lang.OutOfMemoryError: Requested array size 
> exceeds VM limit
> java.util.Arrays.copyOf(Arrays.java:3326)
> 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
> java.lang.StringBuilder.append(StringBuilder.java:136)
> scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
> scala.Tuple2.toString(Tuple2.scala:22)
> 
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)
> 
> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.