Compilaon Error: Spark 1.0.2 with HBase 0.98
Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = "0.98.2" edit pom.xml with following values 2.4.1 2.5.0 ${hadoop.version} 0.98.5 3.4.6 0.13.1 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it fails because of UNRESOLVED DEPENDENCIES "hbase;0.98.2" Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I set HBASE_VERSION back to “0.94.6"? Regards Arthur [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.hbase#hbase;0.98.2: not found [warn] :: sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51) at sbt.IvySbt$$anon$3.call(Ivy.scala:60) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102) at xsbt.boot.Using$.withResource(Using.scala:11) at xsbt.boot.Using$.apply(Using.scala:10) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:60) at sbt.IvySbt.withIvy(Ivy.scala:101) at sbt.IvySbt.withIvy(Ivy.scala:97) at sbt.IvySbt$Module.withModule(Ivy.scala:116) at sbt.IvyActions$.update(IvyActions.scala:125) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42) at sbt.std.Transform$$anon$4.work(System.scala:64) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18) at sbt.Execute.work(Execute.scala:244) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160) at sbt.CompletionService$$anon$2.call(CompletionService.scala:30) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) [error] (examples/*:update) sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found [error] Total time: 270 s, completed Aug 28, 2014 9:42:05 AM
Compilation Error: Spark 1.0.2 with HBase 0.98
(correction: "Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated) Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = "0.98.2" edit pom.xml with following values 2.4.1 2.5.0 ${hadoop.version} 0.98.5 3.4.6 0.13.1 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it fails because of UNRESOLVED DEPENDENCIES "hbase;0.98.2" Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I set HBASE_VERSION back to “0.94.6"? Regards Arthur [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.hbase#hbase;0.98.2: not found [warn] :: sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51) at sbt.IvySbt$$anon$3.call(Ivy.scala:60) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102) at xsbt.boot.Using$.withResource(Using.scala:11) at xsbt.boot.Using$.apply(Using.scala:10) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:60) at sbt.IvySbt.withIvy(Ivy.scala:101) at sbt.IvySbt.withIvy(Ivy.scala:97) at sbt.IvySbt$Module.withModule(Ivy.scala:116) at sbt.IvyActions$.update(IvyActions.scala:125) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42) at sbt.std.Transform$$anon$4.work(System.scala:64) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18) at sbt.Execute.work(Execute.scala:244) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160) at sbt.CompletionService$$anon$2.call(CompletionService.scala:30) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) [error] (examples/*:update) sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found [error] Total time: 270 s, completed Aug 28, 2014 9:42:05 AM --
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
Hi Ted, Thank you so much!! As I am new to Spark, can you please advise the steps about how to apply this patch to my spark-1.0.2 source folder? Regards Arthur On 28 Aug, 2014, at 10:13 am, Ted Yu wrote: > See SPARK-1297 > > The pull request is here: > https://github.com/apache/spark/pull/1893 > > > On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com > wrote: > (correction: "Compilation Error: Spark 1.0.2 with HBase 0.98” , please > ignore if duplicated) > > > Hi, > > I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with > HBase 0.98, > > My steps: > wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz > tar -vxf spark-1.0.2.tgz > cd spark-1.0.2 > > edit project/SparkBuild.scala, set HBASE_VERSION > // HBase version; set as appropriate. > val HBASE_VERSION = "0.98.2" > > > edit pom.xml with following values > 2.4.1 > 2.5.0 > ${hadoop.version} > 0.98.5 > 3.4.6 > 0.13.1 > > > SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly > but it fails because of UNRESOLVED DEPENDENCIES "hbase;0.98.2" > > Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I > set HBASE_VERSION back to “0.94.6"? > > Regards > Arthur > > > > > [warn] :: > [warn] :: UNRESOLVED DEPENDENCIES :: > [warn] :: > [warn] :: org.apache.hbase#hbase;0.98.2: not found > [warn] :: > > sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: > not found > at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) > at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) > at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125) > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) > at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104) > at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51) > at sbt.IvySbt$$anon$3.call(Ivy.scala:60) > at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98) > at > xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81) > at > xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102) > at xsbt.boot.Using$.withResource(Using.scala:11) > at xsbt.boot.Using$.apply(Using.scala:10) > at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62) > at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52) > at xsbt.boot.Locks$.apply0(Locks.scala:31) > at xsbt.boot.Locks$.apply(Locks.scala:28) > at sbt.IvySbt.withDefaultLogger(Ivy.scala:60) > at sbt.IvySbt.withIvy(Ivy.scala:101) > at sbt.IvySbt.withIvy(Ivy.scala:97) > at sbt.IvySbt$Module.withModule(Ivy.scala:116) > at sbt.IvyActions$.update(IvyActions.scala:125) > at > sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170) > at > sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168) > at > sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191) > at > sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189) > at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) > at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193) > at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188) > at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) > at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196) > at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161) > at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139) > at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) > at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42) > at sbt.std.Transform$$anon$4.work(System.scala:64) > at > sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) > at > sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) > at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18) > at sbt.Execute.work(Execute.scala:244) > at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) > at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) > at > sbt.ConcurrentRestriction
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
Hi Ted, I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one? Regards Arthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 wget https://github.com/apache/spark/pull/1893.patch patch < 1893.patch patching file pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej patching file pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 FAILED at 171. 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej can't find file to patch at input line 267 Perhaps you should have used the -p or --strip option? The text leading up to this was: -- | |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001 |From: tedyu |Date: Mon, 11 Aug 2014 15:57:46 -0700 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add | description to building-with-maven.md | |--- | docs/building-with-maven.md | 3 +++ | 1 file changed, 3 insertions(+) | |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- a/docs/building-with-maven.md |+++ b/docs/building-with-maven.md -- File to patch: On 28 Aug, 2014, at 10:24 am, Ted Yu wrote: > You can get the patch from this URL: > https://github.com/apache/spark/pull/1893.patch > > BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml > > Cheers > > > On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com > wrote: > Hi Ted, > > Thank you so much!! > > As I am new to Spark, can you please advise the steps about how to apply this > patch to my spark-1.0.2 source folder? > > Regards > Arthur > > > On 28 Aug, 2014, at 10:13 am, Ted Yu wrote: > >> See SPARK-1297 >> >> The pull request is here: >> https://github.com/apache/spark/pull/1893 >> >> >> On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com >> wrote: >> (correction: "Compilation Error: Spark 1.0.2 with HBase 0.98” , please >> ignore if duplicated) >> >> >> Hi, >> >> I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with >> HBase 0.98, >> >> My steps: >> wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz >> tar -vxf spark-1.0.2.tgz >> cd spark-1.0.2 >> >> edit project/SparkBuild.scala, set HBASE_VERSION >> // HBase version; set as appropriate. >> val HBASE_VERSION = "0.98.2" >> >> >> edit pom.xml with following values >> 2.4.1 >> 2.5.0 >> ${hadoop.version} >> 0.98.5 >> 3.4.6 >> 0.13.1 >> >> >> SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly >> but it fails because of UNRESOLVED DEPENDENCIES "hbase;0.98.2" >> >> Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should >> I set HBASE_VERSION back to “0.94.6"? >> >> Regards >> Arthur >> >> >> >> >> [warn] :: >> [warn] :: UNRESOLVED DEPENDENCIES :: >> [warn] :: >> [warn] :: org.apache.hbase#hbase;0.98.2: not found >> [warn] :: >> >> sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: >> not found >> at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) >> at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) >> at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125) >> at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) >> at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) >> at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104) >> at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51) >> at sbt.IvySbt$$anon$3.call(Ivy.scala:60) >> at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98) >> at >> xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81) >> at >> xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102) >> at xsbt.boot.Using$.withResource(Using.scala:11) >> at xsbt.boot.Using$.apply(Using.scala:10) >> at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62) >> at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:5
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
Hi Ted, Thanks. Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.) Is this normal? Regards Arthur patch -p1 -i 1893.patch patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file examples/pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 succeeded at 122 (offset -49 lines). 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 succeeded at 122 (offset -40 lines). Hunk #2 succeeded at 195 (offset -40 lines). On 28 Aug, 2014, at 10:53 am, Ted Yu wrote: > Can you use this command ? > > patch -p1 -i 1893.patch > > Cheers > > > On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com > wrote: > Hi Ted, > > I tried the following steps to apply the patch 1893 but got Hunk FAILED, can > you please advise how to get thru this error? or is my spark-1.0.2 source not > the correct one? > > Regards > Arthur > > wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz > tar -vxf spark-1.0.2.tgz > cd spark-1.0.2 > wget https://github.com/apache/spark/pull/1893.patch > patch < 1893.patch > patching file pom.xml > Hunk #1 FAILED at 45. > Hunk #2 FAILED at 110. > 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej > patching file pom.xml > Hunk #1 FAILED at 54. > Hunk #2 FAILED at 72. > Hunk #3 FAILED at 171. > 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej > can't find file to patch at input line 267 > Perhaps you should have used the -p or --strip option? > The text leading up to this was: > -- > | > |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001 > |From: tedyu > |Date: Mon, 11 Aug 2014 15:57:46 -0700 > |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add > | description to building-with-maven.md > | > |--- > | docs/building-with-maven.md | 3 +++ > | 1 file changed, 3 insertions(+) > | > |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md > |index 672d0ef..f8bcd2b 100644 > |--- a/docs/building-with-maven.md > |+++ b/docs/building-with-maven.md > -- > File to patch: > > > > On 28 Aug, 2014, at 10:24 am, Ted Yu wrote: > >> You can get the patch from this URL: >> https://github.com/apache/spark/pull/1893.patch >> >> BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml >> >> Cheers >> >> >> On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com >> wrote: >> Hi Ted, >> >> Thank you so much!! >> >> As I am new to Spark, can you please advise the steps about how to apply >> this patch to my spark-1.0.2 source folder? >> >> Regards >> Arthur >> >> >> On 28 Aug, 2014, at 10:13 am, Ted Yu wrote: >> >>> See SPARK-1297 >>> >>> The pull request is here: >>> https://github.com/apache/spark/pull/1893 >>> >>> >>> On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com >>> wrote: >>> (correction: "Compilation Error: Spark 1.0.2 with HBase 0.98” , please >>> ignore if duplicated) >>> >>> >>> Hi, >>> >>> I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with >>> HBase 0.98, >>> >>> My steps: >>> wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz >>> tar -vxf spark-1.0.2.tgz >>> cd spark-1.0.2 >>> >>> edit project/SparkBuild.scala, set HBASE_VERSION >>> // HBase version; set as appropriate. >>> val HBASE_VERSION = "0.98.2" >>> >>> >>> edit pom.xml with following values >>> 2.4.1 >>> 2.5.0 >>> ${hadoop.version} >>> 0.98.5 >>> 3.4.6 >>> 0.13.1 >>> >>> >>> SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly >>> but it fails because of UNRESOLVED DEPENDENCIES "hbase;0.98.2" >>> >>> Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should >>> I set HBASE_VERSION back to “0.94.6"? >>> >>> Regards >>> Arthur >>> >>> >>> >>> >>> [warn] :: >>> [warn] :: UNRESOLVED DEPENDENCIES :: >>> [warn]
Compilation FAILURE : Spark 1.0.2 / Project Hive (0.13.1)
Hi, I use Hadoop 2.4.1, HBase 0.98.5, Zookeeper 3.4.6 and Hive 0.13.1. I just tried to compile Spark 1.0.2, but got error on "Spark Project Hive", can you please advise which repository has "org.spark-project.hive:hive-metastore:jar:0.13.1"? FYI, below is my repository setting in maven which would be old: nexus local private nexus http://maven.oschina.net/content/groups/public/ true false export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -DskipTests clean package [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [1.892s] [INFO] Spark Project Core SUCCESS [1:33.698s] [INFO] Spark Project Bagel ... SUCCESS [12.270s] [INFO] Spark Project GraphX .. SUCCESS [2:16.343s] [INFO] Spark Project ML Library .. SUCCESS [4:18.495s] [INFO] Spark Project Streaming ... SUCCESS [39.765s] [INFO] Spark Project Tools ... SUCCESS [9.173s] [INFO] Spark Project Catalyst SUCCESS [35.462s] [INFO] Spark Project SQL . SUCCESS [1:16.118s] [INFO] Spark Project Hive FAILURE [1:36.816s] [INFO] Spark Project REPL SKIPPED [INFO] Spark Project YARN Parent POM . SKIPPED [INFO] Spark Project YARN Stable API . SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 12:40.616s [INFO] Finished at: Thu Aug 28 11:41:07 HKT 2014 [INFO] Final Memory: 37M/826M [INFO] [ERROR] Failed to execute goal on project spark-hive_2.10: Could not resolve dependencies for project org.apache.spark:spark-hive_2.10:jar:1.0.2: The following artifacts could not be resolved: org.spark-project.hive:hive-metastore:jar:0.13.1, org.spark-project.hive:hive-exec:jar:0.13.1, org.spark-project.hive:hive-serde:jar:0.13.1: Could not find artifact org.spark-project.hive:hive-metastore:jar:0.13.1 in nexus-osc (http://maven.oschina.net/content/groups/public/) -> [Help 1] Regards Arthur
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
.2:compile [INFO] | \- javax.activation:activation:jar:1.1:compile [INFO] +- org.apache.hbase:hbase-hadoop-compat:jar:0.98.5-hadoop1:compile [INFO] | \- org.apache.commons:commons-math:jar:2.1:compile [INFO] +- org.apache.hbase:hbase-hadoop-compat:test-jar:tests:0.98.5-hadoop1:test [INFO] +- com.twitter:algebird-core_2.10:jar:0.1.11:compile [INFO] | \- com.googlecode.javaewah:JavaEWAH:jar:0.6.6:compile [INFO] +- org.scalatest:scalatest_2.10:jar:2.1.5:test [INFO] | \- org.scala-lang:scala-reflect:jar:2.10.4:compile [INFO] +- org.scalacheck:scalacheck_2.10:jar:1.11.3:test [INFO] | \- org.scala-sbt:test-interface:jar:1.0:test [INFO] +- org.apache.cassandra:cassandra-all:jar:1.2.6:compile [INFO] | +- net.jpountz.lz4:lz4:jar:1.1.0:compile [INFO] | +- org.antlr:antlr:jar:3.2:compile [INFO] | +- com.googlecode.json-simple:json-simple:jar:1.1:compile [INFO] | +- org.yaml:snakeyaml:jar:1.6:compile [INFO] | +- edu.stanford.ppl:snaptree:jar:0.1:compile [INFO] | +- org.mindrot:jbcrypt:jar:0.3m:compile [INFO] | +- org.apache.thrift:libthrift:jar:0.7.0:compile [INFO] | | \- javax.servlet:servlet-api:jar:2.5:compile [INFO] | +- org.apache.cassandra:cassandra-thrift:jar:1.2.6:compile [INFO] | \- com.github.stephenc:jamm:jar:0.2.5:compile [INFO] \- com.github.scopt:scopt_2.10:jar:3.2.0:compile [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 2.151s [INFO] Finished at: Thu Aug 28 18:12:45 HKT 2014 [INFO] Final Memory: 17M/479M [INFO] RegardsArthurOn 28 Aug, 2014, at 12:22 pm, Ted Yu <yuzhih...@gmail.com> wrote:I forgot to include '-Dhadoop.version=2.4.1' in the command below.The modified command passed.You can verify the dependence on hbase 0.98 through this command: mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree > dep.txtCheersOn Wed, Aug 27, 2014 at 8:58 PM, Ted Yu <yuzhih...@gmail.com> wrote: Looks like the patch given by that URL only had the last commit.I have attached pom.xml for spark-1.0.2 to SPARK-1297 You can download it and replace examples/pom.xml with the downloaded pom I am running this command locally:mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean packageCheers On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com <arthur.hk.c...@gmail.com> wrote: Hi Ted, Thanks. Tried [patch -p1 -i 1893.patch] (Hunk #1 FAILED at 45.) Is this normal?RegardsArthurpatch -p1 -i 1893.patch patching file examples/pom.xmlHunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines).1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file examples/pom.xmlHunk #1 FAILED at 54.Hunk #2 FAILED at 72. Hunk #3 succeeded at 122 (offset -49 lines).2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file docs/building-with-maven.md patching file examples/pom.xmlHunk #1 succeeded at 122 (offset -40 lines).Hunk #2 succeeded at 195 (offset -40 lines). On 28 Aug, 2014, at 10:53 am, Ted Yu <yuzhih...@gmail.com> wrote: Can you use this command ?patch -p1 -i 1893.patchCheersOn Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com <arthur.hk.c...@gmail.com> wrote: Hi Ted,I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one? RegardsArthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgzcd spark-1.0.2 wget https://github.com/apache/spark/pull/1893.patch patch < 1893.patchpatching file pom.xml Hunk #1 FAILED at 45.Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rejpatching file pom.xml Hunk #1 FAILED at 54.Hunk #2 FAILED at 72. Hunk #3 FAILED at 171.3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej can't find file to patch at input line 267Perhaps you should have used the -p or --strip option? The text leading up to this was:-- ||From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001 |From: tedyu <yuzhih...@gmail.com>|Date: Mon, 11 Aug 2014 15:57:46 -0700 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add| description to building-with-maven.md ||---| docs/building-with-maven.md | 3 +++ | 1 file changed, 3 insertions(+)| |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644|--- a/docs/building-with-maven.md |+++ b/docs/building-with-maven.md-- File to patch: On 28 Aug, 2014, at 10:24 am, Ted Yu <yuzhih...@gmail.com> wrote: You can get the patch from this URL:https://github.com/apache/spark/pull/1893.patch BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
Hi, I tried to start Spark but failed: $ ./sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /mnt/hadoop/spark-1.0.2/sbin/../logs/spark-edhuser-org.apache.spark.deploy.master.Master-1-m133.out failed to launch org.apache.spark.deploy.master.Master: Failed to find Spark assembly in /mnt/hadoop/spark-1.0.2/assembly/target/scala-2.10/ $ ll assembly/ total 20 -rw-rw-r--. 1 hduser hadoop 11795 Jul 26 05:50 pom.xml -rw-rw-r--. 1 hduser hadoop 507 Jul 26 05:50 README drwxrwxr-x. 4 hduser hadoop 4096 Jul 26 05:50 src Regards Arthur On 28 Aug, 2014, at 6:19 pm, Ted Yu wrote: > I see 0.98.5 in dep.txt > > You should be good to go. > > > On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com > wrote: > Hi, > > tried > mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests > dependency:tree > dep.txt > > Attached the dep. txt for your information. > > > Regards > Arthur > > On 28 Aug, 2014, at 12:22 pm, Ted Yu wrote: > >> I forgot to include '-Dhadoop.version=2.4.1' in the command below. >> >> The modified command passed. >> >> You can verify the dependence on hbase 0.98 through this command: >> >> mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests >> dependency:tree > dep.txt >> >> Cheers >> >> >> On Wed, Aug 27, 2014 at 8:58 PM, Ted Yu wrote: >> Looks like the patch given by that URL only had the last commit. >> >> I have attached pom.xml for spark-1.0.2 to SPARK-1297 >> You can download it and replace examples/pom.xml with the downloaded pom >> >> I am running this command locally: >> >> mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package >> >> Cheers >> >> >> On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com >> wrote: >> Hi Ted, >> >> Thanks. >> >> Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.) >> Is this normal? >> >> Regards >> Arthur >> >> >> patch -p1 -i 1893.patch >> patching file examples/pom.xml >> Hunk #1 FAILED at 45. >> Hunk #2 succeeded at 94 (offset -16 lines). >> 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej >> patching file examples/pom.xml >> Hunk #1 FAILED at 54. >> Hunk #2 FAILED at 72. >> Hunk #3 succeeded at 122 (offset -49 lines). >> 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej >> patching file docs/building-with-maven.md >> patching file examples/pom.xml >> Hunk #1 succeeded at 122 (offset -40 lines). >> Hunk #2 succeeded at 195 (offset -40 lines). >> >> >> On 28 Aug, 2014, at 10:53 am, Ted Yu wrote: >> >>> Can you use this command ? >>> >>> patch -p1 -i 1893.patch >>> >>> Cheers >>> >>> >>> On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com >>> wrote: >>> Hi Ted, >>> >>> I tried the following steps to apply the patch 1893 but got Hunk FAILED, >>> can you please advise how to get thru this error? or is my spark-1.0.2 >>> source not the correct one? >>> >>> Regards >>> Arthur >>> >>> wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz >>> tar -vxf spark-1.0.2.tgz >>> cd spark-1.0.2 >>> wget https://github.com/apache/spark/pull/1893.patch >>> patch < 1893.patch >>> patching file pom.xml >>> Hunk #1 FAILED at 45. >>> Hunk #2 FAILED at 110. >>> 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej >>> patching file pom.xml >>> Hunk #1 FAILED at 54. >>> Hunk #2 FAILED at 72. >>> Hunk #3 FAILED at 171. >>> 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej >>> can't find file to patch at input line 267 >>> Perhaps you should have used the -p or --strip option? >>> The text leading up to this was: >>> -- >>> | >>> |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001 >>> |From: tedyu >>> |Date: Mon, 11 Aug 2014 15:57:46 -0700 >>> |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add >>> | description to building-with-maven.md >>> | >>> |--- >>> | docs/building-with-maven.md | 3 +++ >>> | 1 file changed, 3 insertions(+) >>> | >>> |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md >>> |index 672d0ef..f8bcd2b 100644 >>
SPARK-1297 patch error (spark-1297-v4.txt )
Hi, I have just tried to apply the patch of SPARK-1297: https://issues.apache.org/jira/browse/SPARK-1297 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt respectively. When applying the 2nd one, I got "Hunk #1 FAILED at 45" Can you please advise how to fix it in order to make the compilation of Spark Project Examples success? (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2) Regards Arthur patch -p1 -i spark-1297-v4.txt patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej below is the content of examples/pom.xml.rej: +++ examples/pom.xml @@ -45,6 +45,39 @@ + + hbase-hadoop2 + + + hbase.profile + hadoop2 + + + +2.5.0 +0.98.4-hadoop2 + + + + + + + + hbase-hadoop1 + + + !hbase.profile + + + +0.98.4-hadoop1 + + + + + + + This caused the related compilation failed: [INFO] Spark Project Examples FAILURE [0.102s]
Re: SPARK-1297 patch error (spark-1297-v4.txt )
Hi, patch -p1 -i spark-1297-v5.txt can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |diff --git docs/building-with-maven.md docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- docs/building-with-maven.md |+++ docs/building-with-maven.md -- File to patch: Please advise Regards Arthur On 29 Aug, 2014, at 12:50 am, arthur.hk.c...@gmail.com wrote: > Hi, > > I have just tried to apply the patch of SPARK-1297: > https://issues.apache.org/jira/browse/SPARK-1297 > > There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt > respectively. > > When applying the 2nd one, I got "Hunk #1 FAILED at 45" > > Can you please advise how to fix it in order to make the compilation of Spark > Project Examples success? > (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2) > > Regards > Arthur > > > > patch -p1 -i spark-1297-v4.txt > patching file examples/pom.xml > Hunk #1 FAILED at 45. > Hunk #2 succeeded at 94 (offset -16 lines). > 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej > > below is the content of examples/pom.xml.rej: > +++ examples/pom.xml > @@ -45,6 +45,39 @@ > > > > + > + hbase-hadoop2 > + > + > + hbase.profile > + hadoop2 > + > + > + > +2.5.0 > +0.98.4-hadoop2 > + > + > + > + > + > + > + > + hbase-hadoop1 > + > + > + !hbase.profile > + > + > + > +0.98.4-hadoop1 > + > + > + > + > + > + > + > > > > > > This caused the related compilation failed: > [INFO] Spark Project Examples FAILURE [0.102s] > >
Re: SPARK-1297 patch error (spark-1297-v4.txt )
Hi Ted, I downloaded pom.xml to examples directory. It works, thanks!! Regards Arthur [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.119s] [INFO] Spark Project Core SUCCESS [1:27.100s] [INFO] Spark Project Bagel ... SUCCESS [10.261s] [INFO] Spark Project GraphX .. SUCCESS [31.332s] [INFO] Spark Project ML Library .. SUCCESS [35.226s] [INFO] Spark Project Streaming ... SUCCESS [39.135s] [INFO] Spark Project Tools ... SUCCESS [6.469s] [INFO] Spark Project Catalyst SUCCESS [36.521s] [INFO] Spark Project SQL . SUCCESS [35.488s] [INFO] Spark Project Hive SUCCESS [35.296s] [INFO] Spark Project REPL SUCCESS [18.668s] [INFO] Spark Project YARN Parent POM . SUCCESS [0.583s] [INFO] Spark Project YARN Stable API . SUCCESS [15.989s] [INFO] Spark Project Assembly SUCCESS [11.497s] [INFO] Spark Project External Twitter SUCCESS [8.777s] [INFO] Spark Project External Kafka .. SUCCESS [9.688s] [INFO] Spark Project External Flume .. SUCCESS [10.411s] [INFO] Spark Project External ZeroMQ . SUCCESS [9.511s] [INFO] Spark Project External MQTT ... SUCCESS [8.451s] [INFO] Spark Project Examples SUCCESS [1:40.240s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 8:33.350s [INFO] Finished at: Fri Aug 29 01:58:00 HKT 2014 [INFO] Final Memory: 82M/1086M [INFO] On 29 Aug, 2014, at 1:36 am, Ted Yu wrote: > bq. Spark 1.0.2 > > For the above release, you can download pom.xml attached to the JIRA and > place it in examples directory > > I verified that the build against 0.98.4 worked using this command: > > mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 > -DskipTests clean package > > Patch v5 is @ level 0 - you don't need to use -p1 in the patch command. > > Cheers > > > On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com > wrote: > Hi, > > I have just tried to apply the patch of SPARK-1297: > https://issues.apache.org/jira/browse/SPARK-1297 > > There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt > respectively. > > When applying the 2nd one, I got "Hunk #1 FAILED at 45" > > Can you please advise how to fix it in order to make the compilation of Spark > Project Examples success? > (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2) > > Regards > Arthur > > > > patch -p1 -i spark-1297-v4.txt > patching file examples/pom.xml > Hunk #1 FAILED at 45. > Hunk #2 succeeded at 94 (offset -16 lines). > 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej > > below is the content of examples/pom.xml.rej: > +++ examples/pom.xml > @@ -45,6 +45,39 @@ > > > > + > + hbase-hadoop2 > + > + > + hbase.profile > + hadoop2 > + > + > + > +2.5.0 > +0.98.4-hadoop2 > + > + > + > + > + > + > + > + hbase-hadoop1 > + > + > + !hbase.profile > + > + > + > +0.98.4-hadoop1 > + > + > + > + > + > + > + > > > > > > This caused the related compilation failed: > [INFO] Spark Project Examples FAILURE [0.102s] > > >
org.apache.hadoop.io.compress.SnappyCodec not found
Hi, I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and HBase. With default setting in Spark 1.0.2, when trying to load a file I got "Class org.apache.hadoop.io.compress.SnappyCodec not found" Can you please advise how to enable snappy in Spark? Regards Arthur scala> inFILE.first() java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.RDD.take(RDD.scala:983) at org.apache.spark.rdd.RDD.first(RDD.scala:1015) at $iwC$$iwC$$iwC$$iwC.(:15) at $iwC$$iwC$$iwC.(:20) at $iwC$$iwC.(:22) at $iwC.(:24) at (:26) at .(:30) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) ... 55 more Caused by: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.SnappyCodec not found. at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135) at org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:175) at org.apache.hadoop.mapred.TextInpu
Re: org.apache.hadoop.io.compress.SnappyCodec not found
Hi, my check native result: hadoop checknative 14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version 14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library Native library checking: hadoop: true /mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libhadoop.so zlib: true /lib64/libz.so.1 snappy: true /mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libsnappy.so.1 lz4:true revision:99 bzip2: false Any idea how to enable or disable snappy in Spark? Regards Arthur On 29 Aug, 2014, at 2:39 am, arthur.hk.c...@gmail.com wrote: > Hi, > > I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and > HBase. > With default setting in Spark 1.0.2, when trying to load a file I got "Class > org.apache.hadoop.io.compress.SnappyCodec not found" > > Can you please advise how to enable snappy in Spark? > > Regards > Arthur > > > scala> inFILE.first() > java.lang.RuntimeException: Error in configuring object > at > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) > at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.RDD.take(RDD.scala:983) > at org.apache.spark.rdd.RDD.first(RDD.scala:1015) > at $iwC$$iwC$$iwC$$iwC.(:15) > at $iwC$$iwC$$iwC.(:20) > at $iwC$$iwC.(:22) > at $iwC.(:24) > at (:26) > at .(:30) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) > at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) > at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
Re: org.apache.hadoop.io.compress.SnappyCodec not found
ore Caused by: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.GzipCodec not found. at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135) at org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:175) at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45) ... 60 more Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.io.compress.GzipCodec not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801) at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:128) ... 62 more Any idea to fix this issue? Regards Arthur On 29 Aug, 2014, at 2:58 am, arthur.hk.c...@gmail.com wrote: > Hi, > > my check native result: > > hadoop checknative > 14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize > native-bzip2 library system-native, will use pure-Java version > 14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded & initialized > native-zlib library > Native library checking: > hadoop: true > /mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libhadoop.so > zlib: true /lib64/libz.so.1 > snappy: true > /mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libsnappy.so.1 > lz4:true revision:99 > bzip2: false > > Any idea how to enable or disable snappy in Spark? > > Regards > Arthur > > > On 29 Aug, 2014, at 2:39 am, arthur.hk.c...@gmail.com > wrote: > >> Hi, >> >> I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and >> HBase. >> With default setting in Spark 1.0.2, when trying to load a file I got "Class >> org.apache.hadoop.io.compress.SnappyCodec not found" >> >> Can you please advise how to enable snappy in Spark? >> >> Regards >> Arthur >> >> >> scala> inFILE.first() >> java.lang.RuntimeException: Error in configuring object >> at >> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) >> at >> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) >> at >> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) >> at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158) >> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) >> at scala.Option.getOrElse(Option.scala:120) >> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) >> at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) >> at scala.Option.getOrElse(Option.scala:120) >> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) >> at org.apache.spark.rdd.RDD.take(RDD.scala:983) >> at org.apache.spark.rdd.RDD.first(RDD.scala:1015) >> at $iwC$$iwC$$iwC$$iwC.(:15) >> at $iwC$$iwC$$iwC.(:20) >> at $iwC$$iwC.(:22) >> at $iwC.(:24) >> at (:26) >> at .(:30) >> at .() >> at .(:7) >> at .() >> at $print() >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) >> at >> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) >> at >> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) >> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) >> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) >> at >> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) >> at >> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) >> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) >> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) >> at org.apache.spark.repl.SparkIL
Re: org.apache.hadoop.io.compress.SnappyCodec not found
Hi, I fixed the issue by copying libsnappy.so to Java ire. Regards Arthur On 29 Aug, 2014, at 8:12 am, arthur.hk.c...@gmail.com wrote: > Hi, > > If change my etc/hadoop/core-site.xml > > from > > io.compression.codecs > > org.apache.hadoop.io.compress.SnappyCodec, > org.apache.hadoop.io.compress.GzipCodec, > org.apache.hadoop.io.compress.DefaultCodec, > org.apache.hadoop.io.compress.BZip2Codec > > > > to > > io.compression.codecs > > org.apache.hadoop.io.compress.GzipCodec, > org.apache.hadoop.io.compress.SnappyCodec, > org.apache.hadoop.io.compress.DefaultCodec, > org.apache.hadoop.io.compress.BZip2Codec > > > > > > and run the test again, I found this time it cannot find > "org.apache.hadoop.io.compress.GzipCodec" > > scala> inFILE.first() > java.lang.RuntimeException: Error in configuring object > at > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) > at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) > at org.apache.spark.rdd.RDD.take(RDD.scala:983) > at org.apache.spark.rdd.RDD.first(RDD.scala:1015) > at $iwC$$iwC$$iwC$$iwC.(:15) > at $iwC$$iwC$$iwC.(:20) > at $iwC$$iwC.(:22) > at $iwC.(:24) > at (:26) > at .(:30) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) > at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) > at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) > at org.apache.sp
Spark Hive max key length is 767 bytes
(Please ignore if duplicated) Hi, I use Spark 1.0.2 with Hive 0.13.1 I have already set the hive mysql database to latine1; mysql: alter database hive character set latin1; Spark: scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) scala> hiveContext.hql("create table test_datatype1 (testbigint bigint )") scala> hiveContext.hql("drop table test_datatype1") 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 14/08/29 12:31:59 ERROR DataNucleus.Datastore: An exception was thrown while adding/validating class(es) : Specified key was too long; max key length is 767 bytes com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at com.mysql.jdbc.Util.handleNewInstance(Util.java:408) at com.mysql.jdbc.Util.getInstance(Util.java:383) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1062) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4226) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4158) Can you please advise what would be wrong? Regards Arthur
Re: Spark Hive max key length is 767 bytes
Hi, Tried the same thing in HIVE directly without issue: HIVE: hive> create table test_datatype2 (testbigint bigint ); OK Time taken: 0.708 seconds hive> drop table test_datatype2; OK Time taken: 23.272 seconds Then tried again in SPARK: scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) 14/08/29 19:33:52 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@395c7b94 scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) res0: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:104 == Query Plan == scala> hiveContext.hql("drop table test_datatype3") 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown while adding/validating class(es) : Specified key was too long; max key length is 767 bytes com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) 14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted in no possible candidates Error(s) were found while auto-creating/validating the datastore for classes. The errors are printed in the log, and are attached to this exception. org.datanucleus.exceptions.NucleusDataStoreException: Error(s) were found while auto-creating/validating the datastore for classes. The errors are printed in the log, and are attached to this exception. at org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.verifyErrors(RDBMSStoreManager.java:3609) Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table. 14/08/29 19:34:25 ERROR DataNucleus.Datastore: An exception was thrown while adding/validating class(es) : Specified key was too long; max key length is 767 bytes com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) Can anyone please help? Regards Arthur On 29 Aug, 2014, at 12:47 pm, arthur.hk.c...@gmail.com wrote: > (Please ignore if duplicated) > > > Hi, > > I use Spark 1.0.2 with Hive 0.13.1 > > I have already set the hive mysql database to latine1; > > mysql: > alter database hive character set latin1; > > Spark: > scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) > scala> hiveContext.hql("create table test_datatype1 (testbigint bigint )") > scala> hiveContext.hql("drop table test_datatype1") > > > 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" > so does not have its own datastore table. > 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > &
Re: Spark Hive max key length is 767 bytes
Hi Michael, Thank you so much!! I have tried to change the following key length from 256 to 255 and from 767 to 766, it still didn’t work alter table COLUMNS_V2 modify column COMMENT VARCHAR(255); alter table INDEX_PARAMS modify column PARAM_KEY VARCHAR(255); alter table SD_PARAMS modify column PARAM_KEY VARCHAR(255); alter table SERDE_PARAMS modify column PARAM_KEY VARCHAR(255); alter table TABLE_PARAMS modify column PARAM_KEY VARCHAR(255); alter table TBLS modify column OWNER VARCHAR(766); alter table PART_COL_STATS modify column PARTITION_NAME VARCHAR(766); alter table PARTITION_KEYS modify column PKEY_TYPE VARCHAR(766); alter table PARTITIONS modify column PART_NAME VARCHAR(766); I use Hadoop 2.4.1 HBase 0.98.5 Hive 0.13, trying Spark 1.0.2 and Shark 0.9.2, and JDK1.6_45. Some questions: shark-0.9.2 is based on which Hive version? is HBase 0.98.x OK? is Hive 0.13.1 OK? and which Java? (I use JDK1.6 at the moment, it seems not working) spark-1.0.2 is based on which Hive version? is HBase 0.98.x OK? Regards Arthur On 30 Aug, 2014, at 1:40 am, Michael Armbrust wrote: > Spark SQL is based on Hive 12. They must have changed the maximum key size > between 12 and 13. > > > On Fri, Aug 29, 2014 at 4:38 AM, arthur.hk.c...@gmail.com > wrote: > Hi, > > > Tried the same thing in HIVE directly without issue: > > HIVE: > hive> create table test_datatype2 (testbigint bigint ); > OK > Time taken: 0.708 seconds > > hive> drop table test_datatype2; > OK > Time taken: 23.272 seconds > > > > Then tried again in SPARK: > scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) > 14/08/29 19:33:52 INFO Configuration.deprecation: > mapred.reduce.tasks.speculative.execution is deprecated. Instead, use > mapreduce.reduce.speculative > hiveContext: org.apache.spark.sql.hive.HiveContext = > org.apache.spark.sql.hive.HiveContext@395c7b94 > > scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) > res0: org.apache.spark.sql.SchemaRDD = > SchemaRDD[0] at RDD at SchemaRDD.scala:104 > == Query Plan == > > > scala> hiveContext.hql("drop table test_datatype3") > > 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown while > adding/validating class(es) : Specified key was too long; max key length is > 767 bytes > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was > too long; max key length is 767 bytes > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > > 14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of > org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted in > no possible candidates > Error(s) were found while auto-creating/validating the datastore for classes. > The errors are printed in the log, and are attached to this exception. > org.datanucleus.exceptions.NucleusDataStoreException: Error(s) were found > while auto-creating/validating the datastore for classes. The errors are > printed in the log, and are attached to this exception. > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.verifyErrors(RDBMSStoreManager.java:3609) > > > Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: > Specified key was too long; max key length is 767 bytes > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > > 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" > so does not have its own datastore table. > 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" > so does not have its own datastore table. > 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 14/08/29 19:34:17 INFO DataNucleus.Datastore: The class > "org.apache.hadoop.hive.me
Re: Spark Hive max key length is 767 bytes
Hi, Already done but still get the same error: (I use HIVE 0.13.1 Spark 1.0.2, Hadoop 2.4.1) Steps: Step 1) mysql: > >> alter database hive character set latin1; Step 2) HIVE: >> hive> create table test_datatype2 (testbigint bigint ); >> OK >> Time taken: 0.708 seconds >> >> hive> drop table test_datatype2; >> OK >> Time taken: 23.272 seconds Step 3) scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) >> 14/08/29 19:33:52 INFO Configuration.deprecation: >> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use >> mapreduce.reduce.speculative >> hiveContext: org.apache.spark.sql.hive.HiveContext = >> org.apache.spark.sql.hive.HiveContext@395c7b94 >> scala> hiveContext.hql(“create table test_datatype3 (testbigint bigint)”) >> res0: org.apache.spark.sql.SchemaRDD = >> SchemaRDD[0] at RDD at SchemaRDD.scala:104 >> == Query Plan == >> >> scala> hiveContext.hql("drop table test_datatype3") >> >> 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown while >> adding/validating class(es) : Specified key was too long; max key length is >> 767 bytes >> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was >> too long; max key length is 767 bytes >> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >> at >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) >> >> 14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of >> org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted in >> no possible candidates >> Error(s) were found while auto-creating/validating the datastore for >> classes. The errors are printed in the log, and are attached to this >> exception. >> org.datanucleus.exceptions.NucleusDataStoreException: Error(s) were found >> while auto-creating/validating the datastore for classes. The errors are >> printed in the log, and are attached to this exception. >> at >> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.verifyErrors(RDBMSStoreManager.java:3609) >> >> >> Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: >> Specified key was too long; max key length is 767 bytes >> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) Should I use HIVE 0.12.0 instead of HIVE 0.13.1? Regards Arthur On 31 Aug, 2014, at 6:01 am, Denny Lee wrote: > Oh, you may be running into an issue with your MySQL setup actually, try > running > > alter database metastore_db character set latin1 > > so that way Hive (and the Spark HiveContext) can execute properly against the > metastore. > > > On August 29, 2014 at 04:39:01, arthur.hk.c...@gmail.com > (arthur.hk.c...@gmail.com) wrote: > >> Hi, >> >> >> Tried the same thing in HIVE directly without issue: >> >> HIVE: >> hive> create table test_datatype2 (testbigint bigint ); >> OK >> Time taken: 0.708 seconds >> >> hive> drop table test_datatype2; >> OK >> Time taken: 23.272 seconds >> >> >> >> Then tried again in SPARK: >> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) >> 14/08/29 19:33:52 INFO Configuration.deprecation: >> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use >> mapreduce.reduce.speculative >> hiveContext: org.apache.spark.sql.hive.HiveContext = >> org.apache.spark.sql.hive.HiveContext@395c7b94 >> >> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) >> res0: org.apache.spark.sql.SchemaRDD = >> SchemaRDD[0] at RDD at SchemaRDD.scala:104 >> == Query Plan == >> >> >> scala> hiveContext.hql("drop table test_datatype3") >> >> 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown while >> adding/validating class(es) : Specified key was too long; max key length is >> 767 bytes >> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was >> too long; max key length is 767 bytes >> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >> at >> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) >> >> 14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of >> org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted in >> no possible candidates >> Error(s) were found while auto-creating/validating the data
Spark Master/Slave and HA
Hi, I have few questions about Spark Master and Slave setup: Here, I have 5 Hadoop nodes (n1, n2, n3, n4, and n5 respectively), at the moment I run Spark under these nodes: n1:Hadoop Active Name node, Hadoop Slave Spark Active Master n2:Hadoop Standby Name Node,Hadoop Salve Spark Slave n3: Hadoop Salve Spark Slave n4: Hadoop Salve Spark Slave n5: Hadoop Salve Spark Slave Questions: Q1: If I set n1 as both Spark Master and Spark Slave, I cannot start the Spark Cluster. does it mean that, unlike Hadoop, I cannot use the same machine to be both MASTER and SLAVE in Spark? n1:Hadoop Active Name node, Hadoop Slave Spark Active Master Spark Slave (failed to Start Spark) n2:Hadoop Standby Name Node,Hadoop Salve Spark Slave n3: Hadoop Salve Spark Slave n4: Hadoop SalveSpark Slave n5: Hadoop Salve Spark Slave Q2: I am planning Spark HA, what if I use n2 as "Spark Standby Master and Spark Slave”? is Spark allowed to run Standby Master and Slave under same machine? n1:Hadoop Active Name node, Hadoop Slave Spark Active Master n2:Hadoop Standby Name Node,Hadoop SalveSpark Standby MasterSpark Slave n3: Hadoop Salve Spark Slave n4: Hadoop Salve Spark Slave n5: Hadoop Salve Spark Slave Q3: Does the Spark Master node do actual computation work like a worker or just a pure monitoring node? Regards Arthur
Spark and Shark Node: RAM Allocation
Hi, Is there any formula to calculate proper RAM allocation values for Spark and Shark based on Physical RAM, HADOOP and HBASE RAM usage? e.g. if a node has 32GB physical RAM spark-defaults.conf spark.executor.memory ?g spark-env.sh export SPARK_WORKER_MEMORY=? export HADOOP_HEAPSIZE=? shark-env.sh export SPARK_MEM=?g export SHARK_MASTER_MEM=?g spark-defaults.conf spark.executor.memory ?g Regards Arthur
Spark and Shark
Hi, I have installed Spark 1.0.2 and Shark 0.9.2 on Hadoop 2.4.1 (by compiling from source). spark: 1.0.2 shark: 0.9.2 hadoop: 2.4.1 java: java version “1.7.0_67” protobuf: 2.5.0 I have tried the smoke test in shark but got “java.util.NoSuchElementException” error, can you please advise how to fix this? shark> create table x1 (a INT); FAILED: Hive Internal Error: java.util.NoSuchElementException(null) 14/09/01 23:04:24 [main]: ERROR shark.SharkDriver: FAILED: Hive Internal Error: java.util.NoSuchElementException(null) java.util.NoSuchElementException at java.util.HashMap$HashIterator.nextEntry(HashMap.java:925) at java.util.HashMap$ValueIterator.next(HashMap.java:950) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:8117) at shark.parse.SharkSemanticAnalyzer.analyzeInternal(SharkSemanticAnalyzer.scala:150) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284) at shark.SharkDriver.compile(SharkDriver.scala:215) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at shark.SharkCliDriver.processCmd(SharkCliDriver.scala:340) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at shark.SharkCliDriver$.main(SharkCliDriver.scala:237) at shark.SharkCliDriver.main(SharkCliDriver.scala) spark-env.sh #!/usr/bin/env bash export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar" export CLASSPATH="$CLASSPATH:$HIVE_HOME/lib/mysql-connector-java-5.1.31-bin.jar" export JAVA_LIBRARY_PATH="$HADOOP_HOME/lib/native/Linux-amd64-64" export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"} export SPARK_CLASSPATH="$SPARK_HOME/lib_managed/jars/mysql-connector-java-5.1.31-bin.jar" export SPARK_WORKER_MEMORY=2g export HADOOP_HEAPSIZE=2000 spark-defaults.conf spark.executor.memory 2048m spark.shuffle.spill.compressfalse shark-env.sh #!/usr/bin/env bash export SPARK_MEM=2g export SHARK_MASTER_MEM=2g SPARK_JAVA_OPTS=" -Dspark.local.dir=/tmp " SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 " SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps " export SPARK_JAVA_OPTS export SHARK_EXEC_MODE=yarn export SPARK_ASSEMBLY_JAR="$SCALA_HOME/assembly/target/scala-2.10/spark-assembly-1.0.2-hadoop2.4.1.jar" export SHARK_ASSEMBLY_JAR="target/scala-2.10/shark_2.10-0.9.2.jar" export HIVE_CONF_DIR="$HIVE_HOME/conf" export SPARK_LIBPATH=$HADOOP_HOME/lib/native/ export SPARK_LIBRARY_PATH=$HADOOP_HOME/lib/native/ export SPARK_CLASSPATH="$SHARK_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar:$SHARK_HOME/lib/protobuf-java-2.5.0.jar" Regards Arthur
Re: Dependency Problem with Spark / ScalaTest / SBT
Hi, What is your SBT command and the parameters? Arthur On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler wrote: > Hello, > > I am writing a Spark App which is already working so far. > Now I started to build also some UnitTests, but I am running into some > dependecy problems and I cannot find a solution right now. Perhaps someone > could help me. > > I build my Spark Project with SBT and it seems to be configured well, because > compiling, assembling and running the built jar with spark-submit are working > well. > > Now I started with the UnitTests, which I located under /src/test/scala. > > When I call "test" in sbt, I get the following: > > 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered BlockManager > 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server > [trace] Stack trace suppressed: run last test:test for the full output. > [error] Could not run test test.scala.SetSuite: > java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse > [info] Run completed in 626 milliseconds. > [info] Total number of tests run: 0 > [info] Suites: completed 0, aborted 0 > [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 > [info] All tests passed. > [error] Error during tests: > [error] test.scala.SetSuite > [error] (test:test) sbt.TestsFailedException: Tests unsuccessful > [error] Total time: 3 s, completed 10.09.2014 12:22:06 > > last test:test gives me the following: > > > last test:test > [debug] Running TaskDef(test.scala.SetSuite, > org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector]) > java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse >at org.apache.spark.HttpServer.start(HttpServer.scala:54) >at > org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156) >at > org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127) >at > org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31) >at > org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48) >at > org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35) >at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218) >at org.apache.spark.SparkContext.(SparkContext.scala:202) >at test.scala.SetSuite.(SparkTest.scala:16) > > I also noticed right now, that sbt run is also not working: > > 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server > [error] (run-main-2) java.lang.NoClassDefFoundError: > javax/servlet/http/HttpServletResponse > java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse >at org.apache.spark.HttpServer.start(HttpServer.scala:54) >at > org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:156) >at > org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127) >at > org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31) >at > org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48) >at > org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35) >at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218) >at org.apache.spark.SparkContext.(SparkContext.scala:202) >at > main.scala.PartialDuplicateScanner$.main(PartialDuplicateScanner.scala:29) >at main.scala.PartialDuplicateScanner.main(PartialDuplicateScanner.scala) > > Here is my Testprojekt.sbt file: > > name := "Testprojekt" > > version := "1.0" > > scalaVersion := "2.10.4" > > libraryDependencies ++= { > Seq( >"org.apache.lucene" % "lucene-core" % "4.9.0", >"org.apache.lucene" % "lucene-analyzers-common" % "4.9.0", >"org.apache.lucene" % "lucene-queryparser" % "4.9.0", >("org.apache.spark" %% "spark-core" % "1.0.2"). >exclude("org.mortbay.jetty", "servlet-api"). >exclude("commons-beanutils", "commons-beanutils-core"). >exclude("commons-collections", "commons-collections"). >exclude("commons-collections", "commons-collections"). >exclude("com.esotericsoftware.minlog", "minlog"). >exclude("org.eclipse.jetty.orbit", "javax.mail.glassfish"). >exclude("org.eclipse.jetty.orbit", "javax.transaction"). >exclude("org.eclipse.jetty.orbit", "javax.servlet") > ) > } > > resolvers += "Akka Repository" at "http://repo.akka.io/releases/"; > > > > > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL -- more than two tables for join
Hi, May be you can take a look about the following. http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html Good luck. Arthur On 10 Sep, 2014, at 9:09 pm, arunshell87 wrote: > > Hi, > > I too had tried SQL queries with joins, MINUS , subqueries etc but they did > not work in Spark Sql. > > I did not find any documentation on what queries work and what do not work > in Spark SQL, may be we have to wait for the Spark book to be released in > Feb-2015. > > I believe you can try HiveQL in Spark for your requirement. > > Thanks, > Arun > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p13877.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL -- more than two tables for join
Hi Some findings: 1) spark sql does not support multiple join 2) spark left join: has performance issue 3) spark sql’s cache table: does not support two-tier query 4) spark sql does not support repartition Arthur On 10 Sep, 2014, at 10:22 pm, arthur.hk.c...@gmail.com wrote: > Hi, > > May be you can take a look about the following. > > http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html > > Good luck. > Arthur > > On 10 Sep, 2014, at 9:09 pm, arunshell87 wrote: > >> >> Hi, >> >> I too had tried SQL queries with joins, MINUS , subqueries etc but they did >> not work in Spark Sql. >> >> I did not find any documentation on what queries work and what do not work >> in Spark SQL, may be we have to wait for the Spark book to be released in >> Feb-2015. >> >> I believe you can try HiveQL in Spark for your requirement. >> >> Thanks, >> Arun >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p13877.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
unable to create new native thread
Hi I am trying the Spark sample program “SparkPi”, I got an error "unable to create new native thread", how to resolve this? 14/09/11 21:36:16 INFO scheduler.DAGScheduler: Completed ResultTask(0, 644) 14/09/11 21:36:16 INFO scheduler.TaskSetManager: Finished TID 643 in 43 ms on node1 (progress: 636/10) 14/09/11 21:36:16 INFO scheduler.DAGScheduler: Completed ResultTask(0, 643) 14/09/11 21:36:16 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [spark-akka.actor.default-dispatcher-16] shutting down ActorSystem [spark] java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at scala.concurrent.forkjoin.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1672) at scala.concurrent.forkjoin.ForkJoinPool.signalWork(ForkJoinPool.java:1966) at scala.concurrent.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1829) at scala.concurrent.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool.execute(AbstractDispatcher.scala:374) at akka.dispatch.ExecutorServiceDelegate$class.execute(ThreadPoolBuilder.scala:212) at akka.dispatch.Dispatcher$LazyExecutorServiceDelegate.execute(Dispatcher.scala:43) at akka.dispatch.Dispatcher.registerForExecution(Dispatcher.scala:118) at akka.dispatch.Dispatcher.dispatch(Dispatcher.scala:59) at akka.actor.dungeon.Dispatch$class.sendMessage(Dispatch.scala:120) at akka.actor.ActorCell.sendMessage(ActorCell.scala:338) at akka.actor.Cell$class.sendMessage(ActorCell.scala:259) at akka.actor.ActorCell.sendMessage(ActorCell.scala:338) at akka.actor.LocalActorRef.$bang(ActorRef.scala:389) at akka.actor.ActorRef.tell(ActorRef.scala:125) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:489) at akka.actor.ActorCell.invoke(ActorCell.scala:455) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Regards Arthur
object hbase is not a member of package org.apache.hadoop
Hi, I have tried to to run HBaseTest.scala, but I got following errors, any ideas to how to fix them? Q1) scala> package org.apache.spark.examples :1: error: illegal start of definition package org.apache.spark.examples Q2) scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat :31: error: object hbase is not a member of package org.apache.hadoop import org.apache.hadoop.hbase.mapreduce.TableInputFormat Regards Arthur
Re: object hbase is not a member of package org.apache.hadoop
If you want to run against 0.98, see:SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297CheersOn Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com <arthur.hk.c...@gmail.com> wrote:Hi, I have tried to to run HBaseTest.scala, but I got following errors, any ideas to how to fix them?Q1) scala> package org.apache.spark.examples:1: error: illegal start of definition package org.apache.spark.examplesQ2) scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat:31: error: object hbase is not a member of package org.apache.hadoop import org.apache.hadoop.hbase.mapreduce.TableInputFormatRegardsArthur
Re: object hbase is not a member of package org.apache.hadoop
Hi, Thanks! patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej Still got errors. Regards Arthur On 14 Sep, 2014, at 11:33 pm, Ted Yu wrote: > spark-1297-v5.txt is level 0 patch > > Please use spark-1297-v5.txt > > Cheers > > On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com > wrote: > Hi, > > Thanks!! > > I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt > are good here, but not spark-1297-v5.txt: > > > $ patch -p1 -i spark-1297-v4.txt > patching file examples/pom.xml > > $ patch -p1 -i spark-1297-v5.txt > can't find file to patch at input line 5 > Perhaps you used the wrong -p or --strip option? > The text leading up to this was: > -- > |diff --git docs/building-with-maven.md docs/building-with-maven.md > |index 672d0ef..f8bcd2b 100644 > |--- docs/building-with-maven.md > |+++ docs/building-with-maven.md > -- > File to patch: > > > > > > > Please advise. > Regards > Arthur > > > > On 14 Sep, 2014, at 10:48 pm, Ted Yu wrote: > >> Spark examples builds against hbase 0.94 by default. >> >> If you want to run against 0.98, see: >> SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297 >> >> Cheers >> >> On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com >> wrote: >> Hi, >> >> I have tried to to run HBaseTest.scala, but I got following errors, any >> ideas to how to fix them? >> >> Q1) >> scala> package org.apache.spark.examples >> :1: error: illegal start of definition >>package org.apache.spark.examples >> >> >> Q2) >> scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat >> :31: error: object hbase is not a member of package >> org.apache.hadoop >>import org.apache.hadoop.hbase.mapreduce.TableInputFormat >> >> >> >> Regards >> Arthur >> > > >
Re: object hbase is not a member of package org.apache.hadoop
Hi, My bad. Tried again, worked. patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml Thanks! Arthur On 14 Sep, 2014, at 11:38 pm, arthur.hk.c...@gmail.com wrote: > Hi, > > Thanks! > > patch -p0 -i spark-1297-v5.txt > patching file docs/building-with-maven.md > patching file examples/pom.xml > Hunk #1 FAILED at 45. > Hunk #2 FAILED at 110. > 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej > > Still got errors. > > Regards > Arthur > > On 14 Sep, 2014, at 11:33 pm, Ted Yu wrote: > >> spark-1297-v5.txt is level 0 patch >> >> Please use spark-1297-v5.txt >> >> Cheers >> >> On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com >> wrote: >> Hi, >> >> Thanks!! >> >> I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt >> are good here, but not spark-1297-v5.txt: >> >> >> $ patch -p1 -i spark-1297-v4.txt >> patching file examples/pom.xml >> >> $ patch -p1 -i spark-1297-v5.txt >> can't find file to patch at input line 5 >> Perhaps you used the wrong -p or --strip option? >> The text leading up to this was: >> -- >> |diff --git docs/building-with-maven.md docs/building-with-maven.md >> |index 672d0ef..f8bcd2b 100644 >> |--- docs/building-with-maven.md >> |+++ docs/building-with-maven.md >> -- >> File to patch: >> >> >> >> >> >> >> Please advise. >> Regards >> Arthur >> >> >> >> On 14 Sep, 2014, at 10:48 pm, Ted Yu wrote: >> >>> Spark examples builds against hbase 0.94 by default. >>> >>> If you want to run against 0.98, see: >>> SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297 >>> >>> Cheers >>> >>> On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com >>> wrote: >>> Hi, >>> >>> I have tried to to run HBaseTest.scala, but I got following errors, any >>> ideas to how to fix them? >>> >>> Q1) >>> scala> package org.apache.spark.examples >>> :1: error: illegal start of definition >>>package org.apache.spark.examples >>> >>> >>> Q2) >>> scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat >>> :31: error: object hbase is not a member of package >>> org.apache.hadoop >>>import org.apache.hadoop.hbase.mapreduce.TableInputFormat >>> >>> >>> >>> Regards >>> Arthur >>> >> >> >> >
Re: object hbase is not a member of package org.apache.hadoop
Hi, I applied the patch. 1) patched $ patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml 2) Compilation result [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [1.550s] [INFO] Spark Project Core SUCCESS [1:32.175s] [INFO] Spark Project Bagel ... SUCCESS [10.809s] [INFO] Spark Project GraphX .. SUCCESS [31.435s] [INFO] Spark Project Streaming ... SUCCESS [44.518s] [INFO] Spark Project ML Library .. SUCCESS [48.992s] [INFO] Spark Project Tools ... SUCCESS [7.028s] [INFO] Spark Project Catalyst SUCCESS [40.365s] [INFO] Spark Project SQL . SUCCESS [43.305s] [INFO] Spark Project Hive SUCCESS [36.464s] [INFO] Spark Project REPL SUCCESS [20.319s] [INFO] Spark Project YARN Parent POM . SUCCESS [1.032s] [INFO] Spark Project YARN Stable API . SUCCESS [19.379s] [INFO] Spark Project Hive Thrift Server .. SUCCESS [12.470s] [INFO] Spark Project Assembly SUCCESS [13.822s] [INFO] Spark Project External Twitter SUCCESS [9.566s] [INFO] Spark Project External Kafka .. SUCCESS [12.848s] [INFO] Spark Project External Flume Sink . SUCCESS [10.437s] [INFO] Spark Project External Flume .. SUCCESS [14.554s] [INFO] Spark Project External ZeroMQ . SUCCESS [9.994s] [INFO] Spark Project External MQTT ... SUCCESS [8.684s] [INFO] Spark Project Examples SUCCESS [1:31.610s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 9:41.700s [INFO] Finished at: Sun Sep 14 23:51:56 HKT 2014 [INFO] Final Memory: 83M/1071M [INFO] 3) testing: scala> package org.apache.spark.examples :1: error: illegal start of definition package org.apache.spark.examples ^ scala> import org.apache.hadoop.hbase.client.HBaseAdmin import org.apache.hadoop.hbase.client.HBaseAdmin scala> import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} scala> import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.hadoop.hbase.mapreduce.TableInputFormat scala> import org.apache.spark._ import org.apache.spark._ scala> object HBaseTest { | def main(args: Array[String]) { | val sparkConf = new SparkConf().setAppName("HBaseTest") | val sc = new SparkContext(sparkConf) | val conf = HBaseConfiguration.create() | // Other options for configuring scan behavior are available. More information available at | // http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html | conf.set(TableInputFormat.INPUT_TABLE, args(0)) | // Initialize hBase table if necessary | val admin = new HBaseAdmin(conf) | if (!admin.isTableAvailable(args(0))) { | val tableDesc = new HTableDescriptor(args(0)) | admin.createTable(tableDesc) | } | val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], | classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], | classOf[org.apache.hadoop.hbase.client.Result]) | hBaseRDD.count() | sc.stop() | } | } warning: there were 1 deprecation warning(s); re-run with -deprecation for details defined module HBaseTest Now only got error when trying to run "package org.apache.spark.examples” Please advise. Regards Arthur On 14 Sep, 2014, at 11:41 pm, Ted Yu wrote: > I applied the patch on master branch without rejects. > > If you use spark 1.0.2, use pom.xml attached to the JIRA. > > On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com > wrote: > Hi, > > Thanks! > > patch -p0 -i spark-1297-v5.txt > patching file docs/building-with-maven.md > patching file examples/pom.xml > Hunk #1 FAILED at 45. > Hunk #2 FAILED at 110. > 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej > > Still got errors. > > Regards > Arthur > > On 14 Sep, 2014, at 11:33 pm, Ted Yu wrote: > >> spark-1297-v5.txt is level 0 patch >> >> Please use spark-1297-v5.txt >> >> Cheers >> >> On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.c
Re: Spark Hive max key length is 767 bytes
Hi, Fixed the issue by downgrade hive from 13.1 to 12.0, it works well now. Regards On 31 Aug, 2014, at 7:28 am, arthur.hk.c...@gmail.com wrote: > Hi, > > Already done but still get the same error: > > (I use HIVE 0.13.1 Spark 1.0.2, Hadoop 2.4.1) > > Steps: > Step 1) mysql: >> >>> alter database hive character set latin1; > Step 2) HIVE: >>> hive> create table test_datatype2 (testbigint bigint ); >>> OK >>> Time taken: 0.708 seconds >>> >>> hive> drop table test_datatype2; >>> OK >>> Time taken: 23.272 seconds > Step 3) scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) >>> 14/08/29 19:33:52 INFO Configuration.deprecation: >>> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use >>> mapreduce.reduce.speculative >>> hiveContext: org.apache.spark.sql.hive.HiveContext = >>> org.apache.spark.sql.hive.HiveContext@395c7b94 >>> scala> hiveContext.hql(“create table test_datatype3 (testbigint bigint)”) >>> res0: org.apache.spark.sql.SchemaRDD = >>> SchemaRDD[0] at RDD at SchemaRDD.scala:104 >>> == Query Plan == >>> >>> scala> hiveContext.hql("drop table test_datatype3") >>> >>> 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown >>> while adding/validating class(es) : Specified key was too long; max key >>> length is 767 bytes >>> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key >>> was too long; max key length is 767 bytes >>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) >>> at >>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) >>> >>> 14/08/29 19:34:17 WARN DataNucleus.Query: Query for candidates of >>> org.apache.hadoop.hive.metastore.model.MPartition and subclasses resulted >>> in no possible candidates >>> Error(s) were found while auto-creating/validating the datastore for >>> classes. The errors are printed in the log, and are attached to this >>> exception. >>> org.datanucleus.exceptions.NucleusDataStoreException: Error(s) were found >>> while auto-creating/validating the datastore for classes. The errors are >>> printed in the log, and are attached to this exception. >>> at >>> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.verifyErrors(RDBMSStoreManager.java:3609) >>> >>> >>> Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: >>> Specified key was too long; max key length is 767 bytes >>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > > > > Should I use HIVE 0.12.0 instead of HIVE 0.13.1? > > Regards > Arthur > > On 31 Aug, 2014, at 6:01 am, Denny Lee wrote: > >> Oh, you may be running into an issue with your MySQL setup actually, try >> running >> >> alter database metastore_db character set latin1 >> >> so that way Hive (and the Spark HiveContext) can execute properly against >> the metastore. >> >> >> On August 29, 2014 at 04:39:01, arthur.hk.c...@gmail.com >> (arthur.hk.c...@gmail.com) wrote: >> >>> Hi, >>> >>> >>> Tried the same thing in HIVE directly without issue: >>> >>> HIVE: >>> hive> create table test_datatype2 (testbigint bigint ); >>> OK >>> Time taken: 0.708 seconds >>> >>> hive> drop table test_datatype2; >>> OK >>> Time taken: 23.272 seconds >>> >>> >>> >>> Then tried again in SPARK: >>> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) >>> 14/08/29 19:33:52 INFO Configuration.deprecation: >>> mapred.reduce.tasks.speculative.execution is deprecated. Instead, use >>> mapreduce.reduce.speculative >>> hiveContext: org.apache.spark.sql.hive.HiveContext = >>> org.apache.spark.sql.hive.HiveContext@395c7b94 >>> >>> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) >>> res0: org.apache.spark.sql.SchemaRDD = >>> SchemaRDD[0] at RDD at SchemaRDD.scala:104 >>> == Query Plan == >>> >>> >>> scala> hiveContext.hql("drop table test_datatype3") >>> >>> 14/08/29 19:34:14 ERROR DataNucleus.Datastore: An exception was thrown >>> while adding/validating class(es) : Specified key was too long; max key &g
Re: SparkSQL on Hive error
hi, I have just tested the same command, it works here, can you please provide your create table command? regards Arthur scala> hiveContext.hql("show tables") warning: there were 1 deprecation warning(s); re-run with -deprecation for details 2014-10-03 17:14:33,575 INFO [main] parse.ParseDriver (ParseDriver.java:parse(179)) - Parsing command: show tables 2014-10-03 17:14:33,575 INFO [main] parse.ParseDriver (ParseDriver.java:parse(197)) - Parse Completed 2014-10-03 17:14:33,600 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 2014-10-03 17:14:33,600 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 17:14:33,600 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 17:14:33,600 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 17:14:33,600 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 17:14:33,600 INFO [main] parse.ParseDriver (ParseDriver.java:parse(179)) - Parsing command: show tables 2014-10-03 17:14:33,600 INFO [main] parse.ParseDriver (ParseDriver.java:parse(197)) - Parse Completed 2014-10-03 17:14:33,601 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 17:14:33,601 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 17:14:33,603 INFO [main] ql.Driver (Driver.java:compile(450)) - Semantic Analysis Completed 2014-10-03 17:14:33,603 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 17:14:33,604 INFO [main] exec.ListSinkOperator (Operator.java:initialize(338)) - Initializing Self 0 OP 2014-10-03 17:14:33,604 INFO [main] exec.ListSinkOperator (Operator.java:initializeChildren(403)) - Operator 0 OP initialized 2014-10-03 17:14:33,604 INFO [main] exec.ListSinkOperator (Operator.java:initialize(378)) - Initialization Done 0 OP 2014-10-03 17:14:33,604 INFO [main] ql.Driver (Driver.java:getSchema(264)) - Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null) 2014-10-03 17:14:33,605 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 17:14:33,605 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 17:14:33,605 INFO [main] ql.Driver (Driver.java:execute(1117)) - Starting command: show tables 2014-10-03 17:14:33,605 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 17:14:33,605 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 17:14:33,605 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 17:14:33,606 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:logInfo(454)) - 0: get_database: default 2014-10-03 17:14:33,606 INFO [main] HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(239)) - ugi=edhuser ip=unknown-ip-addr cmd=get_database: default 2014-10-03 17:14:33,609 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:logInfo(454)) - 0: get_tables: db=default pat=.* 2014-10-03 17:14:33,609 INFO [main] HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(239)) - ugi=edhuser ip=unknown-ip-addr cmd=get_tables: db=default pat=.* 2014-10-03 17:14:33,612 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 17:14:33,613 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 17:14:33,613 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 17:14:33,613 INFO [main] ql.Driver (SessionState.java:printInfo(410)) - OK 2014-10-03 17:14:33,613 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 17:14:33,613 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 17:14:33,613 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 17:14:33,614 INFO [main] mapred.FileInputFormat (FileInputFormat.java:listStatus(247)) - Total input paths to process : 1 2014-10-03 17:14:33,615 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 17:14:33,615 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - res4: org.apache.spark.sql.SchemaRDD = SchemaRDD[3] at RDD at SchemaRDD.scala:103 == Query Plan == scala> On 3 Oct, 2014, at 4:52 pm, Michael Armbrust wrote: > Are you running master? There was briefly a regression here that is > hopefully fixed by spark#2635. > > On Fri, Oct 3, 2014 at 1:43 AM, Kevin Paul wrote: > Hi all, I tried to launch my application with spark-submit, the command I use > is: > > bin/spark-submit --class ${MY_CLASS} --jars ${MY_JARS} --master local > myApplicationJar.jar > > I've buillt spark with SPARK_HIVE=true, and was able to start HiveContext, > and was able to run command like, > hiveContext.sql("select * from myTable") > or hiveContext.sql("select count(*) from myTable") > myTable is a table on my hive dat
How to save Spark log into file
Hi, How can the spark log be saved into file instead of showing them on console? Below is my conf/log4j.properties conf/log4j.properties ### # Root logger option log4j.rootLogger=INFO, file # Direct log messages to a log file log4j.appender.file=org.apache.log4j.RollingFileAppender #Redirect to Tomcat logs folder log4j.appender.file.File=/hadoop_logs/spark/spark_logging.log log4j.appender.file.MaxFileSize=10MB log4j.appender.file.MaxBackupIndex=10 log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.ConversionPattern=%d{-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n ### I tried to stop and start spark again, it still shows INFO WARN log on console. Any ideas? Regards Arthur scala> hiveContext.hql("show tables") warning: there were 1 deprecation warning(s); re-run with -deprecation for details 2014-10-03 19:35:01,554 INFO [main] parse.ParseDriver (ParseDriver.java:parse(179)) - Parsing command: show tables 2014-10-03 19:35:01,715 INFO [main] parse.ParseDriver (ParseDriver.java:parse(197)) - Parse Completed 2014-10-03 19:35:01,845 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 2014-10-03 19:35:01,847 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 19:35:01,847 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 19:35:01,847 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 19:35:01,863 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 19:35:01,863 INFO [main] parse.ParseDriver (ParseDriver.java:parse(179)) - Parsing command: show tables 2014-10-03 19:35:01,863 INFO [main] parse.ParseDriver (ParseDriver.java:parse(197)) - Parse Completed 2014-10-03 19:35:01,863 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 19:35:01,863 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 19:35:01,941 INFO [main] ql.Driver (Driver.java:compile(450)) - Semantic Analysis Completed 2014-10-03 19:35:01,941 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 19:35:01,979 INFO [main] exec.ListSinkOperator (Operator.java:initialize(338)) - Initializing Self 0 OP 2014-10-03 19:35:01,980 INFO [main] exec.ListSinkOperator (Operator.java:initializeChildren(403)) - Operator 0 OP initialized 2014-10-03 19:35:01,980 INFO [main] exec.ListSinkOperator (Operator.java:initialize(378)) - Initialization Done 0 OP 2014-10-03 19:35:01,985 INFO [main] ql.Driver (Driver.java:getSchema(264)) - Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null) 2014-10-03 19:35:01,985 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 19:35:01,985 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 19:35:01,985 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1009)) - mapred.job.name is deprecated. Instead, use mapreduce.job.name 2014-10-03 19:35:01,986 INFO [main] ql.Driver (Driver.java:execute(1117)) - Starting command: show tables 2014-10-03 19:35:01,994 INFO [main] ql.Driver (PerfLogger.java:PerfLogEnd(124)) - 2014-10-03 19:35:01,994 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 19:35:01,994 INFO [main] ql.Driver (PerfLogger.java:PerfLogBegin(97)) - 2014-10-03 19:35:02,019 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:newRawStore(411)) - 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 2014-10-03 19:35:02,034 INFO [main] metastore.ObjectStore (ObjectStore.java:initialize(232)) - ObjectStore, initialize called 2014-10-03 19:35:02,084 WARN [main] DataNucleus.General (Log4JLogger.java:warn(96)) - Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file://hadoop/spark/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar" is already registered, and you are trying to register an identical plugin located at URL "file://hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar." 2014-10-03 19:35:02,108 WARN [main] DataNucleus.General (Log4JLogger.java:warn(96)) - Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file://hadoop/spark/lib_managed/jars/datanucleus-core-3.2.2.jar" is already registered, and you are trying to register an identical plugin located at URL "file://hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-core-3.2.2.jar." 2014-10-03 19:35:02,137 WARN [main] DataNucleus.General (Log4JLogger.java:warn(96)) - Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the sa
Re: Breaking the previous large-scale sort record with Spark
Wonderful !! On 11 Oct, 2014, at 12:00 am, Nan Zhu wrote: > Great! Congratulations! > > -- > Nan Zhu > On Friday, October 10, 2014 at 11:19 AM, Mridul Muralidharan wrote: > >> Brilliant stuff ! Congrats all :-) >> This is indeed really heartening news ! >> >> Regards, >> Mridul >> >> >> On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia >> wrote: >>> Hi folks, >>> >>> I interrupt your regularly scheduled user / dev list to bring you some >>> pretty cool news for the project, which is that we've been able to use >>> Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x >>> faster on 10x fewer nodes. There's a detailed writeup at >>> http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html. >>> Summary: while Hadoop MapReduce held last year's 100 TB world record by >>> sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on >>> 206 nodes; and we also scaled up to sort 1 PB in 234 minutes. >>> >>> I want to thank Reynold Xin for leading this effort over the past few >>> weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali >>> Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for >>> providing the machines to make this possible. Finally, this result would of >>> course not be possible without the many many other contributions, testing >>> and feature requests from throughout the community. >>> >>> For an engine to scale from these multi-hour petabyte batch jobs down to >>> 100-millisecond streaming and interactive queries is quite uncommon, and >>> it's thanks to all of you folks that we are able to make this happen. >>> >>> Matei >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >
How To Implement More Than One Subquery in Scala/Spark
Hi, My Spark version is v1.1.0 and Hive is 0.12.0, I need to use more than 1 subquery in my Spark SQL, below are my sample table structures and a SQL that contains more than 1 subquery. Question 1: How to load a HIVE table into Scala/Spark? Question 2: How to implement a SQL_WITH_MORE_THAN_ONE_SUBQUERY in SCALA/SPARK? Question 3: What is the DATEADD function in Scala/Spark? or how to implement "DATEADD(MONTH, 3, '2013-07-01')” and "DATEADD(YEAR, 1, '2014-01-01')” in Spark or Hive? I can find HIVE (date_add(string startdate, int days)) but it is in days not MONTH / YEAR. Thanks. Regards Arthur === My sample SQL with more than 1 subquery: SELECT S_NAME, COUNT(*) AS NUMWAIT FROM SUPPLIER, LINEITEM L1, ORDERS WHERE S_SUPPKEY = L1.L_SUPPKEY AND O_ORDERKEY = L1.L_ORDERKEY AND O_ORDERSTATUS = 'F' AND L1.L_RECEIPTDATE > L1.L_COMMITDATE AND EXISTS (SELECT * FROM LINEITEM L2 WHERE L2.L_ORDERKEY = L1.L_ORDERKEY AND L2.L_SUPPKEY <> L1.L_SUPPKEY) AND NOT EXISTS (SELECT * FROM LINEITEM L3 WHERE L3.L_ORDERKEY = L1.L_ORDERKEY AND L3.L_SUPPKEY <> L1.L_SUPPKEY AND L3.L_RECEIPTDATE > L3.L_COMMITDATE) GROUP BY S_NAME ORDER BY NUMWAIT DESC, S_NAME limit 100; === Supplier Table: CREATE TABLE IF NOT EXISTS SUPPLIER ( S_SUPPKEY INTEGER PRIMARY KEY, S_NAME CHAR(25), S_ADDRESS VARCHAR(40), S_NATIONKEY BIGINT NOT NULL, S_PHONE CHAR(15), S_ACCTBAL DECIMAL, S_COMMENT VARCHAR(101) ) === Order Table: CREATE TABLE IF NOT EXISTS ORDERS ( O_ORDERKEY INTEGER PRIMARY KEY, O_CUSTKEY BIGINT NOT NULL, O_ORDERSTATUS CHAR(1), O_TOTALPRICEDECIMAL, O_ORDERDATE CHAR(10), O_ORDERPRIORITY CHAR(15), O_CLERK CHAR(15), O_SHIPPRIORITY INTEGER, O_COMMENT VARCHAR(79) === LineItem Table: CREATE TABLE IF NOT EXISTS LINEITEM ( L_ORDERKEY BIGINT not null, L_PARTKEY BIGINT, L_SUPPKEY BIGINT, L_LINENUMBERINTEGER not null, L_QUANTITY DECIMAL, L_EXTENDEDPRICE DECIMAL, L_DISCOUNT DECIMAL, L_TAX DECIMAL, L_SHIPDATE CHAR(10), L_COMMITDATECHAR(10), L_RECEIPTDATE CHAR(10), L_RETURNFLAGCHAR(1), L_LINESTATUSCHAR(1), L_SHIPINSTRUCT CHAR(25), L_SHIPMODE CHAR(10), L_COMMENT VARCHAR(44), CONSTRAINT pk PRIMARY KEY (L_ORDERKEY, L_LINENUMBER ) )
Re: How To Implement More Than One Subquery in Scala/Spark
Hi, Thank you so much! By the way, what is the DATEADD function in Scala/Spark? or how to implement "DATEADD(MONTH, 3, '2013-07-01')” and "DATEADD(YEAR, 1, '2014-01-01')” in Spark or Hive? Regards Arthur On 12 Oct, 2014, at 12:03 pm, Ilya Ganelin wrote: > Because of how closures work in Scala, there is no support for nested > map/rdd-based operations. Specifically, if you have > > Context a { > Context b { > > } > } > > Operations within context b, when distributed across nodes, will no longer > have visibility of variables specific to context a because that context is > not distributed alongside that operation! > > To get around this you need to serialize your operations. For example , run a > map job. Take the output of that and run a second map job to filter. Another > option is to run two separate map jobs and join their results. Keep in mind > that another useful technique is to execute the groupByKey routine , > particularly if you want to operate on a particular variable. > > On Oct 11, 2014 11:09 AM, "arthur.hk.c...@gmail.com" > wrote: > Hi, > > My Spark version is v1.1.0 and Hive is 0.12.0, I need to use more than 1 > subquery in my Spark SQL, below are my sample table structures and a SQL that > contains more than 1 subquery. > > Question 1: How to load a HIVE table into Scala/Spark? > Question 2: How to implement a SQL_WITH_MORE_THAN_ONE_SUBQUERY in > SCALA/SPARK? > Question 3: What is the DATEADD function in Scala/Spark? or how to implement > "DATEADD(MONTH, 3, '2013-07-01')” and "DATEADD(YEAR, 1, '2014-01-01')” in > Spark or Hive? > I can find HIVE (date_add(string startdate, int days)) but it is in days not > MONTH / YEAR. > > Thanks. > > Regards > Arthur > > === > My sample SQL with more than 1 subquery: > SELECT S_NAME, >COUNT(*) AS NUMWAIT > FROM SUPPLIER, >LINEITEM L1, >ORDERS > WHERE S_SUPPKEY = L1.L_SUPPKEY >AND O_ORDERKEY = L1.L_ORDERKEY >AND O_ORDERSTATUS = 'F' >AND L1.L_RECEIPTDATE > L1.L_COMMITDATE >AND EXISTS (SELECT * >FROM LINEITEM L2 >WHERE L2.L_ORDERKEY = L1.L_ORDERKEY > AND L2.L_SUPPKEY <> L1.L_SUPPKEY) >AND NOT EXISTS (SELECT * >FROM LINEITEM L3 >WHERE L3.L_ORDERKEY = L1.L_ORDERKEY > AND L3.L_SUPPKEY <> L1.L_SUPPKEY > AND L3.L_RECEIPTDATE > L3.L_COMMITDATE) > GROUP BY S_NAME > ORDER BY NUMWAIT DESC, S_NAME > limit 100; > > > === > Supplier Table: > CREATE TABLE IF NOT EXISTS SUPPLIER ( > S_SUPPKEY INTEGER PRIMARY KEY, > S_NAMECHAR(25), > S_ADDRESS VARCHAR(40), > S_NATIONKEY BIGINT NOT NULL, > S_PHONE CHAR(15), > S_ACCTBAL DECIMAL, > S_COMMENT VARCHAR(101) > ) > > === > Order Table: > CREATE TABLE IF NOT EXISTS ORDERS ( > O_ORDERKEYINTEGER PRIMARY KEY, > O_CUSTKEY BIGINT NOT NULL, > O_ORDERSTATUS CHAR(1), > O_TOTALPRICE DECIMAL, > O_ORDERDATE CHAR(10), > O_ORDERPRIORITY CHAR(15), > O_CLERK CHAR(15), > O_SHIPPRIORITYINTEGER, > O_COMMENT VARCHAR(79) > > === > LineItem Table: > CREATE TABLE IF NOT EXISTS LINEITEM ( > L_ORDERKEY BIGINT not null, > L_PARTKEY BIGINT, > L_SUPPKEY BIGINT, > L_LINENUMBERINTEGER not null, > L_QUANTITY DECIMAL, > L_EXTENDEDPRICE DECIMAL, > L_DISCOUNT DECIMAL, > L_TAX DECIMAL, > L_SHIPDATE CHAR(10), > L_COMMITDATECHAR(10), > L_RECEIPTDATE CHAR(10), > L_RETURNFLAGCHAR(1), > L_LINESTATUSCHAR(1), > L_SHIPINSTRUCT CHAR(25), > L_SHIPMODE CHAR(10), > L_COMMENT VARCHAR(44), > CONSTRAINT pk PRIMARY KEY (L_ORDERKEY, L_LINENUMBER ) > ) >
Spark Hive Snappy Error
Hi, When trying Spark with Hive table, I got the “java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I” error, val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql(“select count(1) from q8_national_market_share sqlContext.sql("select count(1) from q8_national_market_share").collect().foreach(println) java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) at org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79) at org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83) at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809) at org.apache.spark.sql.hive.HadoopTableReader.(TableReader.scala:68) at org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:68) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:146) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438) at $iwC$$iwC$$iwC$$iwC.(:15) at $iwC$$iwC$$iwC.(:20) at $iwC$$iwC.(:22) at $iwC.(:24) at (:26) at .(:30) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at
Spark/HIVE Insert Into values Error
Hi, When trying to insert records into HIVE, I got error, My Spark is 1.1.0 and Hive 0.12.0 Any idea what would be wrong? Regards Arthur hive> CREATE TABLE students (name VARCHAR(64), age INT, gpa int); OK hive> INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1); NoViableAltException(26@[]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:693) at org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:31374) at org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:29083) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:28968) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:28762) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1238) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) FAILED: ParseException line 1:27 cannot recognize input near 'VALUES' '(' ''fred flintstone'' in select clause
Re: Spark Hive Snappy Error
hread.java:107) 2014-10-22 20:23:17,038 INFO [sparkDriver-akka.actor.default-dispatcher-14] remote.RemoteActorRefProvider$RemotingTerminator (Slf4jLogger.scala:apply$mcV$sp(74)) - Shutting down remote daemon. 2014-10-22 20:23:17,039 INFO [sparkDriver-akka.actor.default-dispatcher-14] remote.RemoteActorRefProvider$RemotingTerminator (Slf4jLogger.scala:apply$mcV$sp(74)) - Remote daemon shut down; proceeding with flushing remote transports. Regards Arthur On 17 Oct, 2014, at 9:33 am, Shao, Saisai wrote: > Hi Arthur, > > I think this is a known issue in Spark, you can check > (https://issues.apache.org/jira/browse/SPARK-3958). I’m curious about it, can > you always reproduce this issue, Is this issue related to some specific data > sets, would you mind giving me some information about you workload, Spark > configuration, JDK version and OS version? > > Thanks > Jerry > > From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com] > Sent: Friday, October 17, 2014 7:13 AM > To: user > Cc: arthur.hk.c...@gmail.com > Subject: Spark Hive Snappy Error > > Hi, > > When trying Spark with Hive table, I got the “java.lang.UnsatisfiedLinkError: > org.xerial.snappy.SnappyNative.maxCompressedLength(I)I” error, > > > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql(“select count(1) from q8_national_market_share > sqlContext.sql("select count(1) from > q8_national_market_share").collect().foreach(println) > java.lang.UnsatisfiedLinkError: > org.xerial.snappy.SnappyNative.maxCompressedLength(I)I > at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) > at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) > at > org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79) > at > org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125) > at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809) > at > org.apache.spark.sql.hive.HadoopTableReader.(TableReader.scala:68) > at > org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:68) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) > at > org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > at > org.apache.spark.sql.execution.SparkStrategies$HashAggregation$.apply(SparkStrategies.scala:146) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406) > at org.apache.spark.sql.SchemaRD
Re: Spark Hive Snappy Error
Hi, FYI, I use snappy-java-1.0.4.1.jar Regards Arthur On 22 Oct, 2014, at 8:59 pm, Shao, Saisai wrote: > Thanks a lot, I will try to reproduce this in my local settings and dig into > the details, thanks for your information. > > > BR > Jerry > > From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com] > Sent: Wednesday, October 22, 2014 8:35 PM > To: Shao, Saisai > Cc: arthur.hk.c...@gmail.com; user > Subject: Re: Spark Hive Snappy Error > > Hi, > > Yes, I can always reproduce the issue: > > about you workload, Spark configuration, JDK version and OS version? > > I ran SparkPI 1000 > > java -version > java version "1.7.0_67" > Java(TM) SE Runtime Environment (build 1.7.0_67-b01) > Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) > > cat /etc/centos-release > CentOS release 6.5 (Final) > > My Spark’s hive-site.xml with following: > > hive.exec.compress.output > true > > > > mapred.output.compression.codec > org.apache.hadoop.io.compress.SnappyCodec > > > > mapred.output.compression.type > BLOCK > > > e.g. > MASTER=spark://m1:7077,m2:7077 ./bin/run-example SparkPi 1000 > 2014-10-22 20:23:17,033 ERROR [sparkDriver-akka.actor.default-dispatcher-18] > actor.ActorSystemImpl (Slf4jLogger.scala:apply$mcV$sp(66)) - Uncaught fatal > error from thread [sparkDriver-akka.actor.default-dispatcher-2] shutting down > ActorSystem [sparkDriver] > java.lang.UnsatisfiedLinkError: > org.xerial.snappy.SnappyNative.maxCompressedLength(I)I > at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) > at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) > at > org.xerial.snappy.SnappyOutputStream.(SnappyOutputStream.java:79) > at > org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125) > at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:68) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:829) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:769) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:753) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1360) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2014-10-22 20:23:17,036 INFO [main] scheduler.DAGScheduler > (Logging.scala:logInfo(59)) - Failed to run reduce at SparkPi.scala:35 > Exception in thread "main" org.apache.spark.SparkException: Job cancelled > because SparkContext was shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:694) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:693) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:693) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.postStop(DAGScheduler.scala:1399) > at > akka.
ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId
Hi, I just tried sample PI calculation on Spark Cluster, after returning the Pi result, it shows ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(m37,35662) not found ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://m33:7077 --executor-memory 512m --total-executor-cores 40 examples/target/spark-examples_2.10-1.1.0.jar 100 14/10/23 05:09:03 INFO TaskSetManager: Finished task 87.0 in stage 0.0 (TID 87) in 346 ms on m134.emblocsoft.net (99/100) 14/10/23 05:09:03 INFO TaskSetManager: Finished task 98.0 in stage 0.0 (TID 98) in 262 ms on m134.emblocsoft.net (100/100) 14/10/23 05:09:03 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/10/23 05:09:03 INFO DAGScheduler: Stage 0 (reduce at SparkPi.scala:35) finished in 2.597 s 14/10/23 05:09:03 INFO SparkContext: Job finished: reduce at SparkPi.scala:35, took 2.725328861 s Pi is roughly 3.1414948 14/10/23 05:09:03 INFO SparkUI: Stopped Spark web UI at http://m33:4040 14/10/23 05:09:03 INFO DAGScheduler: Stopping DAGScheduler 14/10/23 05:09:03 INFO SparkDeploySchedulerBackend: Shutting down all executors 14/10/23 05:09:03 INFO SparkDeploySchedulerBackend: Asking each executor to shut down 14/10/23 05:09:04 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@37852165 14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(m37,35662) 14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(m37,35662) 14/10/23 05:09:04 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(m37,35662) not found 14/10/23 05:09:04 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@37852165 java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) 14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(m36,34230) 14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(m35,50371) 14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(m36,34230) 14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(m34,41562) 14/10/23 05:09:04 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(m36,34230) not found 14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(m35,50371) 14/10/23 05:09:04 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(m35,50371) not found 14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(m33,39517) 14/10/23 05:09:04 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(m33,39517) 14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(m34,41562) 14/10/23 05:09:04 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(m33,39517) not found 14/10/23 05:09:04 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(m34,41562) java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295) at org.apache.spark.network.SendingConnection.read(Connection.scala:390) at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/10/23 05:09:04 INFO ConnectionManager: Handling connection error on connection to ConnectionManagerId(m34,41562) 14/10/23 05:09:04 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(m34,41562) 14/10/23 05:09:04 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@2e0b5c4a 14/10/23 05:09:04 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@2e0b5c4a java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) 14/10/23 05:09:04 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@653f8844 14/10/23 05:09:04 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@653f8844 java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(Connecti
Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId
Hi, I have managed to resolve it because a wrong setting. Please ignore this . Regards Arthur On 23 Oct, 2014, at 5:14 am, arthur.hk.c...@gmail.com wrote: > > 14/10/23 05:09:04 WARN ConnectionManager: All connections not cleaned up >
Spark: Order by Failed, java.lang.NullPointerException
Hi, I got java.lang.NullPointerException. Please help! sqlContext.sql("select l_orderkey, l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem limit 10").collect().foreach(println); 2014-10-23 08:20:12,024 INFO [sparkDriver-akka.actor.default-dispatcher-31] scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Stage 41 (runJob at basicOperators.scala:136) finished in 0.086 s 2014-10-23 08:20:12,024 INFO [Result resolver thread-1] scheduler.TaskSchedulerImpl (Logging.scala:logInfo(59)) - Removed TaskSet 41.0, whose tasks have all completed, from pool 2014-10-23 08:20:12,024 INFO [main] spark.SparkContext (Logging.scala:logInfo(59)) - Job finished: runJob at basicOperators.scala:136, took 0.090129332 s [9001,6,-4584121,17,1997-01-04,N,O] [9002,1,-2818574,23,1996-02-16,N,O] [9002,2,-2449102,21,1993-12-12,A,F] [9002,3,-5810699,26,1994-04-06,A,F] [9002,4,-489283,18,1994-11-11,R,F] [9002,5,2169683,15,1997-09-14,N,O] [9002,6,2405081,4,1992-08-03,R,F] [9002,7,3835341,40,1998-04-28,N,O] [9003,1,1900071,4,1994-05-05,R,F] [9004,1,-2614665,41,1993-06-13,A,F] If "order by L_LINESTATUS” is added then error: sqlContext.sql("select l_orderkey, l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by L_LINESTATUS limit 10").collect().foreach(println); 2014-10-23 08:22:08,524 INFO [main] parse.ParseDriver (ParseDriver.java:parse(179)) - Parsing command: select l_orderkey, l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by L_LINESTATUS limit 10 2014-10-23 08:22:08,525 INFO [main] parse.ParseDriver (ParseDriver.java:parse(197)) - Parse Completed 2014-10-23 08:22:08,526 INFO [main] metastore.HiveMetaStore (HiveMetaStore.java:logInfo(454)) - 0: get_table : db=boc_12 tbl=lineitem 2014-10-23 08:22:08,526 INFO [main] HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(239)) - ugi=hd ip=unknown-ip-addr cmd=get_table : db=boc_12 tbl=lineitem java.lang.NullPointerException at org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1262) at org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1269) at org.apache.spark.sql.hive.HadoopTableReader.(TableReader.scala:63) at org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:68) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$TakeOrdered$.apply(SparkStrategies.scala:191) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438) at $iwC$$iwC$$iwC$$iwC.(:15) at $iwC$$iwC$$iwC.(:20) at $iwC$$iwC.(:22) at $iwC.(:24) at (:26) at .(:30) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.ap
Re: Spark Hive Snappy Error
Hi,Please find the attached file.{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210 {\fonttbl\f0\fnil\fcharset0 Menlo-Regular;} {\colortbl;\red255\green255\blue255;} \paperw11900\paperh16840\margl1440\margr1440\vieww26300\viewh12480\viewkind0 \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural \f0\fs22 \cf0 \CocoaLigature0 lsof -p 16459 (Master)\ COMMAND PIDUSER FD TYPE DEVICE SIZE/OFF NODE NAME\ java16459 tester cwdDIR 253,2 4096 6039786 /hadoop/spark-1.1.0_patched\ java16459 tester rtdDIR 253,0 40962 /\ java16459 tester txtREG 253,0 12150 2780995 /usr/lib/jvm/jdk1.7.0_67/bin/java\ java16459 tester memREG 253,0156928 2228230 /lib64/ld-2.12.so\ java16459 tester memREG 253,0 1926680 2228250 /lib64/libc-2.12.so\ java16459 tester memREG 253,0145896 2228251 /lib64/libpthread-2.12.so\ java16459 tester memREG 253,0 22536 2228254 /lib64/libdl-2.12.so\ java16459 tester memREG 253,0109006 2759278 /usr/lib/jvm/jdk1.7.0_67/lib/amd64/jli/libjli.so\ java16459 tester memREG 253,0599384 2228264 /lib64/libm-2.12.so\ java16459 tester memREG 253,0 47064 2228295 /lib64/librt-2.12.so\ java16459 tester memREG 253,0113952 2228328 /lib64/libresolv-2.12.so\ java16459 tester memREG 253,0 99158576 2388225 /usr/lib/locale/locale-archive\ java16459 tester memREG 253,0 27424 2228249 /lib64/libnss_dns-2.12.so\ java16459 tester memREG 253,2 138832345 6555616 /hadoop/spark-1.1.0_patched/assembly/target/scala-2.10/spark-assembly-1.1.0-hadoop2.4.1.jar\ java16459 tester memREG 253,0580624 2893171 /usr/lib/jvm/jdk1.7.0_67/jre/lib/jsse.jar\ java16459 tester memREG 253,0114742 2893221 /usr/lib/jvm/jdk1.7.0_67/jre/lib/amd64/libnet.so\ java16459 tester memREG 253,0 91178 2893222 /usr/lib/jvm/jdk1.7.0_67/jre/lib/amd64/libnio.so\ java16459 tester memREG 253,2 1769726 6816963 /hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-rdbms-3.2.1.jar\ java16459 tester memREG 253,2337012 6816961 /hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-api-jdo-3.2.1.jar\ java16459 tester memREG 253,2 1801810 6816962 /hadoop/spark-1.1.0_patched/lib_managed/jars/datanucleus-core-3.2.2.jar\ java16459 tester memREG 253,2 25153 7079998 /hadoop/hive-0.12.0-bin/csv-serde-1.1.2-0.11.0-all.jar\ java16459 tester memREG 253,2 21817 6032989 /hadoop/hbase-0.98.5-hadoop2/lib/gmbal-api-only-3.0.0-b023.jar\ java16459 tester memREG 253,2177131 6032940 /hadoop/hbase-0.98.5-hadoop2/lib/jetty-util-6.1.26.jar\ java16459 tester memREG 253,2 32677 6032915 /hadoop/hbase-0.98.5-hadoop2/lib/hbase-hadoop-compat-0.98.5-hadoop2.jar\ java16459 tester memREG 253,2143602 6032959 /hadoop/hbase-0.98.5-hadoop2/lib/commons-digester-1.8.jar\ java16459 tester memREG 253,2 97738 6032917 /hadoop/hbase-0.98.5-hadoop2/lib/hbase-prefix-tree-0.98.5-hadoop2.jar\ java16459 tester memREG 253,2 17884 6032949 /hadoop/hbase-0.98.5-hadoop2/lib/jackson-jaxrs-1.8.8.jar\ java16459 tester memREG 253,2253086 6032987 /hadoop/hbase-0.98.5-hadoop2/lib/grizzly-http-2.1.2.jar\ java16459 tester memREG 253,2 73778 6032916 /hadoop/hbase-0.98.5-hadoop2/lib/hbase-hadoop2-compat-0.98.5-hadoop2.jar\ java16459 tester memREG 253,2336904 6032993 /hadoop/hbase-0.98.5-hadoop2/lib/grizzly-http-servlet-2.1.2.jar\ java16459 tester memREG 253,2927415 6032914 /hadoop/hbase-0.98.5-hadoop2/lib/hbase-client-0.98.5-hadoop2.jar\ java16459 tester memREG 253,2125740 6033008 /hadoop/hbase-0.98.5-hadoop2/lib/hadoop-yarn-server-applicationhistoryservice-2.4.1.jar\ java16459 tester memREG 253,2 15010 6032936 /hadoop/hbase-0.98.5-hadoop2/lib/xmlenc-0.52.jar\ java16459 tester memREG 253,2 60686 6032926 /hadoop/hbase-0.98.5-hadoop2/lib/commons-logging-1.1.1.jar\ java16459 tester memREG 253,2259600 6032927 /hadoop/hbase-0.98.5-hadoop2/lib/commons-codec-1.7.jar\ java16459 tester memREG 253,2321806 6032957 /hadoop/hbase-0.98.5-hadoop2/lib/jets3t-0.6.1.jar\ java16459 tester memREG 253,2 85353 6032982 /hadoop/hbase-0.98.5-hadoop2/lib/javax.servlet-api-3.0.1.jar\ java16459 tester memREG
Re: Spark Hive Snappy Error
Hi May I know where to configure Spark to load libhadoop.so? Regards Arthur On 23 Oct, 2014, at 11:31 am, arthur.hk.c...@gmail.com wrote: > Hi, > > Please find the attached file. > > > > > my spark-default.xml > # Default system properties included when running spark-submit. > # This is useful for setting default environmental settings. > # > # Example: > # spark.masterspark://master:7077 > # spark.eventLog.enabled true > # spark.eventLog.dirhdfs://namenode:8021/directory > # spark.serializerorg.apache.spark.serializer.KryoSerializer > # > spark.executor.memory 2048m > spark.shuffle.spill.compressfalse > spark.io.compression.codecorg.apache.spark.io.SnappyCompressionCodec > > > > my spark-env.sh > #!/usr/bin/env bash > export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar" > export > CLASSPATH="$CLASSPATH:$HIVE_HOME/lib/mysql-connector-java-5.1.31-bin.jar" > export JAVA_LIBRARY_PATH="$HADOOP_HOME/lib/native/Linux-amd64-64" > export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"} > export SPARK_WORKER_DIR="/edh/hadoop_data/spark_work/" > export SPARK_LOG_DIR="/edh/hadoop_logs/spark" > export SPARK_LIBRARY_PATH="$HADOOP_HOME/lib/native/Linux-amd64-64" > export > SPARK_CLASSPATH="$SPARK_HOME/lib_managed/jars/mysql-connector-java-5.1.31-bin.jar" > export > SPARK_CLASSPATH="$SPARK_CLASSPATH:$HBASE_HOME/lib/*:$HIVE_HOME/csv-serde-1.1.2-0.11.0-all.jar:" > export SPARK_WORKER_MEMORY=2g > export HADOOP_HEAPSIZE=2000 > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=m35:2181,m33:2181,m37:2181" > export SPARK_JAVA_OPTS=" -XX:+UseConcMarkSweepGC" > > > ll $HADOOP_HOME/lib/native/Linux-amd64-64 > -rw-rw-r--. 1 tester tester50523 Aug 27 14:12 hadoop-auth-2.4.1.jar > -rw-rw-r--. 1 tester tester 1062640 Aug 27 12:19 libhadoop.a > -rw-rw-r--. 1 tester tester 1487564 Aug 27 11:14 libhadooppipes.a > lrwxrwxrwx. 1 tester tester 24 Aug 27 07:08 libhadoopsnappy.so -> > libhadoopsnappy.so.0.0.1 > lrwxrwxrwx. 1 tester tester 24 Aug 27 07:08 libhadoopsnappy.so.0 -> > libhadoopsnappy.so.0.0.1 > -rwxr-xr-x. 1 tester tester54961 Aug 27 07:08 libhadoopsnappy.so.0.0.1 > -rwxrwxr-x. 1 tester tester 630328 Aug 27 12:19 libhadoop.so > -rwxrwxr-x. 1 tester tester 630328 Aug 27 12:19 libhadoop.so.1.0.0 > -rw-rw-r--. 1 tester tester 582472 Aug 27 11:14 libhadooputils.a > -rw-rw-r--. 1 tester tester 298626 Aug 27 11:14 libhdfs.a > -rwxrwxr-x. 1 tester tester 200370 Aug 27 11:14 libhdfs.so > -rwxrwxr-x. 1 tester tester 200370 Aug 27 11:14 libhdfs.so.0.0.0 > lrwxrwxrwx. 1 tester tester 55 Aug 27 07:08 libjvm.so -> > /usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64/server/libjvm.so > lrwxrwxrwx. 1 tester tester 25 Aug 27 07:08 libprotobuf-lite.so -> > libprotobuf-lite.so.8.0.0 > lrwxrwxrwx. 1 tester tester 25 Aug 27 07:08 libprotobuf-lite.so.8 > -> libprotobuf-lite.so.8.0.0 > -rwxr-xr-x. 1 tester tester 964689 Aug 27 07:08 > libprotobuf-lite.so.8.0.0 > lrwxrwxrwx. 1 tester tester 20 Aug 27 07:08 libprotobuf.so -> > libprotobuf.so.8.0.0 > lrwxrwxrwx. 1 tester tester 20 Aug 27 07:08 libprotobuf.so.8 -> > libprotobuf.so.8.0.0 > -rwxr-xr-x. 1 tester tester 8300050 Aug 27 07:08 libprotobuf.so.8.0.0 > lrwxrwxrwx. 1 tester tester 18 Aug 27 07:08 libprotoc.so -> > libprotoc.so.8.0.0 > lrwxrwxrwx. 1 tester tester 18 Aug 27 07:08 libprotoc.so.8 -> > libprotoc.so.8.0.0 > -rwxr-xr-x. 1 tester tester 9935810 Aug 27 07:08 libprotoc.so.8.0.0 > -rw-r--r--. 1 tester tester 233554 Aug 27 15:19 libsnappy.a > lrwxrwxrwx. 1 tester tester 23 Aug 27 11:32 libsnappy.so -> > /usr/lib64/libsnappy.so > lrwxrwxrwx. 1 tester tester 23 Aug 27 11:33 libsnappy.so.1 -> > /usr/lib64/libsnappy.so > -rwxr-xr-x. 1 tester tester 147726 Aug 27 07:08 libsnappy.so.1.2.0 > drwxr-xr-x. 2 tester tester 4096 Aug 27 07:08 pkgconfig > > > Regards > Arthur > > > On 23 Oct, 2014, at 10:57 am, Shao, Saisai wrote: > >> Hi Arthur, >> >> I think your problem might be different from what >> SPARK-3958(https://issues.apache.org/jira/browse/SPARK-3958) mentioned, >> seems your problem is more likely to be a library link problem, would you >> mind checking your Spark runtime to see if the snappy.so is loaded or not? >> (through lsof -p). >> >> I guess your problem is more likely to be a library not found problem. >> >> >> Thanks >> Jerry >> >>
Re: Spark Hive Snappy Error
HI Removed export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar” It works, THANK YOU!! Regards Arthur On 23 Oct, 2014, at 1:00 pm, Shao, Saisai wrote: > Seems you just add snappy library into your classpath: > > export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar" > > But for spark itself, it depends on snappy-0.2.jar. Is there any possibility > that this problem caused by different version of snappy? > > Thanks > Jerry > > From: arthur.hk.c...@gmail.com [mailto:arthur.hk.c...@gmail.com] > Sent: Thursday, October 23, 2014 11:32 AM > To: Shao, Saisai > Cc: arthur.hk.c...@gmail.com; user > Subject: Re: Spark Hive Snappy Error > > Hi, > > Please find the attached file. > > > > my spark-default.xml > # Default system properties included when running spark-submit. > # This is useful for setting default environmental settings. > # > # Example: > # spark.masterspark://master:7077 > # spark.eventLog.enabled true > # spark.eventLog.dir > hdfs://namenode:8021/directory > # spark.serializerorg.apache.spark.serializer.KryoSerializer > # > spark.executor.memory 2048m > spark.shuffle.spill.compressfalse > spark.io.compression.codec > org.apache.spark.io.SnappyCompressionCodec > > > > my spark-env.sh > #!/usr/bin/env bash > export CLASSPATH="$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar" > export > CLASSPATH="$CLASSPATH:$HIVE_HOME/lib/mysql-connector-java-5.1.31-bin.jar" > export JAVA_LIBRARY_PATH="$HADOOP_HOME/lib/native/Linux-amd64-64" > export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"} > export SPARK_WORKER_DIR="/edh/hadoop_data/spark_work/" > export SPARK_LOG_DIR="/edh/hadoop_logs/spark" > export SPARK_LIBRARY_PATH="$HADOOP_HOME/lib/native/Linux-amd64-64" > export > SPARK_CLASSPATH="$SPARK_HOME/lib_managed/jars/mysql-connector-java-5.1.31-bin.jar" > export > SPARK_CLASSPATH="$SPARK_CLASSPATH:$HBASE_HOME/lib/*:$HIVE_HOME/csv-serde-1.1.2-0.11.0-all.jar:" > export SPARK_WORKER_MEMORY=2g > export HADOOP_HEAPSIZE=2000 > export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER > -Dspark.deploy.zookeeper.url=m35:2181,m33:2181,m37:2181" > export SPARK_JAVA_OPTS=" -XX:+UseConcMarkSweepGC" > > > ll $HADOOP_HOME/lib/native/Linux-amd64-64 > -rw-rw-r--. 1 tester tester50523 Aug 27 14:12 hadoop-auth-2.4.1.jar > -rw-rw-r--. 1 tester tester 1062640 Aug 27 12:19 libhadoop.a > -rw-rw-r--. 1 tester tester 1487564 Aug 27 11:14 libhadooppipes.a > lrwxrwxrwx. 1 tester tester 24 Aug 27 07:08 libhadoopsnappy.so -> > libhadoopsnappy.so.0.0.1 > lrwxrwxrwx. 1 tester tester 24 Aug 27 07:08 libhadoopsnappy.so.0 -> > libhadoopsnappy.so.0.0.1 > -rwxr-xr-x. 1 tester tester54961 Aug 27 07:08 libhadoopsnappy.so.0.0.1 > -rwxrwxr-x. 1 tester tester 630328 Aug 27 12:19 libhadoop.so > -rwxrwxr-x. 1 tester tester 630328 Aug 27 12:19 libhadoop.so.1.0.0 > -rw-rw-r--. 1 tester tester 582472 Aug 27 11:14 libhadooputils.a > -rw-rw-r--. 1 tester tester 298626 Aug 27 11:14 libhdfs.a > -rwxrwxr-x. 1 tester tester 200370 Aug 27 11:14 libhdfs.so > -rwxrwxr-x. 1 tester tester 200370 Aug 27 11:14 libhdfs.so.0.0.0 > lrwxrwxrwx. 1 tester tester 55 Aug 27 07:08 libjvm.so > ->/usr/lib/jvm/jdk1.6.0_45/jre/lib/amd64/server/libjvm.so > lrwxrwxrwx. 1 tester tester 25 Aug 27 07:08 libprotobuf-lite.so -> > libprotobuf-lite.so.8.0.0 > lrwxrwxrwx. 1 tester tester 25 Aug 27 07:08 libprotobuf-lite.so.8 > -> libprotobuf-lite.so.8.0.0 > -rwxr-xr-x. 1 tester tester 964689 Aug 27 07:08 > libprotobuf-lite.so.8.0.0 > lrwxrwxrwx. 1 tester tester 20 Aug 27 07:08 libprotobuf.so -> > libprotobuf.so.8.0.0 > lrwxrwxrwx. 1 tester tester 20 Aug 27 07:08 libprotobuf.so.8 -> > libprotobuf.so.8.0.0 > -rwxr-xr-x. 1 tester tester 8300050 Aug 27 07:08 libprotobuf.so.8.0.0 > lrwxrwxrwx. 1 tester tester 18 Aug 27 07:08 libprotoc.so -> > libprotoc.so.8.0.0 > lrwxrwxrwx. 1 tester tester 18 Aug 27 07:08 libprotoc.so.8 -> > libprotoc.so.8.0.0 > -rwxr-xr-x. 1 tester tester 9935810 Aug 27 07:08 libprotoc.so.8.0.0 > -rw-r--r--. 1 tester tester 233554 Aug 27 15:19 libsnappy.a > lrwxrwxrwx. 1 tester tester 23 Aug 27 11:32 libsnappy.so -> > /usr/lib64/libsnappy.so > lrwxrwxrwx. 1 tester tester 23 Aug 27 11:33 libsnappy.so.1 -> > /usr/lib64/libsnappy.so > -rwxr-xr-x. 1 tester tester 147726 Aug 27 07:08 libsnappy.so.1.2.0
Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Hi, I got $TreeNodeException, few questions: Q1) How should I do aggregation in SparK? Can I use aggregation directly in SQL? or Q1) Should I use SQL to load the data to form RDD then use scala to do the aggregation? Regards Arthur MySQL (good one, without aggregation): sqlContext.sql("SELECT L_RETURNFLAG FROM LINEITEM WHERE L_SHIPDATE<='1998-09-02' GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG, L_LINESTATUS").collect().foreach(println); [A] [N] [N] [R] My SQL (problem SQL, with aggregation): sqlContext.sql("SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY) AS SUM_QTY, SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE, SUM(L_EXTENDEDPRICE * ( 1 - L_DISCOUNT )) AS SUM_DISC_PRICE, SUM(L_EXTENDEDPRICE * ( 1 - L_DISCOUNT ) * ( 1 + L_TAX )) AS SUM_CHARGE, AVG(L_QUANTITY) AS AVG_QTY, AVG(L_EXTENDEDPRICE) AS AVG_PRICE, AVG(L_DISCOUNT) AS AVG_DISC, COUNT(*) AS COUNT_ORDER FROM LINEITEM WHERE L_SHIPDATE<='1998-09-02' GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG, L_LINESTATUS").collect().foreach(println); 14/10/23 20:38:31 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks have all completed, from pool org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree: Sort [l_returnflag#200 ASC,l_linestatus#201 ASC], true Exchange (RangePartitioning [l_returnflag#200 ASC,l_linestatus#201 ASC], 200) Aggregate false, [l_returnflag#200,l_linestatus#201], [l_returnflag#200,l_linestatus#201,SUM(PartialSum#216) AS sum_qty#181,SUM(PartialSum#217) AS sum_base_price#182,SUM(PartialSum#218) AS sum_disc_price#183,SUM(PartialSum#219) AS sum_charge#184,(CAST(SUM(PartialSum#220), DoubleType) / CAST(SUM(PartialCount#221L), DoubleType)) AS avg_qty#185,(CAST(SUM(PartialSum#222), DoubleType) / CAST(SUM(PartialCount#223L), DoubleType)) AS avg_price#186,(CAST(SUM(PartialSum#224), DoubleType) / CAST(SUM(PartialCount#225L), DoubleType)) AS avg_disc#187,SUM(PartialCount#226L) AS count_order#188L] Exchange (HashPartitioning [l_returnflag#200,l_linestatus#201], 200) Aggregate true, [l_returnflag#200,l_linestatus#201], [l_returnflag#200,l_linestatus#201,COUNT(l_discount#195) AS PartialCount#225L,SUM(l_discount#195) AS PartialSum#224,COUNT(1) AS PartialCount#226L,SUM((l_extendedprice#194 * (1.0 - l_discount#195))) AS PartialSum#218,SUM(l_extendedprice#194) AS PartialSum#217,COUNT(l_quantity#193) AS PartialCount#221L,SUM(l_quantity#193) AS PartialSum#220,COUNT(l_extendedprice#194) AS PartialCount#223L,SUM(l_extendedprice#194) AS PartialSum#222,SUM(((l_extendedprice#194 * (1.0 - l_discount#195)) * (1.0 + l_tax#196))) AS PartialSum#219,SUM(l_quantity#193) AS PartialSum#216] Project [l_extendedprice#194,l_linestatus#201,l_discount#195,l_returnflag#200,l_tax#196,l_quantity#193] Filter (l_shipdate#197 <= 1998-09-02) HiveTableScan [l_shipdate#197,l_extendedprice#194,l_linestatus#201,l_discount#195,l_returnflag#200,l_tax#196,l_quantity#193], (MetastoreRelation boc_12, lineitem, None), None at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.execution.Sort.execute(basicOperators.scala:191) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438) at $iwC$$iwC$$iwC$$iwC.(:15) at $iwC$$iwC$$iwC.(:20) at $iwC$$iwC.(:22) at $iwC.(:24) at (:26) at .(:30) at .() at .(:7) at .() at $print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$an
Re: Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
HI, My step to create LINEITEM: $HADOOP_HOME/bin/hadoop fs -mkdir /tpch/lineitem $HADOOP_HOME/bin/hadoop fs -copyFromLocal lineitem.tbl /tpch/lineitem/ Create external table lineitem (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/lineitem’; Regards Arthur On 23 Oct, 2014, at 9:36 pm, Yin Huai wrote: > Hello Arthur, > > You can use do aggregations in SQL. How did you create LINEITEM? > > Thanks, > > Yin > > On Thu, Oct 23, 2014 at 8:54 AM, arthur.hk.c...@gmail.com > wrote: > Hi, > > I got $TreeNodeException, few questions: > Q1) How should I do aggregation in SparK? Can I use aggregation directly in > SQL? or > Q1) Should I use SQL to load the data to form RDD then use scala to do the > aggregation? > > Regards > Arthur > > > MySQL (good one, without aggregation): > sqlContext.sql("SELECT L_RETURNFLAG FROM LINEITEM WHERE > L_SHIPDATE<='1998-09-02' GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY > L_RETURNFLAG, L_LINESTATUS").collect().foreach(println); > [A] > [N] > [N] > [R] > > > My SQL (problem SQL, with aggregation): > sqlContext.sql("SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY) AS > SUM_QTY, SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE, SUM(L_EXTENDEDPRICE * ( 1 - > L_DISCOUNT )) AS SUM_DISC_PRICE, SUM(L_EXTENDEDPRICE * ( 1 - L_DISCOUNT ) * ( > 1 + L_TAX )) AS SUM_CHARGE, AVG(L_QUANTITY) AS AVG_QTY, AVG(L_EXTENDEDPRICE) > AS AVG_PRICE, AVG(L_DISCOUNT) AS AVG_DISC, COUNT(*) AS COUNT_ORDER FROM > LINEITEM WHERE L_SHIPDATE<='1998-09-02' GROUP BY L_RETURNFLAG, > L_LINESTATUS ORDER BY L_RETURNFLAG, > L_LINESTATUS").collect().foreach(println); > > 14/10/23 20:38:31 INFO TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks > have all completed, from pool > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree: > Sort [l_returnflag#200 ASC,l_linestatus#201 ASC], true > Exchange (RangePartitioning [l_returnflag#200 ASC,l_linestatus#201 ASC], 200) > Aggregate false, [l_returnflag#200,l_linestatus#201], > [l_returnflag#200,l_linestatus#201,SUM(PartialSum#216) AS > sum_qty#181,SUM(PartialSum#217) AS sum_base_price#182,SUM(PartialSum#218) AS > sum_disc_price#183,SUM(PartialSum#219) AS > sum_charge#184,(CAST(SUM(PartialSum#220), DoubleType) / > CAST(SUM(PartialCount#221L), DoubleType)) AS > avg_qty#185,(CAST(SUM(PartialSum#222), DoubleType) / > CAST(SUM(PartialCount#223L), DoubleType)) AS > avg_price#186,(CAST(SUM(PartialSum#224), DoubleType) / > CAST(SUM(PartialCount#225L), DoubleType)) AS > avg_disc#187,SUM(PartialCount#226L) AS count_order#188L] >Exchange (HashPartitioning [l_returnflag#200,l_linestatus#201], 200) > Aggregate true, [l_returnflag#200,l_linestatus#201], > [l_returnflag#200,l_linestatus#201,COUNT(l_discount#195) AS > PartialCount#225L,SUM(l_discount#195) AS PartialSum#224,COUNT(1) AS > PartialCount#226L,SUM((l_extendedprice#194 * (1.0 - l_discount#195))) AS > PartialSum#218,SUM(l_extendedprice#194) AS > PartialSum#217,COUNT(l_quantity#193) AS PartialCount#221L,SUM(l_quantity#193) > AS PartialSum#220,COUNT(l_extendedprice#194) AS > PartialCount#223L,SUM(l_extendedprice#194) AS > PartialSum#222,SUM(((l_extendedprice#194 * (1.0 - l_discount#195)) * (1.0 + > l_tax#196))) AS PartialSum#219,SUM(l_quantity#193) AS PartialSum#216] > Project > [l_extendedprice#194,l_linestatus#201,l_discount#195,l_returnflag#200,l_tax#196,l_quantity#193] > Filter (l_shipdate#197 <= 1998-09-02) >HiveTableScan > [l_shipdate#197,l_extendedprice#194,l_linestatus#201,l_discount#195,l_returnflag#200,l_tax#196,l_quantity#193], > (MetastoreRelation boc_12, lineitem, None), None > > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) > at org.apache.spark.sql.execution.Sort.execute(basicOperators.scala:191) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85) > at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438) > at $iwC$$iwC$$iwC$$iwC.(:15) > at $iwC$$iwC$$iwC.(:20) > at $iwC$$iwC.(:22) > at $iwC.(:24) > at (:26) > at .(:30) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMetho
Spark 1.1.0 and Hive 0.12.0 Compatibility Issue
(Please ignore if duplicated) Hi, My Spark is 1.1.0 and Hive is 0.12, I tried to run the same query in both Hive-0.12.0 then Spark-1.1.0, HiveQL works while SparkSQL failed. hive> select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate, o_shippriority from customer c join orders o on c.c_mktsegment = 'BUILDING' and c.c_custkey = o.o_custkey join lineitem l on l.l_orderkey = o.o_orderkey where o_orderdate < '1995-03-15' and l_shipdate > '1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10; Ended Job = job_1414067367860_0011 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 2.0 sec HDFS Read: 261 HDFS Write: 96 SUCCESS Job 1: Map: 1 Reduce: 1 Cumulative CPU: 0.88 sec HDFS Read: 458 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 880 msec OK Time taken: 38.771 seconds scala> sqlContext.sql("""select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate, o_shippriority from customer c join orders o on c.c_mktsegment = 'BUILDING' and c.c_custkey = o.o_custkey join lineitem l on l.l_orderkey = o.o_orderkey where o_orderdate < '1995-03-15' and l_shipdate > '1995-03-15' group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, o_orderdate limit 10""").collect().foreach(println); org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in stage 5.0 failed 4 times, most recent failure: Lost task 14.3 in stage 5.0 (TID 568, m34): java.lang.ClassCastException: java.lang.String cannot be cast to scala.math.BigDecimal scala.math.Numeric$BigDecimalIsFractional$.minus(Numeric.scala:182) org.apache.spark.sql.catalyst.expressions.Subtract$$anonfun$eval$3.apply(arithmetic.scala:64) org.apache.spark.sql.catalyst.expressions.Subtract$$anonfun$eval$3.apply(arithmetic.scala:64) org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:114) org.apache.spark.sql.catalyst.expressions.Subtract.eval(arithmetic.scala:64) org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:108) org.apache.spark.sql.catalyst.expressions.Multiply.eval(arithmetic.scala:70) org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:47) org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:108) org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:58) org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:69) org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:433) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at
Re: Spark 1.1.0 and Hive 0.12.0 Compatibility Issue
.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)14/10/25 06:50:15 WARN TaskSetManager: Lost task 7.3 in stage 5.0 (TID 575, m137.emblocsoft.net): TaskKilled (killed intentionally) 14/10/25 06:50:15 WARN TaskSetManager: Lost task 24.2 in stage 5.0 (TID 560, m137.emblocsoft.net): TaskKilled (killed intentionally) 14/10/25 06:50:15 WARN TaskSetManager: Lost task 22.2 in stage 5.0 (TID 561, m137.emblocsoft.net): TaskKilled (killed intentionally) 14/10/25 06:50:15 WARN TaskSetManager: Lost task 20.2 in stage 5.0 (TID 564, m137.emblocsoft.net): TaskKilled (killed intentionally) 14/10/25 06:50:15 WARN TaskSetManager: Lost task 13.2 in stage 5.0 (TID 562, m137.emblocsoft.net): TaskKilled (killed intentionally) 14/10/25 06:50:15 WARN TaskSetManager: Lost task 27.2 in stage 5.0 (TID 565, m137.emblocsoft.net): TaskKilled (killed intentionally) 14/10/25 06:50:15 WARN TaskSetManager: Lost task 34.2 in stage 5.0 (TID 568, m137.emblocsoft.net): TaskKilled (killed intentionally) 14/10/25 06:50:15 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool Regards Arthur On 24 Oct, 2014, at 6:56 am, Michael Armbrust wrote: > Can you show the DDL for the table? It looks like the SerDe might be saying > it will produce a decimal type but is actually producing a string. > > On Thu, Oct 23, 2014 at 3:17 PM, arthur.hk.c...@gmail.com > wrote: > Hi > > My Spark is 1.1.0 and Hive is 0.12, I tried to run the same query in both > Hive-0.12.0 then Spark-1.1.0, HiveQL works while SparkSQL failed. > > > hive> select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, > o_orderdate, o_shippriority from customer c join orders o on c.c_mktsegment = > 'BUILDING' and c.c_custkey = o.o_custkey join lineitem l on l.l_orderkey = > o.o_orderkey where o_orderdate < '1995-03-15' and l_shipdate > '1995-03-15' > group by l_orderkey, o_orderdate, o_shippriority order by revenue desc, > o_orderdate limit 10; > Ended Job = job_1414067367860_0011 > MapReduce Jobs Launched: > Job 0: Map: 1 Reduce: 1 Cumulative CPU: 2.0 sec HDFS Read: 261 HDFS > Write: 96 SUCCESS > Job 1: Map: 1 Reduce: 1 Cumulative CPU: 0.88 sec HDFS Read: 458 HDFS > Write: 0 SUCCESS > Total MapReduce CPU Time Spent: 2 seconds 880 msec > OK > Time taken: 38.771 seconds > > > scala> sqlContext.sql("""select l_orderkey, > sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate, o_shippriority >
Re: Spark: Order by Failed, java.lang.NullPointerException
Hi, Added “l_linestatus” it works, THANK YOU!! sqlContext.sql("select l_linestatus, l_orderkey, l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by L_LINESTATUS limit 10").collect().foreach(println); 14/10/25 07:03:24 INFO DAGScheduler: Stage 12 (takeOrdered at basicOperators.scala:171) finished in 54.358 s 14/10/25 07:03:24 INFO TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool 14/10/25 07:03:24 INFO SparkContext: Job finished: takeOrdered at basicOperators.scala:171, took 54.374629175 s [F,71769288,2,-5859884,13,1993-12-13,R,F] [F,71769319,4,-4098165,19,1992-10-12,R,F] [F,71769288,3,2903707,44,1994-10-08,R,F] [F,71769285,2,-741439,42,1994-04-22,R,F] [F,71769313,5,-1276467,12,1992-08-15,R,F] [F,71769314,7,-5595080,13,1992-03-28,A,F] [F,71769316,1,-1766622,16,1993-12-05,R,F] [F,71769287,2,-767340,50,1993-06-21,A,F] [F,71769317,2,665847,15,1992-05-03,A,F] [F,71769286,1,-5667701,15,1994-04-17,A,F] Regards Arthur On 24 Oct, 2014, at 2:58 pm, Akhil Das wrote: > Not sure if this would help, but make sure you are having the column > l_linestatus in the data. > > Thanks > Best Regards > > On Thu, Oct 23, 2014 at 5:59 AM, arthur.hk.c...@gmail.com > wrote: > Hi, > > I got java.lang.NullPointerException. Please help! > > > sqlContext.sql("select l_orderkey, l_linenumber, l_partkey, l_quantity, > l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem limit > 10").collect().foreach(println); > > 2014-10-23 08:20:12,024 INFO [sparkDriver-akka.actor.default-dispatcher-31] > scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Stage 41 (runJob at > basicOperators.scala:136) finished in 0.086 s > 2014-10-23 08:20:12,024 INFO [Result resolver thread-1] > scheduler.TaskSchedulerImpl (Logging.scala:logInfo(59)) - Removed TaskSet > 41.0, whose tasks have all completed, from pool > 2014-10-23 08:20:12,024 INFO [main] spark.SparkContext > (Logging.scala:logInfo(59)) - Job finished: runJob at > basicOperators.scala:136, took 0.090129332 s > [9001,6,-4584121,17,1997-01-04,N,O] > [9002,1,-2818574,23,1996-02-16,N,O] > [9002,2,-2449102,21,1993-12-12,A,F] > [9002,3,-5810699,26,1994-04-06,A,F] > [9002,4,-489283,18,1994-11-11,R,F] > [9002,5,2169683,15,1997-09-14,N,O] > [9002,6,2405081,4,1992-08-03,R,F] > [9002,7,3835341,40,1998-04-28,N,O] > [9003,1,1900071,4,1994-05-05,R,F] > [9004,1,-2614665,41,1993-06-13,A,F] > > > If "order by L_LINESTATUS” is added then error: > sqlContext.sql("select l_orderkey, l_linenumber, l_partkey, l_quantity, > l_shipdate, L_RETURNFLAG, L_LINESTATUS from lineitem order by L_LINESTATUS > limit 10").collect().foreach(println); > > 2014-10-23 08:22:08,524 INFO [main] parse.ParseDriver > (ParseDriver.java:parse(179)) - Parsing command: select l_orderkey, > l_linenumber, l_partkey, l_quantity, l_shipdate, L_RETURNFLAG, L_LINESTATUS > from lineitem order by L_LINESTATUS limit 10 > 2014-10-23 08:22:08,525 INFO [main] parse.ParseDriver > (ParseDriver.java:parse(197)) - Parse Completed > 2014-10-23 08:22:08,526 INFO [main] metastore.HiveMetaStore > (HiveMetaStore.java:logInfo(454)) - 0: get_table : db=boc_12 tbl=lineitem > 2014-10-23 08:22:08,526 INFO [main] HiveMetaStore.audit > (HiveMetaStore.java:logAuditEvent(239)) - ugi=hd ip=unknown-ip-addr > cmd=get_table : db=boc_12 tbl=lineitem > java.lang.NullPointerException > at > org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1262) > at > org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1269) > at > org.apache.spark.sql.hive.HadoopTableReader.(TableReader.scala:63) > at > org.apache.spark.sql.hive.execution.HiveTableScan.(HiveTableScan.scala:68) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) > at > org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364) > at > org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) > a
Re: Spark/HIVE Insert Into values Error
Hi, I have already found the way about how to “insert into HIVE_TABLE values (…..) Regards Arthur On 18 Oct, 2014, at 10:09 pm, Cheng Lian wrote: > Currently Spark SQL uses Hive 0.12.0, which doesn't support the INSERT INTO > ... VALUES ... syntax. > > On 10/18/14 1:33 AM, arthur.hk.c...@gmail.com wrote: >> Hi, >> >> When trying to insert records into HIVE, I got error, >> >> My Spark is 1.1.0 and Hive 0.12.0 >> >> Any idea what would be wrong? >> Regards >> Arthur >> >> >> >> hive> CREATE TABLE students (name VARCHAR(64), age INT, gpa int); >> OK >> >> hive> INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1); >> NoViableAltException(26@[]) >> at >> org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:693) >> at >> org.apache.hadoop.hive.ql.parse.HiveParser.selectClause(HiveParser.java:31374) >> at >> org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:29083) >> at >> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:28968) >> at >> org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:28762) >> at >> org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1238) >> at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:938) >> at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:190) >> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:424) >> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342) >> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977) >> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) >> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259) >> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216) >> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413) >> at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781) >> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675) >> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:614) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:212) >> FAILED: ParseException line 1:27 cannot recognize input near 'VALUES' '(' >> ''fred flintstone'' in select clause >> >> >
Spark 1.1.0 on Hive 0.13.1
Hi, My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13? Please advise. Or, any news about when will Spark 1.1.0 on Hive 0.1.3.1 be available? Regards Arthur - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.1.0 on Hive 0.13.1
Hi, Thanks for your update. Any idea when will Spark 1.2 be GA? Regards Arthur On 29 Oct, 2014, at 8:22 pm, Cheng Lian wrote: > Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0, and > related PRs are already merged or being merged to the master branch. > > On 10/29/14 7:43 PM, arthur.hk.c...@gmail.com wrote: >> Hi, >> >> My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13? Please advise. >> >> Or, any news about when will Spark 1.1.0 on Hive 0.1.3.1 be available? >> >> Regards >> Arthur >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: OOM with groupBy + saveAsTextFile
Hi, FYI as follows. Could you post your heap size settings as well your Spark app code? Regards Arthur 3.1.3 Detail Message: Requested array size exceeds VM limit The detail message Requested array size exceeds VM limit indicates that the application (or APIs used by that application) attempted to allocate an array that is larger than the heap size. For example, if an application attempts to allocate an array of 512MB but the maximum heap size is 256MB then OutOfMemoryError will be thrown with the reason Requested array size exceeds VM limit. In most cases the problem is either a configuration issue (heap size too small), or a bug that results in an application attempting to create a huge array, for example, when the number of elements in the array are computed using an algorithm that computes an incorrect size.” On 2 Nov, 2014, at 12:25 pm, Bharath Ravi Kumar wrote: > Resurfacing the thread. Oom shouldn't be the norm for a common groupby / sort > use case in a framework that is leading in sorting bench marks? Or is there > something fundamentally wrong in the usage? > > On 02-Nov-2014 1:06 am, "Bharath Ravi Kumar" wrote: > Hi, > > I'm trying to run groupBy(function) followed by saveAsTextFile on an RDD of > count ~ 100 million. The data size is 20GB and groupBy results in an RDD of > 1061 keys with values being Iterable String>>. The job runs on 3 hosts in a standalone setup with each host's > executor having 100G RAM and 24 cores dedicated to it. While the groupBy > stage completes successfully with ~24GB of shuffle write, the saveAsTextFile > fails after repeated retries with each attempt failing due to an out of > memory error [1]. I understand that a few partitions may be overloaded as a > result of the groupBy and I've tried the following config combinations > unsuccessfully: > > 1) Repartition the initial rdd (44 input partitions but 1061 keys) across > 1061 paritions and have max cores = 3 so that each key is a "logical" > partition (though many partitions will end up on very few hosts), and each > host likely runs saveAsTextFile on a single key at a time due to max cores = > 3 with 3 hosts in the cluster. The level of parallelism is unspecified. > > 2) Leave max cores unspecified, set the level of parallelism to 72, and leave > number of partitions unspecified (in which case the # input partitions was > used, which is 44) > Since I do not intend to cache RDD's, I have set > spark.storage.memoryFraction=0.2 in both cases. > > My understanding is that if each host is processing a single logical > partition to saveAsTextFile and is reading from other hosts to write out the > RDD, it is unlikely that it would run out of memory. My interpretation of the > spark tuning guide is that the degree of parallelism has little impact in > case (1) above since max cores = number of hosts. Can someone explain why > there are still OOM's with 100G being available? On a related note, > intuitively (though I haven't read the source), it appears that an entire > key-value pair needn't fit into memory of a single host for saveAsTextFile > since a single shuffle read from a remote can be written to HDFS before the > next remote read is carried out. This way, not all data needs to be collected > at the same time. > > Lastly, if an OOM is (but shouldn't be) a common occurrence (as per the > tuning guide and even as per Datastax's spark introduction), there may need > to be more documentation around the internals of spark to help users take > better informed tuning decisions with parallelism, max cores, number > partitions and other tunables. Is there any ongoing effort on that front? > > Thanks, > Bharath > > > [1] OOM stack trace and logs > 14/11/01 12:26:37 WARN TaskSetManager: Lost task 61.0 in stage 36.0 (TID > 1264, proc1.foo.bar.com): java.lang.OutOfMemoryError: Requested array size > exceeds VM limit > java.util.Arrays.copyOf(Arrays.java:3326) > > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) > > java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) > java.lang.StringBuilder.append(StringBuilder.java:136) > scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197) > scala.Tuple2.toString(Tuple2.scala:22) > > org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158) > > org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984) > > org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > org.apache.spark.scheduler.