Re: Spark SQL

2014-09-14 Thread Burak Yavuz
Hi, I'm not a master on SparkSQL, but from what I understand, the problem ıs that you're trying to access an RDD inside an RDD here: val xyz = file.map(line => *** extractCurRate(sqlContext.sql("select rate ... *** and here: xyz = file.map(line => *** extractCurRate(sqlContext.sql("select rate

Broadcast error

2014-09-14 Thread Chengi Liu
Hi, I am trying to create an rdd out of large matrix sc.parallelize suggest to use broadcast But when I do sc.broadcast(data) I get this error: Traceback (most recent call last): File "", line 1, in File "/usr/common/usg/spark/1.0.2/python/pyspark/context.py", line 370, in broadcast

Re: Broadcast error

2014-09-14 Thread Chengi Liu
Specifically the error I see when I try to operate on rdd created by sc.parallelize method : org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 12:12 was 12062263 bytes which exceeds spark.akka.frameSize (10485760 bytes). Consider using broadcast variables for large

File operations on spark

2014-09-14 Thread rapelly kartheek
Hi I am trying to perform read/write file operations in spark by creating Writable object. But, I am not able to write to a file. The concerned data is not rdd. Can someone please tell me how to perform read/write file operations on non-rdd data in spark. Regards karthik

Driver fail with out of memory exception

2014-09-14 Thread richiesgr
Hi I've written a job (I think not very complicated only 1 reduceByKey) the driver JVM always hang with OOM killing the worker of course. How can I know what is running on the driver and what is running on the worker how to debug the memory problem. I've already used --driver-memory 4g params to g

Re: Driver fail with out of memory exception

2014-09-14 Thread Akhil Das
Try increasing the number of partitions while doing a reduceByKey() Thanks Best Regards On Sun, Sep 14, 2014 at 5:11 PM, richiesgr wrote: > Hi > > I've written a job (I think not very complicated on

object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi, I have tried to to run HBaseTest.scala, but I got following errors, any ideas to how to fix them? Q1) scala> package org.apache.spark.examples :1: error: illegal start of definition package org.apache.spark.examples Q2) scala> import org.apache.hadoop.hbase.mapreduce.TableInputF

Re: Broadcast error

2014-09-14 Thread Akhil Das
When the data size is huge, you better of use the torrentBroadcastFactory. Thanks Best Regards On Sun, Sep 14, 2014 at 2:54 PM, Chengi Liu wrote: > Specifically the error I see when I try to operate on rdd created by > sc.parallelize method > : org.apache.spark.SparkException: Job aborted due t

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread Ted Yu
Spark examples builds against hbase 0.94 by default. If you want to run against 0.98, see: SPARK-1297 https://issues.apache.org/jira/browse/SPARK-1297 Cheers On Sun, Sep 14, 2014 at 7:36 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > I have tried to to run *HBaseTest.

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi,Thanks!!I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt are good here,  but not spark-1297-v5.txt:$ patch -p1 -i spark-1297-v4.txtpatching file examples/pom.xml$ patch -p1 -i spark-1297-v5.txtcan't find file to patch at input line 5Perhaps you used the wrong -p or --st

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread Ted Yu
spark-1297-v5.txt is level 0 patch Please use spark-1297-v5.txt Cheers On Sun, Sep 14, 2014 at 8:06 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > Thanks!! > > I tried to apply the patches, both spark-1297-v2.txt and spark-1297-v4.txt > are good here, but not spark-1

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi, Thanks! patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej Still got errors. Regards Arthur On 14 Sep, 2014, at 11:33 pm,

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi, My bad. Tried again, worked. patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml Thanks! Arthur On 14 Sep, 2014, at 11:38 pm, arthur.hk.c...@gmail.com wrote: > Hi, > > Thanks! > > patch -p0 -i spark-1297-v5.txt > patching file docs

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread Ted Yu
I applied the patch on master branch without rejects. If you use spark 1.0.2, use pom.xml attached to the JIRA. On Sun, Sep 14, 2014 at 8:38 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > Thanks! > > patch -p0 -i spark-1297-v5.txt > patching file docs/building-with-mav

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread arthur.hk.c...@gmail.com
Hi, I applied the patch. 1) patched $ patch -p0 -i spark-1297-v5.txt patching file docs/building-with-maven.md patching file examples/pom.xml 2) Compilation result [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Proje

Re: object hbase is not a member of package org.apache.hadoop

2014-09-14 Thread Ted Yu
Take a look at bin/run-example Cheers On Sun, Sep 14, 2014 at 9:15 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > I applied the patch. > > 1) patched > > $ patch -p0 -i spark-1297-v5.txt > patching file docs/building-with-maven.md > patching file examples/pom.xml > > >

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-14 Thread Dean Wampler
Can you post your whole SBT build file(s)? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogramming.com On Wed, Sep 10, 2014 at 6

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-14 Thread Dean Wampler
Sorry, I meant any *other* SBT files. However, what happens if you remove the line: exclude("org.eclipse.jetty.orbit", "javax.servlet") dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe

failed to run SimpleApp locally on macbook

2014-09-14 Thread Gary Zhao
Hello I'm new to Spark and I couldn't make the SimpleApp run on my macbook. I feel it's related to network configuration. Could anyone take a look? Thanks. 14/09/14 10:10:36 INFO Utils: Fetching http://10.63.93.115:59005/jars/simple-project_2.11-1.0.jar to /var/folders/3p/l2d9ljnx7f99q8hmms3wpcg4

Re: HBase 0.96+ with Spark 1.0+

2014-09-14 Thread Reinis Vicups
I did actually try Seans suggestion just before I posted for the first time in this thread. I got an error when doing this and thought that I am not understanding what Sean was suggesting. Now I re-attempted your suggestions with spark 1.0.0-cdh5.1.0, hbase 0.98.1-cdh5.1.0 and hadoop 2.3.0-cdh

Re: Broadcast error

2014-09-14 Thread Chengi Liu
How? Example please.. Also, if I am running this in pyspark shell.. how do i configure spark.akka.frameSize ?? On Sun, Sep 14, 2014 at 7:43 AM, Akhil Das wrote: > When the data size is huge, you better of use the torrentBroadcastFactory. > > Thanks > Best Regards > > On Sun, Sep 14, 2014 at 2:5

Re: spark 1.1.0 unit tests fail

2014-09-14 Thread Koert Kuipers
ok sounds good. those were the only tests that failed by the way On Sun, Sep 14, 2014 at 1:07 AM, Andrew Or wrote: > Hi Koert, > > Thanks for reporting this. These tests have been flaky even on the master > branch for a long time. You can safely disregard these test failures, as > the root caus

Re: compiling spark source code

2014-09-14 Thread Matei Zaharia
I've seen the "file name too long" error when compiling on an encrypted Linux file system -- some of them have a limit on file name lengths. If you're on Linux, can you try compiling inside /tmp instead? Matei On September 13, 2014 at 10:03:14 PM, Yin Huai (huaiyin@gmail.com) wrote: Can yo

Re: Broadcast error

2014-09-14 Thread Chengi Liu
And when I use sparksubmit script, I get the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o26.trainKMeansModel. : org.apache.spark.SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up. at org.apache.spark.scheduler.DAGScheduler.

Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-14 Thread Brad Miller
Hi Andrew, I agree with Nicholas. That was a nice, concise summary of the meaning of the locality customization options, indicators and default Spark behaviors. I haven't combed through the documentation end-to-end in a while, but I'm also not sure that information is presently represented somew

Re: spark-1.1.0 with make-distribution.sh problem

2014-09-14 Thread Patrick Wendell
Yeah that issue has been fixed by adding better docs, it just didn't make it in time for the release: https://github.com/apache/spark/blob/branch-1.1/make-distribution.sh#L54 On Thu, Sep 11, 2014 at 11:57 PM, Zhanfeng Huo wrote: > resolved: > > ./make-distribution.sh --name spark-hadoop-2.3.0

Alternative to spark.executor.extraClassPath ?

2014-09-14 Thread innowireless TaeYun Kim
Hi, On Spark Configuration document, spark.executor.extraClassPath is regarded as a backwards-compatibility option. It also says that users typically should not need to set this option. Now, I must add a classpath to the executor environment (as well as to the driver in the future, but for

Re: Re: spark-1.1.0 with make-distribution.sh problem

2014-09-14 Thread Zhanfeng Huo
Thank you very much. It is helpful for end users. Zhanfeng Huo From: Patrick Wendell Date: 2014-09-15 10:19 To: Zhanfeng Huo CC: user Subject: Re: spark-1.1.0 with make-distribution.sh problem Yeah that issue has been fixed by adding better docs, it just didn't make it in time for the relea

PathFilter for newAPIHadoopFile?

2014-09-14 Thread Eric Friedman
Hi, I have a directory structure with parquet+avro data in it. There are a couple of administrative files (.foo and/or _foo) that I need to ignore when processing this data or Spark tries to read them as containing parquet content, which they do not. How can I set a PathFilter on the FileInputFor

About SparkSQL 1.1.0 join between more than two table

2014-09-14 Thread boyingk...@163.com
Hi: When I use spark SQL (1.0.1), I found it not support join between three tables,eg: sql("SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on (youhao_age.rowkey=youhao_totalKiloMeter.rowkey)") I take the Exception: Excep

Re: Broadcast error

2014-09-14 Thread Chengi Liu
Any suggestions.. I am really blocked on this one On Sun, Sep 14, 2014 at 2:43 PM, Chengi Liu wrote: > And when I use sparksubmit script, I get the following error: > > py4j.protocol.Py4JJavaError: An error occurred while calling > o26.trainKMeansModel. > : org.apache.spark.SparkException: Job a

Re: Broadcast error

2014-09-14 Thread Davies Liu
Hey Chengi, What's the version of Spark you are using? It have big improvements about broadcast in 1.1, could you try it? On Sun, Sep 14, 2014 at 8:29 PM, Chengi Liu wrote: > Any suggestions.. I am really blocked on this one > > On Sun, Sep 14, 2014 at 2:43 PM, Chengi Liu wrote: >> >> And when

Re: Broadcast error

2014-09-14 Thread Chengi Liu
I am using spark1.0.2. This is my work cluster.. so I can't setup a new version readily... But right now, I am not using broadcast .. conf = SparkConf().set("spark.executor.memory", "32G").set("spark.akka.frameSize", "1000") sc = SparkContext(conf = conf) rdd = sc.parallelize(matrix,5) from pysp

Re: Broadcast error

2014-09-14 Thread Chengi Liu
And the thing is code runs just fine if I reduce the number of rows in my data? On Sun, Sep 14, 2014 at 8:45 PM, Chengi Liu wrote: > I am using spark1.0.2. > This is my work cluster.. so I can't setup a new version readily... > But right now, I am not using broadcast .. > > > conf = SparkConf().

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-14 Thread Evan Chan
SPARK-1671 looks really promising. Note that even right now, you don't need to un-cache the existing table. You can do something like this: newAdditionRdd.registerTempTable("table2") sqlContext.cacheTable("table2") val unionedRdd = sqlContext.table("table1").unionAll(sqlContext.table("table2"))

Re: PathFilter for newAPIHadoopFile?

2014-09-14 Thread Nat Padmanabhan
Hi Eric, Something along the lines of the following should work val fs = getFileSystem(...) // standard hadoop API call val filteredConcatenatedPaths = fs.listStatus(topLevelDirPath, pathFilter).map(_.getPath.toString).mkString(",") // pathFilter is an instance of org.apache.hadoop.fs.PathFilter

combineByKey throws ClassCastException

2014-09-14 Thread Tao Xiao
I followd an example presented in the tutorial Learning Spark to compute the per-key average as follows: val Array(appName) = args val sparkConf = new SparkConf() .setAppName(appName) val sc = new SparkContext(

Re: Accuracy hit in classification with Spark

2014-09-14 Thread jatinpreet
Hi, I have been able to get the same accuracy with MLlib as Mahout's. The pre-processing phase of Mahout was the reason behind the accuracy mismatch. After studying and applying the same logic in my code, it worked like a charm. Thanks, Jatin - Novice Big Data Programmer -- View this mess

SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-14 Thread linkpatrickliu
I started sparkSQL thrift server: "sbin/start-thriftserver.sh" Then I use beeline to connect to it: "bin/beeline" "!connect jdbc:hive2://localhost:1 op1 op1" I have created a database for user op1. "create database dw_op1"; And grant all privileges to user op1; "grant all on database dw_op1

Re: combineByKey throws ClassCastException

2014-09-14 Thread x
How about this. scala> val rdd2 = rdd.combineByKey( | (v: Int) => v.toLong, | (c: Long, v: Int) => c + v, | (c1: Long, c2: Long) => c1 + c2) rdd2: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[9] at combineB yKey at :14 xj @ Tokyo On Mon, Sep 15, 2014 at 3:06 PM, Tao