java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-08-27 Thread durin
Hi, I'm using a cluster with 5 nodes that each use 8 cores and 10GB of RAM Basically I'm creating a dictionary from text, i.e. giving each words that occurs more than n times in all texts a unique identifier. The essential port of the code looks like that: var texts = ctx.sql("SELECT text FROM

Re: Submitting multiple files pyspark

2014-08-27 Thread Andrew Or
Hi Cheng, You specify extra python files through --py-files. For example: bin/spark-submit [your other options] --py-files helper.py main_app.py -Andrew 2014-08-27 22:58 GMT-07:00 Chengi Liu : > Hi, > I have two files.. > > main_app.py and helper.py > main_app.py calls some functions in hel

Re: Update on Pig on Spark initiative

2014-08-27 Thread Russell Jurney
This is really exciting! Thanks so much for this work, I think you've guaranteed Pig's continued vitality. On Wednesday, August 27, 2014, Matei Zaharia wrote: > Awesome to hear this, Mayur! Thanks for putting this together. > > Matei > > On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.ru

Submitting multiple files pyspark

2014-08-27 Thread Chengi Liu
Hi, I have two files.. main_app.py and helper.py main_app.py calls some functions in helper.py. I want to use spark-submit to submit a job but how do i specify helper.py? Basically, how do i specify multiple files in spark? Thanks

Re: Update on Pig on Spark initiative

2014-08-27 Thread Matei Zaharia
Awesome to hear this, Mayur! Thanks for putting this together. Matei On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com) wrote: Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins & other func

Update on Pig on Spark initiative

2014-08-27 Thread Mayur Rustagi
Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins & other functionality is working quite nicely. We are in the process of merging with Apache Pig trunk(something that should happen over the next 2 weeks). Meanwhile if

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
I forgot to include '-Dhadoop.version=2.4.1' in the command below. The modified command passed. You can verify the dependence on hbase 0.98 through this command: mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree > dep.txt Cheers On Wed, Aug 27, 2014 at 8:5

Re: Apache Spark- Cassandra - NotSerializable Exception while saving to cassandra

2014-08-27 Thread lmk
Hi Yana I have done take and confirmed existence of data..Also checked that it is getting connected to Cassandra.. That is why I suspect that this particular rdd is not serializable.. Thanks, Lmk On Aug 28, 2014 5:13 AM, "Yana [via Apache Spark User List]" < ml-node+s1001560n12960...@n3.nabble.com>

Re: Compilation FAILURE : Spark 1.0.2 / Project Hive (0.13.1)

2014-08-27 Thread Ted Yu
See this thread: http://search-hadoop.com/m/JW1q5wwgyL1/Working+Formula+for+Hive+0.13&subj=Re+Working+Formula+for+Hive+0+13+ On Wed, Aug 27, 2014 at 8:54 PM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > I use Hadoop 2.4.1, HBase 0.98.5, Zookeeper 3.4.6 and Hive 0.13.1. >

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
Looks like the patch given by that URL only had the last commit. I have attached pom.xml for spark-1.0.2 to SPARK-1297 You can download it and replace examples/pom.xml with the downloaded pom I am running this command locally: mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package Cheers

Compilation FAILURE : Spark 1.0.2 / Project Hive (0.13.1)

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi, I use Hadoop 2.4.1, HBase 0.98.5, Zookeeper 3.4.6 and Hive 0.13.1. I just tried to compile Spark 1.0.2, but got error on "Spark Project Hive", can you please advise which repository has "org.spark-project.hive:hive-metastore:jar:0.13.1"? FYI, below is my repository setting in maven which

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2014-08-27 Thread Matei Zaharia
You can use spark-shell -i file.scala to run that. However, that keeps the interpreter open at the end, so you need to make your file end with System.exit(0) (or even more robustly, do stuff in a try {} and add that in finally {}). In general it would be better to compile apps and run them with

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted, Thanks. Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.) Is this normal? Regards Arthur patch -p1 -i 1893.patch patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xm

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
Can you use this command ? patch -p1 -i 1893.patch Cheers On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi Ted, > > I tried the following steps to apply the patch 1893 but got Hunk FAILED, > can you please advise how to get thru this error? or i

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted, I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one? Regards Arthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
You can get the patch from this URL: https://github.com/apache/spark/pull/1893.patch BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml Cheers On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi Ted, > > Thank you so much!

RE: how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2014-08-27 Thread Henry Hung
Update: I use shell script to execute the spark-shell, inside the my-script.sh: $SPARK_HOME/bin/spark-shell < $HOME/test.scala > $HOME/test.log 2>&1 & Although it correctly finish the println("hallo world"), but the strange thing is that my-script.sh finished before spark-shell even finish execu

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi Ted, Thank you so much!! As I am new to Spark, can you please advise the steps about how to apply this patch to my spark-1.0.2 source folder? Regards Arthur On 28 Aug, 2014, at 10:13 am, Ted Yu wrote: > See SPARK-1297 > > The pull request is here: > https://github.com/apache/spark/pull/

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread Ted Yu
See SPARK-1297 The pull request is here: https://github.com/apache/spark/pull/1893 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > (correction: "Compilation Error: Spark 1.0.2 with HBase 0.98” , please > ignore if duplicated) > > > Hi, > > I need

how to correctly run scala script using spark-shell through stdin (spark v1.0.0)

2014-08-27 Thread Henry Hung
HI All, Right now I'm trying to execute a script using this command: nohup $SPARK_HOME/bin/spark-shell < $HOME/my-script.scala > $HOME/my-script.log 2>&1 & my-script.scala just have 1 line of code: println("hallo world") But after waiting for a minute, I still don't receive the result from sp

Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
(correction: "Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated) Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2

Compilaon Error: Spark 1.0.2 with HBase 0.98

2014-08-27 Thread arthur.hk.c...@gmail.com
Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val H

Re: Kafka stream receiver stops input

2014-08-27 Thread Dibyendu Bhattacharya
I think this is a known issue in Existing KafkaUtils. Even we had this issue. The problem is in Existing KafkaUtil there is no way to control the message flow. You can refer to another mail thread on Low Level Kafka Consumer which I have written to solve this issue along with many other.. Dib On

Re: Low Level Kafka Consumer for Spark

2014-08-27 Thread Dibyendu Bhattacharya
I agree. This issue should be fixed in Spark rather rely on replay of Kafka messages. Dib On Aug 28, 2014 6:45 AM, "RodrigoB" wrote: > Dibyendu, > > Tnks for getting back. > > I believe you are absolutely right. We were under the assumption that the > raw data was being computed again and that's

Re: Low Level Kafka Consumer for Spark

2014-08-27 Thread RodrigoB
Dibyendu, Tnks for getting back. I believe you are absolutely right. We were under the assumption that the raw data was being computed again and that's not happening after further tests. This applies to Kafka as well. The issue is of major priority fortunately. Regarding your suggestion, I wou

Re: MLBase status

2014-08-27 Thread Ameet Talwalkar
Hi Sameer, MLbase started out as a set of three ML components on top of Spark. The lowest level, MLlib, is now a rapidly growing component within Spark and is maintained by the Spark community. The two higher-level components (MLI and MLOpt) are experimental components that serve as testbeds for

Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Michael Armbrust
In general the various language interfaces try to return the natural type for the language. In python we return lists in scala we return Seqs. Arrays on the JVM have all sorts of messy semantics (e.g. they are invariant and don't have erasure). On Wed, Aug 27, 2014 at 5:34 PM, Du Li wrote: >

Kafka stream receiver stops input

2014-08-27 Thread Tim Smith
Hi, I have Spark (1.0.0 on CDH5) running with Kafka 0.8.1.1. I have a streaming jobs that reads from a kafka topic and writes output to another kafka topic. The job starts fine but after a while the input stream stops getting any data. I think these messages show no incoming data on the stream: 1

Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Du Li
I found this discrepancy when writing unit tests for my project. Basically the expectation was that the returned type should match that of the input data. Although it’s easy to work around, I was just feeling a bit weird. Is there a better reason to return ArrayBuffer? From: Michael Armbrust m

Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Michael Armbrust
Arrays in the JVM are also mutable. However, you should not be relying on the exact type here. The only promise is that you will get back something of type Seq[_]. On Wed, Aug 27, 2014 at 4:27 PM, Du Li wrote: > Hi, Michael. > > I used HiveContext to create a table with a field of type Arra

Re: Apache Spark- Cassandra - NotSerializable Exception while saving to cassandra

2014-08-27 Thread Yana
I'm not so sure that your error is coming from the cassandra write. you have val data = test.map(..).map(..) so data will actually not get created until you try to save it. Can you try to do something like data.count() or data.take(k) after this line and see if you even get to the cassandra part?

SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Du Li
Hi, Michael. I used HiveContext to create a table with a field of type Array. However, in the hql results, this field was returned as type ArrayBuffer which is mutable. Would it make more sense to be an Array? The Spark version of my test is 1.0.2. I haven’t tested it on SQLContext nor newer v

Re: FileNotFoundException (No space left on device) writing to S3

2014-08-27 Thread Frank Austin Nothaft
Hi Dan, Spark will clean up the temp files after a run (IIRC), so you won’t see the drive out of space after the run completes. In any case, by default, Spark puts shuffles files at /tmp/ (this is controlled by the spark.local.dir parameter). I assume you’re running on EC2? You’ll probably want

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
I see a issue here. If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. I wish there was DStream mapPartitionsWithIndex. On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng wrote: > You can use RDD id as the seed, which is unique in the same spark > context. Suppose none of the RDDs would con

FileNotFoundException (No space left on device) writing to S3

2014-08-27 Thread Daniil Osipov
Hello, I've been seeing the following errors when trying to save to S3: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage fail ure: Task 4058 in stage 2.1 failed 4 times, most recent failure: Lost task 4058.3 in stag e 2.1 (TID 12572, ip-10-81-151-40.ec2.interna

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Patrick Wendell
Yeah - each batch will produce a new RDD. On Wed, Aug 27, 2014 at 3:33 PM, Soumitra Kumar wrote: > Thanks. > > Just to double check, rdd.id would be unique for a batch in a DStream? > > > On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng wrote: >> >> You can use RDD id as the seed, which is unique

Re: How to get prerelease thriftserver working?

2014-08-27 Thread Cheng Lian
Hey Matt, if you want to access existing Hive data, you still need a to run a Hive metastore service, and provide a proper hive-site.xml (just drop it in $SPARK_HOME/conf). Could you provide the error log you saw? ​ On Wed, Aug 27, 2014 at 12:09 PM, Michael Armbrust wrote: > I would expect tha

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Thanks. Just to double check, rdd.id would be unique for a batch in a DStream? On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng wrote: > You can use RDD id as the seed, which is unique in the same spark > context. Suppose none of the RDDs would contain more than 1 billion > records. Then you can

Re: SchemaRDD

2014-08-27 Thread Matei Zaharia
I think this will increasingly be its role, though it doesn't make sense to use it to core because it is clearly just a client of the core APIs. What usage do you have in mind in particular? It would be nice to know how the non-SQL APIs for this could be better. Matei On August 27, 2014 at 2:3

Re: minPartitions ignored for bz2?

2014-08-27 Thread Xiangrui Meng
Are you using hadoop-1.0? Hadoop doesn't support splittable bz2 files before 1.2 (or a later version). But due to a bug (https://issues.apache.org/jira/browse/HADOOP-10614), you should try hadoop-2.5.0. -Xiangrui On Wed, Aug 27, 2014 at 2:49 PM, jerryye wrote: > Hi, > I'm running on the master br

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid => rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar wrote:

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng wrote: > No. The indices start at 0 for every RDD. -Xiangrui > > On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar > wrote: > > Hello, > > > > If I do: > > > > DStream

minPartitions ignored for bz2?

2014-08-27 Thread jerryye
Hi, I'm running on the master branch and I noticed that textFile ignores minPartition for bz2 files. Is anyone else seeing the same thing? I tried varying minPartitions for a bz2 file and rdd.partitions.size was always 1 whereas doing it for a non-bz2 file worked. Not sure if this matters or not

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar wrote: > Hello, > > If I do: > > DStream transform { > rdd.zipWithIndex.map { > > Is the index guaranteed to be unique across all RDDs here? > > } > } > > Thanks, > -Soumitra.

Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Soumitra Kumar
Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra.

SchemaRDD

2014-08-27 Thread Koert Kuipers
i feel like SchemaRDD has usage beyond just sql. perhaps it belongs in core?

Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Frank van Lankvelt
you could try looking at ScalaCL[1], it's targeting OpenCL rather than CUDA, but that might be close enough? cheers, Frank 1. https://github.com/ochafik/ScalaCL On Wed, Aug 27, 2014 at 7:33 PM, Wei Tan wrote: > Thank you all. Actually I was looking at JCUDA. Function wise this may be > a perf

Historic data and clocks

2014-08-27 Thread Frank van Lankvelt
Hi, In an attempt to keep processing logic as simple as possible, I'm trying to use spark streaming for processing historic as well as real-time data. This works quite well, using big intervals that match the window size for historic data, and small intervals for real-time. I found this discussi

Re: disable log4j for spark-shell

2014-08-27 Thread Yana
You just have to tell Spark which log4j properties file to use. I think --driver-java-options="-Dlog4j.configuration=log4j.properties" should work but it didn't for me. set SPARK_JAVA_OPTS=-Dlog4j.configuration=log4j.properties did work though (this was on Windows, in local mode, assuming you put a

[Streaming] Cannot get executors to stay alive

2014-08-27 Thread Yana Kadiyska
Hi, I tried a similar question before and didn't get any answers,so I'll try again: I am using updateStateByKey, pretty much exactly as shown in the examples shipping with Spark: def createContext(master:String,dropDir:String, checkpointDirectory:String) = { val updateFunc = (values: Seq[Int

MLBase status

2014-08-27 Thread Sameer Tilak
Hi All,I was wondering can someone please tell me the status of MLbase and its roadmap in terms of software release. We are very interested in exploring it for our applications.

Spark N.C.

2014-08-27 Thread am
Looking for fellow Spark enthusiasts based in and around Research Triangle Park, Raleigh, Durham, and Chapel Hill, North Carolina Please get in touch off list for an employment opportunity. Must be local. Thanks! -Andrew - T

user@spark.apache.org

2014-08-27 Thread Michael Armbrust
You need to have the datanuclus jars on your classpath. It is not okay to merge them into an uber jar. On Wed, Aug 27, 2014 at 1:44 AM, centerqi hu wrote: > Hi all > > > When I run a simple SQL, encountered the following error. > > hive:0.12(metastore in mysql) > > hadoop 2.4.1 > > spark 1.0.2

RE: Execution time increasing with increase of cluster size

2014-08-27 Thread Sameer Tilak
Can you tell which nodes were doing the computation in each case? Date: Wed, 27 Aug 2014 20:29:38 +0530 Subject: Execution time increasing with increase of cluster size From: sarathchandra.jos...@algofusiontech.com To: user@spark.apache.org Hi, I've written a simple scala program which reads a fi

Re: How to get prerelease thriftserver working?

2014-08-27 Thread Michael Armbrust
I would expect that to work. What exactly is the error? On Wed, Aug 27, 2014 at 6:02 AM, Matt Chu wrote: > (apologies for sending this twice, first via nabble; didn't realize it > wouldn't get forwarded) > > Hey, I know it's not officially released yet, but I'm trying to understand > (and run)

Re: Does HiveContext support Parquet?

2014-08-27 Thread Michael Armbrust
I'll note the parquet jars are included by default in 1.1 On Wed, Aug 27, 2014 at 11:53 AM, lyc wrote: > Thanks a lot. Finally, I can create parquet table using your command > "-driver-class-path". > > I am using hadoop 2.3. Now, I will try to load data into the tables. > > Thanks, > lyc > > >

RE: Amplab: big-data-benchmark

2014-08-27 Thread Sameer Tilak
Hi Burak,Thanks, I will then start benchmarking the cluster. > Date: Wed, 27 Aug 2014 11:52:05 -0700 > From: bya...@stanford.edu > To: ssti...@live.com > CC: user@spark.apache.org > Subject: Re: Amplab: big-data-benchmark > > Hi Sameer, > > I've faced this issue before. They don't show up on >

Re: Does HiveContext support Parquet?

2014-08-27 Thread lyc
Thanks a lot. Finally, I can create parquet table using your command "-driver-class-path". I am using hadoop 2.3. Now, I will try to load data into the tables. Thanks, lyc -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp1

Re: Amplab: big-data-benchmark

2014-08-27 Thread Burak Yavuz
Hi Sameer, I've faced this issue before. They don't show up on http://s3.amazonaws.com/big-data-benchmark/. But you can directly use: `sc.textFile("s3n://big-data-benchmark/pavlo/text/tiny/crawl")` The gotcha is that you also need to supply which dataset you want: crawl, uservisits, or rankings

Amplab: big-data-benchmark

2014-08-27 Thread Sameer Tilak
Hi All, I am planning to run amplab benchmark suite to evaluate the performance of our cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions about data avallability at: s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where /tiny/, /1n

Re: Low Level Kafka Consumer for Spark

2014-08-27 Thread Bharat Venkat
Hi Dibyendu, That would be great. One of the biggest drawback of Kafka utils as well as your implementation is I am unable to scale out processing. I am relatively new to Spark and Spark Streaming - from what I read and what I observe with my deployment is that having the RDD created on one rece

Re: Spark 1.1. doesn't work with hive context

2014-08-27 Thread S Malligarjunan
It is my mistake, some how I have added the io.compression.codec property value as the above mentioned class. Resolved the problem now   Thanks and Regards, Sankar S.   On Wednesday, 27 August 2014, 1:23, S Malligarjunan wrote: Hello all, I have just checked out branch-1.1  and executed

Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Xiangrui Meng
Hi Wei, Please keep us posted about the performance result you get. This would be very helpful. Best, Xiangrui On Wed, Aug 27, 2014 at 10:33 AM, Wei Tan wrote: > Thank you all. Actually I was looking at JCUDA. Function wise this may be a > perfect solution to offload computation to GPU. Will se

Re: Issue Connecting to HBase in spark shell

2014-08-27 Thread kpeng1
It looks like the issue I had is that I didn't pull in htrace-core jar into the spark class path. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issue-Connecting-to-HBase-in-spark-shell-tp12855p12924.html Sent from the Apache Spark User List mailing list ar

Re: Specifying classpath

2014-08-27 Thread Ashish Jain
I solved this issue by putting hbase-protobuf in Hadoop classpath, and not in the spark classpath. export HADOOP_CLASSPATH="/path/to/jar/hbase-protocol-0.98.1-cdh5.1.0.jar" On Tue, Aug 26, 2014 at 5:42 PM, Ashish Jain wrote: > Hello, > > I'm using the following version of Spark - 1.0.0+cdh5.1

Re: Out of memory on large RDDs

2014-08-27 Thread Jianshi Huang
I have the same issue (I'm using the latest 1.1.0-SNAPSHOT). I've increased my driver memory to 30G, executor memory to 10G, and spark.akka.askTimeout to 180. Still no good. My other configurations are: spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.mb

Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Wei Tan
Thank you all. Actually I was looking at JCUDA. Function wise this may be a perfect solution to offload computation to GPU. Will see how performance it will be, especially with the Java binding. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Wat

RE: Save an RDD to a SQL Database

2014-08-27 Thread bdev
I have similar requirement to export the data to mysql. Just wanted to know what the best approach is so far after the research you guys have done. Currently thinking of saving to hdfs and use sqoop to handle export. Is that the best approach or is there any other way to write to mysql? Thanks!

Re: What is a Block Manager?

2014-08-27 Thread Victor Tso-Guillen
I have long-lived state I'd like to maintain on the executors that I'd like to initialize during some bootstrap phase and to update the master when such executor leaves the cluster. On Tue, Aug 26, 2014 at 11:18 PM, Liu, Raymond wrote: > The framework have those info to manage cluster status, a

Re: Execute HiveFormSpark ERROR.

2014-08-27 Thread Du Li
As suggested in the error messages, double-check your class path. From: CharlieLin mailto:chury...@gmail.com>> Date: Tuesday, August 26, 2014 at 8:29 PM To: "user@spark.apache.org" mailto:user@spark.apache.org>> Subject: Execute HiveFormSpark ERROR. hi, all :

Reference Accounts & Large Node Deployments

2014-08-27 Thread Steve Nunez
All, Does anyone have specific references to customers, use cases and large-scale deployments of Spark Streaming? By OElarge scale¹ I mean both through-put and number of nodes. I¹m attempting an objective comparison of Streaming and Storm and while this data is known for Storm, there appears to be

Re: Spark Streaming Output to DB

2014-08-27 Thread Ravi Sharma
Thank you Akhil and Mayur. It will be really helpful. Thanks, On 27 Aug 2014 13:19, "Akhil Das" wrote: > Like Mayur said, its better to use mapPartition instead of map. > > Here's a piece of code which typically reads a text file and inserts each > raw into the database. I haven't tested it, It

Saddle structure in Spark

2014-08-27 Thread LPG
Hello everyone, Is it possible to use an "external" data structure, such as Saddle, in Spark? As far as I know, a RDD is a kind of wrapper or container that has certain data structure inside. So I was wondering whether this data structure has to be either a basic (or native) structure or any avail

Re: Does HiveContext support Parquet?

2014-08-27 Thread Silvio Fiorito
What Spark and Hadoop versions are you on? I have it working in my Spark app with the parquet-hive-bundle-1.5.0.jar bundled into my app fat-jar. I¹m running Spark 1.0.2 and CDH5. bin/spark-shell --master local[*] --driver-class-path ~/parquet-hive-bundle-1.5.0.jar To see if that works? On 8/26/1

Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Chen He
JCUDA can let you do that in Java http://www.jcuda.org On Wed, Aug 27, 2014 at 1:48 AM, Antonio Jesus Navarro < ajnava...@stratio.com> wrote: > Maybe this would interest you: > > CPU and GPU-accelerated Machine Learning Library: > > https://github.com/BIDData/BIDMach > > > 2014-08-27 4:08 GMT+02

RE: Example File not running

2014-08-27 Thread Hingorani, Vineet
It didn’t work after adding file:// in the front. I compiled it again and ran it. The same error are coming. Do you think there can be some problem with the java dependency? Also, I don’t want to install Hadoop I just want to run it on local machine. The reason is, whenever I install these thing

Re: Example File not running

2014-08-27 Thread Akhil Das
You can install hadoop 2 by reading this doc https://wiki.apache.org/hadoop/Hadoop2OnWindows Once you are done with it, you can set the environment variable HADOOP_HOME then it should work. Also Not sure if it will work, but can you provide file:// at the front and give it a go? I don't see any re

RE: Installation On Windows machine

2014-08-27 Thread Mishra, Abhishek
I got it upright Matei, Thank you. I was giving wrong directory path. Thank you...!! Thanks, Abhishek Mishra -Original Message- From: Mishra, Abhishek [mailto:abhishek.mis...@xerox.com] Sent: Wednesday, August 27, 2014 4:38 PM To: Matei Zaharia Cc: user@spark.apache.org Subject: RE: Ins

How to get prerelease thriftserver working?

2014-08-27 Thread Matt Chu
(apologies for sending this twice, first via nabble; didn't realize it wouldn't get forwarded) Hey, I know it's not officially released yet, but I'm trying to understand (and run) the Thrift-based JDBC server, in order to enable remote JDBC access to our dev cluster. Before asking about details,

Re: spark and matlab

2014-08-27 Thread Jaonary Rabarisoa
forgot the second point, I found the answer myself inside the source code PipedRDD :) On Wed, Aug 27, 2014 at 1:36 PM, Jaonary Rabarisoa wrote: > Thank you Matei. > > I found a solution using pipe and matlab engine (an executable that can > call matlab behind the scene and uses stdin and stdou

Example file not running

2014-08-27 Thread Hingorani, Vineet
Hello all, I am able to use Spark in the shell but I am not able to run a spark file. I am using sbt and the jar is created but even the SimpleApp class example given on the site http://spark.apache.org/docs/latest/quick-start.html is not running. I installed a prebuilt version of spark and >

NotSerializableException while doing rdd.saveToCassandra

2014-08-27 Thread lmk
Hi All, I am using spark-1.0.0 to parse a json file and save to values to cassandra using case class. My code looks as follows: case class LogLine(x1:Option[String],x2: Option[String],x3:Option[List[String]],x4: Option[String],x5:Option[String],x6:Option[String],x7:Option[String],x8:Option[String],

RE: Example File not running

2014-08-27 Thread Hingorani, Vineet
The code is the example given on Spark site: /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "C:/Users/D062844/Desktop/HandsOnSpark/

Re: spark and matlab

2014-08-27 Thread Jaonary Rabarisoa
Thank you Matei. I found a solution using pipe and matlab engine (an executable that can call matlab behind the scene and uses stdin and stdout to communicate). I just need to fix two other issues : - how can I handle my dependencies ? My matlab script need other matlab files that need to be pre

Re: Example File not running

2014-08-27 Thread Akhil Das
It should point to your hadoop installation directory. (like C:\hadoop\) Since you don't have hadoop installed, What is the code that you are running? Thanks Best Regards On Wed, Aug 27, 2014 at 4:50 PM, Hingorani, Vineet wrote: > What should I put the value of that environment variable? I w

External dependencies management with spark

2014-08-27 Thread Jaonary Rabarisoa
Dear all, I'm looking for an efficient way to manage external dependencies. I know that one can add .jar or .py dependencies easily but how can I handle other type of dependencies. Specifically, I have some data processing algorithm implemented with other languages (ruby, octave, matlab, c++) and

RE: Example File not running

2014-08-27 Thread Hingorani, Vineet
What should I put the value of that environment variable? I want to run the scripts locally on my machine and do not have any Hadoop installed. Thank you From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Mittwoch, 27. August 2014 12:54 To: Hingorani, Vineet Cc: user@spark.apache.org

RE: Installation On Windows machine

2014-08-27 Thread Mishra, Abhishek
Thank you for the reply Matei, Is there something which we missed. ? I am able to run the spark instance on my local system i.e. Windows 7 but the same set of steps do not allow me to run it on Windows server 2012 machine. The black screen just appears for a fraction of second and disappear, I

Re: Example File not running

2014-08-27 Thread Akhil Das
The statement java.io.IOException: Could not locate executable null\bin\winutils.exe explains that the null is received when expanding or replacing an Environment Variable. I'm guessing that you are missing *HADOOP_HOME* in the environment variables. Thanks Best Regards On Wed, Aug 27, 2014 at

Replicate RDDs

2014-08-27 Thread rapelly kartheek
Hi I have a three node spark cluster. I restricted the resources per application by setting appropriate parameters and I could run two applications simultaneously. Now, I want to replicate an RDD and run two applications simultaneously. Can someone help how to go about doing this!!! I replicated

Example File not running

2014-08-27 Thread Hingorani, Vineet
Hello all, I am able to use Spark in the shell but I am not able to run a spark file. I am using sbt and the jar is created but even the SimpleApp class example given on the site http://spark.apache.org/docs/latest/quick-start.html is not running. I installed a prebuilt version of spark and >

Re: Spark - GraphX pregel like with global variables (accumulator / broadcast)

2014-08-27 Thread BertrandR
Thank you for your answers, and sorry for my lack of understanding. So I tried what you suggested, with/without unpersisting and with .cache() (also persist(StorageLevel.MEMORY_AND_DISK) but this is not allowed for msg because you can't change the Storage level apparently) for msg, g and newVerts,

user@spark.apache.org

2014-08-27 Thread centerqi hu
Hi all When I run a simple SQL, encountered the following error. hive:0.12(metastore in mysql) hadoop 2.4.1 spark 1.0.2 build with hive my hql code import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ import org.apache.spark.sql.hive.LocalHiveContext object HqlTes

Developing a spark streaming application

2014-08-27 Thread Filip Andrei
Hey guys, so the problem i'm trying to tackle is the following: - I need a data source that emits messages at a certain frequency - There are N neural nets that need to process each message individually - The outputs from all neural nets are aggregated and only when all N outputs for each message

Re: Spark Streaming Output to DB

2014-08-27 Thread Akhil Das
Like Mayur said, its better to use mapPartition instead of map. Here's a piece of code which typically reads a text file and inserts each raw into the database. I haven't tested it, It might throw up some Serialization errors, In that case, you gotta serialize them! JavaRDD txtRDD

Is there a way to insert data into existing parquet file using spark ?

2014-08-27 Thread rafeeq s
Hi, *Is there a way to insert data into existing parquet file using spark ?* I am using spark stream and spark sql to store store real time data into parquet files and then query it using impala. spark creating multiple sub directories of parquet files and it make me challenge while loading it t

Re: Trying to run SparkSQL over Spark Streaming

2014-08-27 Thread Zhan Zhang
I think current the ExistingRDD is not supported. But ParquestRelation is supported, probably you can try this as walk around. case logical.InsertIntoTable(table: ParquetRelation, partition, child, overwrite) => InsertIntoParquetTable(table, planLater(child), overwrite) :: Nil example: