problem about cluster mode of spark 1.0.0

2014-06-20 Thread randylu
my programer runs in standalone model, the commond line is like: /opt/spark-1.0.0/bin/spark-submit \ --verbose \ --class $class_name --master spark://master:7077 \ --driver-memory 15G \ --driver-cores 2 \ --deploy-mode cluster \ hdfs://master:

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread abhiguruvayya
Does JavaPairRDD.saveAsHadoopFile store data as a sequenceFile? Then what is the significance of RDD.saveAsSequenceFile? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-JavaRDD-as-a-sequence-file-using-spark-java-API-tp7969p7983.html Sent from t

Re: problem about cluster mode of spark 1.0.0

2014-06-20 Thread randylu
in addition, jar file can be copied to driver node automatically. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-cluster-mode-of-spark-1-0-0-tp7982p7984.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: 1.0.1 release plan

2014-06-20 Thread Patrick Wendell
Hey There, I'd like to start voting on this release shortly because there are a few important fixes that have queued up. We're just waiting to fix an akka issue. I'd guess we'll cut a vote in the next few days. - Patrick On Thu, Jun 19, 2014 at 10:47 AM, Mingyu Kim wrote: > Hi all, > > Is there

broadcast in spark streaming

2014-06-20 Thread Hahn Jiang
I want to use broadcast in spark streaming, but I found there is no this function. How can I use global variable in spark streaming? thanks

Re: broadcast in spark streaming

2014-06-20 Thread Sourav Chandra
>From the StreamingContext object, you can get reference of SparkContext using which you can create broadcast variables On Fri, Jun 20, 2014 at 2:09 PM, Hahn Jiang wrote: > I want to use broadcast in spark streaming, but I found there is no this > function. > How can I use global variable in sp

Re: broadcast in spark streaming

2014-06-20 Thread Hahn Jiang
I get it. thank you On Fri, Jun 20, 2014 at 4:43 PM, Sourav Chandra < sourav.chan...@livestream.com> wrote: > From the StreamingContext object, you can get reference of SparkContext > using which you can create broadcast variables > > > On Fri, Jun 20, 2014 at 2:09 PM, Hahn Jiang > wrote: > >>

How could I set the number of executor?

2014-06-20 Thread Earthson
"spark-submit" has an arguments "--num-executors" to set the number of executor, but how could I set it from anywhere else? We're using Shark, and want to change the number of executor. The number of executor seems to be same as workers by default? Shall we configure the executor number manually(

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Eugen Cepoi
Le 20 juin 2014 01:46, "Shivani Rao" a écrit : > > Hello Andrew, > > i wish I could share the code, but for proprietary reasons I can't. But I can give some idea though of what i am trying to do. The job reads a file and for each line of that file and processors these lines. I am not doing anythin

Re: How could I set the number of executor?

2014-06-20 Thread Earthson
--num-executors seems to be only available with YARN-only. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-set-the-number-of-executor-tp7990p7992.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: MLLib inside Storm : silly or not ?

2014-06-20 Thread Eustache DIEMERT
Yes, learning on a dedicated Spark cluster and predicting inside a Storm bolt is quite OK :) Thanks all for your answers. I'll post back if/when we experience this solution. E/ 2014-06-19 20:45 GMT+02:00 Shuo Xiang : > If I'm understanding correctly, you want to use MLlib for offline trainin

Re: problem about cluster mode of spark 1.0.0

2014-06-20 Thread Gino Bustelo
I've found that the jar will be copied to the worker from hdfs fine, but it is not added to the spark context for you. You have to know that the jar will end up in the driver's working dir, and so you just add a the file name if the jar to the context in your program. In your example below, ju

Anything like grid search available for mlbase?

2014-06-20 Thread Charles Earl
Looking for something like scikit's grid search module. C

parallel Reduce within a key

2014-06-20 Thread ansriniv
Hi, I am on Spark 0.9.0 I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32 cores in the cluster). I have an input rdd with 64 partitions. I am running "sc.mapPartitions(...).reduce(...)" I can see that I get full parallelism on the mapper (all my 32 cores are busy simu

java.net.SocketTimeoutException: Read timed out and java.io.IOException: Filesystem closed on Spark 1.0

2014-06-20 Thread Arun Ahuja
Hi all, I'm running a job that seems to continually fail with the following exception: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.Socket

Re: Anything like grid search available for mlbase?

2014-06-20 Thread Xiangrui Meng
This is a planned feature for v1.1. I'm going to work on it after v1.0.1 release. -Xiangrui > On Jun 20, 2014, at 6:46 AM, Charles Earl wrote: > > Looking for something like scikit's grid search module. > C

Performance problems on SQL JOIN

2014-06-20 Thread mathias
Hi there, We're trying out Spark and are experiencing some performance issues using Spark SQL. Anyone who can tell us if our results are normal? We are using the Amazon EC2 scripts to create a cluster with 3 workers/executors (m1.large). Tried both spark 1.0.0 as well as the git master; the Scala

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Shivani Rao
Hello Abhi, I did try that and it did not work And Eugene, Yes I am assembling the argonaut libraries in the fat jar. So how did you overcome this problem? Shivani On Fri, Jun 20, 2014 at 1:59 AM, Eugen Cepoi wrote: > > Le 20 juin 2014 01:46, "Shivani Rao" a écrit : > > > > > Hello Andrew, >

Re: trying to understand yarn-client mode

2014-06-20 Thread Koert Kuipers
thanks! i will try that. i guess what i am most confused about is why the executors are trying to retrieve the jars directly using the info i provided to add jars to my spark context. i mean, thats bound to fail no? i could be on a different machine (so my file://) isnt going to work for them, or i

Better way to use a large data set?

2014-06-20 Thread Muttineni, Vinay
Hi All, I have a 8 mill row, 500 column data set, which is derived by reading a text file and doing a filter, flatMap operation to weed out some anomalies. Now, I have a process which has to run through all 500 columns, do couple of map, reduce, forEach operations on the data set and return some

Re: Performance problems on SQL JOIN

2014-06-20 Thread Xiangrui Meng
Your data source is S3 and data is used twice. m1.large does not have very good network performance. Please try file.count() and see how fast it goes. -Xiangrui > On Jun 20, 2014, at 8:16 AM, mathias wrote: > > Hi there, > > We're trying out Spark and are experiencing some performance issues u

Re: Performance problems on SQL JOIN

2014-06-20 Thread Evan R. Sparks
Also - you could consider caching your data after the first split (before the first filter), this will prevent you from retrieving the data from s3 twice. On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng wrote: > Your data source is S3 and data is used twice. m1.large does not have very > good ne

broadcast not working in yarn-cluster mode

2014-06-20 Thread Christophe Préaud
Hi, Since I migrated to spark 1.0.0, a couple of applications that used to work in 0.9.1 now fail when broadcasting a variable. Those applications are run on a YARN cluster in yarn-cluster mode (and used to run in yarn-standalone mode in 0.9.1) Here is an extract of the error log: Exception in

Re: How do you run your spark app?

2014-06-20 Thread Shivani Rao
Hello Michael, I have a quick question for you. Can you clarify the statement " build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job". Can you give an example. I am using sbt assembly as well to create a fat jar, and supplying the spa

spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
i noticed that when i submit a job to yarn it mistakenly tries to upload files to local filesystem instead of hdfs. what could cause this? in spark-env.sh i have HADOOP_CONF_DIR set correctly (and spark-submit does find yarn), and my core-site.xml has a fs.defaultFS that is hdfs, not local filesys

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Marcelo Vanzin
Hi Koert, Could you provide more details? Job arguments, log messages, errors, etc. On Fri, Jun 20, 2014 at 9:40 AM, Koert Kuipers wrote: > i noticed that when i submit a job to yarn it mistakenly tries to upload > files to local filesystem instead of hdfs. what could cause this? > > in spark-en

Re: How to store JavaRDD as a sequence file using spark java API?

2014-06-20 Thread Kan Zhang
Yes, it can if you set the output format to SequenceFileOutputFormat. The difference is saveAsSequenceFile does the conversion to Writable for you if needed and then calls saveAsHadoopFile. On Fri, Jun 20, 2014 at 12:43 AM, abhiguruvayya wrote: > Does JavaPairRDD.saveAsHadoopFile store data as

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Eugen Cepoi
In my case it was due to a case class I was defining in the spark-shell and not being available on the workers. So packaging it in a jar and adding it with ADD_JARS solved the problem. Note that I don't exactly remember if it was an out of heap space exception or pergmen space. Make sure your jarsP

Re: trying to understand yarn-client mode

2014-06-20 Thread Marcelo Vanzin
On Fri, Jun 20, 2014 at 8:22 AM, Koert Kuipers wrote: > thanks! i will try that. > i guess what i am most confused about is why the executors are trying to > retrieve the jars directly using the info i provided to add jars to my spark > context. i mean, thats bound to fail no? i could be on a diff

Re: 1.0.1 release plan

2014-06-20 Thread Andrew Ash
Sounds good. Mingyu and I are waiting on 1.0.1 to get the fix for the below issues without running a patched version of Spark: https://issues.apache.org/jira/browse/SPARK-1935 -- commons-codec version conflicts for client applications https://issues.apache.org/jira/browse/SPARK-2043 -- correctnes

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
yeah sure see below. i strongly suspect its something i misconfigured causing yarn to try to use local filesystem mistakenly. * [koert@cdh5-yarn ~]$ /usr/local/lib/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --execu

Re: Performance problems on SQL JOIN

2014-06-20 Thread mathias
Thanks for your suggestions. file.count() takes 7s, so that doesn't seem to be the problem. Moreover, a union with the same code/CSV takes about 15s (SELECT * FROM rooms2 UNION SELECT * FROM rooms3). The web status page shows that both stages 'count at joins.scala:216' and 'reduce at joins.scala:

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread bc Wong
Koert, is there any chance that your fs.defaultFS isn't setup right? On Fri, Jun 20, 2014 at 9:57 AM, Koert Kuipers wrote: > yeah sure see below. i strongly suspect its something i misconfigured > causing yarn to try to use local filesystem mistakenly. > > * > > [koert@cdh5

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
in /etc/hadoop/conf/core-site.xml: fs.defaultFS hdfs://cdh5-yarn.tresata.com:8020 also hdfs seems the default: [koert@cdh5-yarn ~]$ hadoop fs -ls / Found 5 items drwxr-xr-x - hdfs supergroup 0 2014-06-19 12:31 /data drwxrwxrwt - hdfs supergroup 0 2014-06-20 12:

Can not checkpoint Graph object's vertices but could checkpoint edges

2014-06-20 Thread dash
I'm trying to workaround the StackOverflowError when an object have a long dependency chain, someone said I should use checkpoint to cuts off dependencies. I write a sample code to test it, but I can only checkpoint edges but not vertices. I think I do materialize vertices and edges after calling c

Re: 1.0.1 release plan

2014-06-20 Thread Mingyu Kim
Cool. Thanks for the note. Looking forward to it. Mingyu From: Andrew Ash Reply-To: "user@spark.apache.org" Date: Friday, June 20, 2014 at 9:54 AM To: "user@spark.apache.org" Subject: Re: 1.0.1 release plan Sounds good. Mingyu and I are waiting on 1.0.1 to get the fix for the below issu

Re: Parallel LogisticRegression?

2014-06-20 Thread Kyle Ellrott
I've tried to parallelize the separate regressions using allResponses.toParArray.map( x=> do logistic regression against labels in x) But I start to see messages like 14/06/20 10:10:26 WARN scheduler.TaskSetManager: Lost TID 4193 (task 363.0:4) 14/06/20 10:10:27 WARN scheduler.TaskSetManager: Loss

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Shivani Rao
Hello Eugene, You are right about this. I did encounter the "pergmgenspace" in the spark shell. Can you tell me a little more about "ADD_JARS". In order to ensure my spark_shell has all required jars, I added the jars to the "$CLASSPATH" in the compute_classpath.sh script. is there another way of

Re: parallel Reduce within a key

2014-06-20 Thread DB Tsai
Currently, the reduce operation combines the result from mapper sequentially, so it's O(n). Xiangrui is working on treeReduce which is O(log(n)). Based on the benchmark, it dramatically increase the performance. You can test the code in his own branch. https://github.com/apache/spark/pull/1110 Si

Re: What is the best way to handle transformations or actions that takes forever?

2014-06-20 Thread Peng Cheng
Wow, that sounds a lot of work (need a mini-thread), thanks a lot for the answer. It might be a nice-to-have feature. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-best-way-to-handle-transformations-or-actions-that-takes-forever-tp7664p8024.htm

Re: How do you run your spark app?

2014-06-20 Thread Shrikar archak
Hi Shivani, I use sbt assembly to create a fat jar . https://github.com/sbt/sbt-assembly Example of the sbt file is below. import AssemblyKeys._ // put this at the top of the file assemblySettings mainClass in assembly := Some("FifaSparkStreaming") name := "FifaSparkStreaming" version := "1.

Re: parallel Reduce within a key

2014-06-20 Thread Michael Malak
How about a treeReduceByKey? :-) On Friday, June 20, 2014 11:55 AM, DB Tsai wrote: Currently, the reduce operation combines the result from mapper sequentially, so it's O(n). Xiangrui is working on treeReduce which is O(log(n)). Based on the benchmark, it dramatically increase the performan

Possible approaches for adding extra metadata (Spark Streaming)?

2014-06-20 Thread Shrikar archak
Hi All, I was curious to know which of the two approach is better for doing analytics using spark streaming. Lets say we want to add some metadata to the stream which is being processed like sentiment, tags etc and then perform some analytics using these added metadata. 1) Is it ok to make a htt

Re: Problems running Spark job on mesos in fine-grained mode

2014-06-20 Thread Sébastien Rainville
Hi, this is just a follow-up regarding this issue. Turns out that it's caused by a bug in Spark. I created a case for it: https://issues.apache.org/jira/browse/SPARK-2204 and submitted a patch. Any chance this could be included in the 1.0.1 release? Thanks, - Sebastien On Tue, Jun 17, 2014 a

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
i put some logging statements in yarn.Client and that confirms its using local filesystem: 14/06/20 15:20:33 INFO Client: fs.defaultFS is file:/// so somehow fs.defaultFS is not being picked up from /etc/hadoop/conf/core-site.xml, but spark does correctly pick up yarn.resourcemanager.hostname from

Re: spark on yarn is trying to use file:// instead of hdfs://

2014-06-20 Thread Koert Kuipers
ok solved it. as it happened in spark/conf i also had a file called core.site.xml (with some tachyone related stuff in it) so thats why it ignored /etc/hadoop/conf/core-site.xml On Fri, Jun 20, 2014 at 3:24 PM, Koert Kuipers wrote: > i put some logging statements in yarn.Client and that confi

Re: Possible approaches for adding extra metadata (Spark Streaming)?

2014-06-20 Thread Mayur Rustagi
You can apply transformations on RDD's inside Dstreams using transform or any number of operations. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Jun 20, 2014 at 2:16 PM, Shrikar archak wrote: > Hi

Re: Spark and RDF

2014-06-20 Thread andy petrella
For RDF, may GraphX be particularly approriated? aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] On Thu, Jun 19, 2014 at 4:49 PM, Flavio Pompermaier wrote: > Hi guys, > > I'm analyzing the possibility to use Spark to analyze RDF files and define

Running Spark alongside Hadoop

2014-06-20 Thread Sameer Tilak
Dear Spark users, I have a small 4 node Hadoop cluster. Each node is a VM -- 4 virtual cores, 8GB memory and 500GB disk. I am currently running Hadoop on it. I would like to run Spark (in standalone mode) along side Hadoop on the same nodes. Given the configuration of my nodes, will that work? D

Re: Running Spark alongside Hadoop

2014-06-20 Thread Mayur Rustagi
The ideal way to do that is to use a cluster manager like Yarn & mesos. You can control how much resources to give to which node etc. You should be able to run both together in standalone mode however you may experience varying latency & performance in the cluster as both MR & spark demand resource

Re: Spark and RDF

2014-06-20 Thread Mayur Rustagi
You are looking to create Shark operators for RDF? Since Shark backend is shifting to SparkSQL it would be slightly hard but much better effort would be to shift Gremlin to Spark (though a much beefier one :) ) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Spark and RDF

2014-06-20 Thread andy petrella
Maybe some SPARQL features in Shark, then ? aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] On Fri, Jun 20, 2014 at 9:45 PM, Mayur Rustagi wrote: > You are looking to create Shark operators for RDF? Since Shark backend is > shifting to SparkSQL i

Re: Spark and RDF

2014-06-20 Thread Mayur Rustagi
or a seperate RDD for sparql operations ala SchemaRDD .. operators for sparql can be defined thr.. not a bad idea :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Jun 20, 2014 at 3:56 PM, andy petrella wrote: > M

Re: Running Spark alongside Hadoop

2014-06-20 Thread Koert Kuipers
for development/testing i think its fine to run them side by side as you suggested, using spark standalone. just be realistic about what size data you can load with limited RAM. On Fri, Jun 20, 2014 at 3:43 PM, Mayur Rustagi wrote: > The ideal way to do that is to use a cluster manager like Yar

Re: Possible approaches for adding extra metadata (Spark Streaming)?

2014-06-20 Thread Tathagata Das
If the metadata is directly related to each individual records, then it can be done either ways. Since I am not sure how easy or hard will it be for you add tags before putting the data into spark streaming, its hard to recommend one method over the other. However, if the metadata is related to ea

Re: Spark and RDF

2014-06-20 Thread andy petrella
yep, would be cool. Even though sparql has its drawbacks (vs cypher vs gremlin I mean), however still cool for semantic thingies and c°. aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] On Fri, Jun 20, 2014 at 10:03 PM, Mayur Rustagi wrote: > or

Set the number/memory of workers under mesos

2014-06-20 Thread Shuo Xiang
Hi, just wondering anybody knows how to set up the number of workers (and the amount of memory) in mesos, while lauching spark-shell? I was trying to edit conf/spark-env.sh and it looks like that the environment variables are for YARN of standalone. Thanks!

Re: Set the number/memory of workers under mesos

2014-06-20 Thread Mayur Rustagi
You should be able to configure in spark context in Spark shell. spark.cores.max & memory. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, Jun 20, 2014 at 4:30 PM, Shuo Xiang wrote: > Hi, just wonderi

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-20 Thread Eugen Cepoi
In short, ADD_JARS will add the jar to your driver classpath and also send it to the workers (similar to what you are doing when you do sc.addJars). ex: MASTER=master/url ADD_JARS=/path/to/myJob.jar ./bin/spark-shell You also have SPARK_CLASSPATH var but it does not distribute the code, it is on

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-20 Thread Koert Kuipers
thanks for the detailed answer andrew. thats helpful. i think the main thing thats bugging me is that there is no simple way for an admin to always set something on the executors for a production environment (an akka timeout comes to mind). yes i could use spark-defaults for that, although that m

Re: Possible approaches for adding extra metadata (Spark Streaming)?

2014-06-20 Thread Shrikar archak
Thanks Mayur and TD for your inputs. ~Shrikar On Fri, Jun 20, 2014 at 1:20 PM, Tathagata Das wrote: > If the metadata is directly related to each individual records, then it > can be done either ways. Since I am not sure how easy or hard will it be > for you add tags before putting the data in

Re: Parallel LogisticRegression?

2014-06-20 Thread Kyle Ellrott
I looks like I was running into https://issues.apache.org/jira/browse/SPARK-2204 The issues went away when I changed to spark.mesos.coarse. Kyle On Fri, Jun 20, 2014 at 10:36 AM, Kyle Ellrott wrote: > I've tried to parallelize the separate regressions using > allResponses.toParArray.map( x=> d

Re: Set the number/memory of workers under mesos

2014-06-20 Thread Shuo Xiang
Hi Mayur, Are you referring to overriding the default sc in sparkshell? Is there any way to do that before running the shell? On Fri, Jun 20, 2014 at 1:40 PM, Mayur Rustagi wrote: > You should be able to configure in spark context in Spark shell. > spark.cores.max & memory. > Regards > Mayur

Re: options set in spark-env.sh is not reflecting on actual execution

2014-06-20 Thread Andrew Or
Hi Meethu, Are you using Spark 1.0? If so, you should use spark-submit ( http://spark.apache.org/docs/latest/submitting-applications.html), which has --executor-memory. If you don't want to specify this every time you submit an application, you can also specify spark.executor.memory in $SPARK_HOME

kibana like frontend for spark

2014-06-20 Thread Mohit Jaggi
Folks, I want to analyse logs and I want to use spark for that. However, elasticsearch has a fancy frontend in Kibana. Kibana's docs indicate that it works with elasticsearch only. Is there a similar frontend that can work with spark? Mohit. P.S.: On MapR's spark FAQ I read a statement like "Kiba

Re: Long running Spark Streaming Job increasing executing time per batch

2014-06-20 Thread Tathagata Das
In the spark web ui, you should see the same pattern of stage repeating over time, as the same sequence of stages get computed in every batch. From that you would be able to get a sense of how much corresponding stages take across different batches, and which stage is actually is taking more time,

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-06-20 Thread Andrew Or
Well, even before spark-submit the standard way of setting spark configurations is to create a new SparkConf, set the values in the conf, and pass this to the SparkContext in your application. It's true that this involves "hard-coding" these configurations in your application, but these configurati

Fwd: Using Spark

2014-06-20 Thread Ricky Thomas
Hi, Would like to add ourselves to the user list if possible please? Company: truedash url: truedash.io Automatic pulling of all your data in to Spark for enterprise visualisation, predictive analytics and data exploration at a low cost. Currently in development with a few clients. Thanks

Re: How do you run your spark app?

2014-06-20 Thread Shivani Rao
Hello Shrikar, Thanks for your email. I have been using the same workflow as you did. But my questions was related to creation of the sparkContext. My question was If I am specifying jars in the "java -cp ", and adding to them to my build.sbt, do I need to additionally add them in my code while c

Re: Worker dies while submitting a job

2014-06-20 Thread Shivani Rao
That error typically means that there is a communication error (wrong ports) between master and worker. Also check if the worker has "write" permissions to create the "work" directory. We were getting this error due one of the above two reasons On Tue, Jun 17, 2014 at 10:04 AM, Luis Ángel Vicent

Re: How do you run your spark app?

2014-06-20 Thread Andrei
Hi Shivani, Adding JARs to classpath (e.g. via "-cp" option) is needed to run your _local_ Java application, whatever it is. To deliver them to _other machines_ for execution you need to add them to SparkContext. And you can do it in 2 different ways: 1. Add them right from your code (your sugges

sc.textFile can't recognize '\004'

2014-06-20 Thread anny9699
Hi, I need to parse a file which is separated by a series of separators. I used SparkContext.textFile and I met two problems: 1) One of the separators is '\004', which could be recognized by python or R or Hive, however Spark seems can't recognize this one and returns a symbol looking like '?'. A

Re: Running Spark alongside Hadoop

2014-06-20 Thread Ognen Duzlevski
I only ran HDFS on the same nodes as Spark and that worked out great performance and robustness wise. However, I did not run Hadoop itself to do any computations/jobs on the same nodes. My expectation is that if you actually ran both at the same time with your configuration, the performance wou

How to terminate job from the task code?

2014-06-20 Thread Piotr Kołaczkowski
If the task detects unrecoverable error, i.e. an error that we can't expect to fix by retrying nor moving the task to another node, how to stop the job / prevent Spark from retrying it? def process(taskContext: TaskContext, data: Iterator[T]) { ... if (unrecoverableError) { ??? // ter

Re: sc.textFile can't recognize '\004'

2014-06-20 Thread Sean Owen
These are actually Scala / Java questions. On Sat, Jun 21, 2014 at 1:08 AM, anny9699 wrote: > 1) One of the separators is '\004', which could be recognized by python or R > or Hive, however Spark seems can't recognize this one and returns a symbol > looking like '?'. Also this symbol is not a que