Re: spilling in-memory map of 5.1 MB to disk (272 times so far)

2015-06-28 Thread Akhil Das
Here's a bunch of configuration for that https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior Thanks Best Regards On Fri, Jun 26, 2015 at 10:37 PM, igor.berman wrote: > Hi, > wanted to get some advice regarding tunning spark application > I see for some of the tasks many log

Re: Master dies after program finishes normally

2015-06-28 Thread Akhil Das
Which version of spark are you using? You can try changing the heap size manually by *export _JAVA_OPTIONS="-Xmx5g" * Thanks Best Regards On Fri, Jun 26, 2015 at 7:52 PM, Yifan LI wrote: > Hi, > > I just encountered the same problem, when I run a PageRank program which > has lots of stages(iter

Re: spark streaming - checkpoint

2015-06-28 Thread ram kumar
SPARK_CLASSPATH=$CLASSPATH:/usr/hdp/2.2.0.0-2041/hadoop-mapreduce/* in spark-env.sh I think i am facing the same issue https://issues.apache.org/jira/browse/SPARK-6203 On Mon, Jun 29, 2015 at 11:38 AM, ram kumar wrote: > I am using Spark 1.2.0.2.2.0.0-82 (git revision de12451) built for Hadoop

Spark SQL parallel query submission via single HiveContext

2015-06-28 Thread V Dineshkumar
Hi, As per my use case I need to submit multiple queries to Spark SQL in parallel but due to HiveContext being thread safe the jobs are getting submitted sequentially. I could see many threads are waiting for HiveContext. "on-spray-can-akka.actor.default-dispatcher-26" - Thread t@149 java.lang

Re: Spark FP-Growth algorithm for frequent sequential patterns

2015-06-28 Thread Xiangrui Meng
Hi Ping, FYI, we just merged Feynman's PR: https://github.com/apache/spark/pull/6997 that adds sequential pattern support. Please check out master branch and help test. Thanks! Best, Xiangrui On Wed, Jun 24, 2015 at 2:16 PM, Feynman Liang wrote: > There is a JIRA for this which I just submitted

Re: Fine control with sc.sequenceFile

2015-06-28 Thread ๏̯͡๏
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.split.maxsize", "67108864") sc.sequenceFile(getMostRecentDirectory(tablePath, _.startsWith("_")).get + "/*", classOf[Text], classOf[Text]) works On Sun, Jun 28, 2015 at 9:46 PM, Ted Yu wrote: > There isn't setter for sc.hadoopConf

Re: Fine control with sc.sequenceFile

2015-06-28 Thread Ted Yu
There isn't setter for sc.hadoopConfiguration You can directly change value of parameter in sc.hadoopConfiguration However, see the note in scaladoc: * '''Note:''' As it will be reused in all Hadoop RDDs, it's better not to modify it unless you * plan to set some global configurations for al

Re: Fine control with sc.sequenceFile

2015-06-28 Thread ๏̯͡๏
val hadoopConf = new Configuration(sc.hadoopConfiguration) hadoopConf.set("mapreduce.input.fileinputformat.split.maxsize", "67108864") sc.hadoopConfiguration(hadoopConf) or sc.hadoopConfiguration = hadoopConf threw error. On Sun, Jun 28, 2015 at 9:32 PM, Ted Yu wrote: > seq

Re: Fine control with sc.sequenceFile

2015-06-28 Thread Ted Yu
sequenceFile() calls hadoopFile() where: val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration)) You can set the parameter in sc.hadoopConfiguration before calling sc.sequenceFile(). Cheers On Sun, Jun 28, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I can do this > >

Fine control with sc.sequenceFile

2015-06-28 Thread ๏̯͡๏
I can do this val hadoopConf = new Configuration(sc.hadoopConfiguration) *hadoopConf.set("mapreduce.input.fileinputformat.split.maxsize", "67108864")* sc.newAPIHadoopFile( path + "/*.avro", classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord

Re: Unable to start Pi (hello world) application on Spark 1.4

2015-06-28 Thread ๏̯͡๏
Figured it out. All the jars that are specified with driver-class-path are now exported through SPARK_CLASSPATH and its working now. I thought SPARK_CLASSPATH was dead. Looks like its flipping ON/OFF On Sun, Jun 28, 2015 at 12:55 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Any thoughts on this ? > > On Fri, Ju

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
other people might disagree, but i have had better luck with a model that looks more like traditional map-red if you use spark for disk-to-disk computations: more cores per executor (and so less RAM per core/task). so i would suggest trying --executor-cores 4 and adjust numPartitions accordingly.

Re: required: org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.linalg.Vector]

2015-06-28 Thread Feynman Liang
You are trying to predict on a DStream[LabeledPoint] (data + labels) but predictOn expects a DStream[Vector] (just the data without the labels). Try doing: val unlabeledStream = labeledStream.map { x => x.features } model.predictOn(unlabeledStream).print() On Sun, Jun 28, 2015 at 6:03 PM, Arthur

Re: required: org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.linalg.Vector]

2015-06-28 Thread Arthur Chan
also my Spark is 1.4 On Mon, Jun 29, 2015 at 9:02 AM, Arthur Chan wrote: > > > Hi, > > > line 99:model.trainOn(labeledStream) > > line 100: model.predictOn(labeledStream).print() > > line 101:ssc.start() > > line 102: ssc.awaitTermination() > > > Regards > > On Sun, Jun 28, 2015

Re: required: org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.linalg.Vector]

2015-06-28 Thread Arthur Chan
Hi, line 99:model.trainOn(labeledStream) line 100: model.predictOn(labeledStream).print() line 101:ssc.start() line 102: ssc.awaitTermination() Regards On Sun, Jun 28, 2015 at 10:53 PM, Ted Yu wrote: > Can you show us your code around line 100 ? > > Which Spark release are

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
Regarding # of executors. I get 342 executors in parallel each time and i set executor-cores to 1. Hence i need to set 342 * 2 * x (x = 1,2,3, ..) as number of partitions while running blockJoin. Is this correct. And is my assumptions on replication levels correct. Did you get a chance to look a

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
I am unable to run my application or sample application with prebuilt spark 1.4 and wit this custom 1.4. In both cases i get this error 15/06/28 15:30:07 WARN ipc.Client: Exception encountered while connecting to the server : java.lang.IllegalArgumentException: Server has invalid Kerberos principa

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
regarding your calculation of executors... RAM in executor is not really comparable to size on disk. if you read from from file and write to file you do not have to set storage level. in the join or blockJoin specify number of partitions as a multiple (say 2 times) of number of cores available t

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
My code: val viEvents = details.filter(_.get(14).asInstanceOf[Long] != NULL_VALUE).map { vi => (vi.get(14).asInstanceOf[Long], vi) } //AVRO (150G) val lstgItem = DataUtil.getDwLstgItem(sc, DateUtil.addDaysToDate(startDate, -89)).filter(_.getItemId().toLong != NULL_VALUE).map { lstg => (ls

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
Could you please suggest and help me understand further. This is the actual sizes -sh-4.1$ hadoop fs -count dw_lstg_item 1 764 2041084436189 /sys/edw/dw_lstg_item/snapshot/2015/06/25/00 *This is not skewed there is exactly one etntry for each item but its 2TB* So should i

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
specify numPartitions or partitioner for operations that shuffle. so use: def join[W](other: RDD[(K, W)], numPartitions: Int) or def blockJoin[W]( other: JavaPairRDD[K, W], leftReplication: Int, rightReplication: Int, partitioner: Partitioner) for example: left.blockJoin(right, 3, 1, new

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
a blockJoin spreads out one side while replicating the other. i would suggest replicating the smaller side. so if lstgItem is smaller try 3,1 or else 1,3. this should spread the "fat" keys out over multiple (3) executors... On Sun, Jun 28, 2015 at 5:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I am able to us

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
You mentioned storage levels must be (should be memory-and-disk or disk-only), number of partitions (should be large, multiple of num executors), how do i specify that ? On Sun, Jun 28, 2015 at 2:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I am able to use blockjoin API and it does not throw compilation erro

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
I am able to use blockjoin API and it does not throw compilation error val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))] = lstgItem.blockJoin(viEvents,1,1).map { } Here viEvents is highly skewed and both are on HDFS. What should be the optimal values of replication, i

Re: What does "Spark is not just MapReduce" mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Stephen Boesch
Vanilla map/reduce does not expose it: but hive on top of map/reduce has superior partitioning (and bucketing) support to Spark. 2015-06-28 13:44 GMT-07:00 Koert Kuipers : > spark is partitioner aware, so it can exploit a situation where 2 datasets > are partitioned the same way (for example by d

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
I incremented the version of spark from 1.4.0 to 1.4.0.1 and ran ./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Phive -Phive-thriftserver Build was successful but the script faild. Is there a way to pass the incremented version ? [INFO] BUILD SUCCESS [INFO] --

Re: What does "Spark is not just MapReduce" mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Koert Kuipers
spark is partitioner aware, so it can exploit a situation where 2 datasets are partitioned the same way (for example by doing a map-side join on them). map-red does not expose this. On Sun, Jun 28, 2015 at 12:13 PM, YaoPau wrote: > I've heard "Spark is not just MapReduce" mentioned during Spark

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
you need 1) to publish to inhouse maven, so your application can depend on your version, and 2) use the spark distribution you compiled to launch your job (assuming you run with yarn so you can launch multiple versions of spark on same cluster) On Sun, Jun 28, 2015 at 4:33 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
How can i import this pre-built spark into my application via maven as i want to use the block join API. On Sun, Jun 28, 2015 at 1:31 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I ran this w/o maven options > > ./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Phive > -Phive-thriftserver > > I got this spark-1

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
I ran this w/o maven options ./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Phive -Phive-thriftserver I got this spark-1.4.0-bin-2.4.0.tgz in the same working directory. I hope this is built with 2.4.x hadoop as i did specify -P On Sun, Jun 28, 2015 at 1:10 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > ./

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
./make-distribution.sh --tgz --*mvn* "-Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package" or ./make-distribution.sh --tgz --*mvn* -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package" ​Both fail with +

Re: Join highly skewed datasets

2015-06-28 Thread Ted Yu
maven command needs to be passed through --mvn option. Cheers On Sun, Jun 28, 2015 at 12:56 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > Running this now > > ./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0 > -Phive -Phive-thriftserver -DskipTests clean package > > > Waiting for it to co

Re: Unable to start Pi (hello world) application on Spark 1.4

2015-06-28 Thread ๏̯͡๏
Any thoughts on this ? On Fri, Jun 26, 2015 at 2:27 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > It used to work with 1.3.1, however with 1.4.0 i get the following > exception > > > export SPARK_HOME=/home/dvasthimal/spark1.4/spark-1.4.0-bin-hadoop2.4 > export > SPARK_JAR=/home/dvasthimal/spark1.4/spark-1.4.0-bin

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
Running this now ./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package Waiting for it to complete. There is no progress after initial log messages //LOGS $ ./make-distribution.sh --tgz -Phadoop-2.4 -Pyarn -Dhadoop.versi

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
I just did that, where can i find that "spark-1.4.0-bin-hadoop2.4.tgz" file ? On Sun, Jun 28, 2015 at 12:15 PM, Ted Yu wrote: > You can use the following command to build Spark after applying the pull > request: > > mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package > > > Cheers > > > On S

Re: Join highly skewed datasets

2015-06-28 Thread Ted Yu
You can use the following command to build Spark after applying the pull request: mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package Cheers On Sun, Jun 28, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > I see that block support did not make it to spark 1.4 release. > > Can you share instruct

Share your cluster & run details

2015-06-28 Thread ๏̯͡๏
I would like to get a sense of spark YARN cluster used around and this thread can help others as well 1. Number of nodes in cluster 2. Container memory limit 3. Typical Hardware configuration of worker nodes 4. Typical number of executors used ? 5. Any other related info you want to share. How d

RE: What does "Spark is not just MapReduce" mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Michael Malak
I would also add, from a data locality theoretic standpoint, mapPartitions() provides for node-local computation that plain old map-reduce does not. From my Android phone on T-Mobile. The first nationwide 4G network. Original message From: Ashic Mahtab Date: 06/28/2015 10:5

Re: Join highly skewed datasets

2015-06-28 Thread ๏̯͡๏
I see that block support did not make it to spark 1.4 release. Can you share instructions of building spark with this support for hadoop 2.4.x distribution. appreciate. On Fri, Jun 26, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > This is nice. Which version of Spark has this support ? Or do I need

Re: spark streaming with kafka reset offset

2015-06-28 Thread Shushant Arora
Few doubts : In 1.2 streaming when I use union of streams , my streaming application getting hanged sometimes and nothing gets printed on driver. [Stage 2:> (0 + 2) / 2] Whats is 0+2/2 here signifies. 1.Does no of streams in topicsMap.put("testSparkPartitio

spark-submit in deployment mode with the "--jars" option

2015-06-28 Thread hishamm
Hi, I want to deploy my application on a standalone cluster. Spark submit acts in strange way. When I deploy the application in *"client"* mode, everything works well and my application can see the additional jar files. Here is the command: > spark-submit --master spark://1.2.3.4:7077 --deplo

Use logback instead of log4j in a Spark job

2015-06-28 Thread Mario Pastorelli
Hey sparkers, I'm trying to use Logback for logging from my Spark jobs but I noticed that if I submit the job with spark-submit then the log4j implementation of slf4j is loaded instead of logback. Consequently, any call to org.slf4j.LoggerFactory.getLogger will return a log4j logger instead of a l

RE: What does "Spark is not just MapReduce" mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Ashic Mahtab
Spark comes with quite a few components. At it's core is..surprisespark core. This provides the core things required to run spark jobs. Spark provides a lot of operators out of the box...take a look at https://spark.apache.org/docs/latest/programming-guide.html#transformationshttps://spark.a

What does "Spark is not just MapReduce" mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread YaoPau
I've heard "Spark is not just MapReduce" mentioned during Spark talks, but it seems like every method that Spark has is really doing something like (Map -> Reduce) or (Map -> Map -> Map -> Reduce) etc behind the scenes, with the performance benefit of keeping RDDs in memory between stages. Am I wr

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ilya Ganelin
Ayman - it's really a question of recommending user to products vs products to users. There will only be a difference if you're not doing All to All. For example, if you're recommending only the Top N recommendations. Then you may recommend only the top N products or the top N users which would be

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ayman Farahat
Thanks Ilya Is there an advantage of say partitioning by users /products when you train ? Here are two alternatives I have #Partition by user or Product tot = newrdd.map(lambda l: (l[1],Rating(int(l[1]),int(l[2]),l[4]))).partitionBy(50).cache() ratings = tot.values() model = ALS.train(ratings,

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ilya Ganelin
Oops - code should be : Val a = rdd.zipWithIndex().filter(s => 1 < s._2 < 100) On Sun, Jun 28, 2015 at 8:24 AM Ilya Ganelin wrote: > You can also select pieces of your RDD by first doing a zipWithIndex and > then doing a filter operation on the second element of the RDD. > > For example to sele

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ilya Ganelin
You can also select pieces of your RDD by first doing a zipWithIndex and then doing a filter operation on the second element of the RDD. For example to select the first 100 elements : Val a = rdd.zipWithIndex().filter(s => 1 < s < 100) On Sat, Jun 27, 2015 at 11:04 AM Ayman Farahat wrote: > How

Re: required: org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.linalg.Vector]

2015-06-28 Thread Ted Yu
Can you show us your code around line 100 ? Which Spark release are you compiling against ? Cheers On Sun, Jun 28, 2015 at 5:49 AM, Arthur Chan wrote: > Hi, > > I am trying Spark with some sample programs, > > > In my code, the following items are imported: > > import > org.apache.spark.mllib.

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-28 Thread Steve Loughran
On 27 Jun 2015, at 07:56, Tim Chen mailto:t...@mesosphere.io>> wrote: Does YARN provide the token through that env variable you mentioned? Or how does YARN do this? Roughly: 1. client-side launcher creates the delegation tokens and adds them as byte[] data to the the request. 2. The YARN R

Re: Spark-Submit / Spark-Shell Error Standalone cluster

2015-06-28 Thread Tomas Hudik
/usr/bin/ - looks like strange directory. Did you copy some files to /usr/bin yourself? If you download (possible compile) spark - it will never be placed into /usr/bin On Sun, Jun 28, 2015 at 9:19 AM, Wojciech Pituła wrote: > I assume that /usr/bin/load-spark-env.sh exists. Have you got the rig

required: org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.linalg.Vector]

2015-06-28 Thread Arthur Chan
Hi, I am trying Spark with some sample programs, In my code, the following items are imported: import org.apache.spark.mllib.regression.{StreamingLinearRegressionWithSGD, LabeledPoint} import org.apache.spark.mllib.regression.{StreamingLinearRegressionWithSGD} import org.apache.spark.streamin

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-28 Thread Iulian Dragoș
This is something we (at Typesafe) also thought about, but didn't start yet. It would be good to pool efforts. On Sat, Jun 27, 2015 at 12:44 AM, Dave Ariens wrote: > Fair. I will look into an alternative with a generated delegation token. > However the same issue exists. How can I have the

problem for submitting job

2015-06-28 Thread 郭谦
HI, I'm a junior user of spark from China. I have a problem about submit spark job right now. I want to submit job from code. In other words ,"How to submit spark job from within java program to yarn cluster without using spark-submit" I've learnt from official site http://spark.apache.

Re: Spark-Submit / Spark-Shell Error Standalone cluster

2015-06-28 Thread Wojciech Pituła
I assume that /usr/bin/load-spark-env.sh exists. Have you got the rights to execute it? niedz., 28.06.2015 o 04:53 użytkownik Ashish Soni napisał: > Not sure what is the issue but when i run the spark-submit or spark-shell > i am getting below error > > /usr/bin/spark-class: line 24: /usr/bin/lo