Re: Understanding the build params for spark with sbt.

2015-04-21 Thread Akhil Das
With maven you could like: mvn -Dhadoop.version=2.3.0 -DskipTests clean package -pl core Thanks Best Regards On Mon, Apr 20, 2015 at 8:10 PM, Shiyao Ma wrote: > Hi. > > My usage is only about the spark core and hdfs, so no spark sql or > mlib or other components invovled. > > > I saw the hint

Re: meet weird exception when studying rdd caching

2015-04-21 Thread Akhil Das
It could be a similar issue as https://issues.apache.org/jira/browse/SPARK-4300 Thanks Best Regards On Tue, Apr 21, 2015 at 8:09 AM, donhoff_h <165612...@qq.com> wrote: > Hi, > > I am studying the RDD Caching function and write a small program to verify > it. I run the program in a Spark1.3.0 en

Re: Custom paritioning of DSTream

2015-04-21 Thread Akhil Das
I think DStream.transform is the one that you are looking for. Thanks Best Regards On Mon, Apr 20, 2015 at 9:42 PM, Evo Eftimov wrote: > Is the only way to implement a custom partitioning of DStream via the > foreach > approach so to gain access to the actual RDDs comprising the DSTReam and > h

WebUI shows poor locality when task schduling

2015-04-21 Thread eric wong
Hi, When running a exprimental KMeans job for expriment, the Cached RDD is original Points data. I saw poor locality in Task details from WebUI. Almost one half of the input of task is Network instead of Memory. And Task with network input consumes almost the same time compare with the task with

Re: Spark and accumulo

2015-04-21 Thread Akhil Das
You can simply use a custom inputformat (AccumuloInputFormat) with the hadoop RDDs (sc.newApiHadoopFile etc) for that, all you need to do is to pass the jobConfs. Here's pretty clean discussion: http://stackoverflow.com/questions/29244530/how-do-i-create-a-spark-rdd-from-accumulo-1-6-in-spark-noteb

Re: Running spark over HDFS

2015-04-21 Thread Akhil Das
Your spark master should be spark://swetha:7077 :) Thanks Best Regards On Mon, Apr 20, 2015 at 2:44 PM, madhvi wrote: > PFA screenshot of my cluster UI > > Thanks > On Monday 20 April 2015 02:27 PM, Akhil Das wrote: > > Are you seeing your task being submitted to the UI? Under completed or >

Re: Running spark over HDFS

2015-04-21 Thread madhvi
On Tuesday 21 April 2015 12:12 PM, Akhil Das wrote: Your spark master should be spark://swetha:7077 :) Thanks Best Regards On Mon, Apr 20, 2015 at 2:44 PM, madhvi > wrote: PFA screenshot of my cluster UI Thanks On Monday 20 April 2015 02:27 PM, Akh

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-21 Thread N B
We already do have a cron job in place to clean just the shuffle files. However, what I would really like to know is whether there is a "proper" way of telling spark to clean up these files once its done with them? Thanks NB On Mon, Apr 20, 2015 at 10:47 AM, Jeetendra Gangele wrote: > Write a

Re: Spark and accumulo

2015-04-21 Thread andy petrella
Hello Madvi, Some work has been done by @pomadchin using the spark notebook, maybe you should come on https://gitter.im/andypetrella/spark-notebook and poke him? There are some discoveries he made that might be helpful to know. Also you can poke @lossyrob from Azavea, he did that for geotrellis

Re: Different behavioural of HiveContext vs. Hive?

2015-04-21 Thread Ophir Cohen
BTW This: hc.sql("show tables").collect Works great! On Tue, Apr 21, 2015 at 10:49 AM, Ophir Cohen wrote: > Lately we upgraded our Spark to 1.3. > Not surprisingly, over the way I find few incomputability between the > versions and quite expected. > I found change that I'm interesting to unders

Features scaling

2015-04-21 Thread Denys Kozyr
Hi! I want to normalize features before train logistic regression. I setup scaler: scaler2 = StandardScaler(withMean=True, withStd=True).fit(features) and apply it to a dataset: scaledData = dataset.map(lambda x: LabeledPoint(x.label, scaler2.transform(Vectors.dense(x.features.toArray() b

Column renaming after DataFrame.groupBy

2015-04-21 Thread Justin Yip
Hello, I would like rename a column after aggregation. In the following code, the column name is "SUM(_1#179)", is there a way to rename it to a more friendly name? scala> val d = sqlContext.createDataFrame(Seq((1, 2), (1, 3), (2, 10))) scala> d.groupBy("_1").sum().printSchema root |-- _1: integ

Re: Can't get SparkListener to work

2015-04-21 Thread Shixiong Zhu
You need to call sc.stop() to wait for the notifications to be processed. Best Regards, Shixiong(Ryan) Zhu 2015-04-21 4:18 GMT+08:00 Praveen Balaji : > Thanks Shixiong. I tried it out and it works. > > If you're looking at this post, here a few points you may be interested in: > > Turns out this

Re: Parquet Hive table become very slow on 1.3?

2015-04-21 Thread Rex Xiong
We have the similar issue with massive parquet files, Cheng Lian, could you have a look? 2015-04-08 15:47 GMT+08:00 Zheng, Xudong : > Hi Cheng, > > I tried both these patches, and seems still not resolve my issue. And I > found the most time is spend on this line in newParquet.scala: > > ParquetF

Different behavioural of HiveContext vs. Hive?

2015-04-21 Thread Ophir Cohen
Lately we upgraded our Spark to 1.3. Not surprisingly, over the way I find few incomputability between the versions and quite expected. I found change that I'm interesting to understand it origin. env: Amazon EMR, Spark 1.3, Hive 0.13, Hadoop 2.4 In Spark 1.2.1 I ran from the code query such: SHOW

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-21 Thread Sean Owen
I think maybe you need more partitions in your input, which might make for smaller tasks? On Tue, Apr 21, 2015 at 2:56 AM, Christian S. Perone wrote: > I keep seeing these warnings when using trainImplicit: > > WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB). > The maxi

Cassandra Connection Issue with Spark-jobserver

2015-04-21 Thread Anand
*I am new to Spark world and Job Server My Code :* package spark.jobserver import java.nio.ByteBuffer import scala.collection.JavaConversions._ import scala.collection.mutable.ListBuffer import scala.collection.immutable.Map import org.apache.cassandra.hadoop.ConfigHelper import org.apache.cas

SPOF in Spark driver in yarn-client mode.

2015-04-21 Thread guoqing0...@yahoo.com.hk
Hi all , I am a begginner of Spark get some problems, i deploy the spark on yarn using "start-thriftserver.sh --master yarn" , it should be yarn-client mode , and i found the SPOF in driver process (SparkSubmit) , the SparkSQL application in yarn will be crashed if the spark dirver down , so ho

Re: Number of input partitions in SparkContext.sequenceFile

2015-04-21 Thread Archit Thakur
Hi, It should generate the same no of partitions as the no. of splits. Howd you check no of partitions.? Also please paste your file size and hdfs-site.xml and mapred-site.xml here. Thanks and Regards, Archit Thakur. On Sat, Apr 18, 2015 at 6:20 PM, Wenlei Xie wrote: > Hi, > > I am wondering t

Re: HiveContext setConf seems not stable

2015-04-21 Thread Ophir Cohen
I think I encounter the same problem, I'm trying to turn on the compression of Hive. I have the following lines: def initHiveContext(sc: SparkContext): HiveContext = { val hc: HiveContext = new HiveContext(sc) hc.setConf("hive.exec.compress.output", "true") hc.setConf("mapreduce.output.

Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-21 Thread Sourav Chandra
Hi, We are building a spark streaming application which reads from kafka, does updateStateBykey based on the received message type and finally stores into redis. After running for few seconds the executor process get killed by throwing OutOfMemory error. The code snippet is below: *NoOfReceive

Compression and Hive with Spark 1.3

2015-04-21 Thread Ophir Cohen
Sadly I'm encounter too many issues migrating my code to Spark 1.3 I wrote one problem on other mail but my main problem is that I can't set the right compression type. In Spark 1.2.1 setting the following values was enough: hc.setConf("hive.exec.compress.output", "true") hc.setConf("mapreduce

Re: Column renaming after DataFrame.groupBy

2015-04-21 Thread ayan guha
Hi There are 2 ways of doing it. 1. Using SQL - this method directly creates another dataframe object. 2. Using methods of the DF object, but in that case you have to provide the schema through a row object. In this case you need to explicitly call createDataFrame again which will infer the schem

Re: Custom Partitioning Spark

2015-04-21 Thread Archit Thakur
Hi, This should work. How are you checking the no. of partitions.? Thanks and Regards, Archit Thakur. On Mon, Apr 20, 2015 at 7:26 PM, mas wrote: > Hi, > > I aim to do custom partitioning on a text file. I first convert it into > pairRDD and then try to use my custom partitioner. However, some

Re: How to run spark programs in eclipse like mapreduce

2015-04-21 Thread Archit Thakur
You just need to specify your master as local and run the main clas that created the sparkcontext object in eclipse. On Mon, Apr 20, 2015 at 12:18 PM, Akhil Das wrote: > Why not build the project and submit the build jar with Spark submit? > > If you want to run it within eclipse, then all you h

RE: Streaming problems running 24x7

2015-04-21 Thread González Salgado , Miquel
thank you Luis, I have tried without the window operation, but the memory leak is still present... I think it must be something related to spark, running some exemple as TwitterPopularTags, it happens the same. I will post something if I found a solution De: Luis Ángel Vicente Sánchez [mailto:

Meet Exception when learning Broadcast Variables

2015-04-21 Thread donhoff_h
Hi, experts. I wrote a very little program to learn how to use Broadcast Variables, but met an exception. The program and the exception are listed as following. Could anyone help me to solve this problem? Thanks! **My Program is as following** object TestBroadcast02 {

Spark Unit Testing

2015-04-21 Thread James King
I'm trying to write some unit tests for my spark code. I need to pass a JavaPairDStream to my spark class. Is there a way to create a JavaPairDStream using Java API? Also is there a good resource that covers an approach (or approaches) for unit testing using Java. Regards jk

Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
Thanks for the hints guys! much appreciated! Even if I just do a something like: "Select * from tableX where attribute1 < 5" I see similar behaviour. @Michael Could you point me to any sample code that uses Spark's Rows? We are at a phase where we can actually change our JavaBeans for something

Re: Spark Unit Testing

2015-04-21 Thread Emre Sevinc
Hello James, Did you check the following resources: - https://github.com/apache/spark/tree/master/streaming/src/test/java/org/apache/spark/streaming - http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs -- Emre Sevinç http://www.bigindus

Re: Spark Unit Testing

2015-04-21 Thread James King
Hi Emre, thanks for the help will have a look. Cheers! On Tue, Apr 21, 2015 at 1:46 PM, Emre Sevinc wrote: > Hello James, > > Did you check the following resources: > > - > https://github.com/apache/spark/tree/master/streaming/src/test/java/org/apache/spark/streaming > > - > http://www.slidesh

UserDefinedTypes for SparkSQL Pitfalls (solved)

2015-04-21 Thread kmader
I was having the following issue with creating a user defined type (PosData) which I had naturally included in an object (SQLTests) In the SQLUserDefinedType.java in the sql.catalyst/src/main (https://github.com/apache/spark/blob/f9969098c8cb15e36c718b80c6cf5b534a6cf7c3/sql/catalyst/src/main/sca

Re: Streaming problems running 24x7

2015-04-21 Thread Conor Fennell
Hi, If the slow memory increase is in the driver, it could be related to this: https://issues.apache.org/jira/browse/SPARK-5967 *"After some hours disk space is being consumed. There are a lot of* *directories with name like "/tmp/spark-e3505437-f509-4b5b-92d2-ae2559badb3c"* Spark doesn't auto

Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread ayan guha
Hi I am getting an error Also, I am getting an error in mlib.ALS.train function when passing dataframe (do I need to convert the DF to RDD?) Code: training = ssc.sql("select userId,movieId,rating from ratings where partitionKey < 6").cache() print type(training) model = ALS.train(training,rank,n

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-21 Thread Conor Fennell
Hi, We set the spark.cleaner.ttl to some reasonable time and also set spark.streaming.unpersist=true. Those together cleaned up the shuffle files for us. -Conor On Tue, Apr 21, 2015 at 8:18 AM, N B wrote: > We already do have a cron job in place to clean just the shuffle files. > However,

Problem with using Spark ML

2015-04-21 Thread Staffan
Hi, I've written an application that performs some machine learning on some data. I've validated that the data _should_ give a good output with a decent RMSE by using Lib-SVM: Mean squared error = 0.00922063 (regression) Squared correlation coefficient = 0.9987 (regression) When I try to use Spark

Shuffle question

2015-04-21 Thread Marius Danciu
Hello anyone, I have a question regarding the sort shuffle. Roughly I'm doing something like: rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2) The problem is that in f2 I don't see the keys being sorted. The keys are Java Comparable not scala.math.Ordered or scala.math.Ordering

Spark REPL no progress when run in cluster

2015-04-21 Thread bipin
Hi all, I am facing an issue, whenever I run a job on my mesos cluster, I cannot see any progress on my terminal. It shows : [Stage 0:>(0 + 0) / 204] I have setup the cluster on AWS EC2 manually. I first run mesos master and slaves, then run s

mapred.reduce.tasks

2015-04-21 Thread Shushant Arora
In MapReduce job how reduce tasks numbers are decided ? I haven't override the mapred.reduce.tasks property and its creating ~700 reduce tasks. Why is it creating this many tasks ? Thanks

Re: Compression and Hive with Spark 1.3

2015-04-21 Thread Ophir Cohen
Some more info: I'm putting the compressions values on hive-site.xml and running spark job. hc.sql("set ") returns the expected (compression) configuration but looking at the logs, it create the tables without compression: 15/04/21 13:14:19 INFO metastore.HiveMetaStore: 0: create_table: Table(t

Re: When the old data dropped from the cache?

2015-04-21 Thread ayan guha
No, spark does not refresh new data automatically. Spark works on RDD. If you run any "action" on a RDD, then all its parents will be loaded to memory and computation will be done. Any further call to any of the parent will come from cache, else drop out from cache through LRU. On Mon, Apr 20, 20

Re: Meet Exception when learning Broadcast Variables

2015-04-21 Thread Ted Yu
Does line 27 correspond to brdcst.value ? Cheers > On Apr 21, 2015, at 3:19 AM, donhoff_h <165612...@qq.com> wrote: > > Hi, experts. > > I wrote a very little program to learn how to use Broadcast Variables, but > met an exception. The program and the exception are listed as following. > C

StandardScaler failing with OOM errors in PySpark

2015-04-21 Thread rok
I'm trying to use the StandardScaler in pyspark on a relatively small (a few hundred Mb) dataset of sparse vectors with 800k features. The fit method of StandardScaler crashes with Java heap space or Direct buffer memory errors. There should be plenty of memory around -- 10 executors with 2 cores e

Re: Map-Side Join in Spark

2015-04-21 Thread ayan guha
Hi Sorry was typing from mobile hence could not elaborate earlier. I presume you want to do map-side join and you mean you want to join 2 RDD without shuffle? Please have a quick look http://apache-spark-user-list.1001560.n3.nabble.com/Text-file-and-shuffle-td5973.html#none 1) co-partition you

Re: Spark REPL no progress when run in cluster

2015-04-21 Thread Prannoy
Hi, This is because your Logger setting is set to OFF. Just add the following lines into your code, probably this should resolve the issue. IMPORTS that are needed. import org.apache.log4j.Logger import org.apache.log4j.Level ADD the two lines to your code. Logger.getLogger("org").setLevel(Lev

Re: Meet Exception when learning Broadcast Variables

2015-04-21 Thread Jeetendra Gangele
please make always broadcast variable as final, they cant be changed as per their property. Also Does line 27 corresponds to extracting value? On 21 April 2015 at 19:28, Ted Yu wrote: > Does line 27 correspond to brdcst.value ? > > Cheers > > > > > On Apr 21, 2015, at 3:19 AM, donhoff_h <165612

Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread Karlson
Hi, can anyone confirm (and if so elaborate on) the following problem? When I join two DataFrames that originate from the same source DataFrame, the resulting DF will explode to a huge number of rows. A quick example: I load a DataFrame with n rows from disk: df = sql_context.parquetFil

implicits is not a member of org.apache.spark.sql.SQLContext

2015-04-21 Thread Wang, Ningjun (LNG-NPV)
I tried to convert an RDD to a data frame using the example codes on spark website case class Person(name: String, age: Int) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split

Re: Custom Partitioning Spark

2015-04-21 Thread MUHAMMAD AAMIR
Hi Archit, Thanks a lot for your reply. I am using "rdd.partitions.length" to check the number of partitions. rdd.partitions return the array of partitions. I would like to add one more question here do you have any idea how to get the objects in each partition ? Further is there any way to figure

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread ayan guha
your code should be df_one = df.select('col1', 'col2') df_two = df.select('col1', 'col3') Your current code is generating a tupple, and of course df_1 and df_2 are different, so join is yielding to cartesian. Best Ayan On Wed, Apr 22, 2015 at 12:42 AM, Karlson wrote: > Hi, > > can anyone co

Re: implicits is not a member of org.apache.spark.sql.SQLContext

2015-04-21 Thread Ted Yu
Have you tried the following ? import sqlContext._ import sqlContext.implicits._ Cheers On Tue, Apr 21, 2015 at 7:54 AM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > I tried to convert an RDD to a data frame using the example codes on > spark website > > > > > > case clas

Re: Custom Partitioning Spark

2015-04-21 Thread ayan guha
Are you looking for? *mapPartitions*(*func*)Similar to map, but runs separately on each partition (block) of the RDD, so *func* must be of type Iterator => Iterator when running on an RDD of type T.*mapPartitionsWithIndex*(*func* )Similar to mapPartitions, but also provides *func* with an integer

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread Karlson
Sorry, my code actually was df_one = df.select('col1', 'col2') df_two = df.select('col1', 'col3') But in Spark 1.4.0 this does not seem to make any difference anyway and the problem is the same with both versions. On 2015-04-21 17:04, ayan guha wrote: your code should be df_one =

Clustering algorithms in Spark

2015-04-21 Thread Jeetendra Gangele
I have a requirement in which I want to match the company name .. and I am thinking to solve this using clustering technique. Can anybody suggest which algo I should Use in Spark and how to evaluate the running time and accuracy for this particular problem. I checked K means looks good. Any idea

Re: Streaming Linear Regression problem

2015-04-21 Thread Xiangrui Meng
+user On Apr 21, 2015 4:46 AM, "baris akgun" wrote: > Hi , > > Actually I solved the problem, I just copied the existing file to train > folder, but I noticed that Spark Streaming look for file created date, > therfore I created new file after starting the streaming job and the > problem was solv

Re: Clustering algorithms in Spark

2015-04-21 Thread Jeetendra Gangele
The problem with k means is we have to define the no of cluster which I dont want in this case So thinking for something like hierarchical clustering any idea and suggestions? On 21 April 2015 at 20:51, Jeetendra Gangele wrote: > I have a requirement in which I want to match the company name .

Re: Understanding the build params for spark with sbt.

2015-04-21 Thread Sree V
Hi Shiyao, >From the same page you referred:Maven is the official recommendation for >packaging Spark, and is the “build of reference”. But SBT is supported for >day-to-day development since it can provide much faster iterative compilation. >More advanced developers may wish to use SBT. For mav

RE:RE:maven compile error

2015-04-21 Thread Shuai Zheng
I have similar issue (I failed on the spark core project but with same exception as you). Then I follow the below steps (I am working on windows): Delete the maven repository, and re-download all dependency. The issue sounds like a corrupted jar can’t be opened by maven. Other than this,

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread ayan guha
you are correct. Just found the same thing. You are better off with sql, then. userSchemaDF = ssc.createDataFrame(userRDD) userSchemaDF.registerTempTable("users") #print userSchemaDF.take(10) #SQL API works as expected sortedDF = ssc.sql("SELECT userId,age,gender,work from users ord

Re: Instantiating/starting Spark jobs programmatically

2015-04-21 Thread Richard Marscher
Could you possibly describe what you are trying to learn how to do in more detail? Some basics of submitting programmatically: - Create a SparkContext instance and use that to build your RDDs - You can only have 1 SparkContext per JVM you are running, so if you need to satisfy concurrent job reque

Error while running SparkPi in Hadoop HA

2015-04-21 Thread Fernando O.
Hi all, I'm wondering if SparkPi works with hadoop HA (I guess it should) Hadoop's pi example works great on my cluster, so after having that done I installed spark and in the worker log I'm seeing two problems that might be related. Versions: Hadoop 2.6.0 Spark 1.3.1 I'm runn

Spark Scala Version?

2015-04-21 Thread ๏̯͡๏
While running a my Spark Application over 1.3.0 with Scala 2.10.0 i encountered 15/04/21 09:13:21 ERROR executor.Executor: Exception in task 7.0 in stage 2.0 (TID 28) java.lang.UnsupportedOperationException: tail of empty list at scala.collection.immutable.Nil$.tail(List.scala:339) at scala.collec

Re: MLlib - Collaborative Filtering - trainImplicit task size

2015-04-21 Thread Christian S. Perone
Hi Sean, thanks for the answer. I tried to call repartition() on the input with many different sizes and it still continues to show that warning message. On Tue, Apr 21, 2015 at 7:05 AM, Sean Owen wrote: > I think maybe you need more partitions in your input, which might make > for smaller tasks

Re: Spark Scala Version?

2015-04-21 Thread Dean Wampler
Without the rest of your code it's hard to make sense of errors. Why do you need to use reflection? ​Make sure you use the same Scala versions throughout and 2.10.4 is recommended. That's still the official version for Spark, even though provisional​ support for 2.11 exists. Dean Wampler, Ph.D. A

How does GraphX stores the routing table?

2015-04-21 Thread mas
Hi, How does GraphX stores the routing table? Is it stored on the master node or chunks of the routing table is send to each partition that maintains the record of vertices and edges at that node? If only customized edge partitioning is performed will the corresponding vertices be sent to same pa

Re: Column renaming after DataFrame.groupBy

2015-04-21 Thread Reynold Xin
You can use the more verbose syntax: d.groupBy("_1").agg(d("_1"), sum("_1").as("sum_1"), sum("_2").as("sum_2")) On Tue, Apr 21, 2015 at 1:06 AM, Justin Yip wrote: > Hello, > > I would like rename a column after aggregation. In the following code, the > column name is "SUM(_1#179)", is there a w

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread Joseph Bradley
Hi Ayan, If you want to use DataFrame, then you should use the Pipelines API (org.apache.spark.ml.*) which will take DataFrames: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS In the examples/ directory for ml/, you can find a MovieLensALS example.

Re: Updating a Column in a DataFrame

2015-04-21 Thread Reynold Xin
You can use df.withColumn("a", df.b) to make column a having the same value as column b. On Mon, Apr 20, 2015 at 3:38 PM, ARose wrote: > In my Java application, I want to update the values of a Column in a given > DataFrame. However, I realize DataFrames are immutable, and therefore > cannot

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread Xiangrui Meng
SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in 1.3. We should allow DataFrames in ALS.train. I will submit a patch. You can use `ALS.train(training.rdd, ...)` for now as a workaround. -Xiangrui On Tue, Apr 21, 2015 at 10:51 AM, Joseph Bradley wrote: > Hi Ayan, > > If you wa

Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-21 Thread Michal Klos
Hi, I'm trying to set up multiple spark clusters with high availability and I was wondering if I can re-use a single ZK cluster to manage them? It's not very clear in the docs and it seems like the answer may be that I need a separate ZK cluster for each spark cluster? thanks, M

Re: Join on DataFrames from the same source (Pyspark)

2015-04-21 Thread Michael Armbrust
This is https://issues.apache.org/jira/browse/SPARK-6231 Unfortunately this is pretty hard to fix as its hard for us to differentiate these without aliases. However you can add an alias as follows: from pyspark.sql.functions import * df.alias("a").join(df.alias("b"), col("a.col1") == col("b.col1

Re: SparkSQL performance

2015-04-21 Thread Michael Armbrust
Here is an example using rows directly: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema Avro or parquet input would likely give you the best performance. On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com

Re: Instantiating/starting Spark jobs programmatically

2015-04-21 Thread Steve Loughran
On 21 Apr 2015, at 17:34, Richard Marscher mailto:rmarsc...@localytics.com>> wrote: - There are System.exit calls built into Spark as of now that could kill your running JVM. We have shadowed some of the most offensive bits within our own application to work around this. You'd likely want to d

Re: Error while running SparkPi in Hadoop HA

2015-04-21 Thread Fernando O.
Solved Looks like it's some incompatibility in the build when using -Phadoop-2.4 , made the distribution with -Phadoop-provided and that fixed the issue On Tue, Apr 21, 2015 at 2:03 PM, Fernando O. wrote: > Hi all, > > I'm wondering if SparkPi works with hadoop HA (I guess it should) > >

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-21 Thread Olivier Girardot
Hi Sourav, Can you post your updateFunc as well please ? Regards, Olivier. Le mar. 21 avr. 2015 à 12:48, Sourav Chandra a écrit : > Hi, > > We are building a spark streaming application which reads from kafka, does > updateStateBykey based on the received message type and finally stores into >

Re: HiveContext setConf seems not stable

2015-04-21 Thread Michael Armbrust
As a workaround, can you call getConf first before any setConf? On Tue, Apr 21, 2015 at 1:58 AM, Ophir Cohen wrote: > I think I encounter the same problem, I'm trying to turn on the > compression of Hive. > I have the following lines: > def initHiveContext(sc: SparkContext): HiveContext = { >

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-21 Thread Jean-Pascal Billaud
At this point I am assuming that nobody has an idea... I am still going to give it a last shot just in case it was missed by some people :) Thanks, On Mon, Apr 20, 2015 at 2:20 PM, Jean-Pascal Billaud wrote: > Hey, so I start the context at the very end when all the piping is done. > BTW a fore

Re: Features scaling

2015-04-21 Thread DB Tsai
Hi Denys, I don't see any issue in your python code, so maybe there is a bug in python wrapper. If it's in scala, I think it should work. BTW, LogsticRegressionWithLBFGS does the standardization internally, so you don't need to do it yourself. It worths giving it a try! Sincerely, DB Tsai --

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread ayan guha
Thank you all. On 22 Apr 2015 04:29, "Xiangrui Meng" wrote: > SchemaRDD subclasses RDD in 1.2, but DataFrame is no longer an RDD in > 1.3. We should allow DataFrames in ALS.train. I will submit a patch. > You can use `ALS.train(training.rdd, ...)` for now as a workaround. > -Xiangrui > > On Tue,

Re: Spark Performance on Yarn

2015-04-21 Thread hnahak
Try --executor-memory 5g , because you have 8 gb RAM in each machine -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22603.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
Thanks Michael! I have tried applying my schema programatically but I didn't get any improvement on performance :( Could you point me to some code examples using Avro please? Many thanks again! Renato M. 2015-04-21 20:45 GMT+02:00 Michael Armbrust : > Here is an example using rows directly: > >

Re: how to make a spark cluster ?

2015-04-21 Thread haihar nahak
I did some performance check on socLiveJournal PageRank b/w my local machine (8 cores, 16 gb ) in local mode and my small cluster (4 nodes, 12 cores, 40 gb) and i found cluster mode is way faster than local mode. So I confused. no. of iterations ---> Local mode(in mins) --> cluster mode(in mins) 1

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-21 Thread Tathagata Das
Yeah, I am not sure what is going on. The only way to figure to take a look at the disassembled bytecodes using javap. TD On Tue, Apr 21, 2015 at 1:53 PM, Jean-Pascal Billaud wrote: > At this point I am assuming that nobody has an idea... I am still going to > give it a last shot just in case i

problem writing to s3

2015-04-21 Thread Daniel Mahler
I am having a strange problem writing to s3 that I have distilled to this minimal example: def jsonRaw = s"${outprefix}-json-raw" def jsonClean = s"${outprefix}-json-clean" val txt = sc.textFile(inpath)//.coalesce(shards, false) txt.count val res = txt.saveAsTextFile(jsonRaw) val txt2 = sc.text

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-21 Thread Jean-Pascal Billaud
Sure. But in general, I am assuming this ""Graph is unexpectedly null when DStream is being serialized" must mean something. Under which circumstances, such an exception would trigger? On Tue, Apr 21, 2015 at 4:12 PM, Tathagata Das wrote: > Yeah, I am not sure what is going on. The only way to f

Efficient saveAsTextFile by key, directory for each key?

2015-04-21 Thread Arun Luthra
Is there an efficient way to save an RDD with saveAsTextFile in such a way that the data gets shuffled into separated directories according to a key? (My end goal is to wrap the result in a multi-partitioned Hive table) Suppose you have: case class MyData(val0: Int, val1: string, directory_name:

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-21 Thread Tathagata Das
It is kind of unexpected, i can imagine a real scenario under which it should trigger. But obviously I am missing something :) TD On Tue, Apr 21, 2015 at 4:23 PM, Jean-Pascal Billaud wrote: > Sure. But in general, I am assuming this ""Graph is unexpectedly null > when DStream is being serialize

Throw Error: Delegation Token can be issued only with kerberos or web authentication when use saveAsNewAPIHadoopFile

2015-04-21 Thread yuemeng1
hi all, my spark version is spark1.2,and i use saveAsNewAPIHadoopFile for my job,but after execute many times ,it may be give me follow error one times i think we may be lost some operation like: add SparkHadoopUtil.get.addCredentials(hadoopConf) in saveAsHadoopDataset(conf: JobConf)(SPARK-120

Not able run multiple tasks in parallel, spark streaming

2015-04-21 Thread Abhay Bansal
Hi, I have use case wherein I have to join multiple kafka topics in parallel. So if there are 2n topics there is a one to one mapping of topics which needs to be joined. val arr= ... for(condition) { val dStream1 = KafkaUtils.createDirectStream[String, String, StringDecoder, St

Re: sparksql - HiveConf not found during task deserialization

2015-04-21 Thread Manku Timma
Akhil, Thanks for the suggestions. I tried out sc.addJar, --jars, --conf spark.executor.extraClassPath and none of them helped. I added stuff into compute-classpath.sh. That did not change anything. I checked the classpath of the running executor and made sure that the hive jars are in that dir. Fo

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-21 Thread Akhil Das
It isn't mentioned anywhere in the doc , but you will probably need separate ZK for each of your HA cluster. Thanks Best Regards On Wed, Apr 22, 2015 at 12:02 AM, Michal Klos wrote: > Hi, > > I'm trying to set up mult

Re: How does GraphX stores the routing table?

2015-04-21 Thread Ankur Dave
On Tue, Apr 21, 2015 at 10:39 AM, mas wrote: > How does GraphX stores the routing table? Is it stored on the master node > or > chunks of the routing table is send to each partition that maintains the > record of vertices and edges at that node? > The latter: the routing table is stored alongsid

Re: problem writing to s3

2015-04-21 Thread Akhil Das
Can you look in your worker logs and see whats happening in there? Are you able to write the same to your HDFS? Thanks Best Regards On Wed, Apr 22, 2015 at 4:45 AM, Daniel Mahler wrote: > I am having a strange problem writing to s3 that I have distilled to this > minimal example: > > def jsonRa

Re: Not able run multiple tasks in parallel, spark streaming

2015-04-21 Thread Akhil Das
You can enable this flag to run multiple jobs concurrently, It might not be production ready, but you can give it a try: sc.set("spark.streaming.concurrentJobs","2") ​Refer to TD's answer here