RE: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Cheng, Hao
Can you provide the detailed failure call stack? From: shahab [mailto:shahab.mok...@gmail.com] Sent: Tuesday, March 3, 2015 3:52 PM To: user@spark.apache.org Subject: Supporting Hive features in Spark SQL Thrift JDBC server Hi, According to Spark SQL documentation, "Spark SQL supports the va

Re: how to clean shuffle write each iteration

2015-03-03 Thread nitin
Shuffle write will be cleaned if it is not referenced by any object directly/indirectly. There is a garbage collector written inside spark which periodically checks for weak references to RDDs/shuffle write/broadcast and deletes them. -- View this message in context: http://apache-spark-user-li

Re: Exception while select into table.

2015-03-03 Thread LinQili
Hi Yi, Thanks for your reply. 1. The version of spark is 1.2.0 and the version of hive is 0.10.0-cdh4.2.1. 2. The full trace stack of the exception: 15/03/03 13:41:30 INFO Client: client token: DUrrav1rAADCnhQzX_Ic6CMnfqcW2NIxra5n8824CRFZQVJOX0NMSUVOVF9UT0tFTgA diagnostics: User cla

Re: how to clean shuffle write each iteration

2015-03-03 Thread lisendong
in ALS, I guess all the iteration’s rdds are referenced by its next iteration’s rdd, so all the shuffle data will not be deleted until the als job finished… I guess checkpoint could solve my problem, do you know checkpoint? > 在 2015年3月3日,下午4:18,nitin [via Apache Spark User List] > 写道: > > S

gc time too long when using mllib als

2015-03-03 Thread lisendong
why does the gc time so long? i 'm using als in mllib, while the garbage collection time is too long (about 1/3 of total time) I have tried some measures in the "tunning spark guide", and try to set the new generation memory, but it still does not work... Tasks Task Index Task ID Stat

Re: gc time too long when using mllib als

2015-03-03 Thread Akhil Das
You need to increase the parallelism/repartition the data to a higher number to get ride of those. Thanks Best Regards On Tue, Mar 3, 2015 at 2:26 PM, lisendong wrote: > why does the gc time so long? > > i 'm using als in mllib, while the garbage collection time is too long > (about 1/3 of tot

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
val sc: SparkContext = new SparkContext(conf) val sqlCassContext = new CassandraAwareSQLContext(sc) // I used some Calliope Cassandra Spark connector val rdd : SchemaRDD = sqlCassContext.sql("select * from db.profile " ) rdd.cache rdd.registerTempTable("profile") rdd.first //enforce

SparkSQL, executing an "OR"

2015-03-03 Thread Guillermo Ortiz
I'm trying to execute a query with Spark. (Example from the Spark Documentation) val teenagers = people.where('age >= 10).where('age <= 19).select('name) Is it possible to execute an OR with this syntax? val teenagers = people.where('age >= 10 'or 'age <= 4).where('age <= 19).select('name) I hav

Return jobid for a hive query?

2015-03-03 Thread Rex Xiong
Hi there, I have an app talking to Spark Hive Server using Hive ODBC, querying is OK. But in this interface, I can't get much running details when my query goes wrong, only one error message is shown. I want to get jobid for my query, so that I can go to Application Detail UI to see what's going o

Re: RDD partitions per executor in Cassandra Spark Connector

2015-03-03 Thread Pavel Velikhov
Hi, is there a paper or a document where one can read how Spark reads Cassandra data in parallel? And how it writes data back from RDDs? Its a bit hard to have a clear picture in mind. Thank you, Pavel Velikhov > On Mar 3, 2015, at 1:08 AM, Rumph, Frens Jan wrote: > > Hi all, > > I didn't fi

Re: One of the executor not getting StopExecutor message

2015-03-03 Thread twinkle sachdeva
Hi, Operations are not very extensive, as this scenario is not always reproducible. One of the executor start behaving in this manner. For this particular application, we are using 8 cores in one executors, and practically, 4 executors are launched on one machine. This machine has good config wit

Is the RDD's Partitions determined before hand ?

2015-03-03 Thread Jeff Zhang
I mean is it possible to change the partition number at runtime. Thanks -- Best Regards Jeff Zhang

Re: Is the RDD's Partitions determined before hand ?

2015-03-03 Thread Sean Owen
An RDD has a certain fixed number of partitions, yes. You can't change an RDD. You can repartition() or coalese() and RDD to make a new one with a different number of RDDs, possibly requiring a shuffle. On Tue, Mar 3, 2015 at 10:21 AM, Jeff Zhang wrote: > I mean is it possible to change the parti

insert Hive table with RDD

2015-03-03 Thread patcharee
Hi, How can I insert an existing hive table with an RDD containing my data? Any examples? Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

delay between removing the block manager of an executor, and marking that as lost

2015-03-03 Thread twinkle sachdeva
Hi, Is there any relation between removing block manager of an executor and marking that as lost? In my setup,even after removing block manager ( after failing to do some operation )...it is taking more than 20 mins, to mark that as lost executor. Following are the logs: *15/03/03 10:26:49 WARN

java.lang.IncompatibleClassChangeError when using PrunedFilteredScan

2015-03-03 Thread taoewang
> > Hi, > > > > I’m trying to build the stratio spark-mongodb connector and got error > "java.lang.IncompatibleClassChangeError: class > com.stratio.deep.mongodb.MongodbRelation has interface > org.apache.spark.sql.sources.PrunedFilteredScan as super class” when trying > to create a table u

Re: RDDs

2015-03-03 Thread Kartheek.R
Hi TD, "You can always run two jobs on the same cached RDD, and they can run in parallel (assuming you launch the 2 jobs from two different threads)" Is this a correct way to launch jobs from two different threads? val threadA = new Thread(new Runnable { def run() { for(i<- 0 until e

Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi, Could you please let me know how to do this? (or) Any suggestion Regards, Rajesh On Mon, Mar 2, 2015 at 4:47 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a below edge list. How to find the parents path for every vertex? > > Example : > > Vertex 1 path : 2, 3,

RE: SparkSQL, executing an "OR"

2015-03-03 Thread Cheng, Hao
Using where('age >=10 && 'age <=4) instead. -Original Message- From: Guillermo Ortiz [mailto:konstt2...@gmail.com] Sent: Tuesday, March 3, 2015 5:14 PM To: user Subject: SparkSQL, executing an "OR" I'm trying to execute a query with Spark. (Example from the Spark Documentation) val teen

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-03 Thread Jaonary Rabarisoa
Here is my current implementation with current master version of spark *class DeepCNNFeature extends Transformer with HasInputCol with HasOutputCol ... { override def transformSchema(...) { ... }* *override def transform(dataSet: DataFrame, paramMap: ParamMap): DataFrame = {* *

RE: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Cheng, Hao
Hive UDF are only applicable for HiveContext and its subclass instance, is the CassandraAwareSQLContext a direct sub class of HiveContext or SQLContext? From: shahab [mailto:shahab.mok...@gmail.com] Sent: Tuesday, March 3, 2015 5:10 PM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: Support

LATERAL VIEW explode requests the full schema

2015-03-03 Thread matthes
I use "LATERAL VIEW explode(...)" to read data from a parquet-file but the full schema is requeseted by parquet instead just the used columns. When I didn't use LATERAL VIEW the requested schema has just the two columns which I use. Is it correct or is there place for an optimization or do I unders

RE: insert Hive table with RDD

2015-03-03 Thread Cheng, Hao
Using the SchemaRDD / DataFrame API via HiveContext Assume you're using the latest code, something probably like: val hc = new HiveContext(sc) import hc.implicits._ existedRdd.toDF().insertInto("hivetable") or existedRdd.toDF().registerTempTable("mydata") hc.sql("insert into hivetable as select

RE: java.lang.IncompatibleClassChangeError when using PrunedFilteredScan

2015-03-03 Thread Cheng, Hao
As the call stack shows, the mongodb connector is not compatible with the Spark SQL Data Source interface. The latest Data Source API is changed since 1.2, probably you need to confirm which spark version the MongoDB Connector build against. By the way, a well format call stack will be more hel

Re: SparkSQL, executing an "OR"

2015-03-03 Thread Guillermo Ortiz
thanks, it works. 2015-03-03 13:32 GMT+01:00 Cheng, Hao : > Using where('age >=10 && 'age <=4) instead. > > -Original Message- > From: Guillermo Ortiz [mailto:konstt2...@gmail.com] > Sent: Tuesday, March 3, 2015 5:14 PM > To: user > Subject: SparkSQL, executing an "OR" > > I'm trying to ex

spark.local.dir leads to "Job cancelled because SparkContext was shut down"

2015-03-03 Thread lisendong
As long as I set the "spark.local.dir" to multiple disks, the job will failed, the errors are as follow: (if I set the spark.local.dir to only 1 dir, the job will succed...) Exception in thread "main" org.apache.spark.SparkException: Job cancelled because SparkContext was shut down at org.

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-03 Thread Imran Rashid
the scala syntax for arrays is Array[T], not T[], so you want to use something: kryo.register(classOf[Array[org.roaringbitmap.RoaringArray$Element]]) kryo.register(classOf[Array[Short]]) nonetheless, the spark should take care of this itself. I'll look into it later today. On Mon, Mar 2, 2015

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
You are right , CassandraAwareSQLContext is subclass of SQL context. But I did another experiment, I queried Cassandra using CassandraAwareSQLContext, then I registered the "rdd" as a temp table , next I tried to query it using HiveContext, but it seems that hive context can not see the registere

[no subject]

2015-03-03 Thread shahab
I did an experiment with Hive and SQL context , I queried Cassandra using CassandraAwareSQLContext (a custom SQL context from Calliope) , then I registered the "rdd" as a temp table , next I tried to query it using HiveContext, but it seems that hive context can not see the registered table suing

Can not query TempTable registered by SQL Context using HiveContext

2015-03-03 Thread shahab
Hi, I did an experiment with Hive and SQL context , I queried Cassandra using CassandraAwareSQLContext (a custom SQL context from Calliope) , then I registered the "rdd" as a temp table , next I tried to query it using HiveContext, but it seems that hive context can not see the registered table su

RE: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Cheng, Hao
The temp table in metastore can not be shared cross SQLContext instances, since HiveContext is a sub class of SQLContext (inherits all of its functionality), why not using a single HiveContext globally? Is there any specific requirement in your case that you need multiple SQLContext/HiveContext?

Re: RDDs

2015-03-03 Thread Manas Kar
The above is a great example using thread. Does any one have an example using scala/Akka Future to do the same. I am looking for an example like that which uses a Akka Future and does something if the Future "Timesout" On Tue, Mar 3, 2015 at 7:00 AM, Kartheek.R wrote: > Hi TD, > "You can always

Re: RDDs

2015-03-03 Thread Manas Kar
The above is a great example using thread. Does any one have an example using scala/Akka Future to do the same. I am looking for an example like that which uses a Akka Future and does something if the Future "Timesout" On Tue, Mar 3, 2015 at 9:16 AM, Manas Kar wrote: > The above is a great examp

Why can't Spark Streaming recover from the checkpoint directory when using a third party library for processingmulti-line JSON?

2015-03-03 Thread Emre Sevinc
Hello, I have a Spark Streaming application (that uses Spark 1.2.1) that listens to an input directory, and when new JSON files are copied to that directory processes them, and writes them to an output directory. It uses a 3rd party library to process the multi-line JSON files ( https://github.co

Re: GraphX path traversal

2015-03-03 Thread Robin East
Rajesh I'm not sure if I can help you, however I don't even understand the question. Could you restate what you are trying to do. Sent from my iPhone > On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar > wrote: > > Hi, > > I have a below edge list. How to find the parents path for every ve

Re: RDD partitions per executor in Cassandra Spark Connector

2015-03-03 Thread Carl Yeksigian
These questions would be better addressed to the Spark Cassandra Connector mailing list, which can be found here: https://github.com/datastax/spark-cassandra-connector/#community Thanks, Carl On Tue, Mar 3, 2015 at 4:42 AM, Pavel Velikhov wrote: > Hi, is there a paper or a document where one ca

Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi Robin, Thank you for your response. Please find below my question. I have a below edge file Source Vertex Destination Vertex 1 2 2 3 3 4 4 5 5 6 6 6 In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is connected to 3rd vertex,. 6th vertex is connected to 6th vertex. S

Re: On app upgrade, restore sliding window data.

2015-03-03 Thread Matus Faro
Thank you Arush, I've implemented initial data for a windowed operation and opened a pull request here: https://github.com/apache/spark/pull/4875 On Tue, Feb 24, 2015 at 4:49 AM, Arush Kharbanda wrote: > I think this could be of some help to you. > > https://issues.apache.org/jira/browse/SPARK-

Re: Running Spark jobs via oozie

2015-03-03 Thread nitinkak001
I am also starting to work on this one. Did you get any solution to this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-jobs-via-oozie-tp5187p21896.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

PRNG in Scala

2015-03-03 Thread Vijayasarathy Kannan
Hi, What pseudo-random-number generator does scala.util.Random uses?

Re: PRNG in Scala

2015-03-03 Thread Robin East
This is more of a java/scala question than spark - it uses java.util.Random : https://github.com/scala/scala/blob/2.11.x/src/library/scala/util/Random.scala > On 3 Mar 2015, at 15:08, Vijayasarathy Kannan wrote: > > Hi, > > What pseudo-random-number generator does scala.util.Random uses?

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Yin Huai
Regarding current_date, I think it is not in either Hive 0.12.0 or 0.13.1 (versions that we support). Seems https://issues.apache.org/jira/browse/HIVE-5472 added it Hive recently. On Tue, Mar 3, 2015 at 6:03 AM, Cheng, Hao wrote: > The temp table in metastore can not be shared cross SQLContext

Re: GraphX path traversal

2015-03-03 Thread Robin East
What about the following which can be run in spark shell: import org.apache.spark._ import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD val vertexlist = Array((1L,"One"), (2L,"Two"), (3L,"Three"), (4L,"Four"),(5L,"Five"),(6L,"Six")) val edgelist = Array(Edge(6,5,"6 to 5"),Edge(5,4,"

Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi, I have tried below program using pergel API but I'm not able to get my required output. I'm getting exactly reverse output which I'm expecting. // Creating graph using above mail mentioned edgefile val graph: Graph[Int, Int] = GraphLoader.edgeListFile(sc, "/home/rajesh/Downloads/graphdata/da

Re: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: pyspark on yarn

2015-03-03 Thread Gustavo Enrique Salazar Torres
Hi Sam: Shouldn't you define the table schema? I had the same problem in Scala and then I solved it defining the schema. I did this: sqlContext.applySchema(dataRDD, tableSchema).registerTempTable(tableName) Hope it helps. On Mon, Jan 5, 2015 at 7:01 PM, Sam Flint wrote: > Below is the code th

Re: GraphX path traversal

2015-03-03 Thread Robin East
Have you tried EdgeDirection.In? > On 3 Mar 2015, at 16:32, Robin East wrote: > > What about the following which can be run in spark shell: > > import org.apache.spark._ > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > > val vertexlist = Array((1L,"One"), (2L,"Two"), (3L,"

Re: PRNG in Scala

2015-03-03 Thread Robin East
And this SO post goes into details on the PRNG in Java http://stackoverflow.com/questions/9907303/does-java-util-random-implementation-differ-between-jres-or-platforms > On 3 Mar 2015, at 16:15, Robin East wrote: > > This is more of a java/scala question than spark - it uses java.util.Random :

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Rohit Rai
Hello Shahab, I think CassandraAwareHiveContext in Calliopee is what you are looking for. Create CAHC instance and you should be able to run hive functions against

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
@Cheng :My problem is that the connector I use to query Spark does not support latest Hive (0.12, 0.13), But I need to perform Hive Queries on data retrieved from Cassandra. I assumed that if I get data out of cassandra in some way and register it as Temp table I would be able to query it using Hiv

Resource manager UI for Spark applications

2015-03-03 Thread Rohini joshi
Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup?

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
Thanks Rohit, I am already using Calliope and quite happy with it, well done ! except the fact that : 1- It seems that it does not support Hive 0.12 or higher, Am i right? for example you can not use : current_time() UDF, or those new UDFs added in hive 0.12 . Are they supported? Any plan for sup

Issue with yarn cluster - hangs in accepted state.

2015-03-03 Thread abhi
I am trying to run below java class with yarn cluster, but it hangs in accepted state . i don't see any error . Below is the class and command . Any help is appreciated . Thanks, Abhi bin/spark-submit --class com.mycompany.app.SimpleApp --master yarn-cluster /home/hduser/my-app-1.0.jar

Re: Resource manager UI for Spark applications

2015-03-03 Thread Rohini joshi
Sorry , for half email - here it is again in full Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup? 2. when I click on Appli

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Rohit Rai
The Hive dependency comes from spark-hive. It does work with Spark 1.1 we will have the 1.2 release later this month. On Mar 3, 2015 8:49 AM, "shahab" wrote: > > Thanks Rohit, > > I am already using Calliope and quite happy with it, well done ! except > the fact that : > 1- It seems that it does

Re: Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077

2015-03-03 Thread Krishnanand Khambadkone
Hello Ted,  Some progress,  now it seems that the spark job does get submitted,  In the spark web UI, I do see this under finished drivers.  However, it seems to not go past this step,  JavaPairReceiverInputDStream messages = KafkaUtils.createStream(jsc, "localhost:2181", "aa", topicMap);   I d

Re: LBGFS optimizer performace

2015-03-03 Thread Gustavo Enrique Salazar Torres
Just did with the same error. I think the problem is the "data.count()" call in LBFGS because for huge datasets that's naive to do. I was thinking to write my version of LBFGS but instead of doing data.count() I will pass that parameter which I will calculate from a Spark SQL query. I will let you

Re: Resource manager UI for Spark applications

2015-03-03 Thread Ted Yu
bq. spark UI does not work for Yarn-cluster. Can you be a bit more specific on the error(s) you saw ? What Spark release are you using ? Cheers On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi wrote: > Sorry , for half email - here it is again in full > Hi , > I have 2 questions - > > 1. I was t

Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-03 Thread Jaonary Rabarisoa
Dear all, Is there a least square solver based on DistributedMatrix that we can use out of the box in the current (or the master) version of spark ? It seems that the only least square solver available in spark is private to recommender package. Cheers, Jao

Re: Shared Drivers

2015-03-03 Thread Timothy Chen
Hi John, I think there are limitations with the way drivers are designed that is required a seperate JVM process per driver, therefore it's not possible without any code and design change AFAIK. A driver shouldn't stay open past your job life time though, so while not sharing between apps it s

Issue using S3 bucket from Spark 1.2.1 with hadoop 2.4

2015-03-03 Thread Ankur Srivastava
Hi, We recently upgraded to Spark 1.2.1 - Hadoop 2.4 binary. We are not having any other dependency on hadoop jars, except for reading our source files from S3. Since we have upgraded to the latest version our reads from S3 have considerably slowed down. For some jobs we see the read from S3 is s

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Yin Huai
@Shahab, based on https://issues.apache.org/jira/browse/HIVE-5472, current_date was added in Hive *1.2.0 (not 0.12.0)*. For my previous email, I meant current_date is not in neither Hive 0.12.0 nor Hive 0.13.1 (Spark SQL currently supports these two Hive versions). On Tue, Mar 3, 2015 at 8:55 AM,

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
Thanks Rohit, yes my mistake, it does work with 1.1 ( I am actually running it on spark 1.1) But do you mean that even HiveConext of spark (nit Calliope CassandraAwareHiveContext) is not supporting Hive 0.12 ?? best, /Shahab On Tue, Mar 3, 2015 at 5:55 PM, Rohit Rai wrote: > The Hive dependenc

Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-03 Thread Srini Karri
Hi All, I am having trouble finding data related to my requirement. Here is the context, I have tried Standalone Spark Installation on Windows, I am able to submit the logs, able to see the history of events. My question is, is it possible to achieve the same monitoring UI experience with Yarn Clu

Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
Hi Ted, I used s3://support.elasticmapreduce/spark/install-spark to install spark on my EMR cluster. It is 1.2.0. When I click on the link for history or logs it takes me to http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop

Re: Issue using S3 bucket from Spark 1.2.1 with hadoop 2.4

2015-03-03 Thread Ted Yu
If you can use hadoop 2.6.0 binary, you can use s3a s3a is being polished in the upcoming 2.7.0 release: https://issues.apache.org/jira/browse/HADOOP-11571 Cheers On Tue, Mar 3, 2015 at 9:44 AM, Ankur Srivastava wrote: > Hi, > > We recently upgraded to Spark 1.2.1 - Hadoop 2.4 binary. We are n

Re: Issue using S3 bucket from Spark 1.2.1 with hadoop 2.4

2015-03-03 Thread Ankur Srivastava
Thanks a lot Ted!! On Tue, Mar 3, 2015 at 9:53 AM, Ted Yu wrote: > If you can use hadoop 2.6.0 binary, you can use s3a > > s3a is being polished in the upcoming 2.7.0 release: > https://issues.apache.org/jira/browse/HADOOP-11571 > > Cheers > > On Tue, Mar 3, 2015 at 9:44 AM, Ankur Srivastava < >

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
@Yin: sorry for my mistake, you are right it was added in 1.2, not 0.12.0 , my bad! On Tue, Mar 3, 2015 at 6:47 PM, shahab wrote: > Thanks Rohit, yes my mistake, it does work with 1.1 ( I am actually > running it on spark 1.1) > > But do you mean that even HiveConext of spark (nit Calliope > Ca

Re: Resource manager UI for Spark applications

2015-03-03 Thread Ted Yu
bq. changing the address with internal to the external one , but still does not work. Not sure what happened. For the time being, you can use yarn command line to pull container log (put in your appId and container Id): yarn logs -applicationId application_1386639398517_0007 -containerId container_

[no subject]

2015-03-03 Thread Jianshi Huang
Hi, I got this error message: 15/03/03 10:22:41 ERROR OneForOneBlockFetcher: Failed while starting block fetches java.lang.RuntimeException: java.io.FileNotFoundException: /hadoop01/scratch/local/usercache/jianshuang/appcache/application_1421268539738_202330/spark-local-20150303100549-fc3b/02/shu

Re: Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-03 Thread Todd Nist
Hi Srini, If you start the $SPARK_HOME/sbin/start-history-server, you should be able to see the basic spark ui. You will not see the master, but you will be able to see the rest as I recall. You also need to add an entry into the spark-defaults.conf, something like this: *## Make sure the host

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-03 Thread Shivaram Venkataraman
There are couple of solvers that I've written that is part of the AMPLab ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are interested in porting them I'd be happy to review it Thanks Shivaram [1] https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
Sorry that I forgot the subject. And in the driver, I got many FetchFailedException. The error messages are 15/03/03 10:34:32 WARN TaskSetManager: Lost task 31.0 in stage 2.2 (TID 7943, ): FetchFailed(BlockManagerId(86, , 43070), shuffleId=0, mapId=24, reduceId=1220, message= org.apache.s

UnsatisfiedLinkError related to libgfortran when running MLLIB code on RHEL 5.8

2015-03-03 Thread Prashant Sharma
Hi Folks, We are trying to run the following code from the spark shell in a CDH 5.3 cluster running on RHEL 5.8. *spark-shell --master yarn --deploy-mode client --num-executors 15 --executor-cores 6 --executor-memory 12G * *import org.apache.spark.mllib.recommendation.ALS * *import org.apache.spa

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Aaron Davidson
"Failed to connect" implies that the executor at that host died, please check its logs as well. On Tue, Mar 3, 2015 at 11:03 AM, Jianshi Huang wrote: > Sorry that I forgot the subject. > > And in the driver, I got many FetchFailedException. The error messages are > > 15/03/03 10:34:32 WARN TaskS

Why different numbers of partitions give different results for the same computation on the same dataset?

2015-03-03 Thread Saiph Kappa
Hi, I have a spark streaming application, running on a single node, consisting mainly of map operations. I perform repartitioning to control the number of CPU cores that I want to use. The code goes like this: val ssc = new StreamingContext(sparkConf, Seconds(5)) > val distFile = ssc.text

Re: Spark Monitoring UI for Hadoop Yarn Cluster

2015-03-03 Thread Marcelo Vanzin
Spark applications shown in the RM's UI should have an "Application Master" link when they're running. That takes you to the Spark UI for that application where you can see all the information you're looking for. If you're running a history server and add "spark.yarn.historyServer.address" to your

Re: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: pyspark on yarn

2015-03-03 Thread Michael Armbrust
In Spark 1.2 you'll have to create a partitioned hive table in order to read parquet data in this format. In Spark 1.3 the parquet data source will auto discover partitions when they are laid out

Re: gc time too long when using mllib als

2015-03-03 Thread Xiangrui Meng
Also try 1.3.0-RC1 or the current master. ALS should performance much better in 1.3. -Xiangrui On Tue, Mar 3, 2015 at 1:00 AM, Akhil Das wrote: > You need to increase the parallelism/repartition the data to a higher number > to get ride of those. > > Thanks > Best Regards > > On Tue, Mar 3, 2015

Re: insert Hive table with RDD

2015-03-03 Thread Jagat Singh
Will this recognize the hive partitions as well. Example insert into specific partition of hive ? On Tue, Mar 3, 2015 at 11:42 PM, Cheng, Hao wrote: > Using the SchemaRDD / DataFrame API via HiveContext > > Assume you're using the latest code, something probably like: > > val hc = new HiveCont

Re: Can not query TempTable registered by SQL Context using HiveContext

2015-03-03 Thread Michael Armbrust
As it says in the API docs , tables created with registerTempTable are local to the context that creates them: ... The lifetime of this temporary table is tied to the SQLContext >

Re: LATERAL VIEW explode requests the full schema

2015-03-03 Thread Michael Armbrust
I believe that this has been optimized in Spark 1.3. On Tue, Mar 3, 2015 at 4:36 AM, matthes wrote: > I use "LATERAL VIEW explode(...)" to read data from a parquet-file but the > full schema is requeseted by parque

dynamically change receiver for a spark stream

2015-03-03 Thread Islem
Hi all, i have been trying to setup a stream using a custom receiver that would pick up data from twitter using follow function to listen just to some users . I'd like to keep that stream context running and dynamically change the custom receiver by adding ids of users that I'd listen to .

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-03 Thread Joseph Bradley
I see. I think your best bet is to create the cnnModel on the master and then serialize it to send to the workers. If it's big (1M or so), then you can broadcast it and use the broadcast variable in the UDF. There is not a great way to do something equivalent to mapPartitions with UDFs right now

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-03 Thread Joseph Bradley
The minimization problem you're describing in the email title also looks like it could be solved using the RidgeRegression solver in MLlib, once you transform your DistributedMatrix into an RDD[LabeledPoint]. On Tue, Mar 3, 2015 at 11:02 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrot

Re: LBGFS optimizer performace

2015-03-03 Thread Joseph Bradley
Is that error actually occurring in LBFGS? It looks like it might be happening before the data even gets to LBFGS. (Perhaps the outer join you're trying to do is making the dataset size explode a bit.) Are you able to call count() (or any RDD action) on the data before you pass it to LBFGS? On

Re: Resource manager UI for Spark applications

2015-03-03 Thread Zhan Zhang
In Yarn (Cluster or client), you can access the spark ui when the app is running. After app is done, you can still access it, but need some extra setup for history server. Thanks. Zhan Zhang On Mar 3, 2015, at 10:08 AM, Ted Yu mailto:yuzhih...@gmail.com>> wrote: bq. changing the address with

Re: Issue with yarn cluster - hangs in accepted state.

2015-03-03 Thread Zhan Zhang
Do you have enough resource in your cluster? You can check your resource manager to see the usage. Thanks. Zhan Zhang On Mar 3, 2015, at 8:51 AM, abhi mailto:abhishek...@gmail.com>> wrote: I am trying to run below java class with yarn cluster, but it hangs in accepted state . i don't see a

Re: Why different numbers of partitions give different results for the same computation on the same dataset?

2015-03-03 Thread Saiph Kappa
Sorry I made a mistake in my code. Please ignore my question number 2. Different numbers of partitions give *the same* results! On Tue, Mar 3, 2015 at 7:32 PM, Saiph Kappa wrote: > Hi, > > I have a spark streaming application, running on a single node, consisting > mainly of map operations. I p

Re: throughput in the web console?

2015-03-03 Thread Saiph Kappa
Sorry I made a mistake. Please ignore my question. On Tue, Mar 3, 2015 at 2:47 AM, Saiph Kappa wrote: > I performed repartitioning and everything went fine with respect to the > number of CPU cores being used (and respective times). However, I noticed > something very strange: inside a map opera

Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread S. Zhou
I did some experiments and it seems not. But I like to get confirmation (or perhaps I missed something). If it does support, could u let me know how to specify multiple folders? Thanks. Senqiang 

Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
Ted, If the application is running then the logs are not available. Plus what i want to view is the details about the running app as in spark UI. Do I have to open some ports or do some other setting changes? On Tue, Mar 3, 2015 at 10:08 AM, Ted Yu wrote: > bq. changing the address with inter

Re: Resource manager UI for Spark applications

2015-03-03 Thread roni
ah!! I think I know what you mean. My job was just in "accepted" stage for a long time as it was running a huge file. But now that it is in running stage , I can see it . I can see it at post 9046 though instead of 4040 . But I can see it. Thanks -roni On Tue, Mar 3, 2015 at 1:19 PM, Zhan Zhang

Re: Why different numbers of partitions give different results for the same computation on the same dataset?

2015-03-03 Thread Tathagata Das
You can use DStream.transform() to do any arbitrary RDD transformations on the RDDs generated by a DStream. val coalescedDStream = myDStream.transform { _.coalesce(...) } On Tue, Mar 3, 2015 at 1:47 PM, Saiph Kappa wrote: > Sorry I made a mistake in my code. Please ignore my question number 2

Re: LBGFS optimizer performace

2015-03-03 Thread Gustavo Enrique Salazar Torres
Yeah, I can call count before that and it works. Also I was over caching tables but I removed those. Now there is no caching but it gets really slow since it calculates my table RDD many times. Also hacked the LBFGS code to pass the number of examples which I calculated outside in a Spark SQL query

java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils

2015-03-03 Thread Krishnanand Khambadkone
Hi,  When I submit my spark job, I see the following runtime exception in the log, Exception in thread "Thread-1" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils at SparkHdfs.run(SparkHdfs.java:56) Caused by: java.lang.ClassNotFoundException: org.apache.spark.

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Ted Yu
Looking at scaladoc: /** Get an RDD for a Hadoop file with an arbitrary new API InputFormat. */ def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]] Your conclusion is confirmed. On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou wrote: > I did some experiments and it seems not. But I like to get con

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread S. Zhou
Thanks Ted. Actually a follow up question. I need to read multiple HDFS files into RDD. What I am doing now is: for each file I read them into a RDD. Then later on I union all these RDDs into one RDD. I am not sure if it is the best way to do it. ThanksSenqiang On Tuesday, March 3, 2015 2

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Sean Owen
This API reads a directory of files, not one file. A "file" here really means a directory full of part-* files. You do not need to read those separately. Any syntax that works with Hadoop's FileInputFormat should work. I thought you could specify a comma-separated list of paths? maybe I am imagini

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Stephen Boesch
The sc.textFile() invokes the Hadoop FileInputFormat via the (subclass) TextInputFormat. Inside the logic does exist to do the recursive directory reading - i.e. first detecting if an entry were a directory and if so then descending: for (FileStatus

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Ted Yu
Looking at FileInputFormat#listStatus(): // Whether we need to recursive look into the directory structure boolean recursive = job.getBoolean(INPUT_DIR_RECURSIVE, false); where: public static final String INPUT_DIR_RECURSIVE = "mapreduce.input.fileinputformat.input.dir.recursive"

  1   2   >