Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-15 Thread DB Tsai
In LBFGS version of logistic regression, the data is properly standardized, so this should not happen. Can you provide a copy of your dataset to us so we can test it? If the dataset can not be public, can you have just send me a copy so I can dig into this? I'm the author of LORWithLBFGS. Thanks.

Null Pointer Exception due to mapVertices function in GraphX

2015-03-15 Thread James
I have got NullPointerException in aggregateMessages on a graph which is the output of mapVertices function of a graph. I found the problem is because of the mapVertices funciton did not affect all the triplet of the graph. // Initial the graph, assign a counter to each vertex that contains the ve

Re: Spark Streaming on Yarn Input from Flume

2015-03-15 Thread tarek_abouzeid
have you fixed this issue ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-Yarn-Input-from-Flume-tp11755p22055.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: 1.3 release

2015-03-15 Thread Sean Owen
I think (I hope) it's because the generic builds "just work". Even though these are of course distributed mostly verbatim in CDH5, with tweaks to be compatible with other stuff at the edges, the stock builds should be fine too. Same for HDP as I understand. The CDH4 build may work on some builds o

Re: order preservation with RDDs

2015-03-15 Thread Sean Owen
Yes I don't think this is entirely reliable in general. I would emit (label,features) pairs and then transform the values. In practice, this may happen to work fine in simple cases. On Sun, Mar 15, 2015 at 3:51 AM, kian.ho wrote: > Hi, I was taking a look through the mllib examples in the offici

Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Shashidhar Rao
Hi, Can anyone who has developed recommendation engine suggest what could be the possible software stack for such an application. I am basically new to recommendation engine , I just found out Mahout and Spark Mlib which are available . I am thinking the below software stack. 1. The user is goin

Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Sean Owen
I think you're assuming that you will pre-compute recommendations and store them in Mongo. That's one way to go, with certain tradeoffs. You can precompute offline easily, and serve results at large scale easily, but, you are forced to precompute everything -- lots of wasted effort, not completely

Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Shashidhar Rao
Thanks Sean, your suggestions and the links provided are just what I needed to start off with. On Sun, Mar 15, 2015 at 6:16 PM, Sean Owen wrote: > I think you're assuming that you will pre-compute recommendations and > store them in Mongo. That's one way to go, with certain tradeoffs. You > can

Re: Spark Release 1.3.0 DataFrame API

2015-03-15 Thread David Mitchell
Thank you for your help. "toDF()" solved my first problem. And, the second issue was a non-issue, since the second example worked without any modification. David On Sun, Mar 15, 2015 at 1:37 AM, Rishi Yadav wrote: > programmatically specifying Schema needs > > import org.apache.spark.sql.ty

[Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class

2015-03-15 Thread Renato Marroquín Mogrovejo
Hi Spark experts, Is there a way to convert a JavaSchemaRDD (for instance loaded from a parquet file) back to a JavaRDD of a given case class? I read on stackOverFlow[1] that I could do a select over the parquet file and then by reflection get the fields out, but I guess that would be an overkill.

Re: deploying Spark on standalone cluster

2015-03-15 Thread tarek_abouzeid
i was having a similar issue but it was in spark and flume integration i was getting failed to bind error , but got it fixed by shutting down firewall for both machines (make sure : service iptables status => firewall stopped) -- View this message in context: http://apache-spark-user-list.1001

Saving Dstream into a single file

2015-03-15 Thread tarek_abouzeid
i am doing word count example on flume stream and trying to save output as text files in HDFS , but in the save directory i got multiple sub directories each having files with small size , i wonder if there is a way to append in a large file instead of saving in multiple files , as i intend to save

Re: [Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class

2015-03-15 Thread Cheng Lian
Currently there’s no convenient way to convert a |SchemaRDD|/|JavaSchemaRDD| back to an |RDD|/|JavaRDD| of some case class. But you can convert an |RDD|/|JavaRDD| into an |RDD[Row]|/|JavaRDD| using |schemaRdd.rdd| and |new JavaRDD(schemaRdd.rdd)|. Cheng On 3/15/15 10:22 PM, Renato Marroquín

Re: Need Advice about reading lots of text files

2015-03-15 Thread Pat Ferrel
Ah most interesting—thanks. So it seems sc.textFile(longFileList) has to read all metadata before starting the read for partitioning purposes so what you do is not use it? You create a task per file that reads one file (in parallel) per task without scanning for _all_ metadata. Can’t argue wit

Re: Explanation on the Hive in the Spark assembly

2015-03-15 Thread Cheng Lian
Spark SQL supports most commonly used features of HiveQL. However, different HiveQL statements are executed in different manners: 1. DDL statements (e.g. |CREATE TABLE|, |DROP TABLE|, etc.) and commands (e.g. |SET = |, |ADD FILE|, |ADD JAR|, etc.) In most cases, Spark SQL simply dele

Re: Issue with yarn cluster - hangs in accepted state.

2015-03-15 Thread abhi
Thanks, It worked. -Abhi On Tue, Mar 3, 2015 at 5:15 PM, Tobias Pfeiffer wrote: > Hi, > > On Wed, Mar 4, 2015 at 6:20 AM, Zhan Zhang wrote: > >> Do you have enough resource in your cluster? You can check your resource >> manager to see the usage. >> > > Yep, I can confirm that this is a very

Re: Problem connecting to HBase

2015-03-15 Thread HARIPRIYA AYYALASOMAYAJULA
Hello all, Thank you for your responses. I did try to include the zookeeper.znode.parent property in the hbase-site.xml. It still continues to give the same error. I am using Spark 1.2.0 and hbase 0.98.9. Could you please suggest what else could be done? On Fri, Mar 13, 2015 at 10:25 PM, Ted Y

Re: Writing wide parquet file in Spark SQL

2015-03-15 Thread Cheng Lian
This article by Ryan Blue should be helpful to understand the problem http://ingest.tips/2015/01/31/parquet-row-group-size/ The TL;DR is, you may decrease |parquet.block.size| to reduce memory consumption. Anyway, 100K columns is a really big burden for Parquet, but I guess your data should be

Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Nick Pentreath
As Sean says, precomputing recommendations is pretty inefficient. Though with 500k items its easy to get all the item vectors in memory so pre-computing is not too bad. Still, since you plan to serve these via a REST service anyway, computing on demand via a serving layer such as Oryx or Pre

Re: Read Parquet file from scala directly

2015-03-15 Thread Cheng Lian
The parquet-tools code should be pretty helpful (although it's Java) https://github.com/apache/incubator-parquet-mr/tree/master/parquet-tools/src/main/java/parquet/tools/command On 3/10/15 12:25 AM, Shuai Zheng wrote: Hi All, I have a lot of parquet files, and I try to open them directly inst

Submitting spark application using Yarn Rest API

2015-03-15 Thread Srini Karri
Hi All, I am trying to submit the spark application using yarn rest API. I am able to submit the application but final status shows as 'UNDEFINED.'. Couple of other observations: User shows as Dr.who Application type is empty though I specify it as Spark Is any one had this problem before? I am

Re: From Spark web ui, how to prove the parquet column pruning working

2015-03-15 Thread Cheng Lian
Hey Yong, It seems that Hadoop `FileSystem` adds the size of a block to the metrics even if you only touch a fraction of it (reading Parquet metadata for example). This behavior can be verified by the following snippet: ```scala import org.apache.spark.sql.Row import org.apache.spark.sql.SQL

Re: Software stack for Recommendation engine with spark mlib

2015-03-15 Thread Shashidhar Rao
Thanks Nick, for your suggestions. On Sun, Mar 15, 2015 at 10:41 PM, Nick Pentreath wrote: > As Sean says, precomputing recommendations is pretty inefficient. Though > with 500k items its easy to get all the item vectors in memory so > pre-computing is not too bad. > > Still, since you plan to s

Re: Running spark function on parquet without sql

2015-03-15 Thread Cheng Lian
That's an unfortunate documentation bug in the programming guide... We failed to update it after making the change. Cheng On 2/28/15 8:13 AM, Deborah Siegel wrote: Hi Michael, Would you help me understand the apparent difference here.. The Spark 1.2.1 programming guide indicates: "Note tha

Re: Is there any problem in having a long opened connection to spark sql thrift server

2015-03-15 Thread Cheng Lian
It should be OK. If you encountered problems in having a long opened connection to the Thrift server, it should be a bug. Cheng On 3/9/15 6:41 PM, fanooos wrote: I have some applications developed using PHP and currently we have a problem in connecting these applications to spark sql thrift se

Benchmarks of 'Hive on Tez' vs 'Hive on Spark' vs Spark SQL

2015-03-15 Thread Slim Baltagi
Hi I would like to share with you my comments on Hortonworks' benchmarks of 'Hive on Tez' vs 'Hive on Spark' vs 'Spark SQL'. Please check them in my related blog entry at http://goo.gl/K5mk0U Thanks Slim Baltagi Chicago, IL http://www.SparkBigData.com -- View this message in context: http:/

Slides of my talk in LA: 'Spark or Hadoop: is it an either-or proposition?'

2015-03-15 Thread Slim Baltagi
Hi I would like to share with you the slide deck of my talk titled:" Spark or Hadoop: is it an either-or proposition?" that I gave at the Los Angeles Spark Users Group on March 12, 2015. Please check it on slideshare.net at http://goo.gl/U4l1rI Thanks Slim Baltagi Chicago, IL http://www.Spark

Re: Problem connecting to HBase

2015-03-15 Thread Ted Yu
"org.apache.hbase" % "hbase" % "0.98.9-hadoop2" % "provided", There is no module in hbase 0.98.9 called hbase. But this would not be the root cause of the error. Most likely hbase-site.xml was not picked up. Meaning this is classpath issue. On Sun, Mar 15, 2015 at 10:04 AM, HARIPRIYA AYYALASOMA

Re: Streaming linear regression example question

2015-03-15 Thread Margus Roo
Hi again Tried the same examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala from 1.3.0 and getting in case testing file content is: (0.0,[3.0,4.0,3.0]) (0.0,[4.0,4.0,4.0]) (4.0,[5.0,5.0,5.0]) (5.0,[5.0,6.0,6.0]) (6.0,[7.0,4.0,7.0]) (7.0,[8.0,6.0,

Re: AWS SDK HttpClient version conflict (spark.files.userClassPathFirst not working)

2015-03-15 Thread Adam Lewandowski
Just following up on this issue. I discovered that when I ran the application in a YARN cluster (on AWS EMR), I was able to use the AWS SDK without issue (without the 'spark.files.userClassPath' flag set). Also, I learned that the entire 'child-first' classloader setup was changed in Spark 1.3.0 (r

Re: Spark 1.2 – How to change Default (Random) port ….

2015-03-15 Thread Shailesh Birari
Hi SM, Apologize for delayed response. No, the issue is with Spark 1.2.0. There is a bug in Spark 1.2.0. Recently Spark have latest 1.3.0 release so it might have fixed in it. I am not planning to test it soon, may be after some time. You can try for it. Regards, Shailesh -- View this messa

Re: Re: Explanation on the Hive in the Spark assembly

2015-03-15 Thread bit1...@163.com
Thanks Cheng for the great explanation! bit1...@163.com From: Cheng Lian Date: 2015-03-16 00:53 To: bit1...@163.com; Wang, Daoyuan; user Subject: Re: Explanation on the Hive in the Spark assembly Spark SQL supports most commonly used features of HiveQL. However, different HiveQL statements ar

Building spark over specified tachyon

2015-03-15 Thread fightf...@163.com
Hi, all Noting that the current spark releases are built-in with tachyon 0.5.0 , if we want to recompile spark with maven and targeting on specific tachyon version (let's say the most recent 0.6.0 release), how should that be done? What maven compile command should be like ? Thanks, Sun. figh

RE: Building spark over specified tachyon

2015-03-15 Thread Shao, Saisai
I think you could change the pom file under Spark project to update the Tachyon related dependency version and rebuild it again (in case API is compatible, and behavior is the same). I'm not sure is there any command you can use to compile against Tachyon version. Thanks Jerry From: fightf...

Re: RE: Building spark over specified tachyon

2015-03-15 Thread fightf...@163.com
Thanks, Jerry I got that way. Just to make sure whether there can be some option to directly specifying tachyon version. fightf...@163.com From: Shao, Saisai Date: 2015-03-16 11:10 To: fightf...@163.com CC: user Subject: RE: Building spark over specified tachyon I think you could change the

Input validation for LogisticRegressionWithSGD

2015-03-15 Thread Rohit U
Hi, I am trying to run LogisticRegressionWithSGD on RDD of LabeledPoints loaded using loadLibSVMFile: val logistic: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "s3n://logistic-regression/epsilon_normalized") val model = LogisticRegressionWithSGD.train(logistic, 100) It gives an input valida

Re: Spark Release 1.3.0 DataFrame API

2015-03-15 Thread Khanderao Kand
toDF() works very well -- thanks On Sun, Mar 15, 2015 at 6:12 AM, David Mitchell wrote: > > Thank you for your help. "toDF()" solved my first problem. And, the > second issue was a non-issue, since the second example worked without any > modification. > > David > > > On Sun, Mar 15, 2015 at 1:

Re: Trouble launching application that reads files

2015-03-15 Thread robert.tunney
I figured out how to use local files with "file://" but not with either the persistent or ephemeral-hdfs -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-launching-application-that-reads-files-tp22065p22068.html Sent from the Apache Spark User List ma

Re: Streaming linear regression example question

2015-03-15 Thread Jeremy Freeman
Hi Margus, thanks for reporting this, I’ve been able to reproduce and there does indeed appear to be a bug. I’ve created a JIRA and have a fix ready, can hopefully include in 1.3.1. In the meantime, you can get the desired result using transform: > model.trainOn(trainingData) > > testingData.t

k-means hang without error/warning

2015-03-15 Thread Xi Shen
Hi, I am running k-means using Spark in local mode. My data set is about 30k records, and I set the k = 1000. The algorithm starts and finished 13 jobs according to the UI monitor, then it stopped working. The last log I saw was: [Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cl

Re: Input validation for LogisticRegressionWithSGD

2015-03-15 Thread Rishi Yadav
ca you share some sample data On Sun, Mar 15, 2015 at 8:51 PM, Rohit U wrote: > Hi, > > I am trying to run LogisticRegressionWithSGD on RDD of LabeledPoints > loaded using loadLibSVMFile: > > val logistic: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, > "s3n://logistic-regression/epsilon_norma

Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread sandeep vura
Hi Sparkers, I couldn't able to run spark-sql on spark.Please find the following error Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient Regards, Sandeep.v

Re: Slides of my talk in LA: 'Spark or Hadoop: is it an either-or proposition?'

2015-03-15 Thread Slim Baltagi
Hi The video recording of this talk titled "Spark or Hadoop: is it an either-or proposition?" at the Los Angeles Spark Users Group on March 12, 2015 is now available on youtube at this link: http://goo.gl/0iJZ4n Thanks Slim Baltagi http://www.SparkBigData.com -- View this message in context:

Re: Input validation for LogisticRegressionWithSGD

2015-03-15 Thread Rohit U
I checked the labels across the entire dataset and it looks like it has -1 and 1 (not the 0 and 1 I originally expected). I will try replacing the -1 with 0 and run it again. On Mon, Mar 16, 2015 at 12:51 AM, Rishi Yadav wrote: > ca you share some sample data > > On Sun, Mar 15, 2015 at 8:51 PM,

Re: RE: Building spark over specified tachyon

2015-03-15 Thread Haoyuan Li
Here is a patch: https://github.com/apache/spark/pull/4867 On Sun, Mar 15, 2015 at 8:46 PM, fightf...@163.com wrote: > Thanks, Jerry > I got that way. Just to make sure whether there can be some option to > directly > specifying tachyon version. > > > -- > fightf...@1

Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread Ted Yu
Can you provide more information ? Such as: Version of Spark you're using Command line Thanks > On Mar 15, 2015, at 9:51 PM, sandeep vura wrote: > > Hi Sparkers, > > > > I couldn't able to run spark-sql on spark.Please find the following error > > Unable to instantiate org.apache.hadoo

Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread sandeep vura
Hi Ted, I am using Spark -1.2.1 and hive -0.13.1 you can check my configuration files attached below. ERROR IN SPARK n: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop

Running Scala Word Count Using Maven

2015-03-15 Thread Su She
Hello Everyone, I am trying to run the Word Count from here: https://github.com/holdenk/learning-spark-examples/blob/master/mini-complete-example/src/main/scala/com/oreilly/learningsparkexamples/mini/scala/WordCount.scala I was able to successfully run the app using SBT, but not Maven. I don't se

Re: Re: Building spark over specified tachyon

2015-03-15 Thread fightf...@163.com
Thanks haoyuan. fightf...@163.com From: Haoyuan Li Date: 2015-03-16 12:59 To: fightf...@163.com CC: Shao, Saisai; user Subject: Re: RE: Building spark over specified tachyon Here is a patch: https://github.com/apache/spark/pull/4867 On Sun, Mar 15, 2015 at 8:46 PM, fightf...@163.com wrote:

Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread sandeep vura
Hi Ted, Did you find any solution. Thanks Sandeep On Mon, Mar 16, 2015 at 10:44 AM, sandeep vura wrote: > Hi Ted, > > I am using Spark -1.2.1 and hive -0.13.1 you can check my configuration > files attached below. > > > ERROR IN SPARK >

Spark Streaming with compressed xml files

2015-03-15 Thread Vijay Innamuri
Hi All, Processing streaming JSON files with Spark features (Spark streaming and Spark SQL), is very efficient and works like a charm. Below is the code snippet to process JSON files. windowDStream.foreachRDD(IncomingFiles => { val IncomingFilesTable = sqlContext.jsonRDD(Incoming

Re: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread fightf...@163.com
Hi, Sandeep From your error log I can see that jdbc driver not found in your classpath. Did you had your mysql jdbc jar correctly configured in the specific classpath? Can you establish a hive jdbc connection using the url : jdbc:hive2://localhost:1 ? Thanks, Sun. fightf...@163.com F

Re: Running Scala Word Count Using Maven

2015-03-15 Thread fightf...@163.com
Hi, If you use maven, what is the actual compiling errors? fightf...@163.com From: Su She Date: 2015-03-16 13:20 To: user@spark.apache.org Subject: Running Scala Word Count Using Maven Hello Everyone, I am trying to run the Word Count from here: https://github.com/holdenk/learning-spark-ex

Re: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread sandeep vura
Hi Fightfate, I have attached my hive-site.xml file in the previous mail.Please check the configuration once. In hive i am able to create tables and also able to load data into hive table. Please find the attached file. Regards, Sandeep.v On Mon, Mar 16, 2015 at 11:34 AM, fightf...@163.com wro

why generateJob is a private API?

2015-03-15 Thread madhu phatak
Hi, I am trying to create a simple subclass of DStream. If I understand correctly, I should override *compute *lazy operations and *generateJob* for actions. But when I try to override, generateJob it gives error saying method is private to the streaming package. Is my approach is correct or am I

RE: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread Cheng, Hao
Or you need to specify the jars either in configuration or bin/spark-sql --jars mysql-connector-xx.jar From: fightf...@163.com [mailto:fightf...@163.com] Sent: Monday, March 16, 2015 2:04 PM To: sandeep vura; Ted Yu Cc: user Subject: Re: Re: Unable to instantiate org.apache.hadoop.hive.metastor

Re: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-15 Thread sandeep vura
I have already added mysql-connector-xx.jar file in spark/lib-managed/jars directory. Regards, Sandeep.v On Mon, Mar 16, 2015 at 11:48 AM, Cheng, Hao wrote: > Or you need to specify the jars either in configuration or > > > > bin/spark-sql --jars mysql-connector-xx.jar > > > > *From:* fightf.

Re: Need Advice about reading lots of text files

2015-03-15 Thread madhu phatak
Hi, Internally Spark uses HDFS api to handle file data. Have a look at HAR, Sequence file input format. More information on this cloudera blog . Regards, Madhukara Phatak http://datamantra.io/ On Sun, Mar 15, 2015 at 9:59 PM, Pat Fer

Re: Streaming linear regression example question

2015-03-15 Thread Margus Roo
Tnx for the workaround. Margus (margusja) Roo http://margus.roo.ee skype: margusja +372 51 480 On 16/03/15 06:20, Jeremy Freeman wrote: Hi Margus, thanks for reporting this, I’ve been able to reproduce and there does indeed appear to be a bug. I’ve created a JIRA and have a fix ready, can hope

Re: Re: How does Spark honor data locality when allocating computing resources for an application

2015-03-15 Thread bit1...@163.com
Thanks Eric. I revisited the code, and find the spreadOutApps option is enabled by default with following code:val spreadOutApps = conf.getBoolean("spark.deploy.spreadOut", true). Which I misread it as val spreadOutApps = conf.getBoolean("spark.deploy.spreadOut", false). Thanks. bit1...@163

Re: Spark Streaming with compressed xml files

2015-03-15 Thread Akhil Das
One approach would be, If you are using fileStream you can access the individual filenames from the partitions and with that filename you can apply your uncompression logic/parsing logic and get it done. Like: UnionPartition upp = (UnionPartition) ds.values().getPartitions()[i]; NewHadoo

Re: k-means hang without error/warning

2015-03-15 Thread Akhil Das
How many threads are you allocating while creating the sparkContext? like local[4] will allocate 4 threads. You can try increasing it to a higher number also try setting level of parallelism to a higher number. Thanks Best Regards On Mon, Mar 16, 2015 at 9:55 AM, Xi Shen wrote: > Hi, > > I am r

Re: Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-15 Thread Akhil Das
Did you change both the versions? The one in your build file of your project and the spark version of your cluster? Thanks Best Regards On Sat, Mar 14, 2015 at 6:47 AM, EH wrote: > Hi all, > > I've been using Spark 1.1.0 for a while, and now would like to upgrade to > Spark 1.1.1 or above. How