from:"Krishna Sankar"

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar

does the data split and the datasets where they are allocated to. Cheers On Sun, Aug 21, 2016 at 4:37 PM, Krishna Sankar wrote: > Hi, > Looks like the test-dataset has different sizes for X & Y. Possible > steps: > >1. What is the test-data-size ? > - If i

Re: Vector size mismatch in logistic regression - Spark ML 2.0

2016-08-21 Thread Krishna Sankar

Hi, Looks like the test-dataset has different sizes for X & Y. Possible steps: 1. What is the test-data-size ? - If it is 15,909, check the prediction variable vector - it is now 29,471, should be 15,909 - If you expect it to be 29,471, then the X Matrix is not right.

Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-25 Thread Krishna Sankar

This intrigued me as well. - Just for sure, I downloaded the 1.6.2 code and recompiled. - spark-shell and pyspark both show 1.6.2 as expected. Cheers On Mon, Jul 25, 2016 at 1:45 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Another possible explanation is that by accide

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Krishna Sankar

Thanks Nick. I also ran into this issue. VG, One workaround is to drop the NaN from predictions (df.na.drop()) and then use the dataset for the evaluator. In real life, probably detect the NaN and recommend most popular on some window. HTH. Cheers On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath

Thanks For a Job Well Done !!!

2016-06-18 Thread Krishna Sankar

Hi all, Just wanted to thank all for the dataset API - most of the times we see only bugs in these lists ;o). - Putting some context, this weekend I was updating the SQL chapters of my book - it had all the ugliness of SchemaRDD, registerTempTable, take(10).foreach(println) and take

Re: Is Spark right for us?

2016-03-06 Thread Krishna Sankar

Good question. It comes to computational complexity, computational scale and data volume. 1. If you can store the data in a single server or a small cluster of db server (say mysql) then hdfs/Spark might be an overkill 2. If you can run the computation/process the data on a single machine

Re: HDP 2.3 support for Spark 1.5.x

2015-09-28 Thread Krishna Sankar

Zhan Zhang : >> >>> Hi Krishna, >>> >>> For the time being, you can download from upstream, and it should be >>> running OK for HDP2.3. For hdp specific problem, you can ask in >>> Hortonworks forum. >>> >>> Thanks. >>>

HDP 2.3 support for Spark 1.5.x

2015-09-22 Thread Krishna Sankar

Guys, - We have HDP 2.3 installed just now. It comes with Spark 1.3.x. The current wisdom is that it will support the 1.4.x train (which is good, need DataFrame et al). - What is the plan to support Spark 1.5.x ? Can we install 1.5.0 on HDP 2.3 ? Or will Spark 1.5.x support be in HD

Re: Spark MLib v/s SparkR

2015-08-05 Thread Krishna Sankar

A few points to consider: a) SparkR gives the union of R_in_a_single_machine and the distributed_computing_of_Spark: b) It also gives the ability to wrangle with data in R, that is in the Spark eco system c) Coming to MLlib, the question is MLlib and R (not MLlib or R) - depending on the scale, dat

Re: Sum elements of an iterator inside an RDD

2015-07-11 Thread Krishna Sankar

Looks like reduceByKey() should work here. Cheers On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna < leonida.gianfa...@gmail.com> wrote: > Thanks a lot oubrik, > > I got your point, my consideration is that sum() should be already a > built-in function for iterators in python. > Anyway I trie

import pyspark.sql.Row gives error in 1.4.1

2015-07-02 Thread Krishna Sankar

Error - ImportError: No module named Row Cheers & enjoy the long weekend

Re: making dataframe for different types using spark-csv

2015-07-01 Thread Krishna Sankar

- use .cast("...").alias('...') after the DataFrame is read. - sql.functions.udf for any domain-specific conversions. Cheers On Wed, Jul 1, 2015 at 11:03 AM, Hafiz Mujadid wrote: > Hi experts! > > > I am using spark-csv to lead csv data into dataframe. By default it makes > type of each

Re: SparkSQL built in functions

2015-06-29 Thread Krishna Sankar

Interesting. Looking at the definitions, sql.functions.pow is defined only for (col,col). Just as an experiment, create a column with value 2 and see if that works. Cheers On Mon, Jun 29, 2015 at 1:34 PM, Bob Corsaro wrote: > 1.4 and I did set the second parameter. The DSL works fine but trying

Re: Kmeans Labeled Point RDD

2015-05-21 Thread Krishna Sankar

You can predict and then zip it with the points RDD to get approx. same as LP. Cheers On Thu, May 21, 2015 at 6:19 PM, anneywarlord wrote: > Hello, > > New to Spark. I wanted to know if it is possible to use a Labeled Point RDD > in org.apache.spark.mllib.clustering.KMeans. After I cluster my d

Re: Trouble working with Spark-CSV package (error: object databricks is not a member of package com)

2015-04-23 Thread Krishna Sankar

Do you have commons-csv-1.1-bin.jar in your path somewhere ? I had to download and add this. Cheers On Wed, Apr 22, 2015 at 11:01 AM, Mohammed Omer wrote: > Afternoon all, > > I'm working with Scala 2.11.6, and Spark 1.3.1 built from source via: > > `mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipT

Re: Dataset announcement

2015-04-15 Thread Krishna Sankar

Thanks Olivier. Good work. Interesting in more than one ways - including training, benchmarking, testing new releases et al. One quick question - do you plan to make it available as an S3 bucket ? Cheers On Wed, Apr 15, 2015 at 5:58 PM, Olivier Chapelle wrote: > Dear Spark users, > > I would l

Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Krishna Sankar

Yep the command-option is gone. No big deal, just add the '%pylab inline' command as part of your notebook. Cheers On Fri, Mar 20, 2015 at 3:45 PM, cong yue wrote: > Hello : > > I tried ipython notebook with the following command in my enviroment. > > PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVE

Re: General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Krishna Sankar

Without knowing the data size, computation & storage requirements ... : - Dual 6 or 8 core machines, 256 GB memory each, 12-15 TB per machine. Probably 5-10 machines. - Don't go for the most exotic machines, otoh don't go for cheapest ones either. - Find a sweet spot with your ve

Re: Movie Recommendation tutorial

2015-02-24 Thread Krishna Sankar

10.0, and numIter = 20. > > > The best model was trained with rank = 12 and lambda = 0.1, and numIter = > 10, and its RMSE on the test set is 0.865407. > > > On Tue, Feb 24, 2015 at 7:23 AM, Xiangrui Meng wrote: > >> Try to set lambda to 0.1. -Xiangrui >> >

Re: Movie Recommendation tutorial

2015-02-23 Thread Krishna Sankar

1. The RSME varies a little bit between the versions. 2. Partitioned the training,validation,test set like so: - training = ratings_rdd_01.filter(lambda x: (x[3] % 10) < 6) - validation = ratings_rdd_01.filter(lambda x: (x[3] % 10) >= 6 and (x[3] % 10) < 8) - test = ratin

Re: randomSplit instead of a huge map & reduce ?

2015-02-21 Thread Krishna Sankar

- Divide and conquer with reduceByKey (like Ashish mentioned, each pair being the key) would work - looks like a "mapReduce with combiners" problem. I think reduceByKey would use combiners while aggregateByKey wouldn't. - Could we optimize this further by using combineByKey directly

Re: spark-shell working in scala-2.11

2015-01-28 Thread Krishna Sankar

Stephen, Scala 2.11 worked fine for me. Did the dev change and then compile. Not using in production, but I go back and forth between 2.10 & 2.11. Cheers On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman < stephen.haber...@gmail.com> wrote: > Hey, > > I recently compiled Spark master against

[no subject]

2015-01-10 Thread Krishna Sankar

Guys, registerTempTable("Employees") gives me the error Exception in thread "main" scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [/Applications/eclipse/plugins/org.scala-lang.scala-library_2.11.4.

Re: DeepLearning and Spark ?

2015-01-09 Thread Krishna Sankar

I am also looking at this domain. We could potentially use the broadcast capability in Spark to distribute the parameters. Haven't thought thru yet. Cheers On Fri, Jan 9, 2015 at 2:56 PM, Andrei wrote: > Does it makes sense to use Spark's actor system (e.g. via > SparkContext.env.actorSystem) t

Re: Re: I think I am almost lost in the internals of Spark

2015-01-06 Thread Krishna Sankar

Interestingly Google Chrome translates the materials. Cheers On Tue, Jan 6, 2015 at 7:26 PM, Boromir Widas wrote: > I do not understand Chinese but the diagrams on that page are very helpful. > > On Tue, Jan 6, 2015 at 9:46 PM, eric wong wrote: > >> A good beginning if you are chinese. >> >> h

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-03 Thread Krishna Sankar

Alec, Good questions. Suggestions: 1. Refactor the problem into layers viz. DFS, Data Store, DB, SQL Layer, Cache, Queue, App Server, App (Interface), App (backend ML) et al. 2. Then slot-in the appropriate technologies - may be even multiple technologies for the same layer and then

Re: Calling ALS-MlLib from desktop application/ Training ALS

2014-12-13 Thread Krishna Sankar

a) There is no absolute RSME - it depends on the domain. Also RSME is the error based on what you have seen so far, a snapshot of a slice of the domain. b) My suggestion is put the system in place, see what happens when users interact with the system and then you can think of reducing the RSME as n

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar

A very timely article http://rahulkavale.github.io/blog/2014/11/16/scrap-your-map-reduce/ Cheers P.S: Now reply to ALL. On Sun, Nov 23, 2014 at 7:16 PM, Krishna Sankar wrote: > Good point. > On the positive side, whether we choose the most efficient mechanism in > Scala might

Re: Spark or MR, Scala or Java?

2014-11-23 Thread Krishna Sankar

Good point. On the positive side, whether we choose the most efficient mechanism in Scala might not be as important, as the Spark framework mediates the distributed computation. Even if there is some declarative part in Spark, we can still choose an inefficient computation path that is not apparent

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Krishna Sankar

Adding to already interesting answers: - "Is there any case where MR is better than Spark? I don't know what cases I should be used Spark by MR. When is MR faster than Spark?" - Many. MR would be better (am not saying faster ;o)) for - Very large dataset, - Multistage ma

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Krishna Sankar

Well done guys. MapReduce sort at that time was a good feat and Spark now has raised the bar with the ability to sort a PB. Like some of the folks in the list, a summary of what worked (and didn't) as well as the monitoring practices would be good. Cheers P.S: What are you folks planning next ? O

Re: Can not see any spark metrics on ganglia-web

2014-10-02 Thread Krishna Sankar

Hi, I am sure you can use the -Pspark-ganglia-lgpl switch to enable Ganglia. This step only adds the support for Hadoop,Yarn,Hive et al in the spark executable.No need to run if one is not using them. Cheers On Thu, Oct 2, 2014 at 12:29 PM, danilopds wrote: > Hi tsingfu, > > I want to see me

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar

be 0.1 or 0.01? > > Best, > Burak > > - Original Message - > From: "Krishna Sankar" > To: user@spark.apache.org > Sent: Wednesday, October 1, 2014 12:43:20 PM > Subject: MLlib Linear Regression Mismatch > > Guys, >Obviously I am doing some

MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar

Guys, Obviously I am doing something wrong. May be 4 points are too small a dataset. Can you help me to figure out why the following doesn't work ? a) This works : data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(10.0, [10.0]), LabeledPoint(20.0, [20.0]), LabeledPoint(30.0, [30.0]) ]

Re: MLlib 1.2 New & Interesting Features

2014-09-29 Thread Krishna Sankar

d use 1.1 instead. Its binary packages and documentation can > be easily found on spark.apache.org, which is important for making > hands-on tutorial. > > Best, > Xiangrui > > On Sat, Sep 27, 2014 at 12:15 PM, Krishna Sankar > wrote: > > Guys, > > > > Need help in

MLlib 1.2 New & Interesting Features

2014-09-27 Thread Krishna Sankar

Guys, - Need help in terms of the interesting features coming up in MLlib 1.2. - I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con - "The Hitchhiker's Guide to Machine Learning with Python & Apache Spark"[2] - At minimum, it would be good to take the last 30 mi

Re: Spark as a application library vs infra

2014-07-27 Thread Krishna Sankar

- IMHO, #2 is preferred as it could work in any environment (Mesos, Standalone et al). While Spark needs HDFS (for any decent distributed system) YARN is not required at all - Meson is a lot better. - Also managing the app with appropriate bootstrap/deployment framework is more flexi

Re: Out of any idea

2014-07-19 Thread Krishna Sankar

Probably you have - if not, try a very simple app in the docker container and make sure it works. Sometimes resource contention/allocation can get in the way. This happened to me in the YARN container. Also try single worker thread. Cheers On Sat, Jul 19, 2014 at 2:39 PM, boci wrote: > Hi guys

Re: Need help on spark Hbase

2014-07-15 Thread Krishna Sankar

not if you have other configurations that might have conflicted with >>> those. >>> >>> Could you try the following, remove anything that is spark specific >>> leaving only hbase related codes. uber jar it and run it just like any >>> other simple java

Re: Need help on spark Hbase

2014-07-15 Thread Krishna Sankar

One vector to check is the HBase libraries in the --jars as in : spark-submit --class --master --jars hbase-client-0.98.3-hadoop2.jar,commons-csv-1.0-SNAPSHOT.jar,hbase-common-0.98.3-hadoop2.jar,hbase-hadoop2-compat-0.98.3-hadoop2.jar,hbase-it-0.98.3-hadoop2.jar,hbase-protocol-0.98.3-hadoop2.jar,

Re: Requirements for Spark cluster

2014-07-09 Thread Krishna Sankar

I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in all the nodes irrespective of Hadoop/YARN. Cheers On Tue, Jul 8, 2014 at 6:24 PM, Robert James wrote: > I have a Spark app which runs well on local master. I'm now ready to > put it on a cluster. What needs to be ins

Re: Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Krishna Sankar

Nick, AFAIK, you can compile with yarn=true and still run spark in stand alone cluster mode. Cheers On Wed, Jul 9, 2014 at 9:27 AM, Nick R. Katsipoulakis wrote: > Hello, > > I am currently learning Apache Spark and I want to see how it integrates > with an existing Hadoop Cluster. > > My cu

Re: Understanding how to install in HDP

2014-07-09 Thread Krishna Sankar

Abel, I rsync the spark-1.0.1 directory to all the nodes. Then whenever the configuration changes, rsync the conf directory. Cheers On Wed, Jul 9, 2014 at 2:06 PM, Abel Coronado Iruegas < acoronadoirue...@gmail.com> wrote: > Hi everybody > > We have hortonworks cluster with many nodes, we wa

Re: Spark Installation

2014-07-07 Thread Krishna Sankar

Couldn't find any reference of CDH in pom.xml - profiles or the hadoop.version.Am also wondering how the cdh compatible artifact was compiled. Cheers On Mon, Jul 7, 2014 at 8:07 PM, Srikrishna S wrote: > Hi All, > > Does anyone know what the command line arguments to mvn are to generate > the

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Krishna Sankar

Konstantin, 1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2, the nodes would have hdfs & YARN. 2. Then you need to install Spark on all nodes. I haven't had experience with HDP, but the tech preview might have installed Spark as well. 3. In the end, one should

Re: Interconnect benchmarking

2014-06-29 Thread Krishna Sankar

- After loading large RDDs that are > 60-70% of the total memory, (k,v) operations like finding uniques/distinct, GroupByKey and SetOperations would be network bound. - A multi-stage Map-Reduce DAG should be a good test. When we tried this for Hadoop, we used examples from Genomics.

Re: Spark Processing Large Data Stuck

2014-06-21 Thread Krishna Sankar

Hi, - I have seen similar behavior before. As far as I can tell, the root cause is the out of memory error - verified this by monitoring the memory. - I had a 30 GB file and was running on a single machine with 16GB. So I knew it would fail. - But instead of raising an exce

Re: question about setting SPARK_CLASSPATH IN spark_env.sh

2014-06-17 Thread Krishna Sankar

Santhosh, All the nodes should have access to the jar file and so the classpath should be on all the nodes. Usually it is better to rsync the conf directory to all nodes rather than editing them separately. Cheers On Tue, Jun 17, 2014 at 9:26 PM, santhoma wrote: > Hi, > > This is about spar

Re: Spark streaming RDDs to Parquet records

2014-06-17 Thread Krishna Sankar

Mahesh, - One direction could be : create a parquet schema, convert & save the records to hdfs. - This might help https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SparkParquetExample.scala Cheers On Tue, Jun 17, 2014 at 12:52 PM, maheshtwc

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Krishna Sankar

Ian, Yep, HLL is an appropriate mechanism. The countApproxDistinctByKey is a wrapper around the com.clearspring.analytics.stream.cardinality.HyperLogLogPlus. Cheers On Sun, Jun 15, 2014 at 4:50 PM, Ian O'Connell wrote: > Depending on your requirements when doing hourly metrics calculating >

Re: Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar

And got the first cut: val res = pairs.groupByKey().map((x) => (x._1, x._2.size, x._2.toSet.size)) gives the total & unique. The question : is it scalable & efficient ? Would appreciate insights. Cheers On Fri, Jun 13, 2014 at 10:51 PM, Krishna Sankar wrote: > Answ

Re: Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar

Answered one of my questions (#5) : val pairs = new PairRDDFunctions() works fine locally. Now I can do groupByKey et al. Am not sure if it is scalable for millions of records & memory efficient. heers On Fri, Jun 13, 2014 at 8:52 PM, Krishna Sankar wrote: > Hi, >Would appreciat

Multi-dimensional Uniques over large dataset

2014-06-13 Thread Krishna Sankar

Hi, Would appreciate insights and wisdom on a problem we are working on: 1. Context: - Given a csv file like: - d1,c1,a1 - d1,c1,a2 - d1,c2,a1 - d1,c1,a1 - d2,c1,a3 - d2,c2,a1 - d3,c1,a1 - d3,c3,a1 - d3,c2,a1 - d3,c3,a2

Re: problem starting the history server on EC2

2014-06-10 Thread Krishna Sankar

Yep, it gives tons of errors. I was able to make it work with sudo. Looks like ownership issue. Cheers On Tue, Jun 10, 2014 at 6:29 PM, zhen wrote: > I created a Spark 1.0 cluster on EC2 using the provided scripts. However, I > do not seem to be able to start the history server on the master n

Re: How to compile a Spark project in Scala IDE for Eclipse?

2014-06-08 Thread Krishna Sankar

Project->Properties->Java Build Path->Add External Jars Add the /spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar Cheers On Sun, Jun 8, 2014 at 8:06 AM, Carter wrote: > Hi All, > > I just downloaded the Scala IDE for Eclipse. After I created a Spark > project > and clicked "Run

Re: Spark Usecase

2014-06-04 Thread Krishna Sankar

Shahab, Interesting question. Couple of points (based on the information from your e-mail) 1. One can support the use case in Spark as a set of transformations on a WIP TDD over a span of time and the final transformation outputting to a processed TDD - Spark streaming would be a

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Krishna Sankar

curity group rules must specify protocols > explicitly.7ff92687-b95a-4a39-94cb-e2d00a6928fd > > This sounds like it could have to do with the access settings of the > security group, but I don't know how to change. Any advice would be much > appreciated! > > Sam > > ---

Re: Trouble launching EC2 Cluster with Spark

2014-06-04 Thread Krishna Sankar

One reason could be that the keys are in a different region. Need to create the keys in us-east-1-North Virginia. Cheers On Wed, Jun 4, 2014 at 7:45 AM, Sam Taylor Steyer wrote: > Hi, > > I am trying to launch an EC2 cluster from spark using the following > command: > > ./spark-ec2 -k HackerPa

Re: Why Scala?

2014-05-29 Thread Krishna Sankar

Nicholas, Good question. Couple of thoughts from my practical experience: - Coming from R, Scala feels more natural than other languages. The functional & succinctness of Scala is more suited for Data Science than other languages. In short, Scala-Spark makes sense, for Data Science, ML

Re: K-nearest neighbors search in Spark

2014-05-27 Thread Krishna Sankar

Carter, Just as a quick & simple starting point for Spark. (caveats - lots of improvements reqd for scaling, graceful and efficient handling of RDD et al): import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import scala.collection.immutable.ListMap import scala.colle

Re: How to Run Machine Learning Examples

2014-05-22 Thread Krishna Sankar

I couldn't find the classification.SVM class. - Most probably the command is something of the order of: - bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification examples/target/scala-*/spark-examples-*.jar --algorithm SVM train.csv - For more details tr

Re: Run Apache Spark on Mini Cluster

2014-05-21 Thread Krishna Sankar

It depends on what stack you want to run. A quick cut: - Worker Machines (DataNode, HBase Region Servers, Spark Worker Nodes) - Dual 6 core CPU - 64 to 128 GB RAM - 3 X 3TB disk (JBOD) - Master Node (Name Node, HBase Master,Spark Master) - Dual 6 core CPU - 64 t

Re: Understanding epsilon in KMeans

2014-05-16 Thread Krishna Sankar

Stuti, - The two numbers at different contexts, but finally end up in two sides of an && operator. - A parallel K-Means consists of multiple iterations which in turn consists of moving centroids around. A centroids would be deemed stabilized when the root square distance between suc

63 matches

Mail list logo