Re: Linear Regression with SGD

2015-06-09 Thread Robin East
Hi Stephen How many is a very large number of iterations? SGD is notorious for requiring 100s or 1000s of iterations, also you may need to spend some time tweaking the step-size. In 1.4 there is an implementation of ElasticNet Linear Regression which is supposed to compare favourably with an eq

Re: Extracting k-means cluster values along with centers?

2015-06-13 Thread Robin East
trying again > On 13 Jun 2015, at 10:15, Robin East wrote: > > Here’s typical way to do it: > > > 1 > 2 > 3 > 4 > 5 > 6 > 7 > 8 > 9 > 10 > 11 > 12 > 13 > 14 > import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} > impor

Re: Spark GraphX memory requirements + java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-06-26 Thread Robin East
You’ll get this issue if you just take the first 2000 lines of that file. The problem is triangleCount() expects srdId < dstId which is not the case in the file (e.g. vertex 28). You can get round this by calling graph.convertToCanonical Edges() which removes bi-directional edges and ensures sr

Re: Research ideas using spark

2015-07-15 Thread Robin East
Well said Will. I would add that you might want to investigate GraphChi which claims to be able to run a number of large-scale graph processing tasks on a workstation much quicker than a very large Hadoop cluster. It would be interesting to know how widely applicable the approach GraphChi takes

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-01 Thread Robin East
Yes and even today CBO (e.g. in Oracle) will still require hints in some cases so I think it is more like: RBO -> RBO + Hints -> CBO + Hints. Most relational databases meet significant numbers of corner cases where CBO plans simply don’t do what you would want. I don’t know enough about Spark S

Re: What is the interpretation of Cores in Spark doc

2016-06-16 Thread Robin East
Mich >> A core may have one or more threads It would be more accurate to say that a core could run one or more threads scheduled for execution. Threads are a software/OS concept that represent executable code that is scheduled to run by the OS; A CPU, core or virtual core/virtual processor exec

Re: What is the interpretation of Cores in Spark doc

2016-06-17 Thread Robin East
t; > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.word

Re: Saving data using tempTable versus save() method

2016-06-21 Thread Robin East
random thought - do you need an explicit commit with the 2nd method? > On 20 Jun 2016, at 21:35, Mich Talebzadeh wrote: > > Hi, > > I have a DF based on a table and sorted and shown below > > This is fine and when I register as tempTable I can populate the underlying > table sales 2 in Hiv

Re: Saving data using tempTable versus save() method

2016-06-21 Thread Robin East
file/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 21 June 2016 at 08:56, Robin Eas

Re: Saving data using tempTable versus save() method

2016-06-21 Thread Robin East
if you are able to trace the underlying oracle session you can see whether a commit has been called or not. > On 21 Jun 2016, at 09:57, Robin East wrote: > > I’m not sure - I don’t know what those APIs do under the hood. It simply rang > a bell with something I have fallen fo

Re: ML PipelineModel to be scored locally

2016-07-21 Thread Robin East
MLeap is another option (Apache licensed) https://github.com/TrueCar/mleap --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action

Re: Is RowMatrix missing in org.apache.spark.ml package?

2016-07-27 Thread Robin East
Can you use the version from mllib? --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/sp

Re: Questions about ml.random forest (only one decision tree?)

2016-08-04 Thread Robin East
--- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 4 Aug 2016, at 09:48, 陈哲 wrote: &g

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Robin East
Another approach is to use L1 regularisation eg http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression. This adds a penalty term to the regression equation to reduce model complexity. When you use L1 (as opposed to say L2) this tends to prom

Re: Feature importance for RandomForestRegressor in Spark 1.5

2016-01-15 Thread Robin East
re 1. The pull requests reference the JIRA ticket in this case https://issues.apache.org/jira/browse/SPARK-5133 <https://issues.apache.org/jira/browse/SPARK-5133>. The JIRA says it was released in 1.5. --- Robi

Re: Constantly increasing Spark streaming heap memory

2016-02-22 Thread Robin East
Hi What you describe looks like normal behaviour for almost any Java/Scala application - objects are created on the heap until a limit point is reached and then GC clears away memory allocated to objects that are no longer referenced. Is there an issue you are experiencing? > On 21 Feb 201

Re: Reindexing in graphx

2016-02-24 Thread Robin East
It looks like you adding vertices one-by-one, you definitely don’t want to do that. What happens when you batch together 400 vertices into an RDD and then add 400 in one go? --- Robin East Spark GraphX in Action Michael

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Robin East
. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 25 Feb

Re: Reindexing in graphx

2016-02-25 Thread Robin East
/stages are taking a long time, and what resource (CPU, IO, network, shuffles) do they seem to be bottle-necking on. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http

Re: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-26 Thread Robin East
possibilities, the key point is that everything is just a graph transformation until you call an action on the resulting graph --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http

Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-03-01 Thread Robin East
. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 1 Mar 2016, at 16:13, Mohammed Guller wrot

Re: LDA topic modeling and Spark

2015-12-03 Thread Robin East
the actual words in each topic? A typical way is to look at the top 5, 10 or 20 words in each topic and use those to infer something about what the topic represents. --- Robin East Spark GraphX in Action Michael Malak and Ro

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
architectural sense. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-act

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
memory-mapped file reading feature. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-gra

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
ch of the current functionality they support... >>> Hi, Robin, >>> Thanks for your reply and thanks for copying my question to user mailing >>> list. >>> Yes, we have a distributed C++ application, that will store data on each >>> node in the

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
.. >>> Hi, Robin, >>> Thanks for your reply and thanks for copying my question to user mailing >>> list. >>> Yes, we have a distributed C++ application, that will store data on each >>> node in the cluster, and we hope to leverage Spark to do more fancy

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Robin East
not the case. If you didn’t mean then we are both in agreement. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <h

Re: Is spark suitable for real time query

2015-07-22 Thread Robin East
Real-time is, of course, relative but you’ve mentioned microsecond level. Spark is designed to process large amounts of data in a distributed fashion. No distributed system I know of could give any kind of guarantees at the microsecond level. Robin > On 22 Jul 2015, at 11:14, Louis Hust wrote

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread Robin East
The first question I would ask is have you determined whether you have a performance issue writing to Oracle? In particular how many commits are you making? If you are issuing a lot of commits that would be a performance problem. Robin > On 22 Jul 2015, at 19:11, diplomatic Guru wrote: > > He

Re: [MLLIB] Anyone tried correlation with RDD[Vector] ?

2015-07-23 Thread Robin East
The OP’s problem is he gets this: :47: error: type mismatch; found : org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.DenseVector] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] Note: org.apache.spark.mllib.linalg.DenseVector <: org.apache.spark.mllib.linalg.Ve

Re: Spark return key value pair

2015-08-19 Thread Robin East
Dawid is right, if you did words.count it would be twice the number of input lines. You can use map like this: words = lines.map(mapper2) for i in words.take(10): msg = i[0] + ":” + i[1] + "\n” -------

Re: Spark ec2 lunch problem

2015-08-24 Thread Robin East
maybe someone on the list can help diagnose the specific problem. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/malak/ <http://www.manning.com/ma

Re: Build k-NN graph for large dataset

2015-08-26 Thread Robin East
You could try dimensionality reduction (PCA or SVD) first. I would imagine that even if you could successfully compute similarities in the high-dimensional space you would probably run into the curse of dimensionality. > On 26 Aug 2015, at 12:35, Jaonary Rabarisoa wrote: > > Dear all, > > I'm

Re: Applying transformations on a JavaRDD using reflection

2015-09-09 Thread Robin East
Have you got some code already that demonstrates the problem? > On 9 Sep 2015, at 04:45, Nirmal Fernando wrote: > > Any thoughts? > > On Tue, Sep 8, 2015 at 3:37 PM, Nirmal Fernando > wrote: > Hi All, > > I'd like to apply a chain of Spark transformations (map/filter) o

Re: spark performance - executor computing time

2015-09-16 Thread Robin East
) that means the processing takes longer. — Robin East Spark GraphX in Action Michael S Malak and Robin East http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 15 Sep 2015, at 12:35, patcharee wrote

Re: Forecasting algorithms in spark ML

2016-09-08 Thread Robin East
Sparks algorithms are summarised on this page (https://spark.apache.org/mllib/) and details are available from the MLLib user guide which is linked from the above URL Sent from my iPhone > On 8 Sep 2016, at 05:30, Madabhattula Rajesh Kumar > wrote: > > Hi, > > Please let me know supported F

Re: MLib : Non Linear Optimization

2016-09-08 Thread Robin East
Do you have any particular algorithms in mind? If you state the most common algorithms you use then it might stimulate the appropriate comments. > On 8 Sep 2016, at 05:04, nsareen wrote: > > Any answer to this question group ? > > > > -- > View this message in context: > http://apache-spa

Re: Graphhopper/routing in Spark

2016-09-09 Thread Robin East
. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 8 Sep 2016, at 22:45,

Re: MLib : Non Linear Optimization

2016-10-05 Thread Robin East
I would say no, at least not without a fair degree of algorithm writing experience. MLLib is primarily a set of machine learning algorithms, many of which are based on implementations of distributed optimisation procedures. The SAS routines you mention are optimisation routines which don't have

Re: MLib : Non Linear Optimization

2016-10-05 Thread Robin East
tml> --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 5 Oct 2016, at 08:29, Robin East wrote: > > I would say no, a

Re: K-Mean retrieving Cluster Members

2016-10-19 Thread Robin East
or alternatively this should work (assuming parsedData is an RDD[Vector]): clusters.predict(parsedData) > On 18 Oct 2016, at 00:35, Reth RM wrote: > > I think I got it > > parsedData.foreach( > new VoidFunction() { > @Override > p

Re: Need help with SVM

2016-10-26 Thread Robin East
As per Assem’s point what do you get from data_rdd.toDF.groupBy("label").count.show > On 25 Oct 2016, at 15:41, Aseem Bansal wrote: > > Is there any labeled point with label 0 in your dataset? > > On Tue, Oct 25, 2016 at 2:13 AM, aditya1702 > wrote: > Hello,

Re: Need help with SVM

2016-10-26 Thread Robin East
It looks like the training is over-regularised - dropping the regParam to 0.1 or 0.01 should resolve the problem. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Robin East
overfit your model. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-act

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Robin East
. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 1 Nov 2016, at 11:18, Nirav Patel wrote: >

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I don’t think the semantics of groupBy necessarily preserve ordering - whatever the implementation details or the observed behaviour. I would use a Window operation and order within the group. > On 3 Nov 2016, at 11:53, Rabin Banerjee wrote: > > Hi All , > > I want to do a dataframe oper

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I agree with Koert. Relying on something because it appears to work when you test it can be dangerous if there is nothing in the api guarantee. Going back quite a few years it used to be the case that Oracle would always return a group by with the rows in the order of the grouping key. This was

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-04 Thread Robin East
--- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 4 Nov 2016,

Re: LinearRegressionWithSGD and Rank Features By Importance

2016-11-07 Thread Robin East
--- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 7 Nov 2016, at 15:47, Carlo.Allocca wrote

Re: Pretrained Word2Vec models

2016-12-05 Thread Robin East
. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 5 Dec 2016, at 21:34, L

Re: Which streaming platform is best? Kafka or Spark Streaming?

2017-03-10 Thread Robin East
on. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 9

Re: Question on Spark's graph libraries

2017-03-10 Thread Robin East
I would love to know the answer to that too. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/bo

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Robin East
. --- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action> > On 25 Jul 2017, at 13:21, Gokula Krishnan D wrote: > &

Re: Positive log-likelihood with Gaussian mixture

2018-05-30 Thread robin . east
Positive log likelihoods for continuous distributions are not unusual. You are evaluating a pdf not a probability. For example a univariate Gaussian pdf returns greater than 1 at the mean when the variance goes below 0.39, at which point the log pdf is positive. Sent from Polymail ( https://po

Mllib / kalman

2018-12-17 Thread robin . east
Pretty sure there is nothing in MLLib. This seems to be the most comprehensive coverage of implementing in Spark  https://dzone.com/articles/kalman-filter-with-apache-spark-streaming-and-kafk. I’ve skimmed it but not read it in detail but looks useful. Sent from Polymail ( https://polymail.io/?

Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-21 Thread Robin East
I don’t get those results. I get: spark 0.14 scikit-learn0.85 The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 and push iterations to 400 and you get a mse ~= 0. Of course the coefficients are both ~1 and the intercept ~0. Similarly if you change the mll

Re: Is Apache Spark less accurate than Scikit Learn?

2015-01-22 Thread Robin East
/dataset it may be the > other way around? > > I ask because I encountered this situation on other, larger datasets, so this > is not an isolated case (though being the simplest example I could think of I > would imagine that it's somewhat indicative of general behaviour) &g

Re: is there a master for spark cluster in ec2

2015-02-02 Thread Robin East
There is a file $SPARK_HOME/conf/spark-env.sh which comes readily configured with the MASTER variable. So if you start pyspark or spark-shell from the ec2 login machine you will connect to the Spark master. On 29 Jan 2015, at 01:11, Mohit Singh wrote: > Hi, > Probably a naive question.. But

Re: obtain cluster assignment in K-means

2015-02-12 Thread Robin East
KMeans.train actually returns a KMeansModel so you can use predict() method of the model e.g. clusters.predict(pointToPredict) or clusters.predict(pointsToPredict) first is a single Vector, 2nd is RDD[Vector] Robin On 12 Feb 2015, at 06:37, Shi Yu wrote: > Hi there, > > I am new to spark.

Re: Spark SQL odbc on Windows

2015-02-23 Thread Robin East
Have you looked at Kylin? http://www.ebaytechblog.com/2014/10/20/announcing-kylin-extreme-olap-engine-for-big-data/#.VOtXUUsqnUk Pretty new but has the backing of eBay. On 23 Feb 2015, at 15:38, Denny Lee wrote: > Makes complete sense - I became a fan of Spark for pretty much the same > reas

Re: GraphX path traversal

2015-03-03 Thread Robin East
Rajesh I'm not sure if I can help you, however I don't even understand the question. Could you restate what you are trying to do. Sent from my iPhone > On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar > wrote: > > Hi, > > I have a below edge list. How to find the parents path for every ve

Re: PRNG in Scala

2015-03-03 Thread Robin East
This is more of a java/scala question than spark - it uses java.util.Random : https://github.com/scala/scala/blob/2.11.x/src/library/scala/util/Random.scala > On 3 Mar 2015, at 15:08, Vijayasarathy Kannan wrote: > > Hi, > > What pseudo-random-number generator does scala.util.Random uses?

Re: GraphX path traversal

2015-03-03 Thread Robin East
gt; > In this graph, How can I compute the 1st vertex parents like 2,3,4,5,6. > Similarly 2nd vertex parents like 3,4,5,6 6th vertex parent like 6 > because this is the root node. > > I'm planning to use pergel API but I'm not able to define messages an

Re: GraphX path traversal

2015-03-03 Thread Robin East
Have you tried EdgeDirection.In? > On 3 Mar 2015, at 16:32, Robin East wrote: > > What about the following which can be run in spark shell: > > import org.apache.spark._ > import org.apache.spark.graphx._ > import org.apache.spark.rdd.RDD > > val vertexlist = Array(

Re: PRNG in Scala

2015-03-03 Thread Robin East
And this SO post goes into details on the PRNG in Java http://stackoverflow.com/questions/9907303/does-java-util-random-implementation-differ-between-jres-or-platforms > On 3 Mar 2015, at 16:15, Robin East wrote: > > This is more of a java/scala question than spark - it uses java.ut

Re: GraphX path traversal

2015-03-04 Thread Robin East
In this graph, How can I compute the 1st vertex parents like 2,3,4,5,6. > Similarly 2nd vertex parents like 3,4,5,6 6th vertex parent like 6 > because this is the root node. > > I'm planning to use pergel API but I'm not able to define messages and vertex > prog

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Robin East
What query are you running. It may be the case that your query requires PosgreSQL to do a large amount of work before identifying the first n rows > On 4 May 2015, at 15:52, Yi Zhang wrote: > > I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and > improve query performance,

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Robin East
and a further question - have you tried running this query in pqsl? what’s the performance like there? > On 4 May 2015, at 16:04, Robin East wrote: > > What query are you running. It may be the case that your query requires > PosgreSQL to do a large amount of work before identifyi

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Robin East
on, May 4, 2015 at 8:06 AM, Robin East wrote: >> and a further question - have you tried running this query in pqsl? what’s >> the performance like there? >> >>>> On 4 May 2015, at 16:04, Robin East wrote: >>>> >>>> What query are you runni

Re: Current Build Gives HTTP ERROR

2015-01-13 Thread Robin East
I’ve just pulled down the latest commits from github, and done the following: 1) mvn clean package -DskipTests builds fine 2) ./bin/spark-shell works 3) run SparkPi example with no problems: ./bin/run-example SparkPi 10 4) Started a master ./sbin/start-master.sh grabbed the MasterWebUI fro