Re: Belief propagation algorithm is open sourced

2016-12-15 Thread Ulanov, Alexander
using BP. We were using both LibDAI and our own implementation of BP for GraphLab and as a reference. Best regards, Manish Marwah & Alexander From: Bertrand Dechoux Sent: Thursday, December 15, 2016 1:03:49 AM To: Bryan Cutler Cc: Ulanov, Alexander; user; d

Belief propagation algorithm is open sourced

2016-12-13 Thread Ulanov, Alexander
Dear Spark developers and users, HPE has open sourced the implementation of the belief propagation (BP) algorithm for Apache Spark, a popular message passing algorithm for performing inference in probabilistic graphical models. It provides exact inference for graphical models without loops. Wh

scalable-deeplearning 1.0.0 released

2016-09-09 Thread Ulanov, Alexander
Dear Spark users and developers, I have released version 1.0.0 of scalable-deeplearning package. This package is based on the implementation of artificial neural networks in Spark ML. It is intended for new Spark deep learning features that were not yet merged to Spark ML or that are too specif

Spark streaming get RDD within the sliding window

2016-08-24 Thread Ulanov, Alexander
Dear Spark developers, I am working with Spark streaming 1.6.1. The task is to get RDDs for some external analytics from each timewindow. This external function accepts RDD so I cannot use DStream. I learned that DStream.window.compute(time) returns Option[RDD]. I am trying to use it in the fol

Graph edge type pattern matching in GraphX

2016-08-02 Thread Ulanov, Alexander
Dear Spark developers, Could you suggest how to perform pattern matching on the type of the graph edge in the following scenario. I need to perform some math by means of aggregateMessages on the graph edges if edges are Double. Here is the code: def my[VD: ClassTag, ED: ClassTag] (graph: Graph[V

RE: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-14 Thread Ulanov, Alexander
-1, due to unresolved https://issues.apache.org/jira/browse/SPARK-15899 From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, July 14, 2016 12:00 PM To: dev@spark.apache.org Subject: [VOTE] Release Apache Spark 2.0.0 (RC4) Please vote on releasing the following candidate as Apache Spark

RE: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Ulanov, Alexander
Here is the fix https://github.com/apache/spark/pull/13868 From: Reynold Xin [mailto:r...@databricks.com] Sent: Wednesday, June 22, 2016 6:43 PM To: Ulanov, Alexander Cc: Mark Hamstra ; Marcelo Vanzin ; dev@spark.apache.org Subject: Re: [VOTE] Release Apache Spark 2.0.0 (RC1) Alex - if you have

RE: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Ulanov, Alexander
4:09 PM To: Marcelo Vanzin Cc: Ulanov, Alexander ; Reynold Xin ; dev@spark.apache.org Subject: Re: [VOTE] Release Apache Spark 2.0.0 (RC1) It's also marked as Minor, not Blocker. On Wed, Jun 22, 2016 at 4:07 PM, Marcelo Vanzin mailto:van...@cloudera.com>> wrote: On Wed, Jun 22, 2016

RE: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-22 Thread Ulanov, Alexander
-1 Spark Unit tests fail on Windows. Still not resolved, though marked as resolved. https://issues.apache.org/jira/browse/SPARK-15893 From: Reynold Xin [mailto:r...@databricks.com] Sent: Tuesday, June 21, 2016 6:27 PM To: dev@spark.apache.org Subject: [VOTE] Release Apache Spark 2.0.0 (RC1) Please

RE: Shrinking the DataFrame lineage

2016-05-13 Thread Ulanov, Alexander
, May 13, 2016 12:38 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Shrinking the DataFrame lineage Here's a JIRA for it: https://issues.apache.org/jira/browse/SPARK-13346 I don't have a great method currently, but hacks can get around it: convert the DataFrame to an RD

Shrinking the DataFrame lineage

2016-05-11 Thread Ulanov, Alexander
Dear Spark developers, Recently, I was trying to switch my code from RDDs to DataFrames in order to compare the performance. The code computes RDD in a loop. I use RDD.persist followed by RDD.count to force Spark compute the RDD and cache it, so that it does not need to re-compute it on each it

RE: Number of partitions for binaryFiles

2016-04-26 Thread Ulanov, Alexander
will involve shuffling. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, April 26, 2016 2:44 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Number of partitions for binaryFiles From what I understand, Spark code was written this way because you don't end up with very

RE: Number of partitions for binaryFiles

2016-04-26 Thread Ulanov, Alexander
Hi Ted, I have 36 files of size ~600KB and the rest 74 are about 400KB. Is there a workaround rather than changing Sparks code? Best regards, Alexander From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, April 26, 2016 1:22 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re

Number of partitions for binaryFiles

2016-04-26 Thread Ulanov, Alexander
Dear Spark developers, I have 100 binary files in local file system that I want to load into Spark RDD. I need the data from each file to be in a separate partition. However, I cannot make it happen: scala> sc.binaryFiles("/data/subset").partitions.size res5: Int = 66 The "minPartitions" param

RE: MLPC model can not be saved

2016-03-21 Thread Ulanov, Alexander
Hi Pan, There is a pull request that is supposed to fix the issue: https://github.com/apache/spark/pull/9854 There is a workaround for saving/loading a model (however I am not sure if it will work for the pipeline): sc.parallelize(Seq(model), 1).saveAsObjectFile("path") val sameModel = sc.object

RE: Using CUDA within Spark / boosting linear algebra

2016-01-21 Thread Ulanov, Alexander
erformance. Best regards, Alexander From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com] Sent: Thursday, January 21, 2016 3:34 AM To: dev@spark.apache.org; Ulanov, Alexander; Joseph Bradley Cc: John Canny; Evan R. Sparks; Xiangrui Meng; Sam Halliday Subject: RE: Using CUDA within Spark / boosting

RE: Using CUDA within Spark / boosting linear algebra

2016-01-20 Thread Ulanov, Alexander
...@gmail.com] Sent: Thursday, March 26, 2015 9:27 AM To: John Canny Cc: Xiangrui Meng; dev@spark.apache.org; Joseph Bradley; Evan R. Sparks; Ulanov, Alexander Subject: Re: Using CUDA within Spark / boosting linear algebra John, I have to disagree with you there. Dense matrices come up a lot in industry

RE: Data and Model Parallelism in MLPC

2016-01-04 Thread Ulanov, Alexander
is handled by Spark RDD, i.e. each worker processes a subset of data partitions, and master serves the role of parameter server. Best regards, Alexander From: Disha Shrivastava [mailto:dishu@gmail.com] Sent: Wednesday, December 30, 2015 4:03 AM To: Ulanov, Alexander Cc: dev@spark.apache.org

RE: Support off-loading computations to a GPU

2016-01-04 Thread Ulanov, Alexander
Hi Kazuaki, Sounds very interesting! Could you elaborate on your benchmark with regards to logistic regression (LR)? Did you compare your implementation with the current implementation of LR in Spark? Best regards, Alexander From: Kazuaki Ishizaki [mailto:ishiz...@jp.ibm.com] Sent: Sunday, Jan

RE: SparkML algos limitations question.

2016-01-04 Thread Ulanov, Alexander
Hi Yanbo, As long as two models fit into memory of a single machine, there should be no problems, so even 16GB machines can handle large models. (master should have more memory because it runs LBFGS) In my experiments, I’ve trained the models 12M and 32M parameters without issues. Best regards

RE: Data and Model Parallelism in MLPC

2015-12-08 Thread Ulanov, Alexander
forward and back propagation. However, this option does not seem very practical to me. Best regards, Alexander From: Disha Shrivastava [mailto:dishu@gmail.com] Sent: Tuesday, December 08, 2015 11:19 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Data and Model Parallelism in MLPC

RE: Data and Model Parallelism in MLPC

2015-12-08 Thread Ulanov, Alexander
Hi Disha, Multilayer perceptron classifier in Spark implements data parallelism. Best regards, Alexander From: Disha Shrivastava [mailto:dishu@gmail.com] Sent: Tuesday, December 08, 2015 12:43 AM To: dev@spark.apache.org; Ulanov, Alexander Subject: Data and Model Parallelism in MLPC Hi, I

RE: A proposal for Spark 2.0

2015-11-12 Thread Ulanov, Alexander
Parameter Server is a new feature and thus does not match the goal of 2.0 is “to fix things that are broken in the current API and remove certain deprecated APIs”. At the same time I would be happy to have that feature. With regards to Machine learning, it would be great to move useful features

RE: Gradient Descent with large model size

2015-10-19 Thread Ulanov, Alexander
look into how to zip the data sent as update. Do you know any options except going from double to single precision (or less) ? Best regards, Alexander From: Evan Sparks [mailto:evan.spa...@gmail.com] Sent: Saturday, October 17, 2015 2:24 PM To: Joseph Bradley Cc: Ulanov, Alexander; dev

RE: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-15 Thread Ulanov, Alexander
the size of the data and the model. Also, you have to make sure that all workers own local data, that is a separate thing to the number of partitions. Best regards, Alexander From: Disha Shrivastava [mailto:dishu@gmail.com] Sent: Thursday, October 15, 2015 10:13 AM To: Ulanov, Alexander Cc

RE: Gradient Descent with large model size

2015-10-15 Thread Ulanov, Alexander
Bradley [mailto:jos...@databricks.com] Sent: Wednesday, October 14, 2015 11:35 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Gradient Descent with large model size For those numbers of partitions, I don't think you'll actually use tree aggregation. The number of partition

Gradient Descent with large model size

2015-10-14 Thread Ulanov, Alexander
Dear Spark developers, I have noticed that Gradient Descent is Spark MLlib takes long time if the model is large. It is implemented with TreeAggregate. I've extracted the code from GradientDescent.scala to perform the benchmark. It allocates the Array of a given size and the aggregates it: val

RE: Operations with cached RDD

2015-10-12 Thread Ulanov, Alexander
Thank you, Nitin. This does explain the problem. It seems that UI should make this more clear to the user, otherwise it is simply misleading if you read it as it. From: Nitin Goyal [mailto:nitin2go...@gmail.com] Sent: Sunday, October 11, 2015 5:57 AM To: Ulanov, Alexander Cc: dev

RE: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-12 Thread Ulanov, Alexander
worthwhile for this rather small dataset. Best regards, Alexander From: Disha Shrivastava [mailto:dishu@gmail.com] Sent: Sunday, October 11, 2015 9:29 AM To: Mike Hynes Cc: dev@spark.apache.org; Ulanov, Alexander Subject: Re: No speedup in MultiLayerPerceptronClassifier with increase in number of

Operations with cached RDD

2015-10-09 Thread Ulanov, Alexander
Dear Spark developers, I am trying to understand how Spark UI displays operation with the cached RDD. For example, the following code caches an rdd: >> val rdd = sc.parallelize(1 to 5, 5).zipWithIndex.cache >> rdd.count The Jobs tab shows me that the RDD is evaluated: : 1 count at :24

RE: GraphX PageRank keeps 3 copies of graph in memory

2015-10-07 Thread Ulanov, Alexander
Hi Ankur, Could you help with explanation of the problem below? Best regards, Alexander From: Ulanov, Alexander Sent: Friday, October 02, 2015 11:39 AM To: 'Robin East' Cc: dev@spark.apache.org Subject: RE: GraphX PageRank keeps 3 copies of graph in memory Hi Robin, Sounds interes

RE: GraphX PageRank keeps 3 copies of graph in memory

2015-10-02 Thread Ulanov, Alexander
] Sent: Friday, October 02, 2015 12:27 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: GraphX PageRank keeps 3 copies of graph in memory Alexander, I’ve just run the benchmark and only end up with 2 sets of RDDs in the Storage tab. This is on 1.5.0, what version are you using? Robin

GraphX PageRank keeps 3 copies of graph in memory

2015-09-30 Thread Ulanov, Alexander
Dear Spark developers, I would like to understand GraphX caching behavior with regards to PageRank in Spark, in particular, the following implementation of PageRank: https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala On each iteration the

Too many executors are created

2015-09-29 Thread Ulanov, Alexander
Dear Spark developers, I have created a simple Spark application for spark submit. It calls a machine learning library from Spark MLlib that is executed in a number of iterations that correspond to the same number of task in Spark. It seems that Spark creates an executor for each task and then

RE: One element per node

2015-09-18 Thread Ulanov, Alexander
partitions per node? From: Reynold Xin [mailto:r...@databricks.com] Sent: Friday, September 18, 2015 4:37 PM To: Ulanov, Alexander Cc: Feynman Liang; dev@spark.apache.org Subject: Re: One element per node Use a global atomic boolean and return nothing from that partition if the boolean is true

RE: One element per node

2015-09-18 Thread Ulanov, Alexander
Thank you! How can I guarantee that I have only one element per executor (per worker, or per physical node)? From: Feynman Liang [mailto:fli...@databricks.com] Sent: Friday, September 18, 2015 4:06 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: One element per node

One element per node

2015-09-18 Thread Ulanov, Alexander
Dear Spark developers, Is it possible (and how to do it if possible) to pick one element per physical node from an RDD? Let's say the first element of any partition on that node. The result would be an RDD[element], the count of elements is equal to the N of nodes that has partitions of the ini

RE: Enum parameter in ML

2015-09-16 Thread Ulanov, Alexander
, September 16, 2015 5:35 PM To: Feynman Liang Cc: Ulanov, Alexander; dev@spark.apache.org Subject: Re: Enum parameter in ML I've tended to use Strings. Params can be created with a validator (isValid) which can ensure users get an immediate error if they try to pass an unsupported String. N

RE: Enum parameter in ML

2015-09-14 Thread Ulanov, Alexander
Hi Feynman, Thank you for suggestion. How can I ensure that there will be no problems for Java users? (I only use Scala API) Best regards, Alexander From: Feynman Liang [mailto:fli...@databricks.com] Sent: Monday, September 14, 2015 5:27 PM To: Ulanov, Alexander Cc: dev@spark.apache.org

Enum parameter in ML

2015-09-14 Thread Ulanov, Alexander
Dear Spark developers, I am currently implementing the Estimator in ML that has a parameter that can take several different values that are mutually exclusive. The most appropriate type seems to be Scala Enum (http://www.scala-lang.org/api/current/index.html#scala.Enumeration). However, the cu

RE: Data frame with one column

2015-09-14 Thread Ulanov, Alexander
Thank you for quick response! I’ll use Tuple1 From: Feynman Liang [mailto:fli...@databricks.com] Sent: Monday, September 14, 2015 11:05 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Data frame with one column For an example, see the ml-feature word2vec user guide<ht

Data frame with one column

2015-09-14 Thread Ulanov, Alexander
Dear Spark developers, I would like to create a dataframe with one column. However, the createDataFrame method accepts at least a Product: val data = Seq(1.0, 2.0) val rdd = sc.parallelize(data, 2) val df = sqlContext.createDataFrame(rdd) [fail]:25: error: overloaded method value createDataFrame

Use of UnsafeRow

2015-09-01 Thread Ulanov, Alexander
Dear Spark developers, Could you suggest what is the intended use of UnsafeRow (except for Tungsten groupBy and sort) and give an example how to use it? 1)Is it intended to be instantiated as the copy of the Row in order to perform in-place modifications of it? 2)Can I create a new UnsafeRow giv

Re: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Ulanov, Alexander
Thank you for the explanation. The size if the 100M data is ~1.4GB in memory and each worker has 32GB of memory. It seems to be a lot of free memory available. I wonder how Spark can hit GC with such setup? Reynold Xin mailto:r...@databricks.com>> On Fri, Aug 21, 2015 at 11:07 AM,

RE: Dataframe aggregation with Tungsten unsafe

2015-08-21 Thread Ulanov, Alexander
t on this? It seems counterintuitive to me. Local performance was not as good as Reynold had. I have around 1.5x, he had 5x. However, local mode is not interesting. From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, August 20, 2015 9:24 PM To: Ulanov, Alexander Cc: dev@spark.apa

RE: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
t 20, 2015 5:43 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Dataframe aggregation with Tungsten unsafe Please git pull :) On Thu, Aug 20, 2015 at 5:35 PM, Ulanov, Alexander mailto:alexander.ula...@hp.com>> wrote: I am using Spark 1.5 cloned from master on June 12. (The

RE: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
I am using Spark 1.5 cloned from master on June 12. (The aggregate unsafe feature was added to Spark on April 29.) From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, August 20, 2015 5:26 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Dataframe aggregation with

RE: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30) at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:64) ... 73 more From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, August 20, 2015 4:22 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject

Re: Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
xtDouble)).toDF("key", "value") data.write.parquet("/scratch/rxin/tmp/alex") val df = sqlContext.read.parquet("/scratch/rxin/tmp/alex") val t = System.nanoTime() val res = df.groupBy("key").agg(sum("value")) res.count() println((System.nanoTime

Dataframe aggregation with Tungsten unsafe

2015-08-20 Thread Ulanov, Alexander
Dear Spark developers, I am trying to benchmark the new Dataframe aggregation implemented under the project Tungsten and released with Spark 1.4 (I am using the latest Spark from the repo, i.e. 1.5): https://github.com/apache/spark/pull/5725 It tells that the aggregation should be faster due to

Machine learning unit tests guidelines

2015-07-30 Thread Ulanov, Alexander
Dear Spark developers, Are there any best practices or guidelines for machine learning unit tests in Spark? After taking a brief look at the unit tests in ML and MLlib, I have found that each algorithm is tested in a different way. There are few kinds of tests: 1)Partial check of internal algor

RE: Two joins in GraphX Pregel implementation

2015-07-29 Thread Ulanov, Alexander
: Tuesday, July 28, 2015 12:05 PM To: Ulanov, Alexander Cc: Robin East; dev@spark.apache.org Subject: Re: Two joins in GraphX Pregel implementation On 27 Jul 2015, at 16:42, Ulanov, Alexander mailto:alexander.ula...@hp.com>> wrote: It seems that the mentioned two joins can be rewritten as one oute

RE: Two joins in GraphX Pregel implementation

2015-07-28 Thread Ulanov, Alexander
. Do you know the reason why this improvement is not pushed? CC’ing Dave From: Robin East [mailto:robin.e...@xense.co.uk] Sent: Monday, July 27, 2015 9:11 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Two joins in GraphX Pregel implementation Quite possibly - there is a JIRA open

RE: Two joins in GraphX Pregel implementation

2015-07-27 Thread Ulanov, Alexander
27, 2015 8:56 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Two joins in GraphX Pregel implementation What happens to this line of code: messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache() Part of the Pregel ‘contract’ is that vertices that

Two joins in GraphX Pregel implementation

2015-07-27 Thread Ulanov, Alexander
Dear Spark developers, Below is the GraphX Pregel code snippet from https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel-api: (it does not contain caching step): while (activeMessages > 0 && i < maxIterations) { // Receive the messages:

RE: Model parallelism with RDD

2015-07-17 Thread Ulanov, Alexander
Hi Shivaram, Thank you for the explanation. Is there a direct way to check the length of the lineage i.e. that the computation is repeated? Best regards, Alexander From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Friday, July 17, 2015 10:10 AM To: Ulanov, Alexander Cc

RE: Model parallelism with RDD

2015-07-16 Thread Ulanov, Alexander
spark.sql.unsafe.enabled=true removes the GC when persisting/unpersisting the DataFrame? Best regards, Alexander From: Ulanov, Alexander Sent: Monday, July 13, 2015 11:15 AM To: shiva...@eecs.berkeley.edu Cc: dev@spark.apache.org Subject: RE: Model parallelism with RDD Below are the average

RE: BlockMatrix multiplication

2015-07-16 Thread Ulanov, Alexander
d decrease then. Best, Burak On Wed, Jul 15, 2015 at 3:04 PM, Ulanov, Alexander mailto:alexander.ula...@hp.com>> wrote: Hi Burak, I’ve modified my code as you suggested, however it still leads to shuffling. Could you suggest what’s wrong with my code or provide an example code with

RE: BlockMatrix multiplication

2015-07-15 Thread Ulanov, Alexander
/ 1e9) Best regards, Alexander From: Ulanov, Alexander Sent: Tuesday, July 14, 2015 6:24 PM To: 'Burak Yavuz' Cc: Rakesh Chalasani; dev@spark.apache.org Subject: RE: BlockMatrix multiplication Hi Burak, Thank you for explanation! I will try to make a diagonal block matrix and report y

RE: BlockMatrix multiplication

2015-07-14 Thread Ulanov, Alexander
From: Burak Yavuz [mailto:brk...@gmail.com] Sent: Tuesday, July 14, 2015 10:14 AM To: Ulanov, Alexander Cc: Rakesh Chalasani; dev@spark.apache.org Subject: Re: BlockMatrix multiplication Hi Alexander, From your example code, using the GridPartitioner, you will have 1 column, and 5 rows. When you

RE: BlockMatrix multiplication

2015-07-14 Thread Ulanov, Alexander
am missing something or using it wrong. Best regards, Alexander From: Rakesh Chalasani [mailto:vnit.rak...@gmail.com] Sent: Tuesday, July 14, 2015 9:05 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: BlockMatrix multiplication Hi Alexander: Aw, I missed the 'cogrou

Re: BlockMatrix multiplication

2015-07-14 Thread Ulanov, Alexander
s a local reduce before aggregating across nodes. Rakesh On Mon, Jul 13, 2015 at 9:24 PM Ulanov, Alexander mailto:alexander.ula...@hp.com>> wrote: Dear Spark developers, I am trying to perform BlockMatrix multiplication in Spark. My test is as follows: 1)create a matrix of N blocks, so

BlockMatrix multiplication

2015-07-13 Thread Ulanov, Alexander
Dear Spark developers, I am trying to perform BlockMatrix multiplication in Spark. My test is as follows: 1)create a matrix of N blocks, so that each row of block matrix contains only 1 block and each block resides in separate partition on separate node, 2)transpose the block matrix and 3)multi

RE: Model parallelism with RDD

2015-07-13 Thread Ulanov, Alexander
= newRDD i += 1 } println("Avg iteration time:" + avgTime / numIterations) Best regards, Alexander From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Friday, July 10, 2015 10:04 PM To: Ulanov, Alexander Cc: ; dev@spark.apache.org Subject: Re: Model parallelism with RD

Re: Model parallelism with RDD

2015-07-10 Thread Ulanov, Alexander
.count` before you do oldRDD.unpersist(true) -- Otherwise it might be recomputing all the previous iterations each time. Thanks Shivaram On Fri, Jul 10, 2015 at 7:44 PM, Ulanov, Alexander mailto:alexander.ula...@hp.com>> wrote: Hi, I am interested how scalable can be the model parallelism

Model parallelism with RDD

2015-07-10 Thread Ulanov, Alexander
Hi, I am interested how scalable can be the model parallelism within Spark. Suppose, the model contains N weights of type Double and N is so large that does not fit into the memory of a single node. So, we can store the model in RDD[Double] within several nodes. To train the model, one needs to

RE: Force inner join to shuffle the smallest table

2015-06-25 Thread Ulanov, Alexander
[68] at explain at :25 Could Spark SQL developers suggest why it happens? Best regards, Alexander From: Stephen Carman [mailto:scar...@coldlight.com] Sent: Wednesday, June 24, 2015 12:33 PM To: Ulanov, Alexander Cc: CC GP; dev@spark.apache.org Subject: Re: Force inner join to shuffle the smallest

RE: Force inner join to shuffle the smallest table

2015-06-24 Thread Ulanov, Alexander
It also fails, as I mentioned in the original question. From: CC GP [mailto:chandrika.gopalakris...@gmail.com] Sent: Wednesday, June 24, 2015 12:08 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Force inner join to shuffle the smallest table Try below and see if it makes a

Force inner join to shuffle the smallest table

2015-06-24 Thread Ulanov, Alexander
Hi, I try to inner join of two tables on two fields(string and double). One table is 2B rows, the second is 500K. They are stored in HDFS in Parquet. Spark v 1.4. val big = sqlContext.paquetFile("hdfs://big") data.registerTempTable("big") val small = sqlContext.paquetFile("hdfs://small") data.reg

Force Spark save parquet files with replication factor other than 3 (default one)

2015-06-22 Thread Ulanov, Alexander
Hi, My Hadoop is configured to have replication ratio = 2. I've added $HADOOP_HOME/config to the PATH as suggested in http://apache-spark-user-list.1001560.n3.nabble.com/hdfs-replication-on-saving-RDD-td289.html. Spark (1.4) does rdd.saveAsTextFile with replication=2. However DataFrame.saveAsP

Increase partition count (repartition) without shuffle

2015-06-18 Thread Ulanov, Alexander
Hi, Is there a way to increase the amount of partition of RDD without causing shuffle? I've found JIRA issue https://issues.apache.org/jira/browse/SPARK-5997 however there is no implementation yet. Just in case, I am reading data from ~300 big binary files, which results in 300 partitions, the

RE: Using CUDA within Spark / boosting linear algebra

2015-05-21 Thread Ulanov, Alexander
IDMat used only one GPU. John, could you suggest how to force BIDMat to use all GPUs? Also, could you suggest how to test Double matrices multiplication in BIDMat-cuda (in GPU and with copy from/to main memory)? Best regards, Alexander -Original Message- From: Ulanov, Alexander

RE: testing HTML email

2015-05-14 Thread Ulanov, Alexander
Testing too. Recently I got few undelivered mails to dev-list. From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, May 14, 2015 3:39 PM To: dev@spark.apache.org Subject: testing HTML email Testing html emails ... Hello This is bold This is a link

RE: DataFrame distinct vs RDD distinct

2015-05-11 Thread Ulanov, Alexander
s1, s2) => s1 ++= s2) Best regards, Alexander -Original Message----- From: Ulanov, Alexander Sent: Monday, May 11, 2015 11:59 AM To: Olivier Girardot; Michael Armbrust Cc: Reynold Xin; dev@spark.apache.org Subject: RE: DataFrame distinct vs RDD distinct Hi, Could you suggest alternative way

RE: Easy way to convert Row back to case class

2015-05-11 Thread Ulanov, Alexander
Thank you for suggestions! From: Reynold Xin [mailto:r...@databricks.com] Sent: Friday, May 08, 2015 11:10 AM To: Will Benton Cc: Ulanov, Alexander; dev@spark.apache.org Subject: Re: Easy way to convert Row back to case class In 1.4, you can do row.getInt("colName") In 1.5, some

RE: DataFrame distinct vs RDD distinct

2015-05-11 Thread Ulanov, Alexander
Hi, Could you suggest alternative way of implementing distinct, e.g. via fold or aggregate? Both SQL distinct and RDD distinct fail on my dataset due to overflow of Spark shuffle disk. I have 7 nodes with 300GB dedicated to Spark shuffle each. My dataset is 2B rows, the field which I'm performi

Easy way to convert Row back to case class

2015-05-08 Thread Ulanov, Alexander
Hi, I created a dataset RDD[MyCaseClass], converted it to DataFrame and saved to Parquet file, following https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds When I load this dataset with sqlContext.parquetFile, I get DataFrame with column names as in initia

RE: Speeding up Spark build during development

2015-05-01 Thread Ulanov, Alexander
Hi Pramod, For cluster-like tests you might want to use the same code as in mllib's LocalClusterSparkContext. You can rebuild only the package that you change and then run this main class. Best regards, Alexander -Original Message- From: Pramod Biligiri [mailto:pramodbilig...@gmail.com

Re: Should we let everyone set Assignee?

2015-04-24 Thread Ulanov, Alexander
after merging >> the >>>> PR >>>>> (why it is important to put the JIRA in the title). It can't auto >> assign >>>>> the JIRA since usernames dont match up but it is an easy reminder to >> set >>>>> the Assignee. I do right aft

RE: Should we let everyone set Assignee?

2015-04-23 Thread Ulanov, Alexander
My thinking is that current way of assigning a contributor after the patch is done (or almost done) is OK. Parallel efforts are also OK until they are discussed in the issue's thread. Ilya Ganelin made a good point that it is about moving the project forward. It also adds means of competition "w

RE: Regularization in MLlib

2015-04-07 Thread Ulanov, Alexander
: Tuesday, April 07, 2015 3:28 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Regularization in MLlib 1) Norm(weights, N) will return (w_1^N + w_2^N +)^(1/N), so norm * norm is required. 2) This is bug as you said. I intend to fix this using weighted regularization, and

Regularization in MLlib

2015-04-07 Thread Ulanov, Alexander
Hi, Could anyone elaborate on the regularization in Spark? I've found that L1 and L2 are implemented with Updaters (L1Updater, SquaredL2Updater). 1)Why the loss reported by L2 is (0.5 * regParam * norm * norm) where norm is Norm(weights, 2.0)? It should be 0.5*regParam*norm (0.5 to disappear aft

RE: Stochastic gradient descent performance

2015-04-06 Thread Ulanov, Alexander
“many updates” equals impractical time needed for learning. From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Sunday, April 05, 2015 7:13 PM To: Ulanov, Alexander Cc: shiva...@eecs.berkeley.edu; Joseph Bradley; dev@spark.apache.org Subject: Re: Stochastic gradient descent

RE: Running LocalClusterSparkContext

2015-04-03 Thread Ulanov, Alexander
you suggest? (it seems that new version of Spark was not tested for Windows. Previous versions worked more or less fine for me) -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Friday, April 03, 2015 1:04 PM To: Ulanov, Alexander Cc: dev@spark.apache.org

RE: Running LocalClusterSparkContext

2015-04-03 Thread Ulanov, Alexander
ideas? -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Friday, April 03, 2015 12:52 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Running LocalClusterSparkContext That looks like another bug on top of 6673; can you point that out in the PR to make

RE: Running LocalClusterSparkContext

2015-04-03 Thread Ulanov, Alexander
running executor java.lang.IllegalStateException: No assemblies found in 'C:\ulanov\dev\spark\mllib\.\assembly\target\scala-2.10'. -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Friday, April 03, 2015 12:31 PM To: Ulanov, Alexander Cc: dev@spark.

Running LocalClusterSparkContext

2015-04-03 Thread Ulanov, Alexander
Hi, I am trying to execute unit tests with LocalClusterSparkContext on Windows 7. I am getting a bunch of error in the log saying that: "Cannot find any assembly build directories." Below is the part from the log where it brakes. Could you suggest what's happening? In addition the application h

RE: Stochastic gradient descent performance

2015-04-02 Thread Ulanov, Alexander
: Thursday, April 02, 2015 1:26 PM To: Joseph Bradley Cc: Ulanov, Alexander; dev@spark.apache.org Subject: Re: Stochastic gradient descent performance I haven't looked closely at the sampling issues, but regarding the aggregation latency, there are fixed overheads (in local and distributed mode)

RE: Stochastic gradient descent performance

2015-04-02 Thread Ulanov, Alexander
on this? I do understand that in cluster mode the network speed will kick in and then one can blame it. Best regards, Alexander From: Joseph Bradley [mailto:jos...@databricks.com] Sent: Thursday, April 02, 2015 10:51 AM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Stochastic

RE: Storing large data for MLlib machine learning

2015-04-01 Thread Ulanov, Alexander
reeman.jer...@gmail.com] Sent: Wednesday, April 01, 2015 1:37 PM To: Hector Yee Cc: Ulanov, Alexander; Evan R. Sparks; Stephen Boesch; dev@spark.apache.org Subject: Re: Storing large data for MLlib machine learning @Alexander, re: using flat binary and metadata, you raise excellent points! At least in ou

RE: Using CUDA within Spark / boosting linear algebra

2015-04-01 Thread Ulanov, Alexander
2:43 PM To: Sean Owen Cc: Evan R. Sparks; Sam Halliday; dev@spark.apache.org; Ulanov, Alexander; jfcanny Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alex, Since it is non-trivial to make nvblas work with netlib-java, it would be great if you can send the instructions to netl

RE: Storing large data for MLlib machine learning

2015-04-01 Thread Ulanov, Alexander
Thanks, sounds interesting! How do you load files to Spark? Did you consider having multiple files instead of file lines? From: Hector Yee [mailto:hector@gmail.com] Sent: Wednesday, April 01, 2015 11:36 AM To: Ulanov, Alexander Cc: Evan R. Sparks; Stephen Boesch; dev@spark.apache.org Subject

RE: Stochastic gradient descent performance

2015-04-01 Thread Ulanov, Alexander
Sorry for bothering you again, but I think that it is an important issue for applicability of SGD in Spark MLlib. Could Spark developers please comment on it. -Original Message- From: Ulanov, Alexander Sent: Monday, March 30, 2015 5:00 PM To: dev@spark.apache.org Subject: Stochastic

RE: Using CUDA within Spark / boosting linear algebra

2015-03-30 Thread Ulanov, Alexander
@spark.apache.org; Ulanov, Alexander; jfcanny Subject: Re: Using CUDA within Spark / boosting linear algebra Hi Alex, Since it is non-trivial to make nvblas work with netlib-java, it would be great if you can send the instructions to netlib-java as part of the README. Hopefully we don't need to m

Stochastic gradient descent performance

2015-03-30 Thread Ulanov, Alexander
Hi, It seems to me that there is an overhead in "runMiniBatchSGD" function of MLlib's "GradientDescent". In particular, "sample" and "treeAggregate" might take time that is order of magnitude greater than the actual gradient computation. In particular, for mnist dataset of 60K instances, miniba

RE: Storing large data for MLlib machine learning

2015-03-26 Thread Ulanov, Alexander
...@gmail.com] Sent: Thursday, March 26, 2015 3:01 PM To: Ulanov, Alexander Cc: Stephen Boesch; dev@spark.apache.org Subject: Re: Storing large data for MLlib machine learning Hi Ulvanov, great question, we've encountered it frequently with scientific data (e.g. time series). Agreed te

RE: Storing large data for MLlib machine learning

2015-03-26 Thread Ulanov, Alexander
Thanks, Evan. What do you think about Protobuf? Twitter has a library to manage protobuf files in hdfs https://github.com/twitter/elephant-bird From: Evan R. Sparks [mailto:evan.spa...@gmail.com] Sent: Thursday, March 26, 2015 2:34 PM To: Stephen Boesch Cc: Ulanov, Alexander; dev

RE: Storing large data for MLlib machine learning

2015-03-26 Thread Ulanov, Alexander
. From: Stephen Boesch [mailto:java...@gmail.com] Sent: Thursday, March 26, 2015 2:27 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Storing large data for MLlib machine learning There are some convenience methods you might consider including: MLUtils.loadLibSVMFile

Storing large data for MLlib machine learning

2015-03-26 Thread Ulanov, Alexander
Hi, Could you suggest what would be the reasonable file format to store feature vector data for machine learning in Spark MLlib? Are there any best practices for Spark? My data is dense feature vectors with labels. Some of the requirements are that the format should be easy loaded/serialized,

RE: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Ulanov, Alexander
adsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing -Original Message----- From: Ulanov, Alexander Sent: Wednesday, March 25, 2015 2:31 PM To: Sam Halliday Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; jfcanny Subject: RE: Using CUDA within Spark / boosting linear algebra Hi

  1   2   >