Re: Record metadata with RDDs and DataFrames

2015-07-15 Thread RJ Nowling
2015 at 12:36 PM, Reynold Xin wrote: > How about just using two fields, one boolean field to mark good/bad, and > another to get the source file? > > > On Wed, Jul 15, 2015 at 10:31 AM, RJ Nowling wrote: > >> Hi all, >> >> I'm working on an ETL task

Record metadata with RDDs and DataFrames

2015-07-15 Thread RJ Nowling
Hi all, I'm working on an ETL task with Spark. As part of this work, I'd like to mark records with some info such as: 1. Whether the record is good or bad (e.g, Either) 2. Originating file and lines Part of my motivation is to prevent errors with individual records from stopping the entire pipe

Re: Grouping runs of elements in a RDD

2015-07-02 Thread RJ Nowling
> > > If you can tolerate a few inaccuracies then you can just do the second > > step. You will miss the “boundaries” of the partitions but it might be > > acceptable for your use case. > > > On Tue, Jun 30, 2015 at 12:21 PM, RJ Nowling wrote: > >> That's

Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
ines that were split prematurely.) On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh < abhis...@tetrationanalytics.com> wrote: > could you use a custom partitioner to preserve boundaries such that all > related tuples end up on the same partition? > > On Jun 30, 2015, at 1

Re: Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
ed by others? On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin wrote: > Try mapPartitions, which gives you an iterator, and you can produce an > iterator back. > > > On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling wrote: > >> Hi all, >> >> I have a problem where I have a

Grouping runs of elements in a RDD

2015-06-30 Thread RJ Nowling
Hi all, I have a problem where I have a RDD of elements: Item1 Item2 Item3 Item4 Item5 Item6 ... and I want to run a function over them to decide which runs of elements to group together: [Item1 Item2] [Item3] [Item4 Item5 Item6] ... Technically, I could use aggregate to do this, but I would h

Re: enum-like types in Spark

2015-03-11 Thread RJ Nowling
How do these proposals affect PySpark? I think compatibility with PySpark through Py4J should be considered. On Mon, Mar 9, 2015 at 8:39 PM, Patrick Wendell wrote: > Does this matter for our own internal types in Spark? I don't think > any of these types are designed to be used in RDD records,

Re: SciSpark: NASA AIST14 proposal

2015-01-14 Thread RJ Nowling
Congratulations, Chris! I created a JIRA for "dimensional" RDDs that might be relevant: https://issues.apache.org/jira/browse/SPARK-4727 Jeremy Freeman pointed me to his lab's work on for neuroscience that have some related functionality : http://thefreemanlab.com/thunder/ On Wed, Jan 14, 2015 a

Re: Incorrect Maven Artifact Names

2015-01-14 Thread RJ Nowling
n change assembly.pom to (i) > not skip the install plugin and (ii) have "jar" as the packaging, > instead of pom. > > > > On Wed, Jan 14, 2015 at 1:08 PM, RJ Nowling wrote: > > Hi Sean, > > > > I confirmed that if I take the Spark 1.2.0 re

Re: Incorrect Maven Artifact Names

2015-01-14 Thread RJ Nowling
-6382f8428b13fa6082fa688178f3dbcc On Wed, Jan 14, 2015 at 2:59 PM, RJ Nowling wrote: > Thanks, Sean. > > Yes, Spark is incorrectly copying the spark assembly jar to > com/google/guava in the maven repository. This is for the 1.2.0 release, > just to clarify. > > I reverted the patches that shade

Re: Incorrect Maven Artifact Names

2015-01-14 Thread RJ Nowling
ng is in com/google/guava? > > You can un-skip the install plugin with -Dmaven.install.skip=false > > On Wed, Jan 14, 2015 at 7:26 PM, RJ Nowling wrote: > > Hi all, > > > > I'm trying to upgrade some Spark RPMs from 1.1.0 to 1.2.0. As part of > the > > R

Incorrect Maven Artifact Names

2015-01-14 Thread RJ Nowling
Hi all, I'm trying to upgrade some Spark RPMs from 1.1.0 to 1.2.0. As part of the RPM process, we build Spark with Maven. With Spark 1.2.0, though, the artifacts are placed in com/google/guava and there is no org/apache/spark. I saw that the pom.xml files had been modified to prevent the instal

Re: Maintainer for Mesos

2015-01-08 Thread RJ Nowling
Hi Andrew, Patrick Wendell and Andrew Or have committed previous patches related to Mesos. Maybe they would be good committers to look at it? RJ On Mon, Jan 5, 2015 at 6:40 PM, Andrew Ash wrote: > Hi Spark devs, > > I'm interested in having a committer look at a PR [1] for Mesos, but > there's

RDDs for "dimensional" (time series, spatial) data

2014-12-04 Thread RJ Nowling
se/SPARK-4727 I saw that MLlib supports some operations for time series in 1.2.0-rc1, but I think that specialized RDDs could optimize the partitioning and algorithms better than a regular RDD. Or, for example, spatial data could be partitioned into a grid. Any feedback would be great! Thanks, RJ N

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread RJ Nowling
Matei, I saw that you're listed as a maintainer for ~6 different subcomponents, and on over half of those, you're only the 2nd person. My concern is that you would be stretched thin and maybe wouldn't be able to work as a "back up" on all of those subcomponents. Are you planning on adding more m

Re: Surprising Spark SQL benchmark

2014-11-01 Thread RJ Nowling
Two thoughts here: 1. The real flaw with the sort benchmark was that Hadoop wasn't run on the same hardware. Given the advances in networking (availabIlity of 10GB Ethernet) and disks (SSDs) since the Hadoop benchmarks it was compared to, it's an apples to oranges comparison. Without that, it does

Re: Multitenancy in Spark - within/across spark context

2014-10-25 Thread RJ Nowling
Ashwin, What is your motivation for needing to share RDDs between jobs? Optimizing for reusing data across jobs? If so, you may want to look into Tachyon. My understanding is that Tachyon acts like a caching layer and you can designate when data will be reused in multiple jobs so it know to keep

Re: PR for Hierarchical Clustering Needs Review

2014-10-24 Thread RJ Nowling
ouple bugs. I will ask other developers to help > review the PR. Thanks for working with Yu and helping the code review! > > Best, > Xiangrui > > On Thu, Oct 23, 2014 at 2:58 AM, RJ Nowling wrote: > > Hi all, > > > > A few months ago, I collected feedback on what t

PR for Hierarchical Clustering Needs Review

2014-10-23 Thread RJ Nowling
Hi all, A few months ago, I collected feedback on what the community was looking for in clustering methods. A number of the community members requested a divisive hierarchical clustering method. Yu Ishikawa has stepped up to implement such a method. I've been working with him to communicate wha

Re: Spark on Mesos 0.20

2014-10-08 Thread RJ Nowling
h Scala and Spark (it's been all python). > > Fi > > > > Fairiz "Fi" Azizi > > On Tue, Oct 7, 2014 at 6:29 AM, RJ Nowling wrote: > >> I was able to reproduce it on a small 4 node cluster (1 mesos master and >> 3 mesos slaves) with relatively low-

Re: Spark on Mesos 0.20

2014-10-07 Thread RJ Nowling
, > Fi > > > > Fairiz "Fi" Azizi > > On Mon, Oct 6, 2014 at 9:20 AM, Timothy Chen wrote: > >> Ok I created SPARK-3817 to track this, will try to repro it as well. >> >> Tim >> >> On Mon, Oct 6, 2014 at 6:08 AM, RJ Nowling wrote: &

Re: Spark on Mesos 0.20

2014-10-06 Thread RJ Nowling
I've recently run into this issue as well. I get it from running Spark examples such as log query. Maybe that'll help reproduce the issue. On Monday, October 6, 2014, Gurvinder Singh wrote: > The issue does not occur if the task at hand has small number of map > tasks. I have a task which has 9

Re: [mllib] Add multiplying large scale matrices

2014-09-05 Thread RJ Nowling
I think it would be interesting to have a variety of matrix operations (multiplication, addition / subtraction, powers, scalar multiply, etc.) available in Spark. Diagonalization may be more difficult but iterative approximation approaches may be quite amenable. On Fri, Sep 5, 2014 at 5:26 AM, Y

Re: Is breeze thread safe in Spark?

2014-09-03 Thread RJ Nowling
27;re using += should be safe. > > > On Wed, Sep 3, 2014 at 11:58 AM, RJ Nowling wrote: > >> David, >> >> Can you confirm that += is not thread safe but + is? I'm assuming + >> allocates a new object for the write, while += doesn't. >> >> Thanks!

Re: Is breeze thread safe in Spark?

2014-09-03 Thread RJ Nowling
cate separate work arrays for each call to >> lapack, so it should be fine. In general concurrent modification isn't >> thread safe of course, but things that "ought" to be thread safe really >> should be. >> >> >> On Wed, Sep 3, 2014 at 10:41 AM, RJ

Re: Is breeze thread safe in Spark?

2014-09-03 Thread RJ Nowling
so it should be fine. In general concurrent modification isn't > thread safe of course, but things that "ought" to be thread safe really > should be. > > > On Wed, Sep 3, 2014 at 10:41 AM, RJ Nowling wrote: > >> No, it's not in all cases. Since Bre

Re: Is breeze thread safe in Spark?

2014-09-03 Thread RJ Nowling
No, it's not in all cases. Since Breeze uses lapack under the hood, changes to memory between different threads is bad. There's actually a potential bug in the KMeans code where it uses += instead of +. On Wed, Sep 3, 2014 at 1:26 PM, Ulanov, Alexander wrote: > Hi, > > Is breeze library call

[GraphX] JIRA / PR to fix breakage in GraphGenerator.logNormalGraph in PR #720

2014-08-27 Thread RJ Nowling
Hi all, PR #720 made multiple changes to GraphGenerator.logNormalGraph including: - Replacing the call to functions for generating random vertices and edges with in-line implementations with different equations. Based on reading the Pregel pape

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling
freeman, phd > neuroscientist > @thefreemanlab > > On Aug 12, 2014, at 2:20 PM, RJ Nowling wrote: > > Hi all, > > I wanted to follow up. > > I have a prototype for an optimized version of hierarchical k-means. I > wanted to get some feedback on my apporach. > >

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-27 Thread RJ Nowling
Hi Yu, A standardized API has not been implemented yet. I think it would be better to implement the other clustering algorithms then extract a common API. Others may feel differently. :) Just a note, there was a pre-existing JIRA for hierarchical KMeans SPARK-2429

Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread RJ Nowling
Also, another idea: may algorithms that use sampling tend to do so multiple times. It may be beneficial to allow a transformation to a representation that is more efficient for multiple rounds of sampling. On Tue, Aug 26, 2014 at 4:36 PM, RJ Nowling wrote: > Xiangrui, > > I posted

Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread RJ Nowling
main/scala/org/apache/spark/mllib/classification/NeuralNetwork.scala > > > > Unit tests are in the same branch. > > > > Alexander > > > > From: RJ Nowling [mailto:rnowl...@gmail.com] > > Sent: Tuesday, August 26, 2014 6:59 PM > > To: Ulanov, Alexand

Re: Gradient descent and runMiniBatchSGD

2014-08-26 Thread RJ Nowling
Hi Alexander, Can you post a link to the code? RJ On Tue, Aug 26, 2014 at 6:53 AM, Ulanov, Alexander wrote: > Hi, > > I've implemented back propagation algorithm using Gradient class and a > simple update using Updater class. Then I run the algorithm with mllib's > GradientDescent class. I ha

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-12 Thread RJ Nowling
e are any open-source examples that are being widely used in production? Thanks! On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling wrote: > Nice to meet you, Jeremy! > > This is great! Hierarchical clustering was next on my list -- > currently trying to get my PR for MiniBatch KMeans acce

Re: Examples have SparkContext improperly labeled?

2014-07-21 Thread RJ Nowling
e.g. "spark". > > -Sandy > > > On Mon, Jul 21, 2014 at 8:36 AM, RJ Nowling wrote: > >> Hi all, >> >> The examples listed here >> >> https://spark.apache.org/examples.html >> >> refer to the spark context as "spark" b

Examples have SparkContext improperly labeled?

2014-07-21 Thread RJ Nowling
Hi all, The examples listed here https://spark.apache.org/examples.html refer to the spark context as "spark" but when running Spark Shell uses "sc" for the SparkContext. Am I missing something? Thanks! RJ -- em rnowl...@gmail.com c 954.496.2314

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-18 Thread RJ Nowling
Nice to meet you, Jeremy! This is great! Hierarchical clustering was next on my list -- currently trying to get my PR for MiniBatch KMeans accepted. If it's cool with you, I'll try converting your code to fit in with the existing MLLib code as you suggest. I also need to review the Decision Tree

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread RJ Nowling
us level of >>>> > kmeans. >>>> > >>>> > I haven't been much of a fan of bottom up approaches like HAC mainly >>>> > because they assume there is already a distance metric for items to >>>> items. >>>> > This

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread RJ Nowling
way the similarity >> > matrix), do you count paths to same parent node if you are computing >> > distances between items in two adjacent nodes for example. It is also a >> bit >> > harder to distribute the computation for bottom up approaches as you have >> &

Re: Contribution to MLlib

2014-07-09 Thread RJ Nowling
Hi Meethu, There is no code for a Gaussian Mixture Model clustering algorithm in the repository, but I don't know if anyone is working on it. RJ On Wednesday, July 9, 2014, MEETHU MATHEW wrote: > Hi, > > I am interested in contributing a clustering algorithm towards MLlib of > Spark.I am focus

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
tions the most useful would be >> > hierarchical >> > > k-means with back tracking and the ability to support k nearest >> > centroids. >> > > >> > > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling >> wrote: >> > &g

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
Thanks, Hector! Your feedback is useful. On Tuesday, July 8, 2014, Hector Yee wrote: > I would say for bigdata applications the most useful would be hierarchical > k-means with back tracking and the ability to support k nearest centroids. > > > On Tue, Jul 8, 2014 at 10:54

Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread RJ Nowling
Hi all, MLlib currently has one clustering algorithm implementation, KMeans. It would benefit from having implementations of other clustering algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical Clustering, and Affinity Propagation. I recently submitted a PR [1] for a MiniBatch KMeans

Re: Contributing to MLlib

2014-07-02 Thread RJ Nowling
Hey Alex, I'm also a new contributor. I created a pull request for the KMeans MiniBatch implementation here: https://github.com/apache/spark/pull/1248 I also created a JIRA here: https://issues.apache.org/jira/browse/SPARK-2308 As part of my work, I started to refactor the common code to crea