Re: deep learning with heterogeneous cloud computing using spark

2016-01-30 Thread Christopher Nguyen
) for the "descent" step while Spark computes the gradients. The video was recently uploaded here http://bit.ly/1JnvQAO. Regards, -- *Algorithms of the Mind **http://bit.ly/1ReQvEW <http://bit.ly/1ReQvEW>* Christopher Nguyen CEO & Co-Founder www.Arimo.com (née Adatao) linkedin.com/in/ctnguyen

Re: Support R in Spark

2014-09-06 Thread Christopher Nguyen
On Sep 6, 2014 1:39 PM, "oppokui" wrote: > Thanks, Christopher. I saw it before, it is amazing. Last time I try to > download it from adatao, but no response after filling the table. How can I > download it or its source code? What is the license? > > Kui > > >

Re: Support R in Spark

2014-09-06 Thread Christopher Nguyen
Hi Kui, DDF (open sourced) also aims to do something similar, adding RDBMS idioms, and is already implemented on top of Spark. One philosophy is that the DDF API aggressively hides the notion of parallel datasets, exposing only (mutable) tables to users, on which they can apply R and other famili

Re: First Bay Area Tachyon meetup: August 25th, hosted by Yahoo! (Limited Space)

2014-08-19 Thread Christopher Nguyen
Fantastic! Sent while mobile. Pls excuse typos etc. On Aug 19, 2014 4:09 PM, "Haoyuan Li" wrote: > Hi folks, > > We've posted the first Tachyon meetup, which will be on August 25th and is > hosted by Yahoo! (Limited Space): > http://www.meetup.com/Tachyon/events/200387252/ . Hope to see you ther

Re: How to save mllib model to hdfs and reload it

2014-08-14 Thread Christopher Nguyen
t; works. > > > On Wed, Aug 13, 2014 at 9:20 PM, Christopher Nguyen > wrote: > >> Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to >> isolate the cause to serialization of the loaded model. And also try to >> serialize the deserialized

Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread Christopher Nguyen
Lance, some debugging ideas: you might try model.predict(RDD[Vector]) to isolate the cause to serialization of the loaded model. And also try to serialize the deserialized (loaded) model "manually" to see if that throws any visible exceptions. Sent while mobile. Pls excuse typos etc. On Aug 13, 20

Re: How to save mllib model to hdfs and reload it

2014-08-13 Thread Christopher Nguyen
+1 what Sean said. And if there are too many state/argument parameters for your taste, you can always create a dedicated (serializable) class to encapsulate them. Sent while mobile. Pls excuse typos etc. On Aug 13, 2014 6:58 AM, "Sean Owen" wrote: > PS I think that solving "not serializable" exc

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Christopher Nguyen
Hi sparkuser2345, I'm inferring the problem statement is something like "how do I make this complete faster (given my compute resources)?" Several comments. First, Spark only allows launching parallel tasks from the driver, not from workers, which is why you're seeing the exception when you try.

Re: initial basic question from new user

2014-06-12 Thread Christopher Nguyen
Toby, #saveAsTextFile() and #saveAsObjectFile() are probably what you want for your use case. As for Parquet support, that's newly arrived in Spark 1.0.0 together with SparkSQL so continue to watch this space. Gerard's suggestion to look at JobServer, which you can generalize as "building a long-r

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread Christopher Nguyen
Lakshmi, this is orthogonal to your question, but in case it's useful. It sounds like you're trying to determine the home location of a user, or something similar. If that's the problem statement, the data pattern may suggest a far more computationally efficient approach. For example, first map a

Re: Announcing Spark 1.0.0

2014-05-30 Thread Christopher Nguyen
Awesome work, Pat et al.! -- Christopher T. Nguyen Co-founder & CEO, Adatao linkedin.com/in/ctnguyen On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell wrote: > I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 > is a milestone release as the first in the

Re: Spark Memory Bounds

2014-05-28 Thread Christopher Nguyen
hitectures, that focus on raw-data or transient caches that serve non-equivalent purposes when viewed from the application level. It allows for very fast access to ready-to-consume high-level data structures, as long as available RAM permits. > > > On Tue, May 27, 2014 at 6:21 PM, Christopher N

Re: Spark Memory Bounds

2014-05-27 Thread Christopher Nguyen
Keith, do you mean "bound" as in (a) strictly control to some quantifiable limit, or (b) try to minimize the amount used by each task? If "a", then that is outside the scope of Spark's memory management, which you should think of as an application-level (that is, above JVM) mechanism. In this scop

Re: is Mesos falling out of favor?

2014-05-16 Thread Christopher Nguyen
Paco, that's a great video reference, thanks. To be fair to our friends at Yahoo, who have done a tremendous amount to help advance the cause of the BDAS stack, it's not FUD coming from them, certainly not in any organized or intentional manner. In vacuo we prefer Mesos ourselves, but also can't

Re: Opinions stratosphere

2014-05-01 Thread Christopher Nguyen
Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually attempted such a comparative study as a Masters thesis: http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf According to this snapshot (c. 2013), Stratosphere is different from Spark in not having an explicit concept of a

Re: Spark and HBase

2014-04-08 Thread Christopher Nguyen
Flavio, the two are best at two orthogonal use cases, HBase on the transactional side, and Spark on the analytic side. Spark is not intended for row-based random-access updates, while far more flexible and efficient in dataset-scale aggregations and general computations. So yes, you can easily see

Re: Spark on other parallel filesystems

2014-04-06 Thread Christopher Nguyen
hy" wrote: > Christopher > > Just to clarify - by 'load ops' do you mean RDD actions that result in > IO? > > Venkat > From: Christopher Nguyen > Reply-To: "user@spark.apache.org" > Date: Saturday, April 5, 2014 at 8:49 AM > To: "user

Re: Spark on other parallel filesystems

2014-04-05 Thread Christopher Nguyen
Avati, depending on your specific deployment config, there can be up to a 10X difference in data loading time. For example, we routinely parallel load 10+GB data files across small 8-node clusters in 10-20 seconds, which would take about 100s if bottlenecked over a 1GigE network. That's about the m

Re: Cross validation is missing in machine learning examples

2014-03-30 Thread Christopher Nguyen
Aureliano, you're correct that this is not "validation error", which is computed as the residuals on out-of-training-sample data, and helps minimize overfit variance. However, in this example, the errors are correctly referred to as "training error", which is what you might compute on a per-iterat

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
something we want to do, since > that would require another complex system we'll have to implement. Is DDF > going to be an alternative to RDD? > > Thanks again! > > > > On Fri, Mar 28, 2014 at 7:02 PM, Christopher Nguyen wrote: > >> Sung Hwan, strictly sp

Re: Mutable tagging RDD rows ?

2014-03-28 Thread Christopher Nguyen
Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to get what you want is to transform to another RDD. But you might look at MutablePair ( https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)

Re: Running a task once on each executor

2014-03-27 Thread Christopher Nguyen
Deenar, yes, you may indeed be overthinking it a bit, about how Spark executes maps/filters etc. I'll focus on the high-order bits so it's clear. Let's assume you're doing this in Java. Then you'd pass some *MyMapper*instance to J *avaRDD#map(myMapper)*. So you'd have a class *MyMapper extends Fu

Re: Running a task once on each executor

2014-03-27 Thread Christopher Nguyen
Deenar, dmpour is correct in that there's a many-to-many mapping between executors and partitions (an executor can be assigned multiple partitions, and a given partition can in principle move a different executor). I'm not sure why you seem to require this problem statement to be solved with RDDs.

Re: Announcing Spark SQL

2014-03-26 Thread Christopher Nguyen
+1 Michael, Reynold et al. This is key to some of the things we're doing. -- Christopher T. Nguyen Co-founder & CEO, Adatao linkedin.com/in/ctnguyen On Wed, Mar 26, 2014 at 2:58 PM, Michael Armbrust wrote: > Hey Everyone, > > This already went out to the dev list, but I wan

Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, the singleton pattern I'm suggesting would look something like this: public class TaskNonce { private transient boolean mIsAlreadyDone; private static transient TaskNonce mSingleton = new TaskNonce(); private transient Object mSyncObject = new Object(); public TaskNonce getSing

Re: Running a task once on each executor

2014-03-25 Thread Christopher Nguyen
Deenar, when you say "just once", have you defined "across multiple " (e.g., across multiple threads in the same JVM on the same machine)? In principle you can have multiple executors on the same machine. In any case, assuming it's the same JVM, have you considered using a singleton that maintains

Re: Spark enables us to process Big Data on an ARM cluster !!

2014-03-19 Thread Christopher Nguyen
Chanwit, that is awesome! Improvements in shuffle operations should help improve life even more for you. Great to see a data point on ARM. Sent while mobile. Pls excuse typos etc. On Mar 18, 2014 7:36 PM, "Chanwit Kaewkasi" wrote: > Hi all, > > We are a small team doing a research on low-power

Re: best practices for pushing an RDD into a database

2014-03-13 Thread Christopher Nguyen
Nicholas, > (Can we make that a thing? Let's make that a thing. :) Yes, we're soon releasing something called Distributed DataFrame (DDF) to the community that will make this (among other useful idioms) "a (straightforward) thing" for Spark. Sent while mobile. Pls excuse typos etc. On Mar 13, 20

Re: major Spark performance problem

2014-03-06 Thread Christopher Nguyen
Dana, When you run multiple "applications" under Spark, and if each application takes up the entire cluster resources, it is expected that one will block the other completely, thus you're seeing that the wall time add together sequentially. In addition there is some overhead associated with starti