Re: RFC: Remote "HBaseTest" from examples?

2016-08-18 Thread Ignacio Zendejas
I'm very late to this party and I get hbase-spark... what's the recommendation for pyspark + hbase? I realize this isn't necessarily a concern of the spark project, but it'd be nice to at least document it here with a very short and sweet response because I haven't found anything useful in the wild

Re: createDataframe from s3 results in error

2015-06-02 Thread Ignacio Zendejas
2, 2015 at 3:13 PM, Ignacio Zendejas wrote: > I've run into an error when trying to create a dataframe. Here's the code: > > -- > from pyspark import StorageLevel > from pyspark.sql import Row > > table = 'blah' > ssc = HiveContext(sc) > > data = sc

createDataframe from s3 results in error

2015-06-02 Thread Ignacio Zendejas
I've run into an error when trying to create a dataframe. Here's the code: -- from pyspark import StorageLevel from pyspark.sql import Row table = 'blah' ssc = HiveContext(sc) data = sc.textFile('s3://bucket/some.tsv') def deserialize(s): p = s.strip().split('\t') p[-1] = float(p[-1]) ret

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-14 Thread Ignacio Zendejas
get a higher-level primitive (e.g. stochastic > gradient descent) that you can plug some functions into, without worrying > about the communication. > > Matei > > On August 13, 2014 at 11:10:02 AM, Ignacio Zendejas ( > ignacio.zendejas...@gmail.com) wrote: > > Has

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Ignacio Zendejas
uages for >> ML-oriented programming", and that's why they went ahead with Python. >> However, as I understand, very few people actually implement algorithms in >> Python directly because of the sub-optimal performance. Most people >> implement algorithms in other lan

A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Ignacio Zendejas
Has anyone had a chance to look at this paper (with title in subject)? http://www.cs.rice.edu/~lp6/comparison.pdf Interesting that they chose to use Python alone. Do we know how much faster Scala is vs. Python in general, if at all? As with any and all benchmarks, I'm sure there are caveats, but

Re: feature selection and sparse vector support

2014-04-11 Thread Ignacio Zendejas
Here's the JIRA: https://issues.apache.org/jira/browse/SPARK-1473 Future discussions should take place in its comments section. Thanks. On Fri, Apr 11, 2014 at 11:26 AM, Ignacio Zendejas < ignacio.zendejas...@gmail.com> wrote: > Thanks for the response, Xiangrui. > > And

Re: feature selection and sparse vector support

2014-04-11 Thread Ignacio Zendejas
on gain > > computation, so it is easy to track the progress. > > > > The sparse vector support for NaiveBayes is already implemented in > > branch-1.0 and master. You only need to provide an RDD of sparse > > vectors (created from Vectors.sparse). > > > > MLUti

Re: minor optimizations to get my feet wet

2014-04-10 Thread Ignacio Zendejas
> > > The tail change looks good to me. > > For foldLeft, I agree with you that the old way is more readable (although > less idiomatic scala). > > > > > On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas < > ignacio.zendejas...@gmail.com> wrote: >

feature selection and sparse vector support

2014-04-10 Thread Ignacio Zendejas
Hi, again - As part of the next step, I'd like to make a more substantive contribution and propose some initial work on feature selection, primarily as it relates to text classification. Specifically, I'd like to contribute very straightforward code to perform information gain feature evaluation.

minor optimizations to get my feet wet

2014-04-10 Thread Ignacio Zendejas
Hi, all - First off, I want to say that I love spark and am very excited about MLBase. I'd love to contribute now that I have some time, but before I do that I'd like to familiarize myself with the process. In looking for a few projects and settling on one which I'll discuss in another thread, I