Pretrained Word2Vec models

2016-12-05 Thread Lee Becker
Hi all, Is there a way for Spark to load Word2Vec models trained using gensim or the original C implementation of Word2Vec? Specifically I'd like to play with the Google News model

Re: Returning DataFrame as Scala method return type

2016-09-08 Thread Lee Becker
On Thu, Sep 8, 2016 at 11:35 AM, Ashish Tadose wrote: > I wish to organize these dataframe operations by grouping them Scala > Object methods. > Something like below > > > >> *Object Driver {* >> *def main(args: Array[String]) {* >> * val df = Operations.process(sparkContext)* >> * }**}* >> >>

collect_set without nulls (1.6 vs 2.0)

2016-09-07 Thread Lee Becker
Hello everyone, Consider this toy example: case class Foo(x: String, y: String) val df = sparkSession.createDataFrame(Array(Foo(null), Foo("a"), Foo("b")) df.select(collect_set($"x")).show In Spark 2.0.0 I get the following results: +--+ |collect_set(x)| +--+ | [null, b

Re: countDistinct, partial aggregates and Spark 2.0

2016-08-12 Thread Lee Becker
On Fri, Aug 12, 2016 at 11:55 AM, Lee Becker wrote: > val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c", > "a"))).toDF("x", "y") > val grouped = df.groupBy($"x").agg(countDistinct($&q

countDistinct, partial aggregates and Spark 2.0

2016-08-12 Thread Lee Becker
Hi everyone, I've started experimenting with my codebase to see how much work I will need to port it from 1.6.1 to 2.0.0. In regressing some of my dataframe transforms, I've discovered I can no longer pair a countDistinct with a collect_set in the same aggregation. Consider: val df = sc.paralle

Re: Dataset aggregateByKey equivalent

2016-04-25 Thread Lee Becker
On Sat, Apr 23, 2016 at 8:56 AM, Michael Armbrust wrote: > Have you looked at aggregators? > > > https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html > Thanks for the pointer to aggregators. I wasn't yet aware of them. However, I still get similar error

Dataset aggregateByKey equivalent

2016-04-22 Thread Lee Becker
Is there a way to do aggregateByKey on Datasets the way one can on an RDD? Consider the following RDD code to build a set of KeyVals into a DataFrame containing a column with the KeyVals' keys and a column containing lists of KeyVals. The end goal is to join it with collections which which will b

[graphx] PageRank with Edge weights

2014-06-07 Thread Lee Becker
Hello, I have been playing around with GraphX and its PageRank capabilities. Something I'm not seeing in the documentation is how to initialize PageRank using edge weights. Is this even possible, or would I need to reimplement the PageRank algorithm so that it can use an Edge property as part of