date:20160802

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-02 Thread Nick Pentreath

Note that both HashingTF and CountVectorizer are usually used for creating TF-IDF normalized vectors. The definition ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency in TF-IDF is actually the "number of times the term occurs in the document". So it's perhaps a bit of a

Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Sun Rui

Based on your code, here is simpler test case on Spark 2.0 case class my (x: Int) val rdd = sc.parallelize(0.until(1), 1000).map { x => my(x) } val df1 = spark.createDataFrame(rdd) val df2 = df1.limit(1) df1.map { r => r.getAs[Int](0) }.first df2.map { r => r.getAs[Int](0) }.first // Much slow

Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Maciej Szymkiewicz

Thank you for your prompt response and great examples Sun Rui but I am still confused about one thing. Do you see any particular reason to not to merge subsequent limits? Following case (limit n (map f (limit m ds))) could be optimized to: (map f (limit n (limit m ds))) and further to

Re: Testing --supervise flag

2016-08-02 Thread Noorul Islam Kamal Malmiyoda

Widening to dev@spark On Mon, Aug 1, 2016 at 4:21 PM, Noorul Islam K M wrote: > > Hi all, > > I was trying to test --supervise flag of spark-submit. > > The documentation [1] says that, the flag helps in restarting your > application automatically if it exited with non-zero exit code. > > I am lo

Re: What happens in Dataset limit followed by rdd

2016-08-02 Thread Sun Rui

Spark does optimise subsequent limits, for example: scala> df1.limit(3).limit(1).explain == Physical Plan == CollectLimit 1 +- *SerializeFromObject [assertnotnull(input[0, $line14.$read$$iw$$iw$my, true], top level non-flat input object).x AS x#2] +- Scan ExternalRDDScan[obj#1] However, limit

Graph edge type pattern matching in GraphX

2016-08-02 Thread Ulanov, Alexander

Dear Spark developers, Could you suggest how to perform pattern matching on the type of the graph edge in the following scenario. I need to perform some math by means of aggregateMessages on the graph edges if edges are Double. Here is the code: def my[VD: ClassTag, ED: ClassTag] (graph: Graph[V

AccumulatorV2 += operator

2016-08-02 Thread Bryan Cutler

It seems like the += operator is missing from the new accumulator API, although the docs still make reference to it. Anyone know if it was intentionally not put in? I'm happy to do a PR for it or update the docs to just use the add() method, just want to check if there was some reason first. Bry

Re: AccumulatorV2 += operator

2016-08-02 Thread Holden Karau

I believe it was intentional with the idea that it would be more unified between Java and Scala APIs. If your talking about the javadoc mention in https://github.com/apache/spark/pull/14466/files - I believe the += is meant to refer to what the internal implementation of the add function can be for

SQL Based Authorization for SparkSQL

2016-08-02 Thread 马晓宇

Hi guys, I wonder if anyone working on SQL based authorization already or not. This is something we needed badly right now and we tried to embedded a Hive frontend in front of SparkSQL to achieve this but it's not quite a elegant solution. If SparkSQL has a way to do it or anyone already work

Re: SQL Based Authorization for SparkSQL

2016-08-02 Thread Ted Yu

There was SPARK-12008 which was closed. Not sure if there is active JIRA in this regard. On Tue, Aug 2, 2016 at 6:40 PM, 马晓宇 wrote: > Hi guys, > > I wonder if anyone working on SQL based authorization already or not. > > This is something we needed badly right now and we tried to embedded a > H

Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-02 Thread satyajit vegesna

Hi All, I am trying to run a spark job using yarn, and i specify --executor-cores value as 20. But when i go check the "nodes of the cluster" page in http://hostname:8088/cluster/nodes then i see 4 containers getting created on each of the node in cluster. But can only see 1 vcore getting assigne

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

Re: What happens in Dataset limit followed by rdd

Re: What happens in Dataset limit followed by rdd

Re: Testing --supervise flag

Re: What happens in Dataset limit followed by rdd

Graph edge type pattern matching in GraphX

AccumulatorV2 += operator

Re: AccumulatorV2 += operator

SQL Based Authorization for SparkSQL

Re: SQL Based Authorization for SparkSQL

Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

11 matches

Site Navigation

Mail list logo

Footer information