Re: Minimum cost flow problem solving in Spark

2017-09-13 Thread Michael Malak
You might be interested in "Maximum Flow implementation on Spark GraphX" done by a Colorado School of Mines grad student a couple of years ago. http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx From: Swapnil Shinde To: u...@spark.ap

Re: Where is DataFrame.scala in 2.0?

2016-06-03 Thread Michael Malak
It's been reduced to a single line of code. http://technicaltidbit.blogspot.com/2016/03/dataframedataset-swap-places-in-spark-20.html From: Gerhard Fiedler To: "dev@spark.apache.org" Sent: Friday, June 3, 2016 9:01 AM Subject: Where is DataFrame.scala in 2.0? When I look at the

Re: [discuss] using deep learning to improve Spark

2016-04-01 Thread Michael Malak
I see you've been burning the midnight oil. From: Reynold Xin To: "dev@spark.apache.org" Sent: Friday, April 1, 2016 1:15 AM Subject: [discuss] using deep learning to improve Spark Hi all, Hope you all enjoyed the Tesla 3 unveiling earlier tonight. I'd like to bring your attention

Re: [discuss] DataFrame vs Dataset in Spark 2.0

2016-02-25 Thread Michael Malak
Would it make sense (in terms of feasibility, code organization, and politically) to have a JavaDataFrame, as a way to isolate the 1000+ extra lines to a Java compatibility layer/class? From: Reynold Xin To: "dev@spark.apache.org" Sent: Thursday, February 25, 2016 4:23 PM Subject: [d

Wrong initial bias in GraphX SVDPlusPlus?

2015-04-03 Thread Michael Malak
I believe that in the initialization portion of GraphX SVDPlusPluS, the initialization of biases is incorrect. Specifically, in line https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala#L96 instead of (vd._1, vd._2, msg.get._2 / msg.ge

textFile() ordering and header rows

2015-02-22 Thread Michael Malak
Since RDDs are generally unordered, aren't things like textFile().first() not guaranteed to return the first row (such as looking for a header row)? If so, doesn't that make the example in http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading? ---

Word2Vec IndexedRDD

2015-02-01 Thread Michael Malak
1. Is IndexedRDD planned for 1.3? https://issues.apache.org/jira/browse/SPARK-2365 2. Once IndexedRDD is in, is it planned to convert Word2VecModel to it from its current Map[String,Array[Float]]? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Wo

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Michael Malak
ose not immersed in data science or AI and thus may have narrower appeal. - Original Message ----- From: Evan R. Sparks To: Matei Zaharia Cc: Koert Kuipers ; Michael Malak ; Patrick Wendell ; Reynold Xin ; "dev@spark.apache.org" Sent: Tuesday, January 27, 2015 9:55 AM Subject: Re: renaming

Re: renaming SchemaRDD -> DataFrame

2015-01-26 Thread Michael Malak
And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay Area Spark Meetup YouTube contained a wealth of background information on this idea (mostly from Patrick and Reynold :-). https://www.youtube.com/watch?v=YWppYPWznSQ From: Patrick Wendell To:

Re: GraphX ShortestPaths backwards?

2015-01-20 Thread Michael Malak
I created https://issues.apache.org/jira/browse/SPARK-5343 for this. - Original Message - From: Michael Malak To: "dev@spark.apache.org" Cc: Sent: Monday, January 19, 2015 5:09 PM Subject: GraphX ShortestPaths backwards? GraphX ShortestPaths seems to be following edges

GraphX ShortestPaths backwards?

2015-01-19 Thread Michael Malak
GraphX ShortestPaths seems to be following edges backwards instead of forwards: import org.apache.spark.graphx._ val g = Graph(sc.makeRDD(Array((1L,""), (2L,""), (3L,""))), sc.makeRDD(Array(Edge(1L,2L,""), Edge(2L,3L,"" lib.ShortestPaths.run(g,Array(3)).vertices.collect res1: Array[(org.apac

Re: GraphX vertex partition/location strategy

2015-01-19 Thread Michael Malak
But wouldn't the gain be greater under something similar to EdgePartition1D (but perhaps better load-balanced based on number of edges for each vertex) and an algorithm that primarily follows edges in the forward direction? From: Ankur Dave To: Michael Malak Cc: "dev@spark.

GraphX vertex partition/location strategy

2015-01-19 Thread Michael Malak
Does GraphX make an effort to co-locate vertices onto the same workers as the majority (or even some) of its edges? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache

GraphX doc: triangleCount() requirement overstatement?

2015-01-18 Thread Michael Malak
According to: https://spark.apache.org/docs/1.2.0/graphx-programming-guide.html#triangle-counting "Note that TriangleCount requires the edges to be in canonical orientation (srcId < dstId)" But isn't this overstating the requirement? Isn't the requirement really that IF there are duplicate ed

Re: GraphX rmatGraph hangs

2015-01-04 Thread Michael Malak
Thank you. I created https://issues.apache.org/jira/browse/SPARK-5064 - Original Message - From: xhudik To: dev@spark.apache.org Cc: Sent: Saturday, January 3, 2015 2:04 PM Subject: Re: GraphX rmatGraph hangs Hi Michael, yes, I can confirm the behavior. It get stuck (loop?) and eat a

GraphX rmatGraph hangs

2015-01-03 Thread Michael Malak
The following single line just hangs, when executed in either Spark Shell or standalone: org.apache.spark.graphx.util.GraphGenerators.rmatGraph(sc, 4, 8) It just outputs "0 edges" and then locks up. The only other information I've found via Google is: http://mail-archives.apache.org/mod_mbox/sp

15 new MLlib algorithms

2014-07-09 Thread Michael Malak
At Spark Summit, Patrick Wendell indicated the number of MLlib algorithms would "roughly double" in 1.1 from the current approx. 15. http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf What are the planned additional algorithms? In Jira, I only see two when fil

GraphX triplets on 5-node graph

2014-05-28 Thread Michael Malak
Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or am I missing something fundamental? val nodes = sc.parallelize(Array((1L, "N1"), (2L, "N2"), (3L, "N3"), (4L, "N4"), (5L, "N5"))) val edges = sc.parallelize(Array(Edge(1L, 2L, "E1"), Edge(1L, 3L, "E2"), Edge(2L, 4L, "E

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Michael Malak
While developers may appreciate "1.0 == API stability," I'm not sure that will be the understanding of the VP who gives the green light to a Spark-based development effort. I fear a bug that silently produces erroneous results will be perceived like the FDIV bug, but in this case without the mo

map() + lookup() exception

2014-05-15 Thread Michael Malak
When using map() and lookup() in conjunction, I get an exception (each independently works fine). I'm using Spark 0.9.0/Scala 2.10.3 val a = sc.parallelize(Array(11)) val m = sc.parallelize(Array((11,21))) a.map(m.lookup(_)(0)).collect 14/05/14 15:03:35 ERROR Executor: Exception in task ID 23 sc

Class-based key in groupByKey?

2014-05-13 Thread Michael Malak
Is it permissible to use a custom class (as opposed to e.g. the built-in String or Int) for the key in groupByKey? It doesn't seem to be working for me on Spark 0.9.0/Scala 2.10.3: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ class C(val s:String) extends Serializ

Re: Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
12))) r: org.apache.spark.rdd.RDD[(C, Int)] = ParallelCollectionRDD[3] at parallelize at :14 scala> r.lookup(new C("a")) :17: error: type mismatch;  found   : C  required: C   r.lookup(new C("a"))    ^ On Tuesday, May 13, 2014 3:05 PM, Ana

Serializable different behavior Spark Shell vs. Scala Shell

2014-05-13 Thread Michael Malak
Reposting here on dev since I didn't see a response on user: I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 0.9.0