from:"Michael Malak"

RE: What does "Spark is not just MapReduce" mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Michael Malak

I would also add, from a data locality theoretic standpoint, mapPartitions() provides for node-local computation that plain old map-reduce does not. From my Android phone on T-Mobile. The first nationwide 4G network. Original message From: Ashic Mahtab Date: 06/28/2015 10:5

Re: Spark with Druid

2016-03-23 Thread Michael Malak

Will Spark 2.0 Structured Streaming obviate some of the Druid/Spark use cases? From: Raymond Honderdors To: "yuzhih...@gmail.com" Cc: "user@spark.apache.org" Sent: Wednesday, March 23, 2016 8:43 AM Subject: Re: Spark with Druid I saw these but i fail to understand how to direct th

Re: Apache Flink

2016-04-17 Thread Michael Malak

In terms of publication date, a paper on Nephele was published in 2009, prior to the 2010 USENIX paper on Spark. Nephele is the execution engine of Stratosphere, which became Flink. From: Mark Hamstra To: Mich Talebzadeh Cc: Corey Nolet ; "user @spark" Sent: Sunday, April 17, 2016 3:

Re: Apache Flink

2016-04-17 Thread Michael Malak

There have been commercial CEP solutions for decades, including from my employer. From: Mich Talebzadeh To: Mark Hamstra Cc: Corey Nolet ; "user @spark" Sent: Sunday, April 17, 2016 3:48 PM Subject: Re: Apache Flink The problem is that the strength and wider acceptance of a typic

Re: Apache Flink

2016-04-17 Thread Michael Malak

As with all history, "what if"s are not scientifically testable hypotheses, but my speculation is the energy (VCs, startups, big Internet companies, universities) within Silicon Valley contrasted to Germany. From: Mich Talebzadeh To: Michael Malak ; "user @spark"

Re: Spark 2.0 forthcoming features

2016-04-20 Thread Michael Malak

http://go.databricks.com/apache-spark-2.0-presented-by-databricks-co-founder-reynold-xin From: Sourav Mazumder To: user Sent: Wednesday, April 20, 2016 11:07 AM Subject: Spark 2.0 forthcoming features Hi All, Is there somewhere we can get idea of the upcoming features in Spark 2

Re: Adhoc queries on Spark 2.0 with Structured Streaming

2016-05-06 Thread Michael Malak

At first glance, it looks like the only streaming data sources available out of the box from the github master branch are https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala and https://github.com/apache/spark/blob/

Re: GraphX Java API

2016-05-30 Thread Michael Malak

Yes, it is possible to use GraphX from Java but it requires 10x the amount of code and involves using obscure typing and pre-defined lambda prototype facilities. I give an example of it in my book, the source code for which can be downloaded for free from https://www.manning.com/books/spark-gra

Re: Build k-NN graph for large dataset

2015-08-26 Thread Michael Malak

Yes. And a paper that describes using grids (actually varying grids) is http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf In the Spark GraphX In Action book that Robin East and I are writing, we implement a drastically simplified version of this in chapter

Re: GraphX drawing algorithm

2016-09-11 Thread Michael Malak

In chapter 10 of Spark GraphX In Action, we describe how to use Zeppelin with d3.js to render graphs using d3's force-directed rendering algorithm. The source code can be downloaded for free from https://www.manning.com/books/spark-graphx-in-action From: agc studio To: user@spark.apache.

Re: Shortest path with directed and weighted graphs

2016-10-24 Thread Michael Malak

Chapter 6 of my book implements Dijkstra's Algorithm. The source code is available to download for free. https://www.manning.com/books/spark-graphx-in-action From: Brian Wilson To: user@spark.apache.org Sent: Monday, October 24, 2016 7:11 AM Subject: Shortest path with directed and

Re: Minimum cost flow problem solving in Spark

2017-09-13 Thread Michael Malak

You might be interested in "Maximum Flow implementation on Spark GraphX" done by a Colorado School of Mines grad student a couple of years ago. http://datascienceassn.org/2016-01-27-maximum-flow-implementation-spark-graphx From: Swapnil Shinde To: user@spark.ap

Re: spark challenge: zip with next???

2015-01-30 Thread Michael Malak

But isn't foldLeft() overkill for the originally stated use case of max diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative non-associative accumulation as opposed to an embarrassingly parallel operation such as this one? This use case reminds me of FIR filtering in DSP. It se

Spark GraphX In Action on documentation page?

2015-03-24 Thread Michael Malak

Can my new book, Spark GraphX In Action, which is currently in MEAP http://manning.com/malak/, be added to https://spark.apache.org/documentation.html and, if appropriate, to https://spark.apache.org/graphx/ ? Michael Malak

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Michael Malak

You could have your receiver send a "magic value" when it is done. I discuss this Spark Streaming pattern in my presentation "Spark Gotchas and Anti-Patterns". In the PDF version, it's slides 34-36.http://www.datascienceassn.org/content/2014-11-05-spark-gotchas-and-anti-patterns-julia-language

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Michael Malak

http://www.datascienceassn.org/content/making-sense-making-sense-performance-data-analytics-frameworks From: "bit1...@163.com" To: user Sent: Monday, April 27, 2015 8:33 PM Subject: Why Spark is much faster than Hadoop MapReduce even on disk #yiv1713360705 body {line-height:1.5;}

Re: parallel Reduce within a key

2014-06-20 Thread Michael Malak

How about a treeReduceByKey? :-) On Friday, June 20, 2014 11:55 AM, DB Tsai wrote: Currently, the reduce operation combines the result from mapper sequentially, so it's O(n). Xiangrui is working on treeReduce which is O(log(n)). Based on the benchmark, it dramatically increase the performan

Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Michael Malak

It's really more of a Scala question than a Spark question, but the standard OO (not Scala-specific) way is to create your own custom supertype (e.g. MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD and MyArray), each of which manually forwards method calls to the co

Re: UpdateStateByKey - How to improve performance?

2014-08-06 Thread Michael Malak

Depending on the density of your keys, the alternative signature def updateStateByKey[S](updateFunc: (Iterator[(K, Seq[V], Option[S])]) ? Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner: Boolean)(implicit arg0: ClassTag[S]): DStream[(K, S)] at least iterates by key rather than

Re: Rdd of Rdds

2014-10-22 Thread Michael Malak

On Wednesday, October 22, 2014 9:06 AM, Sean Owen wrote: > No, there's no such thing as an RDD of RDDs in Spark. > Here though, why not just operate on an RDD of Lists? or a List of RDDs? > Usually one of these two is the right approach whenever you feel > inclined to operate on an RDD of RDDs.

Re: RDD Moving Average

2015-01-06 Thread Michael Malak

Asim Jalis writes: > > Thanks. Another question. I have event data with timestamps. I want to > create a sliding window > using timestamps. Some windows will have a lot of events in them others > won’t. Is there a way > to get an RDD made of this kind of a variable length window? You should c

Re: Opinions stratosphere

2014-05-02 Thread Michael Malak

"looks like Spark outperforms Stratosphere fairly consistently in the experiments" There was one exception the paper noted, which was when memory resources were constrained. In that case, Stratosphere seemed to have degraded more gracefully than Spark, but the author did not explore it deeper.

Bug when zip with longs and too many partitions?

2014-05-12 Thread Michael Malak

Is this a bug? scala> sc.parallelize(1 to 2,4).zip(sc.parallelize(11 to 12,4)).collect res0: Array[(Int, Int)] = Array((1,11), (2,12)) scala> sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect res1: Array[(Long, Int)] = Array((2,11))

Re: Bug when zip with longs and too many partitions?

2014-05-12 Thread Michael Malak

s the ASF Jira system will let me reset my password. On Sunday, May 11, 2014 4:40 AM, Michael Malak wrote: Is this a bug? scala> sc.parallelize(1 to 2,4).zip(sc.parallelize(11 to 12,4)).collect res0: Array[(Int, Int)] = Array((1,11), (2,12)) scala> sc.parallelize(1L to 2L,4).zip(sc.par

Serializable different behavior Spark Shell vs. Scala Shell

2014-05-14 Thread Michael Malak

I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 0.9.0/Scala 2.10.3. Is this a bug? Spark Shell (equals uses match

Re: rdd ordering gets scrambled

2014-05-28 Thread Michael Malak

Mohit Jaggi: A workaround is to use zipWithIndex (to appear in Spark 1.0, but if you're still on 0.9x you can swipe the code from https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala ), map it to (x => (x._2,x._1)) and then sortByKey. Sp

RE: What does "Spark is not just MapReduce" mean? Isn't every Spark job a form of MapReduce?

Re: Spark with Druid

Re: Apache Flink

Re: Apache Flink

Re: Apache Flink

Re: Spark 2.0 forthcoming features

Re: Adhoc queries on Spark 2.0 with Structured Streaming

Re: GraphX Java API

Re: Build k-NN graph for large dataset

Re: GraphX drawing algorithm

Re: Shortest path with directed and weighted graphs

Re: Minimum cost flow problem solving in Spark

Re: spark challenge: zip with next???

Spark GraphX In Action on documentation page?

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

Re: Why Spark is much faster than Hadoop MapReduce even on disk

Re: parallel Reduce within a key

Re: relationship of RDD[Array[String]] to Array[Array[String]]

Re: UpdateStateByKey - How to improve performance?

Re: Rdd of Rdds

Re: RDD Moving Average

Re: Opinions stratosphere

Bug when zip with longs and too many partitions?

Re: Bug when zip with longs and too many partitions?

Serializable different behavior Spark Shell vs. Scala Shell

Re: rdd ordering gets scrambled

26 matches

Site Navigation

Mail list logo

Footer information