Are there any plans to develop Graphx Streaming?

2014-03-14 Thread Qi Song
Hi, I'm an undergraduate student. Our team want to built a anomalous event detection system based on graph structured social network data. Graphx is a good system, but we need to deal with streaming data. As Spark Streaming exists, I find that Graphx can not support streaming data. I want to know

Re: best practices for pushing an RDD into a database

2014-03-14 Thread Bertrand Dechoux
But you might run into performance issue. I don't know the subject about Spark but with Hadoop MapReduce, Sqoop might be a solution in order to handle with care the database Bertrand Dechoux On Fri, Mar 14, 2014 at 4:47 AM, Christopher Nguyen wrote: > Nicholas, > > > (Can we make that a thing?

Re: Spark usage patterns and questions

2014-03-14 Thread Rohit Rai
> > 3. In our usecase we read from Kafka, do some mapping and lastly persists > data to cassandra as well as pushes the data over remote actor for > realtime update in dashboard. I used below approaches > - First tried to use vary naive way like stream.map(...)*.foreachRDD( > pushes to actor)* >

Re: Large shuffle RDD

2014-03-14 Thread sparrow
found out what the problem was. It turned out that spark was consuming too much memory and not enough was left for OS. When doing large shuffle writes, performance is greatly reduced if there is not enough memory left for OS cache buffer. We have changed our configuration that spark on workers on

Realtime counting job with reading access from flatmappers

2014-03-14 Thread Dirk Weissenborn
Hey guys, first of all, nice job with Spark ;) I want to you Spark in the following setting and I am not completely sure what the best architecture would be. This is why I would like to ask for your opinion. Job: - read object from input stream - input is a set of ids - map input ids to new ids

Re: spark config params conventions

2014-03-14 Thread Chester Chen
Based on typesafe config maintainer's response, with latest version of typeconfig, the double quote is no longer needed for key like spark.speculation, so you don't need code to strip the quotes Chester Alpine data labs Sent from my iPhone On Mar 12, 2014, at 2:50 PM, Aaron Davidson wrote:

Re: Incrementally add/remove vertices in GraphX

2014-03-14 Thread alelulli
Hi Matei, Could you please clarify why i must call union before creating the graph? What's the behavior if i call union / subtract after the creation? Is the added /removed vertexes been processed? For example if i'm implementing an iterative algorithm and at the 5th step i need to add some ver

Fwd: Accessing HDFS file on CDH4.4 through Spark

2014-03-14 Thread Pariksheet Barapatre
-- Forwarded message -- From: Pariksheet Barapatre Date: 14 March 2014 23:09 Subject: Accessing HDFS file on CDH4.4 through Spark To: u...@spark.apaache.org, u...@spark.incubator.org Hello All, I just started exploring Spark functionality. I have downloaded and extracted binary

Re: Pig on Spark

2014-03-14 Thread Julien Le Dem
Hi Mayur, Are you going to the Pig meetup this afternoon? http://www.meetup.com/PigUser/events/160604192/ Aniket and I will be there. We would be happy to chat about Pig-on-Spark On Tue, Mar 11, 2014 at 8:56 AM, Mayur Rustagi wrote: > Hi Lin, > We are working on getting Pig on spark functional

Re: Pig on Spark

2014-03-14 Thread Mayur Rustagi
Dam I am off to NY for Structure Conf. Would it be possible to meet anytime after 28th March? I am really interested in making it stable & production quality. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Fri, M

slf4j and log4j loop

2014-03-14 Thread Adrian Mocanu
Hi Have you encountered a slf4j and log4j loop when using Spark? I pull a few packages via sbt. Spark package uses slf4j-log4j12.jar and another package uses use log4j-over-slf4j.jar which creates the circular loop between the 2 loggers and thus the exception below. Do you know of a fix for this

Re: slf4j and log4j loop

2014-03-14 Thread Sean Owen
Yes, I think you are interested in this issue and fix: https://github.com/apache/spark/pull/107 -- Sean Owen | Director, Data Science | London On Fri, Mar 14, 2014 at 1:04 PM, Adrian Mocanu wrote: > Hi > > Have you encountered a slf4j and log4j loop when using Spark? I pull a few > packages vi

Re: Pig on Spark

2014-03-14 Thread Aniket Mokashi
We will post fixes from our side at - https://github.com/twitter/pig. Top on our list are- 1. Make it work with pig-trunk (execution engine interface) (with 0.8 or 0.9 spark). 2. Support for algebraic udfs (this mitigates the group by oom problems). Would definitely love more contribution on this

RE: slf4j and log4j loop

2014-03-14 Thread Adrian Mocanu
That’s great! How would I pull that with sbt? I currently use these 2 (mvnrepository.com/artifact/org.spark-project seems to be down atm): val spark="org.apache.spark" % "spark-core_2.10" % "0.9.0-incubating" val sparkStreaming= "org.apache.spark" % "spark-streaming_2.10" % "0.9.0-incubating"

How to run a jar against spark

2014-03-14 Thread Chengi Liu
Hi, A very noob question.. Here is my code in eclipse import org.apache.spark.SparkContext; import org.apache.spark.SparkContext._; object HelloWorld { def main(args: Array[String]) { println("Hello, world!") val sc = new SparkContext("localhost","wordcount",args(0),Seq(args(1))

Re: possible bug in Spark's ALS implementation...

2014-03-14 Thread Michael Allman
I've been thoroughly investigating this issue over the past couple of days and have discovered quite a bit. For one thing, there is definitely (at least) one issue/bug in the Spark implementation that leads to incorrect results for models generated with rank > 1 or a large number of iterations. I w

Re: possible bug in Spark's ALS implementation...

2014-03-14 Thread Xiangrui Meng
Hi Michael, Thanks for looking into the details! Computing X first and computing Y first can deliver different results, because the initial objective values could differ by a lot. But the algorithm should converge after a few iterations. It is hard to tell which should go first. After all, the def

new user question on using scala collections inside RDDs

2014-03-14 Thread Peter
Hi  I'm new to Spark. I have played with some data locally but starting to wonder if I'm going down a wrong track of using Scala collections inside RDDs.  I'm looking at a log file of events from mobile clients. One of the engagement metrics we're interested in is lifetime (not terribly interes

Re: new user question on using scala collections inside RDDs

2014-03-14 Thread Ewen Cheslack-Postava
Code in a transformation (i.e. inside the function passed to RDD.map() or RDD.filter()) will run on workers, not the driver. They will run in parallel. In Spark, the driver actually doesn't do much -- it just builds up a description of the computation to be performed and then sends it off to th

spark-streaming

2014-03-14 Thread Nathan Kronenfeld
I'm trying to update some spark streaming code from 0.8.1 to 0.9.0. Among other things, I've found the function clearMetadata, who's comment says: "...Subclasses of DStream may override this to clear their own metadata along with the generated RDDs" yet which is declared private[streaming].

Can two spark applications share rdd?

2014-03-14 Thread 林武康
hi, I am a newbie of spark, the question below may seems fool, but I really want some advices: As load data from disk to generate an rdd is very cost in my applications, I hope I can generate it once and cache it in memory, then any other spark applications can refer to this rdd. Can this possib

Re: Problem with HBase external table on freshly created EMR cluster

2014-03-14 Thread Kanwaldeep
I'm getting the same error on writing data to HBase cluster using SPark Streaming. Any suggestions on how to fix this? 2014-03-14 23:10:33,832 ERROR o.a.s.s.scheduler.JobScheduler - Error running job streaming job 139486383 ms.0 org.apache.spark.SparkExceptio