processing 50 gb data using just one machine

2016-06-15 Thread spR
Hi, can I use spark in local mode using 4 cores to process 50gb data effeciently? Thank you misha

update mysql in spark

2016-06-15 Thread spR
hi, can we write a update query using sqlcontext? sqlContext.sql("update act1 set loc = round(loc,4)") what is wrong in this? I get the following error. Py4JJavaError: An error occurred while calling o20.sql. : java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier update f

Re: processing 50 gb data using just one machine

2016-06-15 Thread spR
com/in/deicool > Skype: thumsupdeicool > Google talk: deicool > Blog: http://loveandfearless.wordpress.com > Facebook: http://www.facebook.com/deicool > > "Contribute to the world, environment and more : > http://www.gridrepublic.org > " > > On Wed, Jun 15,

Re: processing 50 gb data using just one machine

2016-06-15 Thread spR
0 PM, Sergio Fernández > wrote: > >> In theory yes... the common sense say that: >> >> volume / resources = time >> >> So more volume on the same processing resources would just take more time. >> On Jun 15, 2016 6:43 PM, "spR" wrote: >> >>

Re: processing 50 gb data using just one machine

2016-06-15 Thread spR
n 15, 2016 6:43 PM, "spR" wrote: > >> I have 16 gb ram, i7 >> >> Will this config be able to handle the processing without my ipythin >> notebook dying? >> >> The local mode is for testing purpose. But, I do not have any cluster at >> my dispo

concat spark dataframes

2016-06-15 Thread spR
hi, how to concatenate spark dataframes? I have 2 frames with certain columns. I want to get a dataframe with columns from both the other frames. Regards, Misha

data too long

2016-06-15 Thread spR
I am trying to save a spark dataframe in the mysql database by using: df.write(sql_url, table='db.table') the first column in the dataframe seems too long and I get this error : Data too long for column 'custid' at row 1 what should I do? Thanks

Re: concat spark dataframes

2016-06-15 Thread spR
ioners/dp/1484209656/> > > > > *From:* Natu Lauchande [mailto:nlaucha...@gmail.com] > *Sent:* Wednesday, June 15, 2016 2:07 PM > *To:* spR > *Cc:* user > *Subject:* Re: concat spark dataframes > > > > Hi, > > You can select the common collumns and use DataFra

java server error - spark

2016-06-15 Thread spR
Hi, I am getting this error while executing a query using sqlcontext.sql The table has around 2.5 gb of data to be scanned. First I get out of memory exception. But I have 16 gb of ram Then my notebook dies and I get below error Py4JNetworkError: An error occurred while trying to connect to the

Re: java server error - spark

2016-06-15 Thread spR
Zhang wrote: > Could you paste the full stacktrace ? > > On Thu, Jun 16, 2016 at 7:24 AM, spR wrote: > >> Hi, >> I am getting this error while executing a query using sqlcontext.sql >> >> The table has around 2.5 gb of data to be scanned. >> >> First I g

Re: java server error - spark

2016-06-15 Thread spR
> "--executor-memory" > > > > > > On Thu, Jun 16, 2016 at 8:54 AM, spR wrote: > >> Hey, >> >> error trace - >> >> hey, >> >> >> error trace - >> >> >> --

Re: java server error - spark

2016-06-15 Thread spR
sc.conf = conf On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang wrote: > >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded > > > It is OOM on the executor. Please try to increase executor memory. > "--executor-memory" > > > > &g

Re: java server error - spark

2016-06-15 Thread spR
mode, please use other cluster mode. > > On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang wrote: > >> Specify --executor-memory in your spark-submit command. >> >> >> >> On Thu, Jun 16, 2016 at 9:01 AM, spR wrote: >> >>> Thank you. Can you pls te

Re: java server error - spark

2016-06-15 Thread spR
hey, Thanks. Now it worked.. :) On Wed, Jun 15, 2016 at 6:59 PM, Jeff Zhang wrote: > Then the only solution is to increase your driver memory but still > restricted by your machine's memory. "--driver-memory" > > On Thu, Jun 16, 2016 at 9:53 AM, spR wrote: > >

noob: how to extract different members of a VertexRDD

2014-08-19 Thread spr
I'm a Scala / Spark / GraphX newbie, so may be missing something obvious. I have a set of edges that I read into a graph. For an iterative community-detection algorithm, I want to assign each vertex to a community with the name of the vertex. Intuitively it seems like I should be able to pull

Re: noob: how to extract different members of a VertexRDD

2014-08-19 Thread spr
ankurdave wrote > val g = ... > val newG = g.mapVertices((id, attr) => id) > // newG.vertices has type VertexRDD[VertexId], or RDD[(VertexId, > VertexId)] Yes, that worked perfectly. Thanks much. One follow-up question. If I just wanted to get those values into a vanilla variable (n

how to group within the messages at a vertex?

2014-09-17 Thread spr
Sorry if this is in the docs someplace and I'm missing it. I'm trying to implement label propagation in GraphX. The core step of that algorithm is - for each vertex, find the most frequent label among its neighbors and set its label to that. (I think) I see how to get the input from all the ne

SparkStreaming program does not start

2014-10-07 Thread spr
_ println("Point 0") val appName = "try1.scala" val master = "local[5]" val conf = new SparkConf().setAppName(appName).setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) println("Point 1") val lines = ssc.textFileStream("/Users/spr/D

Re: SparkStreaming program does not start

2014-10-07 Thread spr
|| Try using spark-submit instead of spark-shell Two questions: - What does spark-submit do differently from spark-shell that makes you think that may be the cause of my difficulty? - When I try spark-submit it complains about "Error: Cannot load main class from JAR: file:/Users/spr/...

Re: SparkStreaming program does not start

2014-10-14 Thread spr
Thanks Abraham Jacob, Tobias Pfeiffer, Akhil Das-2, and Sean Owen for your helpful comments. Cockpit error on my part in just putting the .scala file as an argument rather than redirecting stdin from it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spa

what does DStream.union() do?

2014-10-29 Thread spr
The documentation at https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.streaming.dstream.DStream describes the union() method as "Return a new DStream by unifying data of another DStream with this DStream." Can somebody provide a clear definition of what "unifying" means i

how to extract/combine elements of an Array in DStream element?

2014-10-29 Thread spr
I am processing a log file, from each line of which I want to extract the zeroth and 4th elements (and an integer 1 for counting) into a tuple. I had hoped to be able to index the Array for elements 0 and 4, but Arrays appear not to support vector indexing. I'm not finding a way to extract and co

Re: what does DStream.union() do?

2014-10-29 Thread spr
I need more precision to understand. If the elements of one DStream/RDD are (String) and the elements of the other are (Time, Int), what does "union" mean? I'm hoping for (String, Time, Int) but that appears optimistic. :) Do the elements have to be of homogeneous type? Holden Karau wrote >

does updateStateByKey accept a state that is a tuple?

2014-10-30 Thread spr
MinTime).min, Seq(maxTime, newMaxTime).max) } var DhcpSvrCum = newState.updateStateByKey[(Int, Time, Time)](updateDhcpState) The error I get is [info] Compiling 3 Scala sources to /Users/spr/Documents/.../target/scala-2.10/classes... [error] /Users/spr/Documents/...StatefulDhcpServer

Re: does updateStateByKey accept a state that is a tuple?

2014-10-30 Thread spr
I think I understand how to deal with this, though I don't have all the code working yet. The point is that the V of (K, V) can itself be a tuple. So the updateFunc prototype looks something like val updateDhcpState = (newValues: Seq[Tuple1[(Int, Time, Time)]], state: Option[Tuple1[(Int, Tim

Re: does updateStateByKey accept a state that is a tuple?

2014-10-31 Thread spr
Based on execution on small test cases, it appears that the construction below does what I intend. (Yes, all those Tuple1()s were superfluous.) var lines = ssc.textFileStream(dirArg) var linesArray = lines.map( line => (line.split("\t"))) var newState = linesArray.map( lineArray =>

with SparkStreeaming spark-submit, don't see output after ssc.start()

2014-11-03 Thread spr
I have a Spark Streaming program that works fine if I execute it via sbt "runMain com.cray.examples.spark.streaming.cyber.StatefulDhcpServerHisto -f /Users/spr/Documents/<...>/tmp/ -t 10" but if I start it via $S/bin/spark-submit --master local[12] --class StatefulNewDhcpServe

Re: with SparkStreeaming spark-submit, don't see output after ssc.start()

2014-11-03 Thread spr
P.S. I believe I am creating output from the Spark Streaming app, and thus not falling into the "no-output, no-execution" pitfall, as at the end I have newServers.print() newServers.saveAsTextFiles("newServers","out") -- View this message in context: http://apache-spark-user-list.1001560.n3.

Re: with SparkStreeaming spark-submit, don't see output after ssc.start()

2014-11-04 Thread spr
Yes, good catch. I also realized, after I posted, that I was calling 2 different classes, though they are in the same JAR. I went back and tried it again with the same class in both cases, and it failed the same way. I thought perhaps having 2 classes in a JAR was an issue, but commenting out o

Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread spr
The use case I'm working on has a main data stream in which a human needs to modify what to look for. I'm thinking to implement the main data stream with Spark Streaming and the things to look for with Spark. (Better approaches welcome.) To do this, I have intermixed Spark and Spark Streaming cod

Spark Streaming appears not to recognize a more recent version of an already-seen file; true?

2014-11-04 Thread spr
I am trying to implement a use case that takes some human input. Putting that in a single file (as opposed to a collection of HDFS files) would be a simpler human interface, so I tried an experiment with whether Spark Streaming (via textFileStream) will recognize a new version of a filename it has

Re: Spark Streaming appears not to recognize a more recent version of an already-seen file; true?

2014-11-04 Thread spr
Holden Karau wrote > This is the expected behavior. Spark Streaming only reads new files once, > this is why they must be created through an atomic move so that Spark > doesn't accidentally read a partially written file. I'd recommend looking > at "Basic Sources" in the Spark Streaming guide ( > ht

Re: Streaming: which code is (not) executed at every batch interval?

2014-11-04 Thread spr
Good, thanks for the clarification. It would be great if this were precisely stated somewhere in the docs. :) To state this another way, it seems like there's no way to straddle the streaming world and the non-streaming world; to get input from both a (vanilla, Linux) file and a stream. Is tha

Re: with SparkStreeaming spark-submit, don't see output after ssc.start()

2014-11-05 Thread spr
This problem turned out to be a cockpit error. I had the same class name defined in a couple different files, and didn't realize SBT was compiling them all together, and then executing the "wrong" one. Mea culpa. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble

how to blend a DStream and a broadcast variable?

2014-11-05 Thread spr
My use case has one large data stream (DS1) that obviously maps to a DStream. The processing of DS1 involves filtering it for any of a set of known values, which will change over time, though slowly by streaming standards. If the filter data were static, it seems to obviously map to a broadcast v

in function prototypes?

2014-11-11 Thread spr
ount, Seq(minTime, newMinTime).min) } var DnsSvrCum = DnsSvr.updateStateByKey[(Int, Time)](updateDnsCount) // <=== error here --compilation output-- [error] /Users/spr/Documents/.../cyberSCinet.scala:513: overloaded method value updateStateByKey with alternativ

"overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-11 Thread spr
reviousCount, Seq(minTime, newMinTime).min) } var DnsSvrCum = DnsSvr.updateStateByKey[(Int, Time)](updateDnsCount) // <=== error here --compilation output-- [error] /Users/spr/Documents/.../cyberSCinet.scala:513: overloaded method value updateStateByKey with alternative

Re: "overloaded method value updateStateByKey ... cannot be applied to ..." when Key is a Tuple2

2014-11-12 Thread spr
After comparing with previous code, I got it work by making the return a Some instead of Tuple2. Perhaps some day I will understand this. spr wrote > --code > > val updateDnsCount = (values: Seq[(Int, Time)], state: Option[(Int, > Time)]) => { > val

how to convert System.currentTimeMillis to calendar time

2014-11-13 Thread spr
Apologies for what seems an egregiously simple question, but I can't find the answer anywhere. I have timestamps from the Spark Streaming Time() interface, in milliseconds since an epoch, and I want to print out a human-readable calendar date and time. How does one do that? -- View this me

representing RDF literals as vertex properties

2014-12-04 Thread spr
@ankurdave's concise code at https://gist.github.com/ankurdave/587eac4d08655d0eebf9, responding to an earlier thread (http://apache-spark-user-list.1001560.n3.nabble.com/How-to-construct-graph-in-graphx-tt16335.html#a16355) shows how to build a graph with multiple edge-types ("predicates" in RDF-sp

Re: representing RDF literals as vertex properties

2014-12-08 Thread spr
OK, have waded into implementing this and have gotten pretty far, but am now hitting something I don't understand, an NoSuchMethodError. The code looks like [...] val conf = new SparkConf().setAppName(appName) //conf.set("fs.default.name", "file://"); val sc = new SparkContext(c