Hi,
can I use spark in local mode using 4 cores to process 50gb data
effeciently?
Thank you
misha
hi,
can we write a update query using sqlcontext?
sqlContext.sql("update act1 set loc = round(loc,4)")
what is wrong in this? I get the following error.
Py4JJavaError: An error occurred while calling o20.sql.
: java.lang.RuntimeException: [1.1] failure: ``with'' expected but
identifier update f
com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
>
> "Contribute to the world, environment and more :
> http://www.gridrepublic.org
> "
>
> On Wed, Jun 15,
0 PM, Sergio Fernández
> wrote:
>
>> In theory yes... the common sense say that:
>>
>> volume / resources = time
>>
>> So more volume on the same processing resources would just take more time.
>> On Jun 15, 2016 6:43 PM, "spR" wrote:
>>
>>
n 15, 2016 6:43 PM, "spR" wrote:
>
>> I have 16 gb ram, i7
>>
>> Will this config be able to handle the processing without my ipythin
>> notebook dying?
>>
>> The local mode is for testing purpose. But, I do not have any cluster at
>> my dispo
hi,
how to concatenate spark dataframes? I have 2 frames with certain columns.
I want to get a dataframe with columns from both the other frames.
Regards,
Misha
I am trying to save a spark dataframe in the mysql database by using:
df.write(sql_url, table='db.table')
the first column in the dataframe seems too long and I get this error :
Data too long for column 'custid' at row 1
what should I do?
Thanks
ioners/dp/1484209656/>
>
>
>
> *From:* Natu Lauchande [mailto:nlaucha...@gmail.com]
> *Sent:* Wednesday, June 15, 2016 2:07 PM
> *To:* spR
> *Cc:* user
> *Subject:* Re: concat spark dataframes
>
>
>
> Hi,
>
> You can select the common collumns and use DataFra
Hi,
I am getting this error while executing a query using sqlcontext.sql
The table has around 2.5 gb of data to be scanned.
First I get out of memory exception. But I have 16 gb of ram
Then my notebook dies and I get below error
Py4JNetworkError: An error occurred while trying to connect to the
Zhang wrote:
> Could you paste the full stacktrace ?
>
> On Thu, Jun 16, 2016 at 7:24 AM, spR wrote:
>
>> Hi,
>> I am getting this error while executing a query using sqlcontext.sql
>>
>> The table has around 2.5 gb of data to be scanned.
>>
>> First I g
> "--executor-memory"
>
>
>
>
>
> On Thu, Jun 16, 2016 at 8:54 AM, spR wrote:
>
>> Hey,
>>
>> error trace -
>>
>> hey,
>>
>>
>> error trace -
>>
>>
>> --
sc.conf = conf
On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang wrote:
> >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>
>
> It is OOM on the executor. Please try to increase executor memory.
> "--executor-memory"
>
>
>
>
&g
mode, please use other cluster mode.
>
> On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang wrote:
>
>> Specify --executor-memory in your spark-submit command.
>>
>>
>>
>> On Thu, Jun 16, 2016 at 9:01 AM, spR wrote:
>>
>>> Thank you. Can you pls te
hey,
Thanks. Now it worked.. :)
On Wed, Jun 15, 2016 at 6:59 PM, Jeff Zhang wrote:
> Then the only solution is to increase your driver memory but still
> restricted by your machine's memory. "--driver-memory"
>
> On Thu, Jun 16, 2016 at 9:53 AM, spR wrote:
>
>
I'm a Scala / Spark / GraphX newbie, so may be missing something obvious.
I have a set of edges that I read into a graph. For an iterative
community-detection algorithm, I want to assign each vertex to a community
with the name of the vertex. Intuitively it seems like I should be able to
pull
ankurdave wrote
> val g = ...
> val newG = g.mapVertices((id, attr) => id)
> // newG.vertices has type VertexRDD[VertexId], or RDD[(VertexId,
> VertexId)]
Yes, that worked perfectly. Thanks much.
One follow-up question. If I just wanted to get those values into a vanilla
variable (n
Sorry if this is in the docs someplace and I'm missing it.
I'm trying to implement label propagation in GraphX. The core step of that
algorithm is
- for each vertex, find the most frequent label among its neighbors and set
its label to that.
(I think) I see how to get the input from all the ne
_
println("Point 0")
val appName = "try1.scala"
val master = "local[5]"
val conf = new SparkConf().setAppName(appName).setMaster(master)
val ssc = new StreamingContext(conf, Seconds(10))
println("Point 1")
val lines = ssc.textFileStream("/Users/spr/D
|| Try using spark-submit instead of spark-shell
Two questions:
- What does spark-submit do differently from spark-shell that makes you
think that may be the cause of my difficulty?
- When I try spark-submit it complains about "Error: Cannot load main class
from JAR: file:/Users/spr/...
Thanks Abraham Jacob, Tobias Pfeiffer, Akhil Das-2, and Sean Owen for your
helpful comments. Cockpit error on my part in just putting the .scala file
as an argument rather than redirecting stdin from it.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spa
The documentation at
https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.streaming.dstream.DStream
describes the union() method as
"Return a new DStream by unifying data of another DStream with this
DStream."
Can somebody provide a clear definition of what "unifying" means i
I am processing a log file, from each line of which I want to extract the
zeroth and 4th elements (and an integer 1 for counting) into a tuple. I had
hoped to be able to index the Array for elements 0 and 4, but Arrays appear
not to support vector indexing. I'm not finding a way to extract and
co
I need more precision to understand. If the elements of one DStream/RDD are
(String) and the elements of the other are (Time, Int), what does "union"
mean? I'm hoping for (String, Time, Int) but that appears optimistic. :)
Do the elements have to be of homogeneous type?
Holden Karau wrote
>
MinTime).min, Seq(maxTime,
newMaxTime).max)
}
var DhcpSvrCum = newState.updateStateByKey[(Int, Time,
Time)](updateDhcpState)
The error I get is
[info] Compiling 3 Scala sources to
/Users/spr/Documents/.../target/scala-2.10/classes...
[error] /Users/spr/Documents/...StatefulDhcpServer
I think I understand how to deal with this, though I don't have all the code
working yet. The point is that the V of (K, V) can itself be a tuple. So
the updateFunc prototype looks something like
val updateDhcpState = (newValues: Seq[Tuple1[(Int, Time, Time)]], state:
Option[Tuple1[(Int, Tim
Based on execution on small test cases, it appears that the construction
below does what I intend. (Yes, all those Tuple1()s were superfluous.)
var lines = ssc.textFileStream(dirArg)
var linesArray = lines.map( line => (line.split("\t")))
var newState = linesArray.map( lineArray =>
I have a Spark Streaming program that works fine if I execute it via
sbt "runMain com.cray.examples.spark.streaming.cyber.StatefulDhcpServerHisto
-f /Users/spr/Documents/<...>/tmp/ -t 10"
but if I start it via
$S/bin/spark-submit --master local[12] --class StatefulNewDhcpServe
P.S. I believe I am creating output from the Spark Streaming app, and thus
not falling into the "no-output, no-execution" pitfall, as at the end I have
newServers.print()
newServers.saveAsTextFiles("newServers","out")
--
View this message in context:
http://apache-spark-user-list.1001560.n3.
Yes, good catch. I also realized, after I posted, that I was calling 2
different classes, though they are in the same JAR. I went back and tried
it again with the same class in both cases, and it failed the same way. I
thought perhaps having 2 classes in a JAR was an issue, but commenting out
o
The use case I'm working on has a main data stream in which a human needs to
modify what to look for. I'm thinking to implement the main data stream
with Spark Streaming and the things to look for with Spark.
(Better approaches welcome.)
To do this, I have intermixed Spark and Spark Streaming cod
I am trying to implement a use case that takes some human input. Putting
that in a single file (as opposed to a collection of HDFS files) would be a
simpler human interface, so I tried an experiment with whether Spark
Streaming (via textFileStream) will recognize a new version of a filename it
has
Holden Karau wrote
> This is the expected behavior. Spark Streaming only reads new files once,
> this is why they must be created through an atomic move so that Spark
> doesn't accidentally read a partially written file. I'd recommend looking
> at "Basic Sources" in the Spark Streaming guide (
> ht
Good, thanks for the clarification. It would be great if this were precisely
stated somewhere in the docs. :)
To state this another way, it seems like there's no way to straddle the
streaming world and the non-streaming world; to get input from both a
(vanilla, Linux) file and a stream. Is tha
This problem turned out to be a cockpit error. I had the same class name
defined in a couple different files, and didn't realize SBT was compiling
them all together, and then executing the "wrong" one. Mea culpa.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble
My use case has one large data stream (DS1) that obviously maps to a DStream.
The processing of DS1 involves filtering it for any of a set of known
values, which will change over time, though slowly by streaming standards.
If the filter data were static, it seems to obviously map to a broadcast
v
ount, Seq(minTime, newMinTime).min)
}
var DnsSvrCum = DnsSvr.updateStateByKey[(Int, Time)](updateDnsCount)
// <=== error here
--compilation output--
[error] /Users/spr/Documents/.../cyberSCinet.scala:513: overloaded method
value updateStateByKey with alternativ
reviousCount, Seq(minTime, newMinTime).min)
}
var DnsSvrCum = DnsSvr.updateStateByKey[(Int, Time)](updateDnsCount)
// <=== error here
--compilation output--
[error] /Users/spr/Documents/.../cyberSCinet.scala:513: overloaded method
value updateStateByKey with alternative
After comparing with previous code, I got it work by making the return a Some
instead of Tuple2. Perhaps some day I will understand this.
spr wrote
> --code
>
> val updateDnsCount = (values: Seq[(Int, Time)], state: Option[(Int,
> Time)]) => {
> val
Apologies for what seems an egregiously simple question, but I can't find the
answer anywhere.
I have timestamps from the Spark Streaming Time() interface, in milliseconds
since an epoch, and I want to print out a human-readable calendar date and
time. How does one do that?
--
View this me
@ankurdave's concise code at
https://gist.github.com/ankurdave/587eac4d08655d0eebf9, responding to an
earlier thread
(http://apache-spark-user-list.1001560.n3.nabble.com/How-to-construct-graph-in-graphx-tt16335.html#a16355)
shows how to build a graph with multiple edge-types ("predicates" in
RDF-sp
OK, have waded into implementing this and have gotten pretty far, but am now
hitting something I don't understand, an NoSuchMethodError.
The code looks like
[...]
val conf = new SparkConf().setAppName(appName)
//conf.set("fs.default.name", "file://");
val sc = new SparkContext(c
41 matches
Mail list logo