Hi,
When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
usage. Here is the jstack output:
"main" prio=10 tid=0x40112800 nid=0x46f2 runnable
[0x4162e000]
java.lang.Thread.State: RUNNABLE
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(Object
Hi,
I have setup the spark 1.2 standalone cluster and trying to run hive on spark
by following below link.
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
I got the latest build of hive on spark from git and was trying to running few
queries. Queries are runn
Can you confirm your emr version? Could it be because of the classpath
entries for emrfs? You might face issues with using S3 without them.
Thanks,
Aniket
On Mon, Jan 5, 2015, 11:16 AM Adam Gilmore wrote:
> Just an update on this - I found that the script by Amazon was the culprit
> - not exact
@Brad, I'm guessing that the additional memory usage is coming from the
shuffle performed by coalesce, so that at least explains the memory blowup.
On Sun, Jan 4, 2015 at 10:16 PM, Akhil Das
wrote:
> You can try:
>
> - Using KryoSerializer
> - Enabling RDD Compression
> - Setting storage type to
What are you trying to do? Can you paste the whole code? I used to see this
sort of Exception when i close the fs object inside map/mapPartition etc.
Thanks
Best Regards
On Mon, Jan 5, 2015 at 6:43 AM, Michael Albert <
m_albert...@yahoo.com.invalid> wrote:
> Greetings!
>
> So, I think I have dat
You can try:
- Using KryoSerializer
- Enabling RDD Compression
- Setting storage type to MEMORY_ONLY_SER or MEMORY_AND_DISK_SER
Thanks
Best Regards
On Sun, Jan 4, 2015 at 11:53 PM, Brad Willard wrote:
> I have a 10 node cluster with 600gb of ram. I'm loading a fairly large
> dataset from json
Just an update on this - I found that the script by Amazon was the culprit
- not exactly sure why. When I installed Spark manually onto the EMR (and
did the manual configuration of all the EMR stuff), it worked fine.
On Mon, Dec 22, 2014 at 11:37 AM, Adam Gilmore
wrote:
> Hi all,
>
> I've just
Boromir,
You may like to take a look at how we make Spray and Spark working together
at the PredictionIO project: https://github.com/PredictionIO/PredictionIO
Simon
On Sun, Jan 4, 2015 at 8:31 PM, Chester At Work
wrote:
> Just a guess here, may not be correct.
>
> Spray needs to start akka
I saw that in the source, which is why I was wondering.
I was mainly reading:
http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/
"A query that tries to parse the organizationId and userId from the 2
logTypes should be able to do so correctly, though they are positioned
differently
I'm trying to compare the performance of Spark running on Mesos vs YARN.
However, I am having problems being able to configure the Spark workload to
run in a similar way on Mesos and YARN.
When running Spark on YARN, you can specify the number of executors per
node. So if I have a node with 4 CPUs
Hallo,
It really depends on your requirements, what kind of machine learning
algorithm your budget, if you do currently something really new or
integrate it with an existing application, etc.. You can run MongoDB as
well as a cluster. I don't think this question can be answered generally,
but depe
The issue is happening when I try to concatenate column values in the query
like "col1+'$'+col3". For some reason, this issue is not manifesting itself
when I do a single IF query.
Is there a concat function in SparkSQL? I can't find anything in the
documentation.
Thanks,RK
On Sunday, Ja
Sure, here is a ticket. https://issues.apache.org/jira/browse/SPARK-5081
--- Original Message ---
Sender : Josh Rosen
Date : 2015-01-05 06:14 (GMT+09:00)
Title : Re: Shuffle write increases in spark 1.2
If you have a small reproduction for this issue, can you open a ticke
Hello,
I am trying to launch a Spark app(client mode for standalone cluster) from
a Spray server, using the following code.
When I run it as
$> java -cp SprayServer
the SimpleApp.getA() call from SprayService returns -1(which means it sees
the logData RDD as null for HTTP requests), but the s
BTW, I am seeing this issue in Spark 1.1.1.
On Sunday, January 4, 2015 7:29 PM, RK wrote:
When I use a single IF statement like "select IF(col1 != "", col1+'$'+col3,
col2+'$'+col3) from my_table", it works fine.
However, when I use a nested IF like "select IF(col1 != "", col1+'$'+col
When I use a single IF statement like "select IF(col1 != "", col1+'$'+col3,
col2+'$'+col3) from my_table", it works fine.
However, when I use a nested IF like "select IF(col1 != "", col1+'$'+col3,
IF(col2 != "", col2+'$'+col3, '$')) from my_table", I am getting the following
exception.
Exception
Hi,
I wonder if anyone knows when a python API will be added for Gradient Boosted
Trees? I see that java and scala APIs were added for the 1.2 release, and would
love to be able to build GBMs in pyspark too.
cheers
chris
Christopher Thom
QUANTIUM
Level 25, 8 Chifley, 8-12 Chifley Square
Sydney
Recently we want to use spark to calculate the average shortest path
distance between each reachable pair of nodes in a very big graph.
Is there any one ever try this? We hope to discuss about the problem.
Greetings!
So, I think I have data saved so that each partition (part-r-0, etc)is
exactly what I wan to translate into an output file of a format not related to
hadoop.
I believe I've figured out how to tell Spark to read the data set without
re-partitioning (in another post I mentioned this
Greetings!
I would like to know if the code below will read "one-partition-at-a-time",
and whether I am reinventing the wheel.
If I may explain, upstream code has managed (I hope) to save an RDD such
that each partition file (e.g, part-r-0, part-r-1) contains exactly the
data subset whi
Ah, so I guess this *is* still an issue since we needed to use a bitmap for
tracking zero-sized blocks (see
https://issues.apache.org/jira/browse/SPARK-3740; this isn't just a
performance issue; it's necessary for correctness). This will require a
bit more effort to fix, since we'll either have to
If you have a small reproduction for this issue, can you open a ticket at
https://issues.apache.org/jira/browse/SPARK ?
On December 29, 2014 at 7:10:02 PM, Kevin Jung (itsjb.j...@samsung.com) wrote:
Hi all,
The size of shuffle write showing in spark web UI is mush different when I
execute
It doesn’t seem like there’s a whole lot of clues to go on here without seeing
the job code. The original "org.apache.spark.SparkException: PairwiseRDD:
unexpected value: List([B@130dc7ad)” error suggests that maybe there’s an issue
with PySpark’s serialization / tracking of types, but it’s har
val sconf = new
SparkConf().setMaster("local").setAppName("MedicalSideFx-CassandraLogsMessageTypeCount")
val sc = new SparkContext(sconf)val inputDir = "/path/to/cassandralogs.txt"
sc.textFile(inputDir).map(line => line.replace("\"", "")).map(line =>
(line.split(' ')(0) + " " + line.split(' ')(2
Compiling with YARN set to true is not required for Spark working with
Hadoop 2.2.0 in standalone mode.
-Sandy
On Fri, Jan 2, 2015 at 12:06 PM, Moep wrote:
> Well that's confusing. I have the same issue. So you're saying I have to
> compile Spark with Yarn set to true to make it work with Hadoo
Go through spark API documentation. Basically you have to do group by
(date, message_type) and then do a count.
On Sun, Jan 4, 2015, 9:58 PM Dinesh Vallabhdas
wrote:
> A spark cassandra newbie question. Thanks in advance for the help.
> I have a cassandra table with 2 columns message_timestamp(t
I have a 10 node cluster with 600gb of ram. I'm loading a fairly large
dataset from json files. When I load the dataset it is about 200gb however
it only creates 60 partitions. I'm trying to repartition to 256 to increase
cpu utilization however when I do that it balloons in memory to way over 2x
t
The error you provided says that build was unsuccessful. If you write what
you did (what command you used), whole error trace - someone might be able
to help you ...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-building-spark-1-2-0-tp20961p20
yeah.. but none of the sites get opened.
On Sun, Jan 4, 2015 at 10:35 PM, Ted Yu wrote:
> Have you used Google to find some way of accessing github :-)
>
>
>
> On Jan 4, 2015, at 8:46 AM, Kartheek.R wrote:
>
> The problem is that my network is not able to access github.com for
> cloning some de
Have you used Google to find some way of accessing github :-)
> On Jan 4, 2015, at 8:46 AM, Kartheek.R wrote:
>
> The problem is that my network is not able to access github.com for cloning
> some dependencies as github is blocked in India. What are the other possible
> ways for this problem
The problem is that my network is not able to access github.com for cloning
some dependencies as github is blocked in India. What are the other
possible ways for this problem??
Thank you!
On Sun, Jan 4, 2015 at 9:45 PM, Rapelly Kartheek
wrote:
> Hi,
>
> I get the following error when I build sp
A spark cassandra newbie question. Appreciate the help.u...@host.com
I have a cassandra table with 2 columns message_timestamp(timestamp) and
message_type(text). The data is of the form
2014-06-25 12:01:39 "START"
2014-06-25 12:02:39 "START"
2014-06-25 12:02:39 "PAUSE"
2014-06-25 14:02:39 "STOP"
A spark cassandra newbie question. Thanks in advance for the help.I have a
cassandra table with 2 columns message_timestamp(timestamp) and
message_type(text). The data is of the form2014-06-25 12:01:39 "START"
2014-06-25 12:02:39 "START"
2014-06-25 12:02:39 "PAUSE"
2014-06-25 14:02:39 "STOP"
2014
Hi,
I get the following error when I build spark-1.2.0 using sbt:
[error] Nonzero exit code (128): git clone
https://github.com/ScrapCodes/sbt-pom-reader.git
/home/karthik/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader
[error] Use 'last' for the full log.
Any help please?
Thanks
--
V
Hi,
We have recently upgraded to latest version of Spark and suddenly the Spark
SQLs are performing bad i.e. response time got increased from sub-seconds to
5-6 seconds. What all info I need to provide so that I can get to know the
reason for this performance degrade ?
Regards,
Abhideep
_
Hi J_soft
mvn do not provide tar packages by default. You got many jar files - each
project has its own jar (e.g. mllib has
mllib/target/spark-mllib_2.10-1.2.0.jar).
However, if you want one big tar package with all dependencies - look here:
https://github.com/apache/spark/tree/master/assembly
An
Hi Pengcheng YIN,
RDD cache / persist calls do not trigger evaluation.
Unpersist call is blocking (it does have an async flavor but am not sure what
are the SLAs on behavior).
val rdd = sc.textFile().map()
rdd.persist() // This does not trigger actual storage
while(true){
val count = rdd.filt
Please check the document added by Andrew. I could run tasks with Spark 1.2.0.
*
https://github.com/apache/spark/pull/3731/files#diff-c3cbe4cabe90562520f22d2306aa9116R86
*
https://github.com/apache/spark/pull/3757/files#diff-c3cbe4cabe90562520f22d2306aa9116R101
Thanks,
- Tsuyoshi
On Sun, Jan 4
I started to know your requirement, maybe there’s some limitations in current
MetricsSystem, I think we can improve it either.
Thanks
Jerry
From: Enno Shioji [mailto:eshi...@gmail.com]
Sent: Sunday, January 4, 2015 5:46 PM
To: Shao, Saisai
Cc: user@spark.apache.org
Subject: Re: Better way of mea
Hi Jerry, thanks for your answer.
I had looked at MetricsSystem, but I couldn't see how I could use it in my
use case, which is:
stream
.map { i =>
Metriker.mr.meter(Metriker.metricName("testmetric123")).mark(i)
i * 2
}
>From what I can see, a Source
40 matches
Mail list logo