Thanks, that worked! I downloaded the version pre-built against hadoop1 and
the examples worked.
- David
On Tue, Sep 30, 2014 at 5:08 PM, Kan Zhang wrote:
> > java.lang.IncompatibleClassChangeError: Found interface
> org.apache.hadoop.mapreduce.JobContext, but class was expected
Hi,
I am building a graph from a large CSV file. Each record contains a couple of
nodes and about 10 edges. When I try to load a large portion of the graph,
using multiple partitions, I get inconsistent results in the number of edges
between different runs. However, if I use a single partitio
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1].
1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400
On Nov 26, 2014 12:24 PM, "Aaron Davidson" wrote:
> Spark has a known problem where it will do a pass of metadata on a large
> number of small files
ep getting
StackOverflowError's in DAGScheduler such as the one below. I've
attached a sample application that illustrates what I'm trying to do.
Can anyone point out how I can keep the DAG from growing so large that
spark is not able to process it?
Thank you,
David
java.lang.Stac
.(TableSchema.java:45)
at
org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:234)
... 51 more
Cheers
David
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Array-type-support-Unregonized-Thrift-TTypeId-value-ARRAY-TYPE-tp20817.html
Sent from the Ap
Doh...figured it out.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Array-type-support-Unregonized-Thrift-TTypeId-value-ARRAY-TYPE-tp20817p20832.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
er.memoryOverhead", "1024") on my spark
configuration object but I still get "Will allocate AM container, with
MB memory including 384 MB overhead" when launching. I'm running
in yarn-cluster mode.
Any help or tips would be appreciated.
Thanks,
David
--
Davi
Hi Ganelin, sorry if it wasn't clear from my previous email, but that is
how I am creating a spark context. I just didn't write out the lines
where I create the new SparkConf and SparkContext. I am also upping the
driver memory when running.
Thanks,
David
On 01/12/2015 11:12 A
I ran into this recently. Turned out we had an old
org-xerial-snappy.properties file in one of our conf directories that
had the setting:
# Disables loading Snappy-Java native library bundled in the
# snappy-java-*.jar file forcing to load the Snappy-Java native
# library from the java.library
would be if the AMP Lab or Databricks
maintained a set of benchmarks on the web that showed how much each successive
version of Spark improved.
Dave
From: Madabhattula Rajesh Kumar [mailto:mrajaf...@gmail.com]
Sent: Monday, January 12, 2015 9:24 PM
To: Buttler, David
Subject: Re: GraphX vs
EMR.
If that's not possible, is there some way to load multiple avro files into
the same table/RDD so the whole dataset can be processed (and in that case
I'd supply paths to each file concretely, but I *really* don't want to have
to do that).
Thanks
David
; Here was my question for reference:
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E
>
> On Wed, Jan 14, 2015 at 4:34 AM, David Jones
> wrote:
>
>> Hi,
>>
>> I have a p
4, 2015 at 3:53 PM, David Jones
wrote:
> Should I be able to pass multiple paths separated by commas? I haven't
> tried but didn't think it'd work. I'd expected a function that accepted a
> list of strings.
>
> On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska
So I'm having this code:
rdd.foreach(p => {
print(p)
})
Where can I see this output? Currently I'm running my spark program on a
cluster. When I run the jar using sbt run, I see only INFO logs on the
console. Where should I check to see the application sysouts?
I have an RDD of (K, Array[V]) pairs.
For example: ((key1, (1,2,3)), (key2, (3,2,4)), (key1, (4,3,2)))
How can I do a groupByKey such that I get back an RDD of the form (K,
Array[V]) pairs.
Ex: ((key1, (1,2,3,4,3,2)), (key2, (3,2,4)))
Is there any guide available on creating a custom RDD?
What is the concept of Block and BlockManager in Spark? How is a Block
related to a Partition of a RDD?
For example, is distinct() transformation lazy?
when I see the Spark source code, distinct applies a map-> reduceByKey ->
map function to the RDD elements. Why is this lazy? Won't the function be
applied immediately to the elements of RDD when I call someRDD.distinct?
/**
* Return a new RDD
ld be lazy, but
> apparently uses an RDD.count call in its implementation:
> https://spark-project.atlassian.net/browse/SPARK-1021).
>
> David Thomas
> March 11, 2014 at 9:49 PM
> For example, is distinct() transformation lazy?
>
> when I see the Spark source code, distin
Spark runtime/scheduler traverses the DAG starting from
> that RDD and triggers evaluation of anything parent RDDs it needs that
> aren't computed and cached yet.
>
> Any future operations build on the same DAG as long as you use the same
> RDD objects and, if you used cache
Is it possible to parition the RDD elements in a round robin fashion? Say I
have 5 nodes in the cluster and 5 elements in the RDD. I need to ensure
each element gets mapped to each node in the cluster.
, "student")), (7L, ("jgonzal", "postdoc")),
(5L, ("franklin", "prof")), (2L, ("istoica", "prof"
thanks
--david
How can we replicate RDD elements? Say I have 1 element and 100 nodes in
the cluster. I need to replicate this one item on all the nodes i.e.
effectively create an RDD of 100 elements.
That helps! Thank you.
On Fri, Mar 28, 2014 at 12:36 AM, Sonal Goyal wrote:
> Hi David,
>
> I am sorry but your question is not clear to me. Are you talking about
> taking some value and sharing it across your cluster so that it is present
> on all the nodes? You can
Is there a way to see 'Application Detail UI' page (at master:4040) for
completed applications? Currently, I can see that page only for running
applications, I would like to see various numbers for the application after
it has completed.
Can someone explain how RDD is resilient? If one of the partition is lost,
who is responsible to recreate that partition - is it the driver program?
but the
> re-computation will occur on an executor. So if several partitions are
> lost, e.g. due to a few machines failing, the re-computation can be striped
> across the cluster making it fast.
>
>
> On Wed, Apr 2, 2014 at 11:27 AM, David Thomas wrote:
>
>> Can someone e
Hi all,
We are currently using hbase to store user data and periodically doing a
full scan to aggregate data. The reason we use hbase is that we need a
single user's data to be contiguous, so as user data comes in, we need the
ability to update a random access store.
The performance of a full hba
What is the difference between checkpointing and caching an RDD?
During a Spark stage, how are tasks split among the workers? Specifically
for a HadoopRDD, who determines which worker has to get which task?
This sounds like a configuration issue. Either you have not set the MASTER
correctly, or possibly another process is using up all of the cores
Dave
From: ge ko [mailto:koenig@gmail.com]
Sent: Sunday, April 13, 2014 12:51 PM
To: user@spark.apache.org
Subject:
Hi,
I'm still going to start w
Hi,
I am trying to run the K-means code in mllib, and it works very nicely with
small K (less than 1000). However, when I try for a larger K (I am looking for
2000-4000 clusters), it seems like the code gets part way through (perhaps just
the initialization step) and freezes. The compute nodes
@spark.apache.org
Cc: user@spark.apache.org
Subject: Re: K-means with large K
David,
Just curious to know what kind of use cases demand such large k clusters
Chester
Sent from my iPhone
On Apr 28, 2014, at 9:19 AM, "Buttler, David"
mailto:buttl...@llnl.gov>> wrote:
Hi,
I am trying to
For some reason the patch did not make it.
Trying via email:
/D
On May 23, 2014, at 9:52 AM, lemieud wrote:
> Hi,
>
> I think I found the problem.
> In SparkFlumeEvent the readExternal method use in.read(bodyBuff) which read
> the first 1020 bytes, but no more. The code should make sure to r
Created https://issues.apache.org/jira/browse/SPARK-1916
I'll submit a pull request soon.
/D
On May 23, 2014, at 9:56 AM, David Lemieux
wrote:
> For some reason the patch did not make it.
>
> Trying via email:
>
>
>
> /D
>
> On May 23, 2014, at 9:52 AM, lem
Hi,
I'm running CDH5 and its bundled Spark (0.9.0). The Spark shell has been
coming up fine over the last couple of weeks. However today it doesn't come
up and I just see this message over and over:
14/05/30 12:06:05 INFO YarnClientSchedulerBackend: Application report from
ASM:
appMasterRpcPort
and how much memory on each?
> If you go to the RM web UI at port 8088, how much memory is used? Which
> YARN scheduler are you using?
>
> -Sandy
>
>
> On Fri, May 30, 2014 at 12:38 PM, David Belling > wrote:
>
>> Hi,
>>
>> I'm running CDH5 and it
Hi chaps,
It seems there is an issue while saving dataframes in Spark 1.4.
The default file extension inside Hive warehouse folder is now
part-r-X.gz.parquet but while running queries from SparkSQL Thriftserver is
still looking for part-r-X.parquet.
Is there any config parameter we can use as wor
Hi all,
Do you know if there is an option to specify how many replicas we want
while caching in memory a table in SparkSQL Thrift server? I have not seen
any option so far but I assumed there is an option as you can see in the
Storage section of the UI that there is 1 x replica of your
Dataframe/Ta
As Sean suggested you can actually build Spark 1.4 for CDH 5.4.x and also
include Hive libraries for 0.13.1, but *this will be completely unsupported
by Cloudera*.
I would suggest to do that only if you just want to experiment with new
features from Spark 1.4. I.e. Run SparkSQL with sort-merge join
It seems this feature was added in Hive 0.13.
https://issues.apache.org/jira/browse/HIVE-4943
I would assume this is supported as Spark is by default compiled using Hive
0.13.1.
On Sun, Jul 12, 2015 at 7:42 PM, Ruslan Dautkhanov
wrote:
> You can see what Spark SQL functions are supported in Spa
002 7003 7004 7005 7006
I'm using those images docker images to run spark jobs without a problem. I
only get errors on the streaming app.
any pointers on what can be wrong?
Thank you very much in advanced.
David
-
tcp6 0 0 :::6005 :::*LISTEN
-
tcp6 0 0 172.18.0.2:6006 :::*LISTEN
-
tcp6 0 0 172.18.0.2: :::*LISTEN
-
so far still no success
On Mon,
If you are using sbt, I personally use sbt-pack to pack all dependencies
under a certain folder and then I set those jars in the spark config
// just for demo I load this through config file overridden by environment
variables
val sparkJars = Seq
("/ROOT_OF_YOUR_PROJECT/target/pack/lib/YOUR_JAR_DE
The issue is related to this
https://issues.apache.org/jira/browse/SPARK-13906
.set("spark.rpc.netty.dispatcher.numThreads","2")
seem to fix the problem
On Tue, Mar 15, 2016 at 6:45 AM, David Gomez Saavedra
wrote:
> I have updated the config since I realized the act
ader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
16/03/15 20:48:12 WARN ReliableDeliverySupervisor: Association with
remote system [akka.tcp://spark-engine@spark-engine:9083] has failed,
address is now gated for [5000] ms. Reason: [Disassociated]
Any idea why the two actor systems get disassociated ?
Thank you very much in advanced.
Best
David
e-detector {
heartbeat-interval = 4 s
acceptable-heartbeat-pause = 16 s
}
}
}
.set("spark.akka.heartbeat.interval", "4s")
.set("spark.akka.heartbeat.pauses", "16s")
On Tue, Mar 15, 2016 at 9:50 PM, David Gomez Saavedra
wrote:
> hi th
I have a Spark Streaming job that runs great the first time around (Elastic
MapReduce 4.1.0), but when recovering from a checkpoint in S3, the job runs
but Spark itself seems to be jacked-up in lots of little ways:
- Executors, which are normally stable for days, are terminated within a
coup
Hi,
I'm wanting to take a SQL string as a user input, then transform it before
execution. In particular, I want to modify the top-level projection (select
clause), injecting additional columns to be retrieved by the query.
I was hoping to achieve this by hooking into Catalyst using
sparkSession.e
Hello,
I've got a Spark SQL dataframe containing a "key" column. The queries I
want to run start by filtering on the key range. My question in outline: is
it possible to sort the dataset by key so as to do efficient key range
filters, before subsequently running a more complex SQL query?
I'm awar
Hi Mich,
Thank you for your quick reply!
What type of table is the underlying table? Is it Hbase, Hive ORC or what?
>
It is a custom datasource, but ultimately backed by HBase.
> By Key you mean a UNIQUE ID or something similar and then you do multiple
> scans on the tempTable which stores dat
Hi Mich,
Thank you again for your reply.
As I see you are caching the table already sorted
>
> val keyValRDDSorted = keyValRDD.sortByKey().cache
>
> and the next stage is you are creating multiple tempTables (different
> ranges) that cache a subset of rows already cached in RDD. The data stored
>
Hi again Mich,
"But the thing is that I don't explicitly cache the tempTables ..".
>
> I believe tempTable is created in-memory and is already cached
>
That surprises me since there is a sqlContext.cacheTable method to
explicitly cache a table in memory. Or am I missing something? This could
expl
Thanks for the link, I hadn't come across this.
According to https://forums.databricks.com/questions/400/what-is-the-
> difference-between-registertemptable-a.html
>
> and I quote
>
> "registerTempTable()
>
> registerTempTable() creates an in-memory table that is scoped to the
> cluster in which i
Awesome, thank you Michael for the detailed example!
I'll look into whether I can use this approach for my use case. If so, I
could avoid the overhead of repeatedly registering a temp table for one-off
queries, instead registering the table once and relying on the injected
strategy. Don't know how
n=2)]]
I use spark 2.1.0 and pyspark.
Regards,
/David
The information in this email may be confidential and/or legally privileged. It
has been sent for the sole use of the intended recipient(s). If you are not an
intended recipient, you are strictly prohibited from reading, disclosing,
distr
Hi!
I've been using spark for the last months and it is awesome. I'm pretty new on
this topic so don't be too harsh on me.
Recently I've been doing some simple tests with Spark Streaming for log
processing and I'm considering different ETL input solutions such as Flume or
PDI+Kafka.
My use case
Classification: Public
Hi Team,
We are trying to utilize ML Gradient Boosting Tree Classification algorithm and
found the performance of the algorithm is very poor during training.
We would like to see we can improve the performance timings since, it is taking
2 days for training for a smaller
Classification: Limited
Many thanks for your response Sean.
Question - why spark is overkill for this and why is sklearn is faster please?
It's the same algorithm right?
Thanks again,
Dave Williams
From: Sean Owen mailto:sro...@gmail.com>>
Sent: 25 March 2021 16:40
To: Williams,
if we get that working in distributed, will we get
benefits similar to spark ML?
Best Regards,
Dave Williams
From: Sean Owen
Sent: 26 March 2021 13:20
To: Williams, David (Risk Value Stream)
Cc: user@spark.apache.org
Subject: Re: FW: Email to Spark Org please
-- This email has reached the Bank v
Classification: Public
Many thanks for the info. So you wouldn't use sklearn with Spark for large
datasets but use it with smaller datasets and using hyperopt to build models in
parallel for hypertuning on Spark?
From: Sean Owen
Sent: 26 March 2021 13:53
To: Williams, David (Risk
201 - 262 of 262 matches
Mail list logo