Re: pyspark cassandra examples

2014-09-30 Thread David Vincelli
Thanks, that worked! I downloaded the version pre-built against hadoop1 and the examples worked. - David On Tue, Sep 30, 2014 at 5:08 PM, Kan Zhang wrote: > > java.lang.IncompatibleClassChangeError: Found interface > org.apache.hadoop.mapreduce.JobContext, but class was expected

inconsistent edge counts in GraphX

2014-11-10 Thread Buttler, David
Hi, I am building a graph from a large CSV file. Each record contains a couple of nodes and about 10 edges. When I try to load a large portion of the graph, using multiple partitions, I get inconsistent results in the number of edges between different runs. However, if I use a single partitio

subscribe

2014-11-11 Thread DAVID SWEARINGEN
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-30 Thread David Blewett
You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1]. 1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400 On Nov 26, 2014 12:24 PM, "Aaron Davidson" wrote: > Spark has a known problem where it will do a pass of metadata on a large > number of small files

DAGScheduler StackOverflowError

2014-12-19 Thread David McWhorter
ep getting StackOverflowError's in DAGScheduler such as the one below. I've attached a sample application that illustrates what I'm trying to do. Can anyone point out how I can keep the DAG from growing so large that spark is not able to process it? Thank you, David java.lang.Stac

SparkSQL Array type support - Unregonized Thrift TTypeId value: ARRAY_TYPE

2014-12-22 Thread David Allan
.(TableSchema.java:45) at org.apache.hive.jdbc.HiveQueryResultSet.retrieveSchema(HiveQueryResultSet.java:234) ... 51 more Cheers David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Array-type-support-Unregonized-Thrift-TTypeId-value-ARRAY-TYPE-tp20817.html Sent from the Ap

Re: SparkSQL Array type support - Unregonized Thrift TTypeId value: ARRAY_TYPE

2014-12-23 Thread David Allan
Doh...figured it out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Array-type-support-Unregonized-Thrift-TTypeId-value-ARRAY-TYPE-tp20817p20832.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

configuring spark.yarn.driver.memoryOverhead on Spark 1.2.0

2015-01-12 Thread David McWhorter
er.memoryOverhead", "1024") on my spark configuration object but I still get "Will allocate AM container, with MB memory including 384 MB overhead" when launching. I'm running in yarn-cluster mode. Any help or tips would be appreciated. Thanks, David -- Davi

Re: configuring spark.yarn.driver.memoryOverhead on Spark 1.2.0

2015-01-12 Thread David McWhorter
Hi Ganelin, sorry if it wasn't clear from my previous email, but that is how I am creating a spark context. I just didn't write out the lines where I create the new SparkConf and SparkContext. I am also upping the driver memory when running. Thanks, David On 01/12/2015 11:12 A

Re: no snappyjava in java.library.path

2015-01-12 Thread David Rosenstrauch
I ran into this recently. Turned out we had an old org-xerial-snappy.properties file in one of our conf directories that had the setting: # Disables loading Snappy-Java native library bundled in the # snappy-java-*.jar file forcing to load the Snappy-Java native # library from the java.library

RE: GraphX vs GraphLab

2015-01-13 Thread Buttler, David
would be if the AMP Lab or Databricks maintained a set of benchmarks on the web that showed how much each successive version of Spark improved. Dave From: Madabhattula Rajesh Kumar [mailto:mrajaf...@gmail.com] Sent: Monday, January 12, 2015 9:24 PM To: Buttler, David Subject: Re: GraphX vs

Using Spark SQL with multiple (avro) files

2015-01-14 Thread David Jones
EMR. If that's not possible, is there some way to load multiple avro files into the same table/RDD so the whole dataset can be processed (and in that case I'd supply paths to each file concretely, but I *really* don't want to have to do that). Thanks David

Re: Using Spark SQL with multiple (avro) files

2015-01-14 Thread David Jones
; Here was my question for reference: > > http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E > > On Wed, Jan 14, 2015 at 4:34 AM, David Jones > wrote: > >> Hi, >> >> I have a p

Re: Using Spark SQL with multiple (avro) files

2015-01-15 Thread David Jones
4, 2015 at 3:53 PM, David Jones wrote: > Should I be able to pass multiple paths separated by commas? I haven't > tried but didn't think it'd work. I'd expected a function that accepted a > list of strings. > > On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska

Where does println output go?

2014-03-01 Thread David Thomas
So I'm having this code: rdd.foreach(p => { print(p) }) Where can I see this output? Currently I'm running my spark program on a cluster. When I run the jar using sbt run, I see only INFO logs on the console. Where should I check to see the application sysouts?

Help with groupByKey

2014-03-02 Thread David Thomas
I have an RDD of (K, Array[V]) pairs. For example: ((key1, (1,2,3)), (key2, (3,2,4)), (key1, (4,3,2))) How can I do a groupByKey such that I get back an RDD of the form (K, Array[V]) pairs. Ex: ((key1, (1,2,3,4,3,2)), (key2, (3,2,4)))

Custom RDD

2014-03-10 Thread David Thomas
Is there any guide available on creating a custom RDD?

Block

2014-03-10 Thread David Thomas
What is the concept of Block and BlockManager in Spark? How is a Block related to a Partition of a RDD?

Are all transformations lazy?

2014-03-11 Thread David Thomas
For example, is distinct() transformation lazy? when I see the Spark source code, distinct applies a map-> reduceByKey -> map function to the RDD elements. Why is this lazy? Won't the function be applied immediately to the elements of RDD when I call someRDD.distinct? /** * Return a new RDD

Re: Are all transformations lazy?

2014-03-11 Thread David Thomas
ld be lazy, but > apparently uses an RDD.count call in its implementation: > https://spark-project.atlassian.net/browse/SPARK-1021). > > David Thomas > March 11, 2014 at 9:49 PM > For example, is distinct() transformation lazy? > > when I see the Spark source code, distin

Re: Are all transformations lazy?

2014-03-11 Thread David Thomas
Spark runtime/scheduler traverses the DAG starting from > that RDD and triggers evaluation of anything parent RDDs it needs that > aren't computed and cached yet. > > Any future operations build on the same DAG as long as you use the same > RDD objects and, if you used cache

Round Robin Partitioner

2014-03-13 Thread David Thomas
Is it possible to parition the RDD elements in a round robin fashion? Say I have 5 nodes in the cluster and 5 elements in the RDD. I need to ensure each element gets mapped to each node in the cluster.

graphx samples in Java

2014-03-20 Thread David Soroko
, "student")), (7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof" thanks --david

Replicating RDD elements

2014-03-27 Thread David Thomas
How can we replicate RDD elements? Say I have 1 element and 100 nodes in the cluster. I need to replicate this one item on all the nodes i.e. effectively create an RDD of 100 elements.

Re: Replicating RDD elements

2014-03-28 Thread David Thomas
That helps! Thank you. On Fri, Mar 28, 2014 at 12:36 AM, Sonal Goyal wrote: > Hi David, > > I am sorry but your question is not clear to me. Are you talking about > taking some value and sharing it across your cluster so that it is present > on all the nodes? You can

Spark webUI - application details page

2014-03-30 Thread David Thomas
Is there a way to see 'Application Detail UI' page (at master:4040) for completed applications? Currently, I can see that page only for running applications, I would like to see various numbers for the application after it has completed.

Resilient nature of RDD

2014-04-02 Thread David Thomas
Can someone explain how RDD is resilient? If one of the partition is lost, who is responsible to recreate that partition - is it the driver program?

Re: Resilient nature of RDD

2014-04-03 Thread David Thomas
but the > re-computation will occur on an executor. So if several partitions are > lost, e.g. due to a few machines failing, the re-computation can be striped > across the cluster making it fast. > > > On Wed, Apr 2, 2014 at 11:27 AM, David Thomas wrote: > >> Can someone e

hbase scan performance

2014-04-09 Thread David Quigley
Hi all, We are currently using hbase to store user data and periodically doing a full scan to aggregate data. The reason we use hbase is that we need a single user's data to be contiguous, so as user data comes in, we need the ability to update a random access store. The performance of a full hba

Checkpoint Vs Cache

2014-04-13 Thread David Thomas
What is the difference between checkpointing and caching an RDD?

Task splitting among workers

2014-04-19 Thread David Thomas
During a Spark stage, how are tasks split among the workers? Specifically for a HadoopRDD, who determines which worker has to get which task?

RE:

2014-04-23 Thread Buttler, David
This sounds like a configuration issue. Either you have not set the MASTER correctly, or possibly another process is using up all of the cores Dave From: ge ko [mailto:koenig@gmail.com] Sent: Sunday, April 13, 2014 12:51 PM To: user@spark.apache.org Subject: Hi, I'm still going to start w

K-means with large K

2014-04-28 Thread Buttler, David
Hi, I am trying to run the K-means code in mllib, and it works very nicely with small K (less than 1000). However, when I try for a larger K (I am looking for 2000-4000 clusters), it seems like the code gets part way through (perhaps just the initialization step) and freezes. The compute nodes

RE: K-means with large K

2014-04-28 Thread Buttler, David
@spark.apache.org Cc: user@spark.apache.org Subject: Re: K-means with large K David, Just curious to know what kind of use cases demand such large k clusters Chester Sent from my iPhone On Apr 28, 2014, at 9:19 AM, "Buttler, David" mailto:buttl...@llnl.gov>> wrote: Hi, I am trying to

Re: Spark Streaming using Flume body size limitation

2014-05-23 Thread David Lemieux
For some reason the patch did not make it. Trying via email: /D On May 23, 2014, at 9:52 AM, lemieud wrote: > Hi, > > I think I found the problem. > In SparkFlumeEvent the readExternal method use in.read(bodyBuff) which read > the first 1020 bytes, but no more. The code should make sure to r

Re: Spark Streaming using Flume body size limitation

2014-05-23 Thread David Lemieux
Created https://issues.apache.org/jira/browse/SPARK-1916 I'll submit a pull request soon. /D On May 23, 2014, at 9:56 AM, David Lemieux wrote: > For some reason the patch did not make it. > > Trying via email: > > > > /D > > On May 23, 2014, at 9:52 AM, lem

Spark shell never leaves ACCEPTED state in YARN CDH5

2014-05-30 Thread David Belling
Hi, I'm running CDH5 and its bundled Spark (0.9.0). The Spark shell has been coming up fine over the last couple of weeks. However today it doesn't come up and I just see this message over and over: 14/05/30 12:06:05 INFO YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort

Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2014-05-30 Thread David Belling
and how much memory on each? > If you go to the RM web UI at port 8088, how much memory is used? Which > YARN scheduler are you using? > > -Sandy > > > On Fri, May 30, 2014 at 12:38 PM, David Belling > wrote: > >> Hi, >> >> I'm running CDH5 and it

Issues when saving dataframe in Spark 1.4 with parquet format

2015-07-01 Thread David Sabater Dinter
Hi chaps, It seems there is an issue while saving dataframes in Spark 1.4. The default file extension inside Hive warehouse folder is now part-r-X.gz.parquet but while running queries from SparkSQL Thriftserver is still looking for part-r-X.parquet. Is there any config parameter we can use as wor

SparkSQL cache table with multiple replicas

2015-07-03 Thread David Sabater Dinter
Hi all, Do you know if there is an option to specify how many replicas we want while caching in memory a table in SparkSQL Thrift server? I have not seen any option so far but I assumed there is an option as you can see in the Storage section of the UI that there is 1 x replica of your Dataframe/Ta

Re: How to upgrade Spark version in CDH 5.4

2015-07-12 Thread David Sabater Dinter
As Sean suggested you can actually build Spark 1.4 for CDH 5.4.x and also include Hive libraries for 0.13.1, but *this will be completely unsupported by Cloudera*. I would suggest to do that only if you just want to experiment with new features from Spark 1.4. I.e. Run SparkSQL with sort-merge join

Re: Does spark supports the Hive function posexplode function?

2015-07-12 Thread David Sabater Dinter
It seems this feature was added in Hive 0.13. https://issues.apache.org/jira/browse/HIVE-4943 I would assume this is supported as Spark is by default compiled using Hive 0.13.1. On Sun, Jul 12, 2015 at 7:42 PM, Ruslan Dautkhanov wrote: > You can see what Spark SQL functions are supported in Spa

Docker configuration for akka spark streaming

2016-03-14 Thread David Gomez Saavedra
002 7003 7004 7005 7006 I'm using those images docker images to run spark jobs without a problem. I only get errors on the streaming app. any pointers on what can be wrong? Thank you very much in advanced. David

Re: Docker configuration for akka spark streaming

2016-03-14 Thread David Gomez Saavedra
- tcp6 0 0 :::6005 :::*LISTEN - tcp6 0 0 172.18.0.2:6006 :::*LISTEN - tcp6 0 0 172.18.0.2: :::*LISTEN - so far still no success On Mon,

Re: How to distribute dependent files (.so , jar ) across spark worker nodes

2016-03-15 Thread David Gomez Saavedra
If you are using sbt, I personally use sbt-pack to pack all dependencies under a certain folder and then I set those jars in the spark config // just for demo I load this through config file overridden by environment variables val sparkJars = Seq ("/ROOT_OF_YOUR_PROJECT/target/pack/lib/YOUR_JAR_DE

Re: Docker configuration for akka spark streaming

2016-03-15 Thread David Gomez Saavedra
The issue is related to this https://issues.apache.org/jira/browse/SPARK-13906 .set("spark.rpc.netty.dispatcher.numThreads","2") seem to fix the problem On Tue, Mar 15, 2016 at 6:45 AM, David Gomez Saavedra wrote: > I have updated the config since I realized the act

Spark streaming with akka association with remote system failure

2016-03-15 Thread David Gomez Saavedra
ader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/03/15 20:48:12 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://spark-engine@spark-engine:9083] has failed, address is now gated for [5000] ms. Reason: [Disassociated] Any idea why the two actor systems get disassociated ? Thank you very much in advanced. Best David

Re: Spark streaming with akka association with remote system failure

2016-03-19 Thread David Gomez Saavedra
e-detector { heartbeat-interval = 4 s acceptable-heartbeat-pause = 16 s } } } .set("spark.akka.heartbeat.interval", "4s") .set("spark.akka.heartbeat.pauses", "16s") On Tue, Mar 15, 2016 at 9:50 PM, David Gomez Saavedra wrote: > hi th

Spark Streaming (1.5.0) flaky when recovering from checkpoint

2015-10-30 Thread David P. Kleinschmidt
I have a Spark Streaming job that runs great the first time around (Elastic MapReduce 4.1.0), but when recovering from a checkpoint in S3, the job runs but Spark itself seems to be jacked-up in lots of little ways: - Executors, which are normally stable for days, are terminated within a coup

Transforming Spark SQL AST with extraOptimizations

2016-10-25 Thread Michael David Pedersen
Hi, I'm wanting to take a SQL string as a user input, then transform it before execution. In particular, I want to modify the top-level projection (select clause), injecting additional columns to be retrieved by the query. I was hoping to achieve this by hooking into Catalyst using sparkSession.e

Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen
Hello, I've got a Spark SQL dataframe containing a "key" column. The queries I want to run start by filtering on the key range. My question in outline: is it possible to sort the dataset by key so as to do efficient key range filters, before subsequently running a more complex SQL query? I'm awar

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen
Hi Mich, Thank you for your quick reply! What type of table is the underlying table? Is it Hbase, Hive ORC or what? > It is a custom datasource, but ultimately backed by HBase. > By Key you mean a UNIQUE ID or something similar and then you do multiple > scans on the tempTable which stores dat

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen
Hi Mich, Thank you again for your reply. As I see you are caching the table already sorted > > val keyValRDDSorted = keyValRDD.sortByKey().cache > > and the next stage is you are creating multiple tempTables (different > ranges) that cache a subset of rows already cached in RDD. The data stored >

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen
Hi again Mich, "But the thing is that I don't explicitly cache the tempTables ..". > > I believe tempTable is created in-memory and is already cached > That surprises me since there is a sqlContext.cacheTable method to explicitly cache a table in memory. Or am I missing something? This could expl

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen
Thanks for the link, I hadn't come across this. According to https://forums.databricks.com/questions/400/what-is-the- > difference-between-registertemptable-a.html > > and I quote > > "registerTempTable() > > registerTempTable() creates an in-memory table that is scoped to the > cluster in which i

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-02 Thread Michael David Pedersen
Awesome, thank you Michael for the detailed example! I'll look into whether I can use this approach for my use case. If so, I could avoid the overhead of repeatedly registering a temp table for one-off queries, instead registering the table once and relying on the injected strategy. Don't know how

Order of rows not preserved after cache + count + coalesce

2017-02-13 Thread David Haglund (external)
n=2)]] I use spark 2.1.0 and pyspark. Regards, /David The information in this email may be confidential and/or legally privileged. It has been sent for the sole use of the intended recipient(s). If you are not an intended recipient, you are strictly prohibited from reading, disclosing, distr

SparkStreaming + Flume/PDI+Kafka

2015-05-08 Thread GARCIA MIGUEL, DAVID
Hi! I've been using spark for the last months and it is awesome. I'm pretty new on this topic so don't be too harsh on me. Recently I've been doing some simple tests with Spark Streaming for log processing and I'm considering different ETL input solutions such as Flume or PDI+Kafka. My use case

FW: Email to Spark Org please

2021-03-25 Thread Williams, David (Risk Value Stream)
Classification: Public Hi Team, We are trying to utilize ML Gradient Boosting Tree Classification algorithm and found the performance of the algorithm is very poor during training. We would like to see we can improve the performance timings since, it is taking 2 days for training for a smaller

RE: FW: Email to Spark Org please

2021-03-26 Thread Williams, David (Risk Value Stream)
Classification: Limited Many thanks for your response Sean. Question - why spark is overkill for this and why is sklearn is faster please? It's the same algorithm right? Thanks again, Dave Williams From: Sean Owen mailto:sro...@gmail.com>> Sent: 25 March 2021 16:40 To: Williams,

RE: FW: Email to Spark Org please

2021-03-26 Thread Williams, David (Risk Value Stream)
if we get that working in distributed, will we get benefits similar to spark ML? Best Regards, Dave Williams From: Sean Owen Sent: 26 March 2021 13:20 To: Williams, David (Risk Value Stream) Cc: user@spark.apache.org Subject: Re: FW: Email to Spark Org please -- This email has reached the Bank v

RE: FW: Email to Spark Org please

2021-04-01 Thread Williams, David (Risk Value Stream)
Classification: Public Many thanks for the info. So you wouldn't use sklearn with Spark for large datasets but use it with smaller datasets and using hyperopt to build models in parallel for hypertuning on Spark? From: Sean Owen Sent: 26 March 2021 13:53 To: Williams, David (Risk

<    1   2   3