ROSE: Spark + R on the JVM.

2016-01-12 Thread David
ou to [take a look](https://github.com/onetapbeyond/opencpu-spark-executor). Any feedback, questions etc very welcome. David "All that is gold does not glitter, Not all those who wander are lost."

saveAsTextFile hangs with hdfs

2014-08-19 Thread David
sorted.size()); }); String outputPath = "/summarized/groupedByTrackingId4"; hdfs.rm(outputPath, true); stringIntegerJavaPairRDD.saveAsTextFile(String.format("%s/%s", hdfs.getUrl(), outputPath)); Thanks in advance, David

sortByKey trouble

2014-09-24 Thread david
Hi, Does anybody know how to use sortbykey in scala on a RDD like : val rddToSave = file.map(l => l.split("\\|")).map(r => (r(34)+"-"+r(3), r(4), r(10), r(12))) besauce, i received ann error "sortByKey is not a member of ord.apache.spark.rdd.RDD[(String,String,String,String)]. What i t

Re: sortByKey trouble

2014-09-24 Thread david
thank's i've already try this solution but it does not compile (in Eclipse) I'm surprise to see that in Spark-shell, sortByKey works fine on 2 solutions : (String,String,String,String) (String,(String,String,String)) -- View this message in context: http://apache-spark-user-list.1

foreachPartition: write to multiple files

2014-10-08 Thread david
Hi, I want to write my RDDs to multiples files based on a key value. So, i used groupByKey and iterate over partitions. Here is a the code : rdd.map(f => (f.substring(0,4), f)).groupByKey().foreachPartition(iterator => iterator.map { case (key, values) => val fs: FileSystem = File

Re: foreachPartition: write to multiple files

2014-10-08 Thread david
Hi, I finally found a solution after reading the post : http://apache-spark-user-list.1001560.n3.nabble.com/how-to-split-RDD-by-key-and-save-to-different-path-td11887.html#a11983 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/foreachPartition-write-to-

Key-Value decomposition

2014-11-03 Thread david
Hi, I'm a newbie in Spark and faces the following use case : val data = Array ( "A", "1;2;3") val rdd = sc.parallelize(data) // Something here to produce RDD of (Key,value) // ( "A", "1") , ("A", "2"), ("A", "3) Does anybody know how to do ? Thank's -- View this mess

Re: Key-Value decomposition

2014-11-03 Thread david
Hi, But i've only one RDD. Hre is a more complete exemple : my rdd is something like ("A", "1;2;3"), ("B", "2;5;6"), ("C", "3;2;1") And i expect to have the following result : ("A",1) , ("A",2) , ("A",3) , ("B",2) , ("B",5) , ("B",6) , ("C",3) , ("C",2) , ("C",1) Any idea about how can

RE: Key-Value decomposition

2014-11-04 Thread david
Thank's -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Key-Value-decomposition-tp17966p18050.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe,

Spark SQL (1.0)

2014-11-24 Thread david
Hi, I build 2 tables from files. Table F1 join with table F2 on c5=d4. F1 has 46730613 rows F2 has 3386740 rows All keys d4 exists in F1.c5, so i expect to retrieve 46730613 rows. But it returns only 3437 rows // --- begin code --- val sqlContext = new org.apache.spark.sql.SQLContext(s

Spark SQL Join returns less rows that expected

2014-11-25 Thread david
Hi, I have 2 files which come from csv import of 2 Oracle tables. F1 has 46730613 rows F2 has 3386740 rows I build 2 tables with spark. Table F1 join with table F2 on c1=d1. All keys F2.d1 exists in F1.c1, so i expect to retrieve 46730613 rows. But it returns only 3437 rows // --- b

spark streaming kafa best practices ?

2014-12-05 Thread david
hi, What is the bet way to process a batch window in SparkStreaming : kafkaStream.foreachRDD(rdd => { rdd.collect().foreach(event => { // process the event process(event) }) }) Or kafkaStream.foreachRDD(rdd => { rdd.map(event => { // pro

Spark steaming : work with collect() but not without collect()

2014-12-11 Thread david
Hi, We use the following Spark Streaming code to collect and process Kafka event : kafkaStream.foreachRDD(rdd => { rdd.collect().foreach(event => { process(event._1, event._2) }) }) This work fine. But without /collect()/ function, the following exception is rais

Re: Job is not able to perform Broadcast Join

2020-10-06 Thread David Edwards
After adding the sequential ids you might need a repartition? I've found using monotically increasing id before that the df goes to a single partition. Usually becomes clear in the spark ui though On Tue, 6 Oct 2020, 20:38 Sachit Murarka, wrote: > Yes, Even I tried the same first. Then I moved t

Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
vance David

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
t; nodes should be able to read it immediately > > The solutions/workarounds depend on where you are hosting your Spark > application. > > > > *From: *David Morin > *Date: *Wednesday, December 23, 2020 at 11:08 AM > *To: *"user@spark.apache.org" > *Subject:

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
Does it work with the standard AWS S3 solution and its new consistency model <https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/> ? Le mer. 23 déc. 2020 à 18:48, David Morin a écrit : > Thanks. > My Spark applications run on nodes based on docke

Re: Spark 3.0.1 Structured streaming - checkpoints fail

2020-12-23 Thread David Morin
7a3d68f7%40%3Cuser.spark.apache.org%3E > Probably we may want to add it in the SS guide doc. We didn't need it as > it just didn't work with eventually consistent model, and now it works > anyway but is very inefficient. > > > On Thu, Dec 24, 2020 at 6:16 AM David Morin > wrote:

S3a Committer

2021-02-02 Thread David Morin
Hi, I have some issues at the moment with S3 API of Openstack Swift (S3a). This one is eventually consistent and it causes lots of issues with my distributed jobs in Spark. Is the S3A committer able to fix that ? Or an "S3guard like" implementation is the only way ? David

Re: S3a Committer

2021-02-02 Thread David Morin
Yes, that's true but this is not (yet) the case of the Openstack Swift S3 API Le mar. 2 févr. 2021 à 21:41, Henoc a écrit : > S3 is strongly consistent now > https://aws.amazon.com/s3/consistency/ > > Regards, > Henoc > > On Tue, Feb 2, 2021, 10:27 PM David Morin >

Missing stack function from SQL functions API

2021-06-14 Thread david . szakallas
I noticed that the stack SQL function is missing from the functions API. Could we add it?

Re: Performance of PySpark jobs on the Kubernetes cluster

2021-08-11 Thread David Diebold
eters to see if I can give more work to the executors. Cheers, David Le mar. 10 août 2021 à 12:20, Khalid Mammadov a écrit : > Hi Mich > > I think you need to check your code. > If code does not use PySpark API effectively you may get this. I.e. if you > use pure Python/pa

Trying to hash cross features with mllib

2021-10-01 Thread David Diebold
more from QuantileDiscretizer and other cool functions. Am I missing something in the transformation api ? Or is my approach to hashing wrong ? Or should we consider to extend the api somehow ? Thank you, kind regards, David

Re: Trying to hash cross features with mllib

2021-10-04 Thread David Diebold
Hello Sean, Thank you for the heads-up ! Interaction transform won't help for my use case as it returns a vector that I won't be able to hash. I will definitely dig further into custom transformations though. Thanks ! David Le ven. 1 oct. 2021 à 15:49, Sean Owen a écrit : > Ar

question about data skew and memory issues

2021-12-14 Thread David Diebold
ry before processing them. But why would it need to put them in memory when doing in aggregation ? It looks to me that aggregation can be performed in a stream fashion, so I would not expect any oom at all.. Thank you in advance for your lights :) David

Re: Pyspark debugging best practices

2022-01-03 Thread David Diebold
join operations makes execution plan too complicated at the end of the day ; checkpointing could help there ? Cheers, David Le jeu. 30 déc. 2021 à 16:56, Andrew Davidson a écrit : > Hi Gourav > > I will give databricks a try. > > Each data gets loaded into a data frame. > I

Re: groupMapReduce

2022-01-14 Thread David Diebold
Hello, In RDD api, you must be looking for reduceByKey. Cheers Le ven. 14 janv. 2022 à 11:56, frakass a écrit : > Is there a RDD API which is similar to Scala's groupMapReduce? > https://blog.genuine.com/2019/11/scalas-groupmap-and-groupmapreduce/ > > Thank you. > > ---

Question about spark.sql min_by

2022-02-21 Thread David Diebold
pest_sellers_df = spark.sql("select min_by(sellerId, price) sellerId, min(price) from table group by productId") Is there a way I can rely on min_by directly in groupby ? Is there some code missing in pyspark wrapper to make min_by visible somehow ? Thank you in advance for your help. Cheers David

Re: Question about spark.sql min_by

2022-02-21 Thread David Diebold
.3, up for release soon. It exists in SQL. You can still use it in >> SQL with `spark.sql(...)` in Python though, not hard. >> >> On Mon, Feb 21, 2022 at 4:01 AM David Diebold >> wrote: >> >>> Hello all, >>> >>> I'm trying to use the sp

Question about bucketing and custom partitioners

2022-04-11 Thread David Diebold
have any pointers on this ? Thanks ! David

Writing protobuf RDD to parquet

2023-01-20 Thread David Diebold
have more control, in case rely on *partitionBy *to arrange the files in different folders. But I'm not sure there is some built-in way to convert rdd of protobuf to dataframe in spark ? I would need to rely on this : https://github.com/saurfang/sparksql-protobuf. What do you think ? Kind regards, David

RDD boundaries and triggering processing using tags in the data

2015-05-27 Thread David Webber
reatly appreciate pointers to some specific documentation or examples if you have seen something like this before. Thanks, David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-boundaries-and-triggering-processing-using-tags-in-the-data-tp23060.html

Re: spark sql - reading data from sql tables having space in column names

2015-06-02 Thread David Mitchell
I am having the same problem reading JSON. There does not seem to be a way of selecting a field that has a space, "Executor Info" from the Spark logs. I suggest that we open a JIRA ticket to address this issue. On Jun 2, 2015 10:08 AM, "ayan guha" wrote: > I would think the easiest way would b

Re: Spark performance

2015-07-11 Thread David Mitchell
You can certainly query over 4 TB of data with Spark. However, you will get an answer in minutes or hours, not in milliseconds or seconds. OLTP databases are used for web applications, and typically return responses in milliseconds. Analytic databases tend to operate on large data sets, and retu

Re: No. of Task vs No. of Executors

2015-07-18 Thread David Mitchell
This is likely due to data skew. If you are using key-value pairs, one key has a lot more records, than the other keys. Do you have any groupBy operations? David On Tue, Jul 14, 2015 at 9:43 AM, shahid wrote: > hi > > I have a 10 node cluster i loaded the data onto hdfs, so t

Unable to Limit UI to localhost interface

2016-03-28 Thread David O'Gwynn
Greetings to all, I've search around the mailing list, but it would seem that (nearly?) everyone has the opposite problem as mine. I made a stab at looking in the source for an answer, but I figured I might as well see if anyone else has run into the same problem as I. I'm trying to limit my Mast

Re: Unable to Limit UI to localhost interface

2016-03-29 Thread David O'Gwynn
id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 28 March 2016 at 15:32, David O'Gwynn wrote: > >> Greetings to all, >> >> I've search around the mailing list, but it would seem that (ne

Re: Unable to Limit UI to localhost interface

2016-03-30 Thread David O'Gwynn
Thanks much, Akhil. iptables is certainly a bandaid, but from an OpSec perspective, it's troubling. Is there any way to limit which interfaces the WebUI listens on? Is there a Jetty configuration that I'm missing? Thanks again for your help, David On Wed, Mar 30, 2016 at 2:25 AM,

RE: DStream how many RDD's are created by batch

2016-04-12 Thread David Newberger
Hi, Time is usually the criteria if I’m understanding your question. An RDD is created for each batch interval. If your interval is 500ms then an RDD would be created every 500ms. If it’s 2 seconds then an RDD is created every 2 seconds. Cheers, David From: Natu Lauchande [mailto:nlaucha

RE: DStream how many RDD's are created by batch

2016-04-12 Thread David Newberger
Hi Natu, I believe you are correct one RDD would be created for each file. Cheers, David From: Natu Lauchande [mailto:nlaucha...@gmail.com] Sent: Tuesday, April 12, 2016 1:48 PM To: David Newberger Cc: user@spark.apache.org Subject: Re: DStream how many RDD's are created by batch Hi

RE: Spark replacing Hadoop

2016-04-14 Thread David Newberger
Can we assume your question is “Will Spark replace Hadoop MapReduce?” or do you literally mean replacing the whole of Hadoop? David From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID] Sent: Thursday, April 14, 2016 2:13 PM To: User Subject: Spark replacing Hadoop Hi, I hear that some

RE: Can not set spark dynamic resource allocation

2016-05-20 Thread David Newberger
Hi All, The error you are seeing looks really similar to Spark-13514 to me. I could be wrong though https://issues.apache.org/jira/browse/SPARK-13514 Can you check yarn.nodemanager.local-dirs in your YARN configuration for "file://" Cheers! David Newberger -Original Message

RE: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

2016-05-31 Thread David Newberger
Is https://github.com/alonsoir/awesome-recommendation-engine/blob/master/build.sbt the build.sbt you are using? David Newberger QA Analyst WAND - The Future of Restaurant Technology (W) www.wandcorp.com<http://www.wandcorp.com/> (E) david.newber...@wandcorp.com<mailto:dav

RE: About a problem when mapping a file located within a HDFS vmware cdh-5.7 image

2016-05-31 Thread David Newberger
Have you tried it without either of the setMaster lines? Also, CDH 5.7 uses spark 1.6.0 with some patches. I would recommend using the cloudera repo for spark files in build sbt. I’d also check other files in the build sbt to see if there are cdh specific versions. David Newberger From

RE: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-03 Thread David Newberger
Alonso, The CDH VM uses YARN and the default deploy mode is client. I’ve been able to use the CDH VM for many learning scenarios. http://www.cloudera.com/documentation/enterprise/latest.html http://www.cloudera.com/documentation/enterprise/latest/topics/spark.html David Newberger From

RE: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-03 Thread David Newberger
What does your processing time look like. Is it consistently within that 20sec micro batch window? David Newberger From: Adrian Tanase [mailto:atan...@adobe.com] Sent: Friday, June 3, 2016 8:14 AM To: user@spark.apache.org Cc: Cosmin Ciobanu Subject: [REPOST] Severe Spark Streaming performance

RE: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-03 Thread David Newberger
rk, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime. “ David Newberger From: Alonso Isidoro Roman [mailto:alons...@gmail.com] Sent: Friday, June 3, 2016 10:37 AM To: David Newberger Cc: user@spark.apache.org Subject: Re: About

RE: Spark Streaming - long garbage collection time

2016-06-03 Thread David Newberger
Have you tried UseG1GC in place of UseConcMarkSweepGC? This article really helped me with GC a few short weeks ago https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html David Newberger -Original Message- From: Marco1982 [mailto:marco.plata

RE: Twitter streaming error : No lease on /user/hduser/checkpoint/temp (inode 806125): File does not exist.

2016-06-03 Thread David Newberger
I was going to ask if you had 2 jobs running. If the checkpointing for both are setup to look at the same location I could see an error like this happening. Do both spark jobs have a reference to a checkpointing dir? David Newberger From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent

RE: Twitter streaming error : No lease on /user/hduser/checkpoint/temp (inode 806125): File does not exist.

2016-06-03 Thread David Newberger
Hi Mich, My gut says you are correct that each application should have its own checkpoint directory. Though honestly I’m a bit fuzzy on checkpointing still as I’ve not worked with it much yet. Cheers, David Newberger From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Friday, June

RE: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread David Newberger
Could you be looking at 2 jobs trying to use the same file and one getting to it before the other and finally removing it? David Newberger From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Wednesday, June 8, 2016 1:33 PM To: user; user @spark Subject: Creating a Hive table through

RE: streaming example has error

2016-06-15 Thread David Newberger
Have you tried to “set spark.driver.allowMultipleContexts = true”? David Newberger From: Lee Ho Yeung [mailto:jobmatt...@gmail.com] Sent: Tuesday, June 14, 2016 8:34 PM To: user@spark.apache.org Subject: streaming example has error when simulate streaming with nc -lk got error below, then

RE: Handle empty kafka in Spark Streaming

2016-06-15 Thread David Newberger
If you're asking how to handle no messages in a batch window then I would add an isEmpty check like: dStream.foreachRDD(rdd => { if (!rdd.isEmpty()) ... } Or something like that. David Newberger -Original Message- From: Yogesh Vyas [mailto:informy...@gmail.com] Sent: W

RE: Handle empty kafka in Spark Streaming

2016-06-15 Thread David Newberger
is as an application which is always running and doing something like windowed batching or microbatching or whatever I'm trying to accomplish. IF an RDD I get from Kafka is empty then I don't run the rest of the job. IF the RDD I'm get from Kafka has some number of events then I&#

RE: Limit pyspark.daemon threads

2016-06-15 Thread David Newberger
the maximum amount of CPU cores to request for the application from across the cluster (not from each machine). If not set, the default will bespark.deploy.defaultCores on Spark's standalone cluster manager, or infinite (all available cores) on Mesos.” David Newberger From: agateaaa [mail

RE: streaming example has error

2016-06-16 Thread David Newberger
Try adding wordCounts.print() before ssc.start() David Newberger From: Lee Ho Yeung [mailto:jobmatt...@gmail.com] Sent: Wednesday, June 15, 2016 9:16 PM To: David Newberger Cc: user@spark.apache.org Subject: Re: streaming example has error got another error StreamingContext: Error starting the

RE: difference between dataframe and dataframwrite

2016-06-16 Thread David Newberger
DataFrame is a collection of data which is organized into named columns. DataFrame.write is an interface for saving the contents of a DataFrame to external storage. Hope this helps David Newberger From: pseudo oduesp [mailto:pseudo20...@gmail.com] Sent: Thursday, June 16, 2016 9:43 AM To

RE: HBase-Spark Module

2016-07-29 Thread David Newberger
Hi Ben, This seems more like a question for community.cloudera.com. However, it would be in hbase not spark I believe. https://repository.cloudera.com/artifactory/webapp/#/artifacts/browse/tree/General/cloudera-release-repo/org/apache/hbase/hbase-spark David Newberger -Original Message

Need Advice: Spark-Streaming Setup

2016-08-01 Thread David Kaufman
please don't hesitate to ask. Thanks, David

Spark on YARN multitenancy

2015-12-15 Thread David Fox
Hello Spark experts, We are currently evaluating Spark on our cluster that already supports MRv2 over YARN. We have noticed a problem with running jobs concurrently, in particular that a running Spark job will not release its resources until the job is finished. Ideally, if two people run any co

RE: fishing for help!

2015-12-21 Thread David Newberger
Hi Eran, Based on the limited information the first things that come to my mind are Processor, RAM, and Disk speed. David Newberger QA Analyst WAND - The Future of Restaurant Technology (W) www.wandcorp.com<http://www.wandcorp.com/> (E) david.newber...@wandcorp.com<mailto:dav

Fat jar can't find jdbc

2015-12-21 Thread David Yerrington
ll that does "sqlContext.load("jdbc", myOptions)". I know this is a total newbie question but in my defense, I'm fairly new to Scala, and this is my first go at deploying a fat jar with sbt-assembly. Thanks for any advice! -- David Yerrington yerrington.net

Re: Fat jar can't find jdbc

2015-12-22 Thread David Yerrington
; MergeStrategy.discard case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first case PathList("org", "apache", xs @ _*) => MergeStrategy.first case PathList("org", "jboss", xs @ _*) => MergeStrategy.first case

Re: Fat jar can't find jdbc

2015-12-22 Thread David Yerrington
s the Maven manifest goes, I'm really not sure. I will research it though. Now I'm wondering if my mergeStrategy is to blame? I'm going to try there next. Thank you for the help! On Tue, Dec 22, 2015 at 1:18 AM, Igor Berman wrote: > David, can you verify that mysql connect

Problem About Worker System.out

2015-12-28 Thread David John
I have used Spark 1.4 for 6 months. Thanks all the members of this community for your great work.I have a question about the logging issue. I hope this question can be solved. The program is running under this configurations: YARN Cluster, YARN-client mode. In Scala,writing a code like:rdd.

FW: Problem About Worker System.out

2015-12-28 Thread David John
015 at 5:33 PM, David John wrote: I have used Spark 1.4 for 6 months. Thanks all the members of this community for your great work.I have a question about the logging issue. I hope this question can be solved. The program is running under this configurations: YARN Cluster, YARN-client m

Using Experminal Spark Features

2015-12-30 Thread David Newberger
used this approach yet and if so what has you experience been with using it? If it helps we'd be looking to implement it using Scala. Secondly, in general what has people experience been with using experimental features in Spark? Cheers, David Newberger

Re: [discuss] dropping Python 2.6 support

2016-01-11 Thread David Chin
on 2.7. Some libraries that Spark depend on >>> stopped supporting 2.6. We can still convince the library maintainers to >>> support 2.6, but it will be extra work. I'm curious if anybody still uses >>> Python 2.6 to run Spark. >>> >>> Thanks. >>> >>> >>> >> -- David Chin, Ph.D. david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U. http://www.drexel.edu/research/urcf/ https://linuxfollies.blogspot.com/ +1.215.221.4747 (mobile) https://github.com/prehensilecode

ROSE: Spark + R on the JVM, now available.

2016-01-12 Thread David Russell
ou to [take a look](https://github.com/onetapbeyond/opencpu-spark-executor). Any feedback, questions etc very welcome. David "All that is gold does not glitter, Not all those who wander are lost."

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell
Hi Corey, > Would you mind providing a link to the github? Sure, here is the github link you're looking for: https://github.com/onetapbeyond/opencpu-spark-executor David "All that is gold does not glitter, Not all those who wander are lost." Original Message --

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell
weight as ROSE and it not designed to work in a clustered environment. ROSE on the other hand is designed for scale. David "All that is gold does not glitter, Not all those who wander are lost." Original Message Subject: Re: ROSE: Spark + R on the JVM. Local Time:

Re: ROSE: Spark + R on the JVM.

2016-01-13 Thread David Russell
PIs in Java, JavaScript and .NET that can easily support your use case. The outputs of your DeployR integration could then become inputs to your data processing system. David "All that is gold does not glitter, Not all those who wander are lost." Original Message Subject: R

Re: Kafka Streaming and partitioning

2016-01-13 Thread David D
Yep that's exactly what we want. Thanks for all the info Cody. Dave. On 13 Jan 2016 18:29, "Cody Koeninger" wrote: > The idea here is that the custom partitioner shouldn't actually get used > for repartitioning the kafka stream (because that would involve a shuffle, > which is what you're trying

Re: rdd.foreach return value

2016-01-18 Thread David Russell
The foreach operation on RDD has a void (Unit) return type. See attached. So there is no return value to the driver. David "All that is gold does not glitter, Not all those who wander are lost." Original Message Subject: rdd.foreach return value Local Time: Janua

MLlib OneVsRest causing intermittent exceptions

2016-01-25 Thread David Brooks
ry rare classes. I'm happy to look into patching the code, but I first wanted to confirm that the problem was real, and that I wasn't somehow misunderstanding how I should be using OneVsRest. Any guidance would be appreciated - I'm new to the list. Many thanks, David

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread David Brooks
and issue. I'm happy to try a simpler method for providing column metadata, if one is available. Thanks, David On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha wrote: > Hi David > > What happens if you provide the class labels via metadata instead of > letting OneVsRest de

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread David Brooks
with using the label metadata as a shortcut. Do you agree that there is an issue here? Would you accept contributions to the code to remedy it? I'd gladly take a look if I can be of help. Many thanks, David On Tue, Jan 26, 2016 at 1:29 PM David Brooks wrote: > Hi Ram, > > I did

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-26 Thread David Brooks
g JIRAs and getting patches tomorrow morning. It's late here! Thanks for the swift response, David On Tue, Jan 26, 2016 at 11:09 PM Ram Sriharsha wrote: > Hi David > > If I am reading the email right, there are two problems here right? > a) for rare classes the random spli

Re: MLlib OneVsRest causing intermittent exceptions

2016-01-27 Thread David Brooks
/apache/spark/commit/2388de51912efccaceeb663ac56fc500a79d2ceb This should resolve the issue I'm experiencing. I'll get hold of a build from source and try it out. Thanks for all your help! David On Wed, Jan 27, 2016 at 12:51 AM Ram Sriharsha wrote: > btw, OneVsRest is using the

[ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-01 Thread David Russell
> ROSE Spark Package: https://github.com/onetapbeyond/opencpu-spark-executor <https://github.com/onetapbeyond/opencpu-spark-executor> Questions, suggestions, feedback welcome. David -- "*All that is gold does not glitter,** Not all those who wander are lost."*

Re: Guidelines for writing SPARK packages

2016-02-01 Thread David Russell
the artifacts for your package to Maven central. David On Mon, Feb 1, 2016 at 7:03 AM, Praveen Devarao wrote: > Hi, > > Is there any guidelines or specs to write a Spark package? I would > like to implement a spark package and would like to know the way it needs to > be

Re: [ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-02 Thread David Russell
the vision is to get rid of all cluster > management when using Spark. You might find one of the hosted Spark platform solutions such as Databricks or Amazon EMR that handle cluster management for you a good place to start. At least in my experience, they got me

add kafka streaming jars when initialising the sparkcontext in python

2016-02-10 Thread David Kennedy
fka/libs/metrics-core-2.2.0.jar,' '/usr/share/java/mysql.jar') got the logging to admit to adding the jars to the http server (just as for the spark submit output above) but leaving the other config options in place or removing them the class is still not found. Is this not possible in python? Incidentally, I have tried SPARK_CLASSPATH (getting the message that it's deprecated and ignored anyway) and I cannot find anything else to try. Can anybody help? David K.

How to add kafka streaming jars when initialising the sparkcontext in python

2016-02-15 Thread David Kennedy
fka/libs/metrics-core-2.2.0.jar,' '/usr/share/java/mysql.jar') got the logging to admit to adding the jars to the http server (just as for the spark submit output above) but leaving the other config options in place or removing them the class is still not found. Is this not possible in python? Incidentally, I have tried SPARK_CLASSPATH (getting the message that it's deprecated and ignored anyway) and I cannot find anything else to try. Can anybody help? David K.

Re: How to avoid Spark shuffle spill memory?

2015-10-06 Thread David Mitchell
your code to make is use less memory. David On Tue, Oct 6, 2015 at 3:19 PM, unk1102 wrote: > Hi I have a Spark job which runs for around 4 hours and it shared > SparkContext and runs many child jobs. When I see each job in UI I see > shuffle spill of around 30 to 40 GB and because of

Install via directions in "Learning Spark". Exception when running bin/pyspark

2015-10-12 Thread David Bess
as java8u60 I double checked my python version and it appears to be 2.7.10 I am familiar with command line, and have background in hadoop, but this has me stumped. Thanks in advance, David Bess -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Insta

Re: Install via directions in "Learning Spark". Exception when running bin/pyspark

2015-10-13 Thread David Bess
Got it working! Thank you for confirming my suspicion that this issue was related to Java. When I dug deeper I found multiple versions and some other issues. I worked on it a while before deciding it would be easier to just uninstall all Java and reinstall clean JDK, and now it works perfectly.

Re: OLAP query using spark dataframe with cassandra

2015-11-10 Thread David Morales
work between daily increased large tables, >> for >> >> both spark sql and cassandra. I can see that the [1] use case facilitates >> FiloDB to achieve columnar storage and query performance, but we had >> nothing more >> >> knowledge. >> >>

RE: hdfs-ha on mesos - odd bug

2015-11-11 Thread Buttler, David
I have verified that this error exists on my system as well, and the suggested workaround also works. Spark version: 1.5.1; 1.5.2 Mesos version: 0.21.1 CDH version: 4.7 I have set up the spark-env.sh to contain HADOOP_CONF_DIR pointing to the correct place, and I have also linked in the hdfs-si

RE: Save GraphX to disk

2015-11-13 Thread Buttler, David
A graph is nodes and vertices. What else are you expecting to save/load? You could save/load the triplets, but that is actually more work to reconstruct the graph than the nodes and vertices separately. Dave From: Gaurav Kumar [mailto:gauravkuma...@gmail.com] Sent: Friday, November 13, 2015

Re: WARN LoadSnappy: Snappy native library not loaded

2015-11-19 Thread David Rosenstrauch
I ran into this recently. Turned out we had an old org-xerial-snappy.properties file in one of our conf directories that had the setting: # Disables loading Snappy-Java native library bundled in the # snappy-java-*.jar file forcing to load the Snappy-Java native # library from the java.library

Passing SPARK_CONF_DIR to slaves in standalone mode under Grid Engine job

2015-07-29 Thread David Chin
Hi, all, I am just setting up to run Spark in standalone mode, as a (Univa) Grid Engine job. I have been able to set up the appropriate environment variables such that the master launches correctly, etc. In my setup, I generate GE job-specific conf and log dirs. However, I am finding that the SPA

Spark-Grid Engine light integration writeup

2015-08-06 Thread David Chin
ments from anyone who may be doing something similar. Cheers, Dave -- David Chin, Ph.D. david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U. http://www.drexel.edu/research/urcf/ https://linuxfollies.blogspot.com/ 215.221.4747 (mobile) https://github.com/prehensilecode

Problem with take vs. takeSample in PySpark

2015-08-10 Thread David Montague
behavior with the take function, or at least without needing to choose an element randomly. I was able to get the behavior I wanted above by just changing the seed until I got the dataframe I wanted, but I don't think that is a good approach in general. Any insight is appreciated. Best, David Mon

Re: grpah x issue spark 1.3

2015-08-17 Thread David Zeelen
the code below is taken from the spark website and generates the error detailed Hi using spark 1.3 and trying some sample code: val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica",

Re: submit_spark_job_to_YARN

2015-08-30 Thread David Mitchell
Hi Ajay, Are you trying to save to your local file system or to HDFS? // This would save to HDFS under "/user/hadoop/counter" counter.saveAsTextFile("/user/hadoop/counter"); David On Sun, Aug 30, 2015 at 11:21 AM, Ajay Chander wrote: > Hi Everyone, > > Recent

Event logging not working when worker machine terminated

2015-09-08 Thread David Rosenstrauch
Our Spark cluster is configured to write application history event logging to a directory on HDFS. This all works fine. (I've tested it with Spark shell.) However, on a large, long-running job that we ran tonight, one of our machines at the cloud provider had issues and had to be terminated

Re: Event logging not working when worker machine terminated

2015-09-09 Thread David Rosenstrauch
Standalone. On 09/08/2015 11:18 PM, Jeff Zhang wrote: What cluster mode do you use ? Standalone/Yarn/Mesos ? On Wed, Sep 9, 2015 at 11:15 AM, David Rosenstrauch wrote: Our Spark cluster is configured to write application history event logging to a directory on HDFS. This all works fine

Re: Event logging not working when worker machine terminated

2015-09-09 Thread David Rosenstrauch
o be a bug introduced in 1.3. Hopefully it¹s fixed in 1.4. Thanks, Charles On 9/9/15, 7:30 AM, "David Rosenstrauch" wrote: Standalone. On 09/08/2015 11:18 PM, Jeff Zhang wrote: What cluster mode do you use ? Standalone/Yarn/Mesos ? On Wed, Sep 9, 2015 at 11:15 AM, David Rosens

Re: Spark Streaming Suggestion

2015-09-15 Thread David Morales
Storm writes the data to both cassandra and kafka, spark reads the >>> actual data from kafka , processes the data and writes to cassandra. >>> The second approach avoids additional hit of reading from cassandra >>> every minute , a device has written data to cassandra at the

  1   2   3   >