[Release Question]: Estimate on 3.5.2 release?

2024-04-26 Thread Paul Gerver
Hello, I'm curious if there is an estimate when 3.5.2 for Spark Core will be released. There are several bug and security vulnerability fixes in the dependencies we are excited to receive! If anyone has any insights, that would be greatly appreciated. Thanks! - ​Paul [cid:8a2e80d5

CFP for the 2nd Performance Engineering track at Community over Code NA 2023

2023-07-03 Thread Brebner, Paul
ce-engineering-track-over-code-brebner/> - Paul Brebner and Roger Abelenda

Rename columns without manually setting them all

2023-06-21 Thread John Paul Jayme
Hi, This is currently my column definition : Employee ID NameClient Project Team01/01/2022 02/01/2022 03/01/2022 04/01/2022 05/01/2022 12345 Dummy x Dummy a abc team a OFF WO WH WH WH As you can see, the outer columns are just d

How to read excel file in PySpark

2023-06-20 Thread John Paul Jayme
ct has no attribute 'read_excel'. Can you advise? JOHN PAUL JAYME Data Engineer [https://app.tdcx.com/email-signature/assets/img/tdcx-logo.png] m. +639055716384 w. www.tdcx.com<http://www.tdcx.com/> Winner of over 350 Industry Awards [Linkedin]<https://www.linkedin.com/comp

Re: NoClassDefError and SparkSession should only be created and accessed on the driver.

2022-09-20 Thread Paul Rogalinski
Hi Rajat, I have been facing similar problem recently and could solve it by moving the UDF implementation into a dedicated class instead having it implemented in the driver class/object. Regards, Paul. On Tuesday 20 September 2022 10:11:31 (+02:00), rajat kumar wrote: Hi Alton, it'

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-31 Thread Paul Wais
produce without reproducer and even couldn't reproduce even > they spent their time. Memory leak issue is not really easy to reproduce, > unless it leaks some objects without any conditions. > > - Jungtaek Lim (HeartSaVioR) > > On Sun, Oct 20, 2019 at 7:18 PM Paul Wais wrote

pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-20 Thread Paul Wais
over time. Those were very different jobs, but perhaps this issue is bespoke to local mode? Emphasis: I did try to del the pyspark objects and run python GC. That didn't help at all. pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image) 12-core i7

Avro support broken?

2019-07-04 Thread Paul Wais
nels%3Acomment-tabpanel#comment-16878896 Cheers, -Paul - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

dropping unused data from a stream

2019-01-22 Thread Paul Tremblay
I will be streaming data and am trying to understand how to get rid of old data from a stream so it does not become to large. I will stream in one large table of buying data and join that to another table of different data. I need the last 14 days from the second table. I will not need data that is

Re: Problem in persisting file in S3 using Spark: xxx file does not exist Exception

2018-05-02 Thread Paul Tremblay
I would like to see the full error. However, S3 can give misleading messages if you don't have the correct permissions. On Tue, Apr 24, 2018, 2:28 PM Marco Mistroni wrote: > HI all > i am using the following code for persisting data into S3 (aws keys are > already stored in the environment vari

[Spark scheduling] Spark schedules single task although rdd has 48 partitions?

2018-05-02 Thread Paul Borgmans
(please notice this question was previously posted to https://stackoverflow.com/questions/49943655/spark-schedules-single-task-although-rdd-has-48-partitions) We are running Spark 2.3 / Python 3.5.2. For a job we run following code (please notice that the input txt files are just a simplified exa

History server and non-HDFS filesystems

2017-11-17 Thread Paul Mackles
, I know ADL is not heavily used at this time so I wonder if anyone is seeing this with S3 as well? Maybe not since S3 permissions are always reported as world-readable (I think) which causes checkAccessPermission() to succeed. Any thoughts or feedback appreciated. -- Thanks, Paul

Spark REST API

2017-11-07 Thread Paul Corley
currently streaming apps running on EMR. Paul Corley | Principle Data Engineer IgnitionOne | Marketing Technology. Simplified. Office: 1545 Peachtree St NE | Suite 500 | Atlanta, GA | 30309 Direct: 702.336.0094 Email: paul.cor...@ignitionone.com<mailto:paul.cor...@ignitionone.com>

Re: Running spark examples in Intellij

2017-10-11 Thread Paul
You say you did the maven package but did you do a maven install and define your local maven repo in SBT? -Paul Sent from my iPhone > On Oct 11, 2017, at 5:48 PM, Stephen Boesch wrote: > > When attempting to run any example program w/ Intellij I am running into > guava versi

Re: is it ok to have multiple sparksession's in one spark structured streaming app?

2017-09-08 Thread Paul
You would set the Kafka topic as your data source and you would write a custom output to Cassandra everything would be or could be contained within your stream -Paul Sent from my iPhone > On Sep 8, 2017, at 2:52 PM, kant kodali wrote: > > How can I use one SparkSession to tal

Structured Streaming from Parquet

2017-05-25 Thread Paul Corley
ntually throws a java OOM error. Additionally each cycle through this step takes successively longer. Hopefully someone can lend some insight as to what is actually taking place in this step and how to alleviate it Thanks, Paul Corley | Principle Data Engineer

splitting a huge file

2017-04-21 Thread Paul Tremblay
as to be split up, right? We ended up using a single machine with a single thread to do the splitting. I just want to make sure I am not missing something obvious. Thanks! -- Paul Henry Tremblay Attunix

small job runs out of memory using wholeTextFiles

2017-04-07 Thread Paul Tremblay
the number of partitions, but get the same error each time. In contrast, if I run a simple: rdd = sc.textFile("s3://paulhtremblay/noaa_tmp/") rdd.coutn() The job finishes in 15 minutes, even with just 3 nodes. Thanks -- Paul Henry Tremblay Robert Half Technology

Re: bug with PYTHONHASHSEED

2017-04-05 Thread Paul Tremblay
ira/browse/SPARK-13330 > > > > Holden Karau 于2017年4月5日周三 上午12:03写道: > >> Which version of Spark is this (or is it a dev build)? We've recently >> made some improvements with PYTHONHASHSEED propagation. >> >> On Tue, Apr 4, 2017 at 7:49 AM Eike von Seg

Re: bug with PYTHONHASHSEED

2017-04-04 Thread Paul Tremblay
So that means I have to pass that bash variable to the EMR clusters when I spin them up, not afterwards. I'll give that a go. Thanks! Henry On Tue, Apr 4, 2017 at 7:49 AM, Eike von Seggern wrote: > 2017-04-01 21:54 GMT+02:00 Paul Tremblay : > >> When I try to to do a groupBy

Re: Alternatives for dataframe collectAsList()

2017-04-03 Thread Paul Tremblay
View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Alternatives-for-dataframe- > collectAsList-tp28547.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Paul Henry Tremblay Robert Half Technology

Re: Read file and represent rows as Vectors

2017-04-03 Thread Paul Tremblay
com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Paul Henry Tremblay Robert Half Technology

Re: Looking at EMR Logs

2017-04-02 Thread Paul Tremblay
d run the history server like: > ``` > cd /usr/local/src/spark-1.6.1-bin-hadoop2.6 > sbin/start-history-server.sh > ``` > and then open http://localhost:18080 > > > > > On Thu, Mar 30, 2017 at 8:45 PM, Paul Tremblay > wrote: > >> I am looking for tips on

bug with PYTHONHASHSEED

2017-04-01 Thread Paul Tremblay
I get the same error: Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED Anyone know how to fix this problem in python 3.4? Thanks Henry -- Paul Henry Tremblay Robert Half Technology

pyspark bug with PYTHONHASHSEED

2017-04-01 Thread Paul Tremblay
I get the same error: Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED Anyone know how to fix this problem in python 3.4? Thanks Henry -- Paul Henry Tremblay Robert Half Technology

Looking at EMR Logs

2017-03-30 Thread Paul Tremblay
evaluate such things as how many tasks were completed, how many executors were used, etc. I currently save my logs to S3. Thanks! Henry -- Paul Henry Tremblay Robert Half Technology

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-11 Thread Paul Tremblay
work as well: http://michaelryanbell.com/processing-whole-files-spark-s3.html Jon On Mon, Feb 6, 2017 at 6:38 PM, Paul Tremblay <mailto:paulhtremb...@gmail.com>> wrote: I've actually been able to trace the problem to the files being read in. If I change to a different d

Re: Turning rows into columns

2017-02-11 Thread Paul Tremblay
chine On Feb 4, 2017 16:25, "Paul Tremblay" <mailto:paulhtremb...@gmail.com>> wrote: I am using pyspark 2.1 and am wondering how to convert a flat file, with one record per row, into a columnar format. Here is an example of the data: u'WARC/1.0

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
I've actually been able to trace the problem to the files being read in. If I change to a different directory, then I don't get the error. Is one of the executors running out of memory? On 02/06/2017 02:35 PM, Paul Tremblay wrote: When I try to create an rdd using wholeTextFiles

wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
When I try to create an rdd using wholeTextFiles, I get an incomprehensible error. But when I use the same path with sc.textFile, I get no error. I am using pyspark with spark 2.1. in_path = 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/ rdd = sc.wholeTextFiles(

Turning rows into columns

2017-02-04 Thread Paul Tremblay
I am using pyspark 2.1 and am wondering how to convert a flat file, with one record per row, into a columnar format. Here is an example of the data: u'WARC/1.0', u'WARC-Type: warcinfo', u'WARC-Date: 2016-12-08T13:00:23Z', u'WARC-Record-ID: ', u'Content-Length: 344', u'Content-Type: applicati

RE: spark 2.02 error when writing to s3

2017-01-27 Thread VND Tremblay, Paul
Not sure what you mean by "a consistency layer on top." Any explanation would be greatly appreciated! Paul _ Paul Tremblay Analytics Specialist THE BOSTON CONSULTING GROUP Tel.

RE: spark 2.02 error when writing to s3

2017-01-26 Thread VND Tremblay, Paul
This seems to have done the trick, although I am not positive. If I have time, I'll test spinning up a cluster with and without consistent view to pin point the error. _____ Paul Tremblay Anal

RE: Ingesting Large csv File to relational database

2017-01-26 Thread VND Tremblay, Paul
. _ Paul Tremblay Analytics Specialist THE BOSTON CONSULTING GROUP Tel. + ▪ Mobile + _ From: Eric Dain [mailto:ericdai...@gmail.com] Sent: Wednesday, January 25, 2017 11:14 PM To

RE: spark 2.02 error when writing to s3

2017-01-20 Thread VND Tremblay, Paul
I am using an EMR cluster, and the latest version offered is 2.02. The link below indicates that that user had the same problem, which seems unresolved. Thanks Paul _ Paul Tremblay Analytics

spark 2.02 error when writing to s3

2017-01-19 Thread VND Tremblay, Paul
tries to write multiple times and causes the error. The suggestion is to turn off speculation, but I believe speculation is turned off by default in pyspark. Thanks! Paul _ Paul Tremblay Analytic

Spark 2.0 Encoder().schema() is sorting StructFields

2016-10-12 Thread Paul Stewart
considered a bug/enhancement? Regards, Paul - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark 2.0 Encoder().schema() is sorting StructFields

2016-10-07 Thread Paul Stewart
attributes is significant. Is there anyway to cause the Encoder().schema() method to return the array of StructFields in the original definition order of the Bean.class? Regards, Paul - To unsubscribe e-mail: user-unsubscr

Re: AVRO vs Parquet

2016-03-04 Thread Paul Leclercq
>>>> — >>>> airis.DATA >>>> Timothy Spann, Senior Solutions Architect >>>> C: 609-250-5894 >>>> http://airisdata.com/ >>>> http://meetup.com/nj-datascience >>>> >>>> >>>> >>> >>> >>> -- >>> Donald Drake >>> Drake Consulting >>> http://www.drakeconsulting.com/ >>> https://twitter.com/dondrake <http://www.MailLaunder.com/> >>> 800-733-2143 >>> >> >> > -- Paul Leclercq | Data engineer paul.lecle...@tabmo.io | http://www.tabmo.fr/

Re: Kafka streaming receiver approach - new topic not read from beginning

2016-02-23 Thread Paul Leclercq
Topic}/{partitionId} {newOffset} Source : https://metabroadcast.com/blog/resetting-kafka-offsets 2016-02-22 11:55 GMT+01:00 Paul Leclercq : > Thanks for your quick answer. > > If I set "auto.offset.reset" to "smallest" as for KafkaParams like this > &

Re: Kafka streaming receiver approach - new topic not read from beginning

2016-02-22 Thread Paul Leclercq
ffset.reset" through parameter > "kafkaParams" which is provided in some other overloaded APIs of > createStream. > > By default Kafka will pick data from latest offset unless you explicitly > set it, this is the behavior Kafka, not Spark. > > Thanks >

Kafka streaming receiver approach - new topic not read from beginning

2016-02-22 Thread Paul Leclercq
t.reset > to "earliest" for the new consumer in 0.9 and "smallest" for the old > consumer. https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whydoesmyconsumernevergetanydata? Thanks -- Paul Leclercq

Re: spark-1.2.0--standalone-ha-zookeeper

2016-01-20 Thread Paul Leclercq
ER" > -Dspark.deploy.zookeeper.url="ZOOKEEPER_IP:2181" > -Dspark.deploy.zookeeper.dir="/spark"' A good thing to check if everything went OK is the folder /spark on the ZooKeeper server. I could not find it on my server. Thanks for reading, Paul 2016-01-19 22:12 GMT+01

Re: Spark streaming job hangs

2015-12-01 Thread Paul Leclercq
or time 144894989 ms > 2015-12-01 06:04:55,064 [JobGenerator] INFO (Logging.scala:59) - Added > jobs for time 1448949895000 ms > 2015-12-01 06:05:00,125 [JobGenerator] INFO (Logging.scala:59) - Added > jobs for time 144894990 ms > > > Thanks > LCassa > -- Paul Leclercq | Data engineer paul.lecle...@tabmo.io | http://www.tabmo.fr/

Re: unpersist RDD from another thread

2015-09-16 Thread Paul Weiss
ill be unpredictable (some partition may use cache, some > may not be able to use the cache). > > On Wed, Sep 16, 2015 at 1:06 PM, Paul Weiss > wrote: > >> Hi, >> >> What is the behavior when calling rdd.unpersist() from a different thread >> while another thre

unpersist RDD from another thread

2015-09-16 Thread Paul Weiss
has been called? thanks, -paul

RE: Too many open files

2015-07-29 Thread Paul Röwer
Maybe you forgot Tod close a reader Ort writer object. Am 29. Juli 2015 18:04:59 MESZ, schrieb saif.a.ell...@wellsfargo.com: >Thank you both, I will take a look, but > > >1. For high-shuffle tasks, is this right for the system to have >the size and thresholds high? I hope there is no bad con

Jobs with unknown origin.

2015-07-08 Thread Jan-Paul Bultmann
Hey, I have quite a few jobs appearing in the web-ui with the description "run at ThreadPoolExecutor.java:1142". Are these generated by SparkSQL internally? There are so many that they cause a RejectedExecutionException when the thread-pool runs out of space for them. RejectedExecutionExceptio

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann
Sorry, that should be shortest path, and diameter of the graph. I shouldn't write emails before I get my morning coffee... > On 06 Jul 2015, at 09:09, Jan-Paul Bultmann wrote: > > I would guess the opposite is true for highly iterative benchmarks (common in > graph processing

Re: Benchmark results between Flink and Spark

2015-07-06 Thread Jan-Paul Bultmann
I would guess the opposite is true for highly iterative benchmarks (common in graph processing and data-science). Spark has a pretty large overhead per iteration, more optimisations and planning only makes this worse. Sure people implemented things like dijkstra's algorithm in spark (a problem

Re: generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Jan-Paul Bultmann
tayed the same though. But I didn’t run that many iterations due to the problem :). > As a workaround, you can break the iterations into smaller ones and trigger > them manually in sequence. You mean` write` ing them to disk after each iteration? Thanks :), Jan > -Original Message

generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Jan-Paul Bultmann
org.apache.spark.sql.DataFrame.persist(StorageLevel) DataFrame.scala:1320 ^ | Application logic. | Could someone confirm my suspicion? And does somebody know why it’s called while caching, and why it walks the entire tree including cached results? Cheers, Jan-Paul

Re: build jar with all dependencies

2015-06-02 Thread Paul Röwer
t.scala:53) at mgm.tp.bigdata.ma_spark.SparkMain.main(SparkMain.java:38) what i do wrong? best regards, paul

Soft distinct on data frames.

2015-05-28 Thread Jan-Paul Bultmann
Hey, Is there a way to do a distinct operation on each partition only? My program generates quite a few duplicate tuples and it would be nice to remove some of these as an optimisation without having to reshuffle the data. I’ve also noticed that plans generated with an unique transformation have

Re: Best practice to avoid ambiguous columns in DataFrame.join

2015-05-17 Thread Jan-Paul Bultmann
It’s probably not advisable to use 1 though since it will break when `df = df2`, which can easily happen when you’ve written a function that does such a join internally. This could be solved by an identity like function that returns the dataframe unchanged but with a different identity. `.as` wo

spark sql, creating literal columns in java.

2015-05-05 Thread Jan-Paul Bultmann
Hey, What is the recommended way to create literal columns in java? Scala has the `lit` function from `org.apache.spark.sql.functions`. Should it be called from java as well? Cheers jan - To unsubscribe, e-mail: user-unsubscr...

Re: Jackson-core-asl conflict with Spark

2015-03-12 Thread Paul Brown
So... one solution would be to use a non-Jurassic version of Jackson. 2.6 will drop before too long, and 3.0 is in longer-term planning. The 1.x series is long deprecated. If you're genuinely stuck with something ancient, then you need to include the JAR that contains the class, and 1.9.13 does

Perf impact of BlockManager byte[] copies

2015-02-27 Thread Paul Wais
of this ByteBuffer API is possible and leverage it. Cheers, -Paul - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Support for SQL on unions of tables (merge tables?)

2015-01-21 Thread Paul Wais
; On 1/11/15 9:51 PM, Paul Wais wrote: >> >> >> Dear List, >> >> What are common approaches for addressing over a union of tables / RDDs? >> E.g. suppose I have a collection of log files in HDFS, one log file per day, >> and I want to compute the sum of some fi

Re: spark 1.2 three times slower than spark 1.1, why?

2015-01-21 Thread Paul Wais
To force one instance per executor, you could explicitly subclass FlatMapFunction and have it lazy-create your parser in the subclass constructor. You might also want to try RDD#mapPartitions() (instead of RDD#flatMap() if you want one instance per partition. This approach worked well for me when

Support for SQL on unions of tables (merge tables?)

2015-01-11 Thread Paul Wais
interactive querying). * Related question: are there plans to use Parquet Index Pages to make Spark SQL faster? E.g. log indices over date ranges would be relevant here. All the best, -Paul

Re: Downloads from S3 exceedingly slow when running on spark-ec2

2014-12-20 Thread Paul Brown
I would suggest checking out disk IO on the nodes in your cluster and then reading up on the limiting behaviors that accompany different kinds of EC2 storage. Depending on how things are configured for your nodes, you may have a local storage configuration that provides "bursty" IOPS where you get

Using S3 block file system

2014-12-09 Thread Paul Colomiets
t find out how to do it. I use spark 1.2.0rc1 with hadoop 2.4 and Riak CS (instead of S3) if that matters. The s3n:// protocol with same settings work. Thanks. -- Paul - To unsubscribe, e-mail: user-unsubscr...@spark.apac

Re: Parsing a large XML file using Spark

2014-11-21 Thread Paul Brown
Unfortunately, unless you impose restrictions on the XML file (e.g., where namespaces are declared, whether entity replacement is used, etc.), you really can't parse only a piece of it even if you have start/end elements grouped together. If you want to deal effectively (and scalably) with large X

Re: Native / C/C++ code integration

2014-11-11 Thread Paul Wais
More thoughts. I took a deeper look at BlockManager, RDD, and friends. Suppose one wanted to get native code access to un-deserialized blocks. This task looks very hard. An RDD behaves much like a Scala iterator of deserialized values, and interop with BlockManager is all on deserialized data.

Native / C/C++ code integration

2014-11-07 Thread Paul Wais
h JNA. Is there a way to expose raw, in-memory partition/block data to native code? Has anybody else attacked this problem a different way? All the best, -Paul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Native-C-C-code-integration-tp18347.html Sen

[SQL] PERCENTILE is not working

2014-11-05 Thread Kevin Paul
2) Thanks, Kevin Paul - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Do Spark executors restrict native heap vs JVM heap?

2014-11-02 Thread Paul Wais
s also taking memory. > > On Oct 30, 2014 6:43 PM, "Paul Wais" > wrote: >> >> Dear Spark List, >> >> I have a Spark app that runs native code inside map functions. I've >> noticed that the native code sometimes sets errno to ENOMEM indicating

Do Spark executors restrict native heap vs JVM heap?

2014-10-30 Thread Paul Wais
freeMemory() shows gigabytes free and the native code needs only megabytes. Does spark limit the /native/ heap size somehow? Am poking through the executor code now but don't see anything obvious. Best Regards, -Paul Wais - To

SchemaRDD.where clause error

2014-10-21 Thread Kevin Paul
Hi all, I tried to use the function SchemaRDD.where() but got some error: val people = sqlCtx.sql("select * from people") people.where('age === 10) :27: error: value === is not a member of Symbol where did I go wrong? Thanks, Kevin Paul

Setting SparkSQL configuration

2014-10-13 Thread Kevin Paul
red to set the config using HiveContext's setConf function? Regards, Kelvin Paul

Re: SparkSQL on Hive error

2014-10-13 Thread Kevin Paul
Thanks Michael, your patch works for me :) Regards, Kelvin Paul On Fri, Oct 3, 2014 at 3:52 PM, Michael Armbrust wrote: > Are you running master? There was briefly a regression here that is > hopefully fixed by spark#2635 <https://github.com/apache/spark/pull/2635>. > > On F

Re: Any issues with repartition?

2014-10-08 Thread Paul Wais
Looks like an OOM issue? Have you tried persisting your RDDs to allow disk writes? I've seen a lot of similar crashes in a Spark app that reads from HDFS and does joins. I.e. I've seen "java.io.IOException: Filesystem closed," "Executor lost," "FetchFailed," etc etc with non-deterministic crashe

SparkSQL on Hive error

2014-10-03 Thread Kevin Paul
execute(NativeCommand.scala:38) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) Thanks, Kelvin Paul

Worker Random Port

2014-09-23 Thread Paul Magid
the spark-env.sh but it does not seem to stop the dynamic port behavior. I have included the startup output when running spark-shell from the edge server in a different dmz and then from a node in the cluster. Any help greatly appreciated. Paul Magid Toyota Motor Sales IS Enterprise

Re: Unable to find proto buffer class error with RDD

2014-09-19 Thread Paul Wais
Derp, one caveat to my "solution": I guess Spark doesn't use Kryo for Function serde :( On Fri, Sep 19, 2014 at 12:44 AM, Paul Wais wrote: > Well it looks like this is indeed a protobuf issue. Poked a little more > with Kryo. Since protobuf messages are serializable

Re: Unable to find proto buffer class error with RDD

2014-09-19 Thread Paul Wais
Well it looks like this is indeed a protobuf issue. Poked a little more with Kryo. Since protobuf messages are serializable, I tried just making Kryo use the JavaSerializer for my messages. The resulting stack trace made it look like protobuf GeneratedMessageLite is actually using the classloade

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
es the problem): https://github.com/apache/spark/blob/2f9b2bd7844ee8393dc9c319f4fefedf95f5e460/core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala#L74 If uber.jar is on the classpath, then the root classloader would have the code, hence why --driver-class-path fixes the bug. On Thu, Sep 18, 201

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
hmm would using kyro help me here? On Thursday, September 18, 2014, Paul Wais wrote: > Ah, can one NOT create an RDD of any arbitrary Serializable type? It > looks like I might be getting bitten by the same > "java.io.ObjectInputStream uses root class loader only"

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
d3259.html * https://github.com/apache/spark/pull/181 * http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3c7f6aa9e820f55d4a96946a87e086ef4a4bcdf...@eagh-erfpmbx41.erf.thomson.com%3E * https://groups.google.com/forum/#!topic/spark-users/Q66UOeA2u-I On Thu, Sep 18, 2014 at 4:51 PM,

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
ache.org/repos/asf/hadoop/common/branches/branch-2.3.0/hadoop-project/pom.xml On Thu, Sep 18, 2014 at 1:06 AM, Paul Wais wrote: > Dear List, > > I'm writing an application where I have RDDs of protobuf messages. > When I run the app via bin/spar-submit with --master local >

RE: Spark SQL Exception

2014-09-18 Thread Paul Magid
identical keys in the input tuples.) SPARK-2926 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle The Exception is included below. Paul Magid Toyota Motor Sales IS Enterprise Architecture (EA) Architect I R&D Ph: 310-468-9091 (X69091) PCN 1C2970, Mail Drop PN12 Excep

Spark SQL Exception

2014-09-18 Thread Paul Magid
is there a document that lists current Spark SQL limitations/issues? Paul Magid Toyota Motor Sales IS Enterprise Architecture (EA) Architect I R&D Ph: 310-468-9091 (X69091) PCN 1C2970, Mail Drop PN12 Successful Re

Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
ark://my.master:7077 ) ? I've tried poking through the shell scripts and SparkSubmit.scala and unfortunately I haven't been able to grok exactly what Spark is doing with the remote/local JVMs. Cheers, -Paul - To

Re: Stable spark streaming app

2014-09-17 Thread Paul Wais
Thanks Tim, this is super helpful! Question about jars and spark-submit: why do you provide myawesomeapp.jar as the program jar but then include other jars via the --jars argument? Have you tried building one uber jar with all dependencies and just sending that to Spark as your app jar? Also, h

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
mib13.cloudfront.net/spark-1.1.0-bin-hadoop2.3.tgz pom.xml snippets: https://gist.github.com/ypwais/ff188611d4806aa05ed9 [1] http://stackoverflow.com/questions/24747037/how-to-define-a-dependency-scope-in-maven-to-include-a-library-in-compile-run Thanks everybody!! -Paul On Tue, Sep 16, 2014 at 3:

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
eSize=512m" mvn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package and hadoop 2.3 / cdh5 from http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.0.tar.gz On Mon, Sep 15, 2014 at 6:49 PM, Christian Chua wrote: > Hi Paul. > > I would recommend building you

Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-15 Thread Paul Wais
distro of hadoop is used at Data Bricks? Are there distros of Spark 1.1 and hadoop that should work together out-of-the-box? (Previously I had Spark 1.0.0 and Hadoop 2.3 working fine..) Thanks for any help anybody can give me here! -Paul --

Re: increase parallelism of reading from hdfs

2014-08-11 Thread Paul Hamilton
NewHadoopRDD. I am sure there is some way to use it with convenience methods like SparkContext.textFile, you could probably set the system property "mapreduce.input.fileinputformat.split.maxsize". Regards, Paul Hamilton From: Chen Song Date: Friday, August 8, 2014 at 9:13 PM

Re: How to read a multipart s3 file?

2014-08-07 Thread paul
In this case any file larger than 256,000,000 bytes is split. If you don't explicitly set it the limit is infinite which leads to the behavior you are seeing where it is 1 split per file. Regards, Paul Hamilton -- View this message in context: http://apache-spark-user-list.1001560.n3.nab

Re: Release date for new pyspark

2014-07-17 Thread Paul Wais
ld and pass tests on Jenkins. >> >> You shouldn't expect new features to be added to stable code in >> maintenance releases (e.g. 1.0.1). >> >> AFAIK, we're still on track with Spark 1.1.0 development, which means that >> it should be released sometime in

Release date for new pyspark

2014-07-16 Thread Paul Wais
gards, -Paul Wais

Re: Recommended pipeline automation tool? Oozie?

2014-07-10 Thread Paul Brown
We use Luigi for this purpose. (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.) — p...@mult.ifario.us | Multi

Re: jackson-core-asl jar (1.8.8 vs 1.9.x) conflict with the spark-sql (version 1.x)

2014-06-27 Thread Paul Brown
Hi, Mans -- Both of those versions of Jackson are pretty ancient. Do you know which of the Spark dependencies is pulling them in? It would be good for us (the Jackson, Woodstox, etc., folks) to see if we can get people to upgrade to more recent versions of Jackson. -- Paul — p

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-25 Thread Paul Brown
Hi, Robert -- I wonder if this is an instance of SPARK-2075: https://issues.apache.org/jira/browse/SPARK-2075 -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Wed, Jun 25, 2014 at 6:28 AM, Robert James wrote: > On 6/24/14, Robert James wrote: > > My

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Paul Brown
jar not reporting the files. Also, the classes do get correctly packaged into the uberjar: unzip -l /target/[deleted]-driver.jar | grep 'rdd/RDD' | grep 'saveAs' 1519 06-08-14 12:05 org/apache/spark/rdd/RDD$anonfun$saveAsTextFile$1.class 1560 06-08-14 12:05 org/ap

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-08 Thread Paul Brown
entirely different* artifacts (spark-core-h1, spark-core-h2). Logged as SPARK-2075 <https://issues.apache.org/jira/browse/SPARK-2075>. Cheers. -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Jun 6, 2014 at 2:45 AM, HenriV wrote: > I

Re: missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Paul Brown
Hi, Adrian -- If my memory serves, you need 1.7.7 of the various slf4j modules to avoid that issue. Best. -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Mon, May 12, 2014 at 7:51 AM, Adrian Mocanu wrote: > Hey guys, > > I've asked before, in Spa

Unexpected results when caching data

2014-05-12 Thread paul
sted: 2014050917: 7 2014050918: 42 Persisted: 2014050917: 7 2014050918: 12 Any idea what could account for the differences? BTW I am using Spark 0.9.1. Thanks, Paul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unexpected-results-when-caching-dat

Re: Packaging a spark job using maven

2014-05-12 Thread Paul Brown
Hi, Laurent -- That's the way we package our Spark jobs (i.e., with Maven). You'll need something like this: https://gist.github.com/prb/d776a47bd164f704eecb That packages separate driver (which you can run with java -jar ...) and worker JAR files. Cheers. -- Paul — p...@mult

  1   2   >