RE: Apache Spark Operator for Kubernetes?

2022-10-28 Thread Jim Halfpenny
Hi Clayton, I’m not aware of an official Apache operator, but I can recommend taking a look a the one we’re created at Stackable. https://github.com/stackabletech/spark-k8s-operator It’s actively maintained and we’d be happy to receive feedback if you have feature requests. Kind regards, Jim

Re: Writing files to s3 with out temporary directory

2017-11-21 Thread Jim Carroll
I got it working. It's much faster. If someone else wants to try it I: 1) Was already using the code from the Presto S3 Hadoop FileSystem implementation modified to sever it from the rest of the Presto codebase. 2) I extended it and overrode the method "keyFromPath" so that anytime the Path referr

Re: Writing files to s3 with out temporary directory

2017-11-21 Thread Jim Carroll
It's not actually that tough. We already use a custom Hadoop FileSystem for S3 because when we started using Spark with S3 the native FileSystem was very unreliable. Our's is based on the code from Presto. (see https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/pr

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread Jim Carroll
Thanks. In the meantime I might just write a custom file system that maps writes to parquet file parts to their final locations and then skips the move. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To u

Re: Writing files to s3 with out temporary directory

2017-11-20 Thread Jim Carroll
I have this exact issue. I was going to intercept the call in the filesystem if I had to (since we're using the S3 filesystem from Presto anyway) but if there's simply a way to do this correctly I'd much prefer it. This basically doubles the time to write parquet files to s3. -- Sent from: http:

Re: best spark spatial lib?

2017-10-10 Thread Jim Hughes
types and functions. Cheers, Jim On 10/10/2017 11:30 AM, Georg Heiler wrote: What about someting like gromesa? Anastasios Zouzias mailto:zouz...@gmail.com>> schrieb am Di. 10. Okt. 2017 um 15:29: Hi, Which spatial operations do you require exactly? Also, I don't foll

Spark on mesos in docker not getting parameters

2016-08-09 Thread Jim Carroll
. I "fixed" the classpath problem by putting my jar in /opt/spark/jars (/opt/spark is the location I have spark installed in the docker container). Can someone tell me what I'm missing? Thanks Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble

Re: How do I download 2.0? The main download page isn't showing it?

2016-07-27 Thread Jim O'Flaherty
Nevermind, it literally just appeared right after I posted this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-download-2-0-The-main-download-page-isn-t-showing-it-tp27420p27421.html Sent from the Apache Spark User List mailing list archive at Nab

How do I download 2.0? The main download page isn't showing it?

2016-07-27 Thread Jim O'Flaherty
How do I download 2.0? The main download page isn't showing it? And all the other download links point to the same single download page. This is the one I end up at: http://spark.apache.org/downloads.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-

Non-classification neural networks

2016-03-27 Thread Jim Carroll
is continuous and you want to simply do prediction? Thanks Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Non-classification-neural-networks-tp26604.html Sent from the Apache Spark User List mail

Re: Spark 1.5.2 memory error

2016-02-02 Thread Jim Green
Look at part#3 in below blog: http://www.openkb.info/2015/06/resource-allocation-configurations-for.html You may want to increase the executor memory, not just the spark.yarn.executor.memoryOverhead. On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov wrote: > For the memoryOvethead I have the def

Re: spark.kryo.classesToRegister

2016-01-28 Thread Jim Lohse
You are only required to add classes to Kryo (compulsorily) if you use a specific setting: //require registration of all classes with Kyro .set("spark.kryo.registrationRequired","true") Here's an example of my setup, I think this is the best approach because it forces me to really think about

Contrib to Docs: Re: SparkContext SyntaxError: invalid syntax

2016-01-18 Thread Jim Lohse
I don't think you have to build the docs, just fork them on Github and submit the pull request? I have been able to do is submit a pull request just by editing the markdown file, I am just confused if I am supposed to merge it myself or wait for notification and/or wait for someone else to mer

Re: Fwd: how to submit multiple jar files when using spark-submit script in shell?

2016-01-12 Thread Jim Lohse
Thanks for your answer, you are correct, it's just a different approach than the one I am asking for :) Building an uber- or assembly- jar goes against the idea of placing the jars on all workers. Uber-jars increase network traffic, using local:/ in the classpath reduces network traffic. Eve

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Jim Lohse
Hey Python 2.6 don't let the door hit you on the way out! haha Drop It No Problem On 01/05/2016 12:17 AM, Reynold Xin wrote: Does anybody here care about us dropping support for Python 2.6 in Spark 2.0? Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json parsing) when compar

Re: feedback on the use of Spark’s gateway hidden REST API (standalone cluster mode) for application submission

2016-01-02 Thread Jim Lohse
#rest-api but the "REST" of the story was off my radar screen, pardon the pun :) Jim On 01/02/2016 09:14 AM, HILEM Youcef wrote: Happy new year. I solicit community feedback on the use of Spark’s gateway hidden REST API (standalone cluster mode) for application submission. We a

Re: Dynamic jar loading

2015-12-18 Thread Jim Lohse
I am going to say no, but have not actually tested this. Just going on this line in the docs: http://spark.apache.org/docs/latest/configuration.html |spark.driver.extraClassPath| (none) Extra classpath entries to prepend to the classpath of the driver. /Note:/ In client mode, this config mus

Re: Pyspark submitted app just hangs

2015-12-02 Thread Jim Lohse
Is this the stderr output from a woker? Are any files being written? Can you run in debug and see how far it's getting? This to me doesn't give me a direction to look without the actual logs from $SPARK_HOME or the stderr from the worker UI. Just imho maybe someone know what this means but it

Confirm this won't parallelize/partition?

2015-11-28 Thread Jim Lohse
List();// is Spark good at spreading a for loop like this?for(inti =0;i ;i++){returnRDD.add(pld.doDrop());}returnreturnRDD;}});| On 11/27/2015 4:18 PM, Jim wrote: Hello there, (part of my problem is docs that say "undocumented" on parallelize <https://spark.apache.org/docs/

Give parallelize a dummy Arraylist length N to control RDD size?

2015-11-27 Thread Jim
Hello there, (part of my problem is docs that say "undocumented" on parallelize leave me reading books for examples that don't always pertain

localhost webui port

2015-10-13 Thread Langston, Jim
Hi all, Is there anyway to change the default port 4040 for the localhost webUI, unfortunately, that port is blocked and I have no control of that. I have not found any configuration parameter that would enable me to change it. Thanks, Jim

Array column stored as “.bag” in parquet file instead of “REPEATED INT64"

2015-08-27 Thread Jim Green
Hi Team, Say I have a test.json file: {"c1":[1,2,3]} I can create a parquet file like : var df = sqlContext.load("/tmp/test.json","json") var df_c = df.repartition(1) df_c.select("*").save("/tmp/testjson_spark","parquet”) The output parquet file’s schema is like: c1: OPTIONAL F:1 .bag:

Re: Is SPARK-3322 fixed in latest version of Spark?

2015-08-05 Thread Jim Green
: > ConnectionManager has been deprecated and is no longer used by default > (NettyBlockTransferService is the replacement). Hopefully you would no > longer see these messages unless you have explicitly flipped it back on. > > On Tue, Aug 4, 2015 at 6:14 PM, Jim Green wrote: >

Re: Is SPARK-3322 fixed in latest version of Spark?

2015-08-04 Thread Jim Green
And also https://issues.apache.org/jira/browse/SPARK-3106 This one is still open. On Tue, Aug 4, 2015 at 6:12 PM, Jim Green wrote: > *Symotom:* > Even sample job fails: > $ MASTER=spark://xxx:7077 run-example org.apache.spark.examples.SparkPi 10 > Pi is roughly 3.14

Is SPARK-3322 fixed in latest version of Spark?

2015-08-04 Thread Jim Green
*Symotom:* Even sample job fails: $ MASTER=spark://xxx:7077 run-example org.apache.spark.examples.SparkPi 10 Pi is roughly 3.140636 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(xxx,) not found WARN ConnectionManager: All connections not cleaned up Found https

Re: EOFException using KryoSerializer

2015-06-24 Thread Jim Carroll
I finally got back to this and I just wanted to let anyone that runs into this know that the problem is a kryo version issue. Spark (at least 1.4.0) depends on Kryo 2.21 while my client had 2.24.0 on the classpath. Changing it to 2.21 fixed the problem. -- View this message in context: http://a

Resource allocation configurations for Spark on Yarn

2015-06-12 Thread Jim Green
Hi Team, Sharing one article which summarize the Resource allocation configurations for Spark on Yarn: Resource allocation configurations for Spark on Yarn -- Thanks, www.openkb.info (Open KnowledgeBase for Hadoop/Datab

EOFException using KryoSerializer

2015-05-19 Thread Jim Carroll
hat TorrentBroadcast expects. Any suggestions? Thanks. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-using-KryoSerializer-tp22948.html Sent from the Apache Spark User List mailing list a

Re: 1.3 Hadoop File System problem

2015-03-25 Thread Jim Carroll
Thanks Patrick and Michael for your responses. For anyone else that runs across this problem prior to 1.3.1 being released, I've been pointed to this Jira ticket that's scheduled for 1.3.1: https://issues.apache.org/jira/browse/SPARK-6351 Thanks again. -- View this message in context: http:

1.3 Hadoop File System problem

2015-03-24 Thread Jim Carroll
seen this? I'm running spark using "local[8]" so it's all internal. These are actually unit tests in our app that are failing now. Thanks. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-3-Hadoop-File-System-problem-tp22207.html S

Re: Mailing list schizophrenia?

2015-03-20 Thread Jim Kleckner
ed Yu wrote: > Jim: > I can find the example message here: > http://search-hadoop.com/m/JW1q5zP54J1 > > On Fri, Mar 20, 2015 at 12:29 PM, Jim Kleckner > wrote: > >> I notice that some people send messages directly to user@spark.apache.org >> and some via nabble

Mailing list schizophrenia?

2015-03-20 Thread Jim Kleckner
@spark.apache.org to see how that behaves. Jim

Reliable method/tips to solve dependency issues?

2015-03-19 Thread Jim Kleckner
Do people have a reliable/repeatable method for solving dependency issues or tips? The current world of spark-hadoop-hbase-parquet-... is very challenging given the huge footprint of dependent packages and we may be pushing against the limits of how many packages can be combined into one environme

Re: Spark excludes "fastutil" dependencies we need

2015-02-27 Thread Jim Kleckner
. On Thu, Feb 26, 2015 at 9:54 AM, Marcelo Vanzin wrote: > On Wed, Feb 25, 2015 at 8:42 PM, Jim Kleckner > wrote: > > So, should the userClassPathFirst flag work and there is a bug? > > Sorry for jumping in the middle of conversation (and probably missing > some of it), but

Re: Fwd: Spark excludes "fastutil" dependencies we need

2015-02-25 Thread Jim Kleckner
I created an issue and pull request. Discussion can continue there: https://issues.apache.org/jira/browse/SPARK-6029 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Spark-excludes-fastutil-dependencies-we-need-tp21812p21814.html Sent from the Apache Spa

Fwd: Spark excludes "fastutil" dependencies we need

2015-02-25 Thread Jim Kleckner
Forwarding conversation below that didn't make it to the list. -- Forwarded message -- From: Jim Kleckner Date: Wed, Feb 25, 2015 at 8:42 PM Subject: Re: Spark excludes "fastutil" dependencies we need To: Ted Yu Cc: Sean Owen , user Inline On Wed, Feb 25,

Re: Spark excludes "fastutil" dependencies we need

2015-02-25 Thread Jim Kleckner
ld the userClassPathFirst flag work and there is a bug? Or is it reasonable to put in a pull request for the elimination of the exclusion? > >> On Wed, Feb 25, 2015 at 5:34 AM, Ted Yu wrote: >> > bq. depend on missing fastutil classes like Long2LongOpenHashMap >> > >

Spark excludes "fastutil" dependencies we need

2015-02-24 Thread Jim Kleckner
Spark includes the clearspring analytics package but intentionally excludes the dependencies of the fastutil package (see below). Spark includes parquet-column which includes fastutil and relocates it under parquet/ but creates a shaded jar file which is incomplete because it shades out some of t

Re: New combination-like RDD based on two RDDs

2015-02-04 Thread Jim Green
You should use join: val rdd1 = sc.parallelize(List((1,(3)), (2,(5)), (3,(6 val rdd2 = sc.parallelize(List((2,(1)), (2,(3)), (3,(9 rdd1.join(rdd2).collect res0: Array[(Int, (Int, Int))] = Array((2,(5,1)), (2,(5,3)), (3,(6,9))) Please see my cheat sheet at * 3.14 join(otherDataset, [numTas

Scala on Spark functions examples cheatsheet.

2015-02-02 Thread Jim Green
Hi Team, I just spent some time these 2 weeks on Scala and tried all Scala on Spark functions in the Spark Programming Guide . If you need example codes of Scala on Spark functions, I created this cheat sheet

Spark impersonation

2015-02-02 Thread Jim Green
Hi Team, Does spark support impersonation? For example, when spark on yarn/hive/hbase/etc..., which user is used by default? The user which starts the spark job? Any suggestions related to impersonation? -- Thanks, www.openkb.info (Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)

Re: Re: Bulk loading into hbase using saveAsNewAPIHadoopFile

2015-01-28 Thread Jim Green
Thanks for all respnding. Finally I figured out the way to use bulk load to hbase using scala on spark. The sample code is here which others can refer in future: http://www.openkb.info/2015/01/how-to-use-scala-on-spark-to-load-data.html Thanks! On Tue, Jan 27, 2015 at 6:27 PM, Jim Green wrote

Parquet divide by zero

2015-01-28 Thread Jim Carroll
ved it? Thanks Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-divide-by-zero-tp21406.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsu

Re: Re: Bulk loading into hbase using saveAsNewAPIHadoopFile

2015-01-27 Thread Jim Green
Thanks Sun. My understanding is , savaAsNewHadoopFile is to save as Hfile on hdfs. Is it doable to use saveAsNewAPIHadoopDataset to directly loading to hbase? If so, is there any sample code for that? Thanks! On Tue, Jan 27, 2015 at 6:07 PM, fightf...@163.com wrote: > Hi, Jim > Your gen

Re: Bulk loading into hbase using saveAsNewAPIHadoopFile

2015-01-27 Thread Jim Green
tBytes(), "c1".getBytes(), ("value_xxx").getBytes()) (new ImmutableBytesWritable(Bytes.toBytes(x)), put) }) rdd.saveAsNewAPIHadoopFile("/tmp/13", classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat], conf) On Tue, Jan 27, 2015 at 12:17 PM,

Re: Bulk loading into hbase using saveAsNewAPIHadoopFile

2015-01-27 Thread Jim Green
public void write(ImmutableBytesWritable row, KeyValue kv) > > Meaning, KeyValue is expected, not Put. > > On Tue, Jan 27, 2015 at 10:54 AM, Jim Green wrote: > >> Hi Team, >> >> I need some help on writing a scala to bulk load some data into hbase. >> *Env:* &

Bulk loading into hbase using saveAsNewAPIHadoopFile

2015-01-27 Thread Jim Green
Hi Team, I need some help on writing a scala to bulk load some data into hbase. *Env:* hbase 0.94 spark-1.0.2 I am trying below code to just bulk load some data into hbase table “t1”. import org.apache.spark._ import org.apache.spark.rdd.NewHadoopRDD import org.apache.hadoop.hbase.{HBaseConfigur

Cluster Aware Custom RDD

2015-01-16 Thread Jim Carroll
Hello all, I have a custom RDD for fast loading of data from a non-partitioned source. The partitioning happens in the RDD implementation by pushing data from the source into queues picked up by the current active partitions in worker threads. This works great on a multi-threaded single host (say

Re: No disk single pass RDD aggregation

2014-12-18 Thread Jim Carroll
it took me the better part of a day to find and several misguided emails to this list (including the one that started this thread). Sorry about that. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20763.htm

Re: How do I stop the automatic partitioning of my RDD?

2014-12-16 Thread Jim Carroll
Wow. i just realized what was happening and it's all my fault. I have a library method that I wrote that presents the RDD and I was actually repartitioning it myself. I feel pretty dumb. Sorry about that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-

How do I stop the automatic partitioning of my RDD?

2014-12-16 Thread Jim Carroll
I've been trying to figure out how to use Spark to do a simple aggregation without reparitioning and essentially creating fully instantiated intermediate RDDs and it seem virtually impossible. I've now gone as far as writing my own single parition RDD that wraps an Iterator[String] and calling ag

Re: Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Jim
equence,1) and THIS results in a call to "toArray" on my sequence. Since this wont fit on disk it certainly wont fit in an array in memory. Thanks for taking the time to respond. Jim On 12/16/2014 04:57 PM, Harry Brundage wrote: Are you certain that's happening Jim? Why? What happens if

Spark handling of a file://xxxx.gz Uri

2014-12-16 Thread Jim Carroll
Is there a way to get Spark to NOT reparition/shuffle/expand a sc.textFile(fileUri) when the URI is a gzipped file? Expanding a gzipped file should be thought of as a "transformation" and not an "action" (if the analogy is apt). There is no need to fully create and fill out an intermediate RDD wit

Re: No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll
Nvm. I'm going to post another question since this has to do with the way spark handles sc.textFile with a file://.gz -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20725.html Sent from the Apache Spark User L

Re: No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll
e comboOp doesn't even need to be called (since there would be nothing to combine across partitions). Thanks for any help. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20724.html Sent from the A

No disk single pass RDD aggregation

2014-12-16 Thread Jim Carroll
(StorageLevel.NONE) and it had no affect. When I run locally I get my /tmp directory filled with transient rdd data even though I never need the data again after the row's been processed. Is there a way to turn this off? Thanks Jim -- View this message in context: http://apache-spark-user-list.10

Re: Standard SQL tool access to SchemaRDD

2014-12-02 Thread Jim Carroll
Thanks! I'll give it a try. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-tp20197p20202.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Standard SQL tool access to SchemaRDD

2014-12-02 Thread Jim Carroll
Hello all, Is there a way to load an RDD in a small driver app and connect with a JDBC client and issue SQL queries against it? It seems the thrift server only works with pre-existing Hive tables. Thanks Jim -- View this message in context: http://apache-spark-user-list.1001560.n3

Re: How do I turn off Parquet logging in a worker?

2014-11-14 Thread Jim Carroll
Jira: https://issues.apache.org/jira/browse/SPARK-4412 PR: https://github.com/apache/spark/pull/3271 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-turn-off-Parquet-logging-in-a-worker-tp18955p18977.html Sent from the Apache Spark User List maili

Re: Set worker log configuration when running "local[n]"

2014-11-14 Thread Jim Carroll
Just to be complete, this is a problem in Spark that I worked around and detailed here: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-turn-off-Parquet-logging-in-a-worker-td18955.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Set-worker-

Re: How do I turn off Parquet logging in a worker?

2014-11-14 Thread Jim Carroll
This is a problem because (other than the fact that Parquet uses java.util.logging) of a bug in Spark in the current master. ParquetRelation.scala attempts to override the parquet logger but, at least currently (and if your application simply reads a parquet file before it does anything else with

How do I turn off Parquet logging in a worker?

2014-11-14 Thread Jim Carroll
I'm running a local spark master ("local[n]"). I cannot seem to turn off the parquet logging. I tried: 1) Setting a log4j.properties on the classpath. 2) Setting a log4j.properties file in a spark install conf directory and pointing to the install using setSparkHome 3) Editing the log4j-default.

Re: Set worker log configuration when running "local[n]"

2014-11-14 Thread Jim Carroll
Actually, it looks like it's Parquet logging that I don't have control over. For some reason the parquet project decided to use java.util logging with its own logging configuration. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Set-worker-log-configuratio

Set worker log configuration when running "local[n]"

2014-11-14 Thread Jim Carroll
't work either. I'm running with the current master. Thanks Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Set-worker-log-configuration-when-running-local-n-tp18948.html Sent from the Apache Spark User List mail

Re: Wildly varying "aggregate" performance depending on code location

2014-11-12 Thread Jim Carroll
Well it looks like this is a scala problem after all. I loaded the file using pure scala and ran the exact same Processors without Spark and I got 20 seconds (with the code in the same file as the 'main') vs 30 seconds (with the exact same code in a different file) on the 500K rows. -- View thi

Wildly varying "aggregate" performance depending on code location

2014-11-12 Thread Jim Carroll
Hello all, I have a really strange thing going on. I have a test data set with 500K lines in a gzipped csv file. I have an array of "column processors," one for each column in the dataset. A Processor tracks aggregate state and has a method "process(v : String)" I'm calling: val processors:

Ending a job early

2014-10-28 Thread Jim Carroll
alue. Within a partition, or better yet, within a worker across 'reduce' steps, is there a way to stop all of the aggregations and just continue on with reduces of already processed data? Thanks JIm -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ending

Bug in JettyUtils?

2014-09-24 Thread Jim Donahue
This method will return null in such implementations if this class was loaded by the bootstrap class loader." Unfortunately, the code in JettyUtils isn't prepared to handle this case. :( Jim Donahue

Re: Network requirements between Driver, Master, and Slave

2014-09-12 Thread Jim Carroll
would be all that's required. Can you describe what it is about the architecture that requires the master to connect back to the driver even when the driver initiates a connection to the master? Just curious. Thanks anyway. Jim -- View this message in context: http://apache-spark-user

Network requirements between Driver, Master, and Slave

2014-09-11 Thread Jim Carroll
imeout when any job is run. The master eventually removes the driver's job. I supposed this makes sense if there's a requirement for either the worker or the master to be on the same network as the driver. Is that the case? Thanks Jim -- View this message in context: http://apache-s

Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
ses I rebuilt from source using the same codebase on my machine and moved the entire project into /root/spark (since to run the spark-shell it needs to match the same path as the install on ec2). Could I have missed something here? Thanks. Jim -- View this message in context: http://apache-spark-user

Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
vity if it's reading the slice headers but the job is exiting anyway. So if this fixes the problem the measure of "activity" seems wrong. Ian and Manu, thanks for your help. I'll post back and let you know if that fixes it. Jim -- View this message in context: http://apac

Re: Querying a parquet file in s3 with an ec2 install

2014-09-09 Thread Jim Carroll
It just takes a while. I can try copying it to HDFS first but that wont help longer term. Thanks Jim - Manu's response: - I am pretty sure it is due to the number of parts you have.. I have a parquet data

Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Jim Carroll
f this exception is from the spark-shell jvm or transferred over from the master or a worker through the master. Any help would be greatly appreciated. Thanks Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-

Re: Using Spark to add data to an existing Parquet file without a schema

2014-09-04 Thread Jim Carroll
les into the same parquet directory and managing the file names externally but this seems like a work around. Is that the way others are doing it? Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-to-add-data-to-an-existing-Parquet-file-witho

Using Spark to add data to an existing Parquet file without a schema

2014-09-04 Thread Jim Carroll
rk. All of the examples assume a schema exists in the form of a serialization IDL and generated classes. I looked into the code and considered direct use of InsertIntoParquetTable or a copy of it but I was hoping someone already solved the problem. Any guidance would be greatly appreciated.

Re: Compiling SNAPTSHOT

2014-08-14 Thread Jim Blomo
Tracked this down to incompatibility with Scala and encryptfs. Resolved by compiling in a directory not mounted with encryption (eg /tmp). On Thu, Aug 14, 2014 at 3:25 PM, Jim Blomo wrote: > Hi, I'm having trouble compiling a snapshot, any advice would be > appreciated. I get the

Compiling SNAPTSHOT

2014-08-14 Thread Jim Blomo
Hi, I'm having trouble compiling a snapshot, any advice would be appreciated. I get the error below when compiling either master or branch-1.1. The key error is, I believe, "[ERROR] File name too long" but I don't understand what it is referring to. Thanks! ./make-distribution.sh --tgz --skip-

Re: No space left on device

2014-08-09 Thread Jim Donahue
Root partitions on AWS instances tend to be small (for example, an m1.large instance has 2 420 GB drives, but only a 10 GB root partition). Matei's probably right on about this - just need to be careful where things like the logs get stored. From: Matei Zaharia mailto:matei.zaha...@gmail.com>>

RE: Save an RDD to a SQL Database

2014-08-07 Thread Jim Donahue
.youtube.com/watch?v=C7gWtxelYNM&feature=youtu.be. Jim Donahue Adobe -Original Message- From: Ron Gonzalez [mailto:zlgonza...@yahoo.com.INVALID] Sent: Wednesday, August 06, 2014 7:18 AM To: Vida Ha Cc: u...@spark.incubator.apache.org Subject: Re: Save an RDD to a SQL Database Hi Vida,

Re: Command exited with code 137

2014-06-13 Thread Jim Blomo
I've seen these caused by the OOM killer. I recommend checking /var/log/syslog to see if it was activated due to lack of system memory. On Thu, Jun 12, 2014 at 11:45 PM, libl <271592...@qq.com> wrote: > I use standalone mode submit task.But often,I got an error.The stacktrace as > > 2014-06-12 11

Re: pySpark memory usage

2014-05-15 Thread Jim Blomo
That worked amazingly well, thank you Matei! Numbers that worked for me were 400 for the textFile()s, 1500 for the join()s. On Mon, May 12, 2014 at 7:58 PM, Matei Zaharia wrote: > Hey Jim, unfortunately external spilling is not implemented in Python right > now. While it would be possi

Re: pySpark memory usage

2014-05-15 Thread Jim Blomo
Should add that I had to tweak the numbers a bit to keep above swap threshold, but below the "Too many open files" error (`ulimit -n` is 32768). On Wed, May 14, 2014 at 10:47 AM, Jim Blomo wrote: > That worked amazingly well, thank you Matei! Numbers that worked for > me

Re: pySpark memory usage

2014-05-12 Thread Jim Blomo
g IOException > altogether. Long live exception propagation! > > > On Mon, Apr 28, 2014 at 1:28 PM, Patrick Wendell wrote: >> >> Hey Jim, >> >> This IOException thing is a general issue that we need to fix and your >> observation is spot-in.

Re: pySpark memory usage

2014-04-28 Thread Jim Blomo
ore elegant way to do this in Scala, though? Let me know if I should open a ticket or discuss this on the developers list instead. Best, Jim diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala b/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala index 0d71fdb.

Finding bad data

2014-04-24 Thread Jim Blomo
I'm using PySpark to load some data and getting an error while parsing it. Is it possible to find the source file and line of the bad data? I imagine that this would be extremely tricky when dealing with multiple derived RRDs, so an answer with the caveat of "this only works when running .map() o

stdout in workers

2014-04-21 Thread Jim Carroll
e to another list posting where the user thought his driver print statement should have printed on the worker. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/stdout-in-workers-tp4537.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Continuously running non-streaming jobs

2014-04-17 Thread Jim Carroll
Daniel, I'm new to Spark but I thought that thread hinted at the right answer. Thanks, Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Continuously-running-non-streaming-jobs-tp4391p4397.html Sent from the Apache Spark User List mailing list ar

Continuously running non-streaming jobs

2014-04-17 Thread Jim Carroll
lization-performance-of-Spark-standalone-mode-vs-YARN-td2016.html Thanks. Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Continuously-running-non-streaming-jobs-tp4391.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark - ready for prime time?

2014-04-13 Thread Jim Blomo
On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash wrote: > The biggest issue I've come across is that the cluster is somewhat unstable > when under memory pressure. Meaning that if you attempt to persist an RDD > that's too big for memory, even with MEMORY_AND_DISK, you'll often still get > OOMs. I h

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
7;d like to reproduce and fix this. > > Matei > > On Apr 9, 2014, at 3:52 PM, Jim Blomo wrote: > >> Hi Matei, thanks for working with me to find these issues. >> >> To summarize, the issues I've seen are: >> 0.9.0: >> - https://issues.apache.org/jira/b

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
might happen only if records are > large, or something like that. How much heap are you giving to your > executors, and does it show that much in the web UI? > > Matei > > On Mar 29, 2014, at 10:44 PM, Jim Blomo wrote: > >> I think the problem I ran into in 0.9 is c

Re: pySpark memory usage

2014-03-29 Thread Jim Blomo
, Mar 29, 2014 at 3:17 PM, Jim Blomo wrote: > I've only tried 0.9, in which I ran into the `stdin writer to Python > finished early` so frequently I wasn't able to load even a 1GB file. > Let me know if I can provide any other info! > > On Thu, Mar 27, 2014 at 8:48 PM, Mat

Re: pySpark memory usage

2014-03-29 Thread Jim Blomo
ions of Spark (0.9 or 0.8)? We'll > try to look into these, seems like a serious error. > > Matei > > On Mar 27, 2014, at 7:27 PM, Jim Blomo wrote: > >> Thanks, Matei. I am running "Spark 1.0.0-SNAPSHOT built for Hadoop >> 1.0.4" from GitHub on 2014-

Re: pySpark memory usage

2014-03-27 Thread Jim Blomo
t. Thanks again! On Mon, Mar 24, 2014 at 9:35 AM, Matei Zaharia wrote: > Hey Jim, > > In Spark 0.9 we added a "batchSize" parameter to PySpark that makes it group > multiple objects together before passing them between Java and Python, but > this may be too high by d

pySpark memory usage

2014-03-21 Thread Jim Blomo
Hi all, I'm wondering if there's any settings I can use to reduce the memory needed by the PythonRDD when computing simple stats. I am getting OutOfMemoryError exceptions while calculating count() on big, but not absurd, records. It seems like PythonRDD is trying to keep too many of these records

Re: PySpark worker fails with IOError Broken Pipe

2014-03-20 Thread Jim Blomo
I think I've encountered the same problem and filed https://spark-project.atlassian.net/plugins/servlet/mobile#issue/SPARK-1284 For me it hung the worker, though. Can you add reproducible steps and what version you're running? On Mar 19, 2014 10:13 PM, "Nicholas Chammas" wrote: > So I have the p

Re: Pyspark worker memory

2014-03-19 Thread Jim Blomo
n the machine running the driver (typically the master host) and will affect the memory available to the Executor running on a slave node -D SPARK_DAEMON_OPTS: On Wed, Mar 19, 2014 at 12:48 AM, Jim Blomo wrote: > Thanks for the suggestion, Matei. I've tracked this down to a se

Re: Pyspark worker memory

2014-03-19 Thread Jim Blomo
env.sh on the workers as well. Maybe code there is > somehow overriding the spark.executor.memory setting. > > Matei > > On Mar 18, 2014, at 6:17 PM, Jim Blomo wrote: > > Hello, I'm using the Github snapshot of PySpark and having trouble setting > the worker memory correctly.

Pyspark worker memory

2014-03-18 Thread Jim Blomo
Hello, I'm using the Github snapshot of PySpark and having trouble setting the worker memory correctly. I've set spark.executor.memory to 5g, but somewhere along the way Xmx is getting capped to 512M. This was not occurring with the same setup and 0.9.0. How many places do I need to configure the m

  1   2   >