I got it working. It's much faster.
If someone else wants to try it I:
1) Was already using the code from the Presto S3 Hadoop FileSystem
implementation modified to sever it from the rest of the Presto codebase.
2) I extended it and overrode the method "keyFromPath" so that anytime the
Path referr
It's not actually that tough. We already use a custom Hadoop FileSystem for
S3 because when we started using Spark with S3 the native FileSystem was
very unreliable. Our's is based on the code from Presto. (see
https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/pr
Thanks. In the meantime I might just write a custom file system that maps
writes to parquet file parts to their final locations and then skips the
move.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To u
I have this exact issue. I was going to intercept the call in the filesystem
if I had to (since we're using the S3 filesystem from Presto anyway) but if
there's simply a way to do this correctly I'd much prefer it. This basically
doubles the time to write parquet files to s3.
--
Sent from: http:
I'm running spark 2.0.0 on Mesos using spark.mesos.executor.docker.image to
point to a docker container that I built with the Spark installation.
Everything is working except the Spark client process that's started inside
the container doesn't get any of my parameters I set in the spark config in
Hello all,
We were using the old "Artificial Neural Network" :
https://github.com/apache/spark/pull/1290
This code appears to have been incorporated in 1.5.2 but it's only exposed
publicly via the MultilayerPerceptronClassifier. Is there a way to use the
old feedforward/backprop non-classificatio
I finally got back to this and I just wanted to let anyone that runs into
this know that the problem is a kryo version issue. Spark (at least 1.4.0)
depends on Kryo 2.21 while my client had 2.24.0 on the classpath. Changing
it to 2.21 fixed the problem.
--
View this message in context:
http://a
I'm seeing the following exception ONLY when I run on a Mesos cluster. If I
run the exact same code with master set to "local[N]" I have no problem:
2015-05-19 16:45:43,484 [task-result-getter-0] WARN TaskSetManager - Lost
task 0.0 in stage 0.0 (TID 0, 10.253.1.101): java.io.EOFException
Thanks Patrick and Michael for your responses.
For anyone else that runs across this problem prior to 1.3.1 being released,
I've been pointed to this Jira ticket that's scheduled for 1.3.1:
https://issues.apache.org/jira/browse/SPARK-6351
Thanks again.
--
View this message in context:
http:
I have code that works under 1.2.1 but when I upgraded to 1.3.0 it fails to
find the s3 hadoop file system.
I get the "java.lang.IllegalArgumentException: Wrong FS: s3://path to my
file], expected: file:///" when I try to save a parquet file. This worked in
1.2.1.
Has anyone else seen this?
I'm
Hello all,
I've been hitting a divide by zero error in Parquet though Spark detailed
(and fixed) here: https://github.com/apache/incubator-parquet-mr/pull/102
Is anyone else hitting this error? I hit it frequently.
It looks like the Parquet team is preparing to release 1.6.0 and, since they
have
Hello all,
I have a custom RDD for fast loading of data from a non-partitioned source.
The partitioning happens in the RDD implementation by pushing data from the
source into queues picked up by the current active partitions in worker
threads.
This works great on a multi-threaded single host (say
Hi,
This was all my fault. It turned out I had a line of code buried in a
library that did a "repartition." I used this library to wrap an RDD to
present it to legacy code as a different interface. That's what was causing
the data to spill to disk.
The really stupid thing is it took me the better
Wow. i just realized what was happening and it's all my fault. I have a
library method that I wrote that presents the RDD and I was actually
repartitioning it myself.
I feel pretty dumb. Sorry about that.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-
I've been trying to figure out how to use Spark to do a simple aggregation
without reparitioning and essentially creating fully instantiated
intermediate RDDs and it seem virtually impossible.
I've now gone as far as writing my own single parition RDD that wraps an
Iterator[String] and calling ag
Is there a way to get Spark to NOT reparition/shuffle/expand a
sc.textFile(fileUri) when the URI is a gzipped file?
Expanding a gzipped file should be thought of as a "transformation" and not
an "action" (if the analogy is apt). There is no need to fully create and
fill out an intermediate RDD wit
Nvm. I'm going to post another question since this has to do with the way
spark handles sc.textFile with a file://.gz
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/No-disk-single-pass-RDD-aggregation-tp20723p20725.html
Sent from the Apache Spark User L
In case a little more information is helpful:
the RDD is constructed using sc.textFile(fileUri) where the fileUri is to a
".gz" file (that's too big to fit on my disk).
I do an rdd.persist(StorageLevel.NONE) and it seems to have no affect.
This rdd is what I'm calling aggregate on and I expect t
Okay,
I have an rdd that I want to run an aggregate over but it insists on
spilling to disk even though I structured the processing to only require a
single pass.
In other words, I can do all of my processing one entry in the rdd at a time
without persisting anything.
I set rdd.persist(StorageLe
Thanks! I'll give it a try.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Standard-SQL-tool-access-to-SchemaRDD-tp20197p20202.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
Hello all,
Is there a way to load an RDD in a small driver app and connect with a JDBC
client and issue SQL queries against it? It seems the thrift server only
works with pre-existing Hive tables.
Thanks
Jim
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com
Jira: https://issues.apache.org/jira/browse/SPARK-4412
PR: https://github.com/apache/spark/pull/3271
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-turn-off-Parquet-logging-in-a-worker-tp18955p18977.html
Sent from the Apache Spark User List maili
Just to be complete, this is a problem in Spark that I worked around and
detailed here:
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-I-turn-off-Parquet-logging-in-a-worker-td18955.html
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Set-worker-
This is a problem because (other than the fact that Parquet uses
java.util.logging) of a bug in Spark in the current master.
ParquetRelation.scala attempts to override the parquet logger but, at least
currently (and if your application simply reads a parquet file before it
does anything else with
I'm running a local spark master ("local[n]").
I cannot seem to turn off the parquet logging. I tried:
1) Setting a log4j.properties on the classpath.
2) Setting a log4j.properties file in a spark install conf directory and
pointing to the install using setSparkHome
3) Editing the log4j-default.
Actually, it looks like it's Parquet logging that I don't have control over.
For some reason the parquet project decided to use java.util logging with
its own logging configuration.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Set-worker-log-configuratio
How do I set the log level when running "local[n]"? It ignores the
log4j.properties file on my classpath.
I also tried to set the spark home dir on the SparkConfig using setSparkHome
and made sure an appropriate log4j.properties file was in a "conf"
subdirectory and that didn't work either.
I'm r
Well it looks like this is a scala problem after all. I loaded the file
using pure scala and ran the exact same Processors without Spark and I got
20 seconds (with the code in the same file as the 'main') vs 30 seconds
(with the exact same code in a different file) on the 500K rows.
--
View thi
Hello all,
I have a really strange thing going on.
I have a test data set with 500K lines in a gzipped csv file.
I have an array of "column processors," one for each column in the dataset.
A Processor tracks aggregate state and has a method "process(v : String)"
I'm calling:
val processors:
We have some very large datasets where the calculation converge on a result.
Our current implementation allows us to track how quickly the calculations
are converging and end the processing early. This can significantly speed up
some of our processing.
Is there a way to do the same thing is spark
Hi Akhil,
Thanks! I guess in short that means the master (or slaves?) connect back to
the driver. This seems like a really odd way to work given the driver needs
to already connect to the master on port 7077. I would have thought that if
the driver could initiate a connection to the master, that w
Hello all,
I'm trying to run a Driver on my local network with a deployment on EC2 and
it's not working. I was wondering if either the master or slave instances
(in standalone) connect back to the driver program.
I outlined the details of my observations in a previous post but here is
what I'm se
Okay,
This seems to be either a code version issue or a communication issue. It
works if I execute the spark shell from the master node. It doesn't work if
I run it from my laptop and connect to the master node.
I had opened the ports for the WebUI (8080) and the cluster manager (7077)
for the m
>Why I think its the number of files is that I believe that a
> all of those or large part of those files are read when
>you run sqlContext.parquetFile() and the time it would
>take in s3 for that to happen is a lot so something
>internally is timing out..
I'll create the parquet files with D
My apologies to the list. I replied to Manu's question and it went directly
to him rather than the list.
In case anyone else has this issue here is my reply and Manu's reply to me.
This also answers Ian's question.
---
Hi Manu,
The dataset is 7.5 million rows
Hello all,
I've been wrestling with this problem all day and any suggestions would be
greatly appreciated.
I'm trying to test reading a parquet file that's stored in s3 using a spark
cluster deployed on ec2. The following works in the spark shell when run
completely locally on my own machine (i.e
Okay,
Obviously I don't care about adding more files to the system so is there a
way to point to an existing parquet file (directory) and seed the individual
"part-r-***.parquet" (the value of "partition + offset") while preventing
I mean, I can hack it by copying files into the same parquet dir
Hello all,
I've been trying to figure out how to add data to an existing Parquet file
without having a schema. Spark has allowed me to load JSON and save it as a
Parquet file but I was wondering if anyone knows how to ADD/INSERT more
data.
I tried using sql insert and that doesn't work. All of t
I'm experimenting with a few things trying to understand how it's working. I
took the JavaSparkPi example as a starting point and added a few System.out
lines.
I added a system.out to the main body of the driver program (not inside of
any Functions).
I added another to the mapper.
I added another
Daniel,
I'm new to Spark but I thought that thread hinted at the right answer.
Thanks,
Jim
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Continuously-running-non-streaming-jobs-tp4391p4397.html
Sent from the Apache Spark User List mailing list archive
Is there a way to create continuously-running, or at least
continuously-loaded, jobs that can be 'invoked' rather than 'sent' to to
avoid the job creation overhead of a couple seconds?
I read through the following:
http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-
41 matches
Mail list logo