I'm currently doing something like this in my Spark Streaming program
(Java):
dStream.foreachRDD((rdd, batchTime) -> {
log.info("processing RDD from batch {}", batchTime);
// my rdd processing code
});
Instead of having my
I should note that the amount of data in each batch is very small, so I'm
not concerned with performance implications of grouping into a single RDD.
On Wed, Jul 15, 2015 at 9:58 PM, Jon Chase wrote:
> I'm currently doing something like this in my Spark Streaming pr
I'm running into an issue with a pyspark job where I'm sometimes seeing
extremely variable job times (20min to 2hr) and very long shuffle times
(e.g. ~2 minutes for 18KB/86 records).
Cluster set up is Amazon EMR 4.4.0, Spark 1.6.0, an m4.2xl driver and a
single m4.10xlarge (40 vCPU, 160GB) executo
I'm seeing a similar (same?) problem on Spark 1.4.1 running on Yarn (Amazon
EMR, Java 8). I'm running a Spark Streaming app 24/7 and system memory
eventually gets exhausted after about 3 days and the JVM process dies with:
#
# There is insufficient memory for the Java Runtime Environment to conti
Shahab -
This should do the trick until Hao's changes are out:
sqlContext.sql("create temporary function foobar as 'com.myco.FoobarUDAF'");
sqlContext.sql("select foobar(some_column) from some_table");
This works without requiring to 'deploy' a JAR with the UDAF in it - just
make sure the UDA
Spark 1.3.0, Parquet
I'm having trouble referencing partition columns in my queries.
In the following example, 'probeTypeId' is a partition column. For
example, the directory structure looks like this:
/mydata
/probeTypeId=1
...files...
/probeTypeId=2
...files...
I see
I've filed this as https://issues.apache.org/jira/browse/SPARK-6554
On Thu, Mar 26, 2015 at 6:29 AM, Jon Chase wrote:
> Spark 1.3.0, Parquet
>
> I'm having trouble referencing partition columns in my queries.
>
> In the following example, 'probeTypeId' is a pa
Spark 1.3.0 on YARN (Amazon EMR), cluster of 10 m3.2xlarge (8cpu, 30GB),
executor memory 20GB, driver memory 10GB
I'm using Spark SQL, mainly via spark-shell, to query 15GB of data spread
out over roughly 2,000 Parquet files and my queries frequently hang. Simple
queries like "select count(*) from
Spark 1.3.0
Two issues:
a) I'm unable to get a "lateral view explode" query to work on an array type
b) I'm unable to save an array type to a Parquet file
I keep running into this:
java.lang.ClassCastException: [I cannot be cast to
scala.collection.Seq
Here's a stack trace from the explo
mes the
> underlying SQL array is represented by a Scala Seq. Would you mind to open
> a JIRA ticket for this? Thanks!
>
> Cheng
>
> On 3/27/15 7:00 PM, Jon Chase wrote:
>
> Spark 1.3.0
>
> Two issues:
>
> a) I'm unable to get a "lateral view explode&
that, would you mind to also provide the full stack
> trace of the exception thrown in the saveAsParquetFile call? Thanks!
>
> Cheng
>
> On 3/27/15 7:35 PM, Jon Chase wrote:
>
> https://issues.apache.org/jira/browse/SPARK-6570
>
> I also left in the call to saveAsParquetFi
Zsolt - what version of Java are you running?
On Mon, Mar 30, 2015 at 7:12 AM, Zsolt Tóth
wrote:
> Thanks for your answer!
> I don't call .collect because I want to trigger the execution. I call it
> because I need the rdd on the driver. This is not a huge RDD and it's not
> larger than the one
I'm trying to use the spark-ec2 command to launch a Spark cluster that runs
Java 8, but so far I haven't been able to get the Spark processes to use
the right JVM at start up.
Here's the command I use for launching the cluster. Note I'm using the
user-data feature to install Java 8:
./spark-ec2
I'm running a very simple Spark application that downloads files from S3,
does a bit of mapping, then uploads new files. Each file is roughly 2MB
and is gzip'd. I was running the same code on Amazon's EMR w/Spark and not
having any download speed issues (Amazon's EMR provides a custom
implementat
I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k
small files (~2MB each). I do a simple map, then keyBy(), and then
rdd.saveAsHadoopDataset(...). Depending on the memory settings given to
spark-submit, the time before the first ExecutorLostFailure varies (more
memory == long
at it
> requests from yarn. It's required because jvms take up some memory beyond
> their heap size.
>
> -Sandy
>
> On Dec 19, 2014, at 9:04 AM, Jon Chase wrote:
>
> I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k
> small fi
mitted 7392K, reserved
1048576K
On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase wrote:
> I'm actually already running 1.1.1.
>
> I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
> luck. Still getting "ExecutorLostFailure (executor lost)".
>
Yes, same problem.
On Fri, Dec 19, 2014 at 11:29 AM, Sandy Ryza
wrote:
> Do you hit the same errors? Is it now saying your containers are exceed
> ~10 GB?
>
> On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase wrote:
>>
>> I'm actually already running 1.1.1.
&
Running on Amazon EMR w/Yarn and Spark 1.1.1, I have trouble getting Yarn
to use the number of executors that I specify in spark-submit:
--num-executors 2
In a cluster with two core nodes will typically only result in one executor
running at a time. I can play with the memory settings and
num-co
I've had a lot of difficulties with using the s3:// prefix. s3n:// seems
to work much better. Can't find the link ATM, but seems I recall that
s3:// (Hadoop's original block format for s3) is no longer recommended for
use. Amazon's EMR goes so far as to remap the s3:// to s3n:// behind the
scene
Have a look at RDD.groupBy(...) and reduceByKey(...)
On Tue, Dec 23, 2014 at 4:47 AM, sachin Singh
wrote:
> Hi,
> I have a csv file having fields as a,b,c .
> I want to do aggregation(sum,average..) based on any field(a,b or c) as per
> user input,
> using Apache Spark Java API,Please Help Urgen
http://www.jets3t.org/toolkit/configuration.html
Put the following properties in a file named jets3t.properties and make
sure it is available during the running of your Spark job (just place it in
~/ and pass a reference to it when calling spark-submit with --file
~/jets3t.properties)
httpclien
ultifarious, Inc. | http://mult.ifario.us/
>
> On Thu, Dec 18, 2014 at 5:56 AM, Jon Chase wrote:
>
>> I'm running a very simple Spark application that downloads files from S3,
>> does a bit of mapping, then uploads new files. Each file is roughly 2MB
>> and
23 matches
Mail list logo