Re: Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
So when deciding whether to take on installing/configuring Spark, the size of the data does not automatically make that decision in your mind. Thanks, Gary On Thu, Feb 26, 2015 at 8:55 PM, Tobias Pfeiffer wrote: > Hi > > On Fri, Feb 27, 2015 at 10:50 AM, Gary Malouf > wrote: >

Re: Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
t 8:40 AM, Gary Malouf > wrote: > >> I'm considering whether or not it is worth introducing Spark at my new >> company. The data is no-where near Hadoop size at this point (it sits in >> an RDS Postgres cluster). >> > > Will it ever become "Hadoop si

Dealing with 'smaller' data

2015-02-26 Thread Gary Malouf
I'm considering whether or not it is worth introducing Spark at my new company. The data is no-where near Hadoop size at this point (it sits in an RDS Postgres cluster). I'm wondering at which point it is worth the overhead of adding the Spark infrastructure (deployment scripts, monitoring, etc).

Shuffle Intensive Job: sendMessageReliably failed because ack was not received within 60 sec

2014-11-19 Thread Gary Malouf
Has anyone else received this type of error? We are not sure what the issue is nor how to correct it to get our job to complete...

GraphX bug re-opened

2014-11-19 Thread Gary Malouf
We keep running into https://issues.apache.org/jira/browse/SPARK-2823 when trying to use GraphX. The cost of repartitioning the data is really high for us (lots of network traffic) which is killing the job performance. I understand the bug was reverted to stabilize unit tests, but frankly it make

Re: Sourcing data from RedShift

2014-11-18 Thread Gary Malouf
Spark right now is not reasonable for us but this bug prevents solving the issue. On Fri, Nov 14, 2014 at 9:29 PM, Gary Malouf wrote: > I'll try this out and follow up with what I find. > > On Fri, Nov 14, 2014 at 8:54 PM, Xiangrui Meng > wrote: > >> For each node, i

Re: Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
eed to cache the data > to be efficient, you may need a larger cluster or change the storage level > to MEMORY_AND_DISK. > > -Xiangrui > > On Nov 14, 2014, at 5:32 PM, Gary Malouf wrote: > > Hmm, we actually read the CSV data in S3 now and were looking to avoid > that

Re: Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
t; > On Nov 14, 2014, at 3:46 PM, Michael Armbrust > wrote: > > I'd guess that its an s3n://key:secret_key@bucket/path from the UNLOAD > command used to produce the data. Xiangrui can correct me if I'm wrong > though. > > On Fri, Nov 14, 2014 at 2:19 PM, Gary Mal

Sourcing data from RedShift

2014-11-14 Thread Gary Malouf
We have a bunch of data in RedShift tables that we'd like to pull in during job runs to Spark. What is the path/url format one uses to pull data from there? (This is in reference to using the https://github.com/mengxr/redshift-input-format)

Short Circuit Local Reads

2014-09-17 Thread Gary Malouf
Cloudera had a blog post about this in August 2013: http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/ Has anyone been using this in production - curious as to if it made a significant difference from a Spark perspective.

Dealing with Time Series Data

2014-09-15 Thread Gary Malouf
I have a use case for our data in HDFS that involves sorting chunks of data into time series format by a specific characteristic and doing computations from that. At large scale, what is the most efficient way to do this? Obviously, having the data sharded by that characteristic would make the pe

Re: ReduceByKey performance optimisation

2014-09-13 Thread Gary Malouf
You need something like: val x: RDD[MyAwesomeObject] x.map(obj => obj.fieldtobekey -> obj).reduceByKey { case (l, _) => l } Does that make sense? On Sat, Sep 13, 2014 at 7:28 AM, Julien Carme wrote: > I need to remove objects with duplicate key, but I need the whole object. > Object which ha

Dealing with Idle shells

2014-08-14 Thread Gary Malouf
We have our quantitative team using Spark as part of their daily work. One of the more common problems we run into is that people unintentionally leave their shells open throughout the day. This eats up memory in the cluster and causes others to have limited resources to run their jobs. With som

DistCP - Spark-based

2014-08-12 Thread Gary Malouf
We are probably still the minority, but our analytics platform based on Spark + HDFS does not have map/reduce installed. I'm wondering if there is a distcp equivalent that leverages Spark to do the work. Our team is trying to find the best way to do cross-datacenter replication of our HDFS data t

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
Also, regarding something like redshift not having MLlib built in, much of that could be done on the derived results. On Aug 6, 2014 4:07 PM, "Nicholas Chammas" wrote: > On Wed, Aug 6, 2014 at 3:41 PM, Daniel, Ronald (ELS-SDG)< > r.dan...@elsevier.com> wrote: > >> Mostly I was just objecting to "

Re: Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
o Spark so we have the greater > capabilities mentioned above. > > > > > > Best regards, > > > > Ron Daniel, Jr. > > Director, Elsevier Labs > > r.dan...@elsevier.com > > mobile: +1 619 208 3064 > > > > > > > > *From:* Gary

Spark memory management

2014-08-06 Thread Gary Malouf
I have a few questions about managing Spark memory: 1) In a standalone setup, is their any cpu prioritization across users running jobs? If so, what is the behavior here? 2) With Spark 1.1, users will more easily be able to run drivers/shells from remote locations that do not cause firewall head

Re: Runnning a Spark Shell locally against EC2

2014-08-06 Thread Gary Malouf
cent > commit changes this ( > https://github.com/apache/spark/commit/09f7e4587bbdf74207d2629e8c1314f93d865999) > in that you can now manually configure all ports and only open up the ones > you configured. This will be available in Spark 1.1. > > -Andrew > > > 2014-08-06 8:

Regarding tooling/performance vs RedShift

2014-08-06 Thread Gary Malouf
My company is leaning towards moving much of their analytics work from our own Spark/Mesos/HDFS/Cassandra set up to RedShift. To date, I have been the internal advocate for using Spark for analytics, but a number of good points have been brought up to me. The reasons being pushed are: - RedShift

Runnning a Spark Shell locally against EC2

2014-08-06 Thread Gary Malouf
We have Spark 1.0.1 on Mesos deployed as a cluster in EC2. Our Devops lead tells me that Spark jobs can not be submitted from local machines due to the complexity of opening the right ports to the world etc. Are other people running the shell locally in a production environment?

Re: Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
g loaded. On Fri, Jul 25, 2014 at 2:27 PM, Gary Malouf wrote: > After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going > well. Looking at the Mesos slave logs, I noticed: > > ERROR KryoSerializer: Failed to run spark.kryo.registrator > java.lang.Cla

Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going well. Looking at the Mesos slave logs, I noticed: ERROR KryoSerializer: Failed to run spark.kryo.registrator java.lang.ClassNotFoundException: com/mediacrossing/verrazano/kryo/MxDataRegistrator My spark-env.sh has the follow

Workarounds for accessing sequence file data via PySpark?

2014-07-23 Thread Gary Malouf
I am aware that today PySpark can not load sequence files directly. Are there work-arounds people are using (short of duplicating all the data to text files) for accessing this data?

SparkSQL with sequence file RDDs

2014-07-07 Thread Gary Malouf
Has anyone reported issues using SparkSQL with sequence files (all of our data is in this format within HDFS)? We are considering whether to burn the time upgrading to Spark 1.0 from 0.9 now and this is a main decision point for us.

Re: Spark Summit 2014 (Hotel suggestions)

2014-05-27 Thread Gary Malouf
Go to expedia/orbitz and look for hotels in the union square neighborhood. In my humble opinion having visited San Francisco, it is worth any extra cost to be as close as possible to the conference vs having to travel from other parts of the city. On Tue, May 27, 2014 at 9:36 AM, Gerard Maas wr

Re: is Mesos falling out of favor?

2014-05-11 Thread Gary Malouf
For what it is worth, our team here at MediaCrossing has been using the Spark/Mesos combination since last summer with much success (low operations overhead, high developer performance). IMO, Hadoop is overcomplicated from both a development and operations perspective so

SparkR with Sequence Files

2014-04-10 Thread Gary Malouf
Has anyone been using SparkR to work with data from sequence files? We use protobuf throughout our system and are considering whether to try out SparkR.

Building Spark 0.9.x for CDH5 with mrv1 installation (Protobuf 2.5 upgrade)

2014-03-25 Thread Gary Malouf
Today, our cluster setup is as follows: Mesos 0.15, CDH 4.2.1-MRV1, Spark 0.9-pre-scala-2.10 off master build targeted at appropriate CDH4 version We are looking to upgrade all of these in order to get protobuf 2.5 working properly. The question is, which 'Hadoop version build' of Spark 0.9 is

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-03-25 Thread Gary Malouf
Can anyone verify the claims from Aureliano regarding the Akka dependency protobuf collision? Our team has a major need to upgrade to protobuf 2.5.0 up the pipe and Spark seems to be the blocker here. On Fri, Mar 21, 2014 at 6:49 PM, Aureliano Buendia wrote: > > > > On Tue, Mar 18, 2014 at 12:5