We use both databricks and emr. We use databricks for our exploratory / adhoc
use cases because their notebook is pretty badass and better than Zeppelin IMHO.
We use EMR for our production machine learning and ETL tasks. The nice thing
about EMR is you can use applications other than spark. From
If you are running on Amazon, then it's always a crapshoot as well.
M
> On Dec 21, 2015, at 4:41 PM, Josh Rosen wrote:
>
> @Eran, are Server 1 and Server 2 both part of the same cluster / do they have
> similar positions in the network topology w.r.t the Spark executors? If
> Server 1 had fas
We were looking into this as well --- the answer looks like "no"
Here's the ticket:
https://issues.apache.org/jira/browse/HADOOP-9680
m
On Fri, Dec 4, 2015 at 1:41 PM, Lin, Hao wrote:
> Hi,
>
>
>
> Does anyone knows if Spark run in AWS is supported by temporary access
> credential (AccessKeyI
If you are running on AWS I would recommend using s3 instead of hdfs as a
general practice if you are maintaining state or data there. This way you can
treat your spark clusters as ephemeral compute resources that you can swap out
easily -- eg if something breaks just spin up a fresh cluster and
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
M
> On Dec 4, 2015, at 8:21 AM, Sateesh Karuturi
> wrote:
>
> Hello Spark experts...
> Iam new to Apache Spark..Can anyone send me the proper Documentation to learn
> RDD functions.
> Thanks in advance...
You should consider presto for this use case. If you want fast "first query"
times it is a better fit.
I think sparksql will catch up at some point but if you are not doing multiple
queries against data cached in RDDs and need low latency it may not be a good
fit.
M
> On Dec 1, 2015, at 7:23
defaults config.
>
> https://spark.apache.org/docs/latest/configuration.html
>
>> On Tue, 17 Nov 2015 at 21:06 Michal Klos wrote:
>> Hi,
>>
>> We are running a Spark Standalone cluster on EMR (note: not using YARN) and
>> are trying to use S3 w/ EmrF
PATH. Further info in the
> docs. I've had good luck with these options, and for ease of use I just set
> them in the spark defaults config.
>
> https://spark.apache.org/docs/latest/configuration.html
>
>> On Tue, 17 Nov 2015 at 21:06 Michal Klos wrote:
>> Hi,
&g
Hi,
We are running a Spark Standalone cluster on EMR (note: not using YARN) and
are trying to use S3 w/ EmrFS as our event logging directory.
We are having difficulties with a ClassNotFoundException on EmrFileSystem
when we navigate to the event log screen. This is to be expected as the
EmrFs jar
You must add the partitions to the Hive table with something like "alter table
your_table add if not exists partition (country='us');".
If you have dynamic partitioning turned on, you can do 'msck repair table
your_table' to recover the partitions.
I would recommend reviewing the Hive document
Hi,
We would like to be added to the Powered by Spark list:
organization name: Localytics
URL: http://eng.localytics.com/
a list of which Spark components you are using: Spark, Spark Streaming,
MLLib
a short description of your use case: Batch, real-time, and predictive
analytics driving our mobi
Hi,
I'm trying to set up multiple spark clusters with high availability and I
was wondering if I can re-use a single ZK cluster to manage them? It's not
very clear in the docs and it seems like the answer may be that I need a
separate ZK cluster for each spark cluster?
thanks,
M
t could even fully complete on one machine, but the machine dies
>> > immediately before reporting the result back to the driver.
>> >
>> > This means you need to make sure the side-effects are idempotent, or
>> > use some transactional locking. Spark's own ou
Hi Spark group,
We haven't been able to find clear descriptions of how Spark handles the
resiliency of RDDs in relationship to executing actions with side-effects.
If you do an `rdd.foreach(someSideEffect)`, then you are doing a
side-effect for each element in the RDD. If a partition goes down --
Hi,
We have a Scala application and we want it to programmatically submit Spark
jobs to a Spark-YARN cluster in yarn-client mode.
We're running into a lot of classpath issues, e.g. once submitted it looks
for jars in our parent Scala application's local directory, jars that it
shouldn't need. Our
unction you pass can invoke an RDD
> action. So you can refactor that way too.
>
> On Wed, Mar 11, 2015 at 2:39 PM, Michal Klos
> wrote:
> > Is there a way to have the exception handling go lazily along with the
> > definition?
> >
> > e.g... we define it on the RDD
where the action
> occurs.
>
>
> On Wed, Mar 11, 2015 at 1:51 PM, Michal Klos
> wrote:
> > Hi Spark Community,
> >
> > We would like to define exception handling behavior on RDD instantiation
> /
> > build. Since the RDD is lazily evaluated, it seems
Hi Spark Community,
We would like to define exception handling behavior on RDD instantiation /
build. Since the RDD is lazily evaluated, it seems like we are forced to
put all exception handling in the first action call?
This is an example of something that would be nice:
def myRDD = {
Try {
val
even if the the N partitioned
>> queries are smaller. Some queries require evaluating the whole data set
>> first. If our use case a simple select * from table.. Then the partitions
>> would be an easier sell if it wasn't for the concurrency problem :) Long
>> story shor
Piggy-backing on the thread a little --
Does anyone out there use luigi to manage spark workflows?
I see they recently added spark support
On Sun, Mar 1, 2015 at 10:20 PM, Qiang Cao wrote:
> Thanks, Himanish and Felix!
>
> On Sun, Mar 1, 2015 at 7:50 PM, Himanish Kushary
> wrote:
>
>> We are
Hi Spark community,
We have a use case where we need to pull huge amounts of data from a SQL
query against a database into Spark. We need to execute the query against
our huge database and not a substitute (SparkSQL, Hive, etc) because of a
couple of factors including custom functions used in the
21 matches
Mail list logo