Re: Databricks Cloud vs AWS EMR

2016-01-28 Thread Michal Klos
We use both databricks and emr. We use databricks for our exploratory / adhoc use cases because their notebook is pretty badass and better than Zeppelin IMHO. We use EMR for our production machine learning and ETL tasks. The nice thing about EMR is you can use applications other than spark. From

Re: fishing for help!

2015-12-21 Thread Michal Klos
If you are running on Amazon, then it's always a crapshoot as well. M > On Dec 21, 2015, at 4:41 PM, Josh Rosen wrote: > > @Eran, are Server 1 and Server 2 both part of the same cluster / do they have > similar positions in the network topology w.r.t the Spark executors? If > Server 1 had fas

Re: Is Temporary Access Credential (AccessKeyId, SecretAccessKey + SecurityToken) support by Spark?

2015-12-04 Thread Michal Klos
We were looking into this as well --- the answer looks like "no" Here's the ticket: https://issues.apache.org/jira/browse/HADOOP-9680 m On Fri, Dec 4, 2015 at 1:41 PM, Lin, Hao wrote: > Hi, > > > > Does anyone knows if Spark run in AWS is supported by temporary access > credential (AccessKeyI

Re: newbie best practices: is spark-ec2 intended to be used to manage long-lasting infrastructure ?

2015-12-04 Thread Michal Klos
If you are running on AWS I would recommend using s3 instead of hdfs as a general practice if you are maintaining state or data there. This way you can treat your spark clusters as ephemeral compute resources that you can swap out easily -- eg if something breaks just spin up a fresh cluster and

Re: RDD functions

2015-12-04 Thread Michal Klos
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations M > On Dec 4, 2015, at 8:21 AM, Sateesh Karuturi > wrote: > > Hello Spark experts... > Iam new to Apache Spark..Can anyone send me the proper Documentation to learn > RDD functions. > Thanks in advance...

Re: Low Latency SQL query

2015-12-01 Thread Michal Klos
You should consider presto for this use case. If you want fast "first query" times it is a better fit. I think sparksql will catch up at some point but if you are not doing multiple queries against data cached in RDDs and need low latency it may not be a good fit. M > On Dec 1, 2015, at 7:23

Re: Additional Master daemon classpath

2015-11-18 Thread Michal Klos
defaults config. > > https://spark.apache.org/docs/latest/configuration.html > >> On Tue, 17 Nov 2015 at 21:06 Michal Klos wrote: >> Hi, >> >> We are running a Spark Standalone cluster on EMR (note: not using YARN) and >> are trying to use S3 w/ EmrF

Re: Additional Master daemon classpath

2015-11-18 Thread Michal Klos
PATH. Further info in the > docs. I've had good luck with these options, and for ease of use I just set > them in the spark defaults config. > > https://spark.apache.org/docs/latest/configuration.html > >> On Tue, 17 Nov 2015 at 21:06 Michal Klos wrote: >> Hi, &g

Additional Master daemon classpath

2015-11-17 Thread Michal Klos
Hi, We are running a Spark Standalone cluster on EMR (note: not using YARN) and are trying to use S3 w/ EmrFS as our event logging directory. We are having difficulties with a ClassNotFoundException on EmrFileSystem when we navigate to the event log screen. This is to be expected as the EmrFs jar

Re: Partitioned Parquet based external table

2015-11-12 Thread Michal Klos
You must add the partitions to the Hive table with something like "alter table your_table add if not exists partition (country='us');". If you have dynamic partitioning turned on, you can do 'msck repair table your_table' to recover the partitions. I would recommend reviewing the Hive document

Add to Powered by Spark page

2015-05-19 Thread Michal Klos
Hi, We would like to be added to the Powered by Spark list: organization name: Localytics URL: http://eng.localytics.com/ a list of which Spark components you are using: Spark, Spark Streaming, MLLib a short description of your use case: Batch, real-time, and predictive analytics driving our mobi

Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-21 Thread Michal Klos
Hi, I'm trying to set up multiple spark clusters with high availability and I was wondering if I can re-use a single ZK cluster to manage them? It's not very clear in the docs and it seems like the answer may be that I need a separate ZK cluster for each spark cluster? thanks, M

Re: RDD resiliency -- does it keep state?

2015-03-28 Thread Michal Klos
t could even fully complete on one machine, but the machine dies >> > immediately before reporting the result back to the driver. >> > >> > This means you need to make sure the side-effects are idempotent, or >> > use some transactional locking. Spark's own ou

RDD resiliency -- does it keep state?

2015-03-27 Thread Michal Klos
Hi Spark group, We haven't been able to find clear descriptions of how Spark handles the resiliency of RDDs in relationship to executing actions with side-effects. If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect for each element in the RDD. If a partition goes down --

Spark yarn-client submission example?

2015-03-17 Thread Michal Klos
Hi, We have a Scala application and we want it to programmatically submit Spark jobs to a Spark-YARN cluster in yarn-client mode. We're running into a lot of classpath issues, e.g. once submitted it looks for jars in our parent Scala application's local directory, jars that it shouldn't need. Our

Re: Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos
unction you pass can invoke an RDD > action. So you can refactor that way too. > > On Wed, Mar 11, 2015 at 2:39 PM, Michal Klos > wrote: > > Is there a way to have the exception handling go lazily along with the > > definition? > > > > e.g... we define it on the RDD

Re: Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos
where the action > occurs. > > > On Wed, Mar 11, 2015 at 1:51 PM, Michal Klos > wrote: > > Hi Spark Community, > > > > We would like to define exception handling behavior on RDD instantiation > / > > build. Since the RDD is lazily evaluated, it seems

Define exception handling on lazy elements?

2015-03-11 Thread Michal Klos
Hi Spark Community, We would like to define exception handling behavior on RDD instantiation / build. Since the RDD is lazily evaluated, it seems like we are forced to put all exception handling in the first action call? This is an example of something that would be nice: def myRDD = { Try { val

Re: Scalable JDBCRDD

2015-03-02 Thread Michal Klos
even if the the N partitioned >> queries are smaller. Some queries require evaluating the whole data set >> first. If our use case a simple select * from table.. Then the partitions >> would be an easier sell if it wasn't for the concurrency problem :) Long >> story shor

Re: Tools to manage workflows on Spark

2015-03-02 Thread Michal Klos
Piggy-backing on the thread a little -- Does anyone out there use luigi to manage spark workflows? I see they recently added spark support On Sun, Mar 1, 2015 at 10:20 PM, Qiang Cao wrote: > Thanks, Himanish and Felix! > > On Sun, Mar 1, 2015 at 7:50 PM, Himanish Kushary > wrote: > >> We are

Scalable JDBCRDD

2015-02-28 Thread Michal Klos
Hi Spark community, We have a use case where we need to pull huge amounts of data from a SQL query against a database into Spark. We need to execute the query against our huge database and not a substitute (SparkSQL, Hive, etc) because of a couple of factors including custom functions used in the