Spark not working on windows 7 64 bit

2015-06-10 Thread Eran Medan
I'm on a road block trying to understand why Spark doesn't work for a colleague of mine on his Windows 7 laptop. I have pretty much the same setup and everything works fine. I googled the error message and didn't get anything that resovled it. Here is the exception message (after running spark 1

Joining HDFS and JDBC data sources - benchmarks

2015-11-13 Thread Eran Medan
Hi I'm looking for some benchmarks on joining data frames where most of the data is in HDFS (e.g. in parquet) and some "reference" or "metadata" is still in RDBMS. I am only looking at the very first join before any caching happens, and I assume there will be loss of parallelization because JDBCRDD

DataFrames initial jdbc loading - will it be utilizing a filter predicate?

2015-11-18 Thread Eran Medan
I understand that the following are equivalent df.filter('account === "acct1") sql("select * from tempTableName where account = 'acct1'") But is Spark SQL "smart" to also push filter predicates down for the initial load? e.g. sqlContext.read.jdbc(…).filter('account=== "acct1")

Re: DataFrames initial jdbc loading - will it be utilizing a filter predicate?

2015-11-19 Thread Eran Medan
to > generate new query with “where account = acct1” > > Thanks. > > Zhan Zhang > > On Nov 18, 2015, at 11:36 AM, Eran Medan wrote: > > I understand that the following are equivalent > > df.filter('account === "acct1") > > sql("selec

Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-27 Thread Eran Medan
Remember that article that went viral on HN? (Where a guy showed how GraphX / Giraph / GraphLab / Spark have worse performance on a 128 cluster than on a 1 thread machine? if not here is the article - http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html) Well as you may recall

[spark-sql] What is the right way to represent an “Any” type in Spark SQL?

2015-03-27 Thread Eran Medan
Hi everyone, I had a lot of questions today, sorry if I'm spamming the list, but I thought it's better than posting all questions in one thread. Let me know if I should throttle my posts ;) Here is my question: When I try to have a case class that has Any in it (e.g. I have a property map and va

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-29 Thread Eran Medan
, but I don't think it's some kind > of argument against distributed computing. > > > On Fri, Mar 27, 2015 at 6:32 PM, Eran Medan > wrote: > > Remember that article that went viral on HN? (Where a guy showed how > GraphX > > / Giraph / GraphLab / Spark have

Re: [spark-sql] What is the right way to represent an “Any” type in Spark SQL?

2015-03-29 Thread Eran Medan
e. > > Thanks for the PR btw :) > > On Fri, Mar 27, 2015 at 2:31 PM, Eran Medan > wrote: > >> Hi everyone, >> >> I had a lot of questions today, sorry if I'm spamming the list, but I >> thought it's better than posting all questions in one thr

Understanding Spark's caching

2015-04-27 Thread Eran Medan
Hi Everyone! I'm trying to understand how Spark's cache work. Here is my naive understanding, please let me know if I'm missing something: val rdd1 = sc.textFile("some data") rdd.cache() //marks rdd as cached val rdd2 = rdd1.filter(...) val rdd3 = rdd1.map(...) rdd2.saveAsTextFile("...") rdd3.sa