Spark response times for queries seem slow

2015-01-05 Thread Sam Flint
ondering if there is a configuration that needs to be tweaked or if this is expected response time. Machines are 30g RAM and 4 cores. Seems the CPU's are just getting pegged and that is what is taking so long. Any help on this would be amazing. Thanks, -- *MAGNE**+**I**C* *Sam Flint

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: pyspark on yarn

2015-01-05 Thread Sam Flint
er.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:

Spark Sql on Yarn using python

2014-12-16 Thread Sam Flint
I have tested my python script by using the pyspark shell. I run into an error because of memory limits on the name node. I am wondering how I run the script no spark yarn. I am not familiar with this at all. Any help would be greatly appreciated. Thanks, -- *MAGNE**+**I**C* *Sam Flint

Re: NEW to spark and sparksql

2014-11-20 Thread Sam Flint
ontains all the data. > > On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint wrote: > >> Michael, >> Thanks for your help. I found a wholeTextFiles() that I can use to >> import all files in a directory. I believe this would be the case if all >> the files existed in the

NEW to spark and sparksql

2014-11-19 Thread Sam Flint
Hi, I am new to spark. I have began to read to understand sparks RDD files as well as SparkSQL. My question is more on how to build out the RDD files and best practices. I have data that is broken down by hour into files on HDFS in avro format. Do I need to create a separate RDD for each