ondering if there is a configuration that needs to be tweaked or if
this is expected response time.
Machines are 30g RAM and 4 cores. Seems the CPU's are just getting pegged
and that is what is taking so long.
Any help on this would be amazing.
Thanks,
--
*MAGNE**+**I**C*
*Sam Flint
er.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:
I have tested my python script by using the pyspark shell. I run into an
error because of memory limits on the name node.
I am wondering how I run the script no spark yarn. I am not familiar with
this at all.
Any help would be greatly appreciated.
Thanks,
--
*MAGNE**+**I**C*
*Sam Flint
ontains all the data.
>
> On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint wrote:
>
>> Michael,
>> Thanks for your help. I found a wholeTextFiles() that I can use to
>> import all files in a directory. I believe this would be the case if all
>> the files existed in the
Hi,
I am new to spark. I have began to read to understand sparks RDD files
as well as SparkSQL. My question is more on how to build out the RDD files
and best practices. I have data that is broken down by hour into files on
HDFS in avro format. Do I need to create a separate RDD for each