Delaying failed task retries + giving failing tasks to different nodes

2015-04-02 Thread Stephen Merity
construct. For now I've tried to work around it by persisting to multiple machines (MEMORY_AND_DISK_SER_2). Thanks! ^_^ -- Regards, Stephen Merity Data Scientist @ Common Crawl

GraphX for large scale PageRank (~4 billion nodes, ~128 billion edges)

2014-12-12 Thread Stephen Merity
org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) -- Regards, Stephen Merity Data Scientist @ Common Crawl

GraphX for large scale PageRank (~4 billion nodes, ~128 billion edges)

2014-12-12 Thread Stephen Merity
Hi! tldr; We're looking at potentially using Spark+GraphX to compute PageRank over a 4 billion node + 128 billion edge graph on a regular (monthly) basis, possibly growing larger in size over time. If anyone has hints / tips / upcoming optimizations I should test use (or wants to contribute -- we'