construct. For now I've tried to work around it by
persisting to multiple machines (MEMORY_AND_DISK_SER_2).
Thanks! ^_^
--
Regards,
Stephen Merity
Data Scientist @ Common Crawl
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
--
Regards,
Stephen Merity
Data Scientist @ Common Crawl
Hi!
tldr; We're looking at potentially using Spark+GraphX to compute PageRank
over a 4 billion node + 128 billion edge graph on a regular (monthly) basis,
possibly growing larger in size over time. If anyone has hints / tips /
upcoming optimizations I should test use (or wants to contribute -- we'