Re: configuration needed to run twitter(25GB) dataset

Ankur Dave Thu, 31 Jul 2014 14:24:46 -0700

On Thu, Jul 31, 2014 at 08:28 PM, Jiaxin Shi <shijiaxin...@gmail.com> wrote:
> We have a 6-nodes cluster , each node has 64GB memory.
> [...]
> But it ran out of memory. I also try 2D and 1D partition.
>
> And I also try Giraph under the same configuration, and it runs for 10
> iterations , and then it ran out of memory as well.


If Giraph is also running out of memory, it sounds like the graph is just too 
big to fit entirely in memory on your cluster. In that case, you could try 
changing the storage level from MEMORY_ONLY (the default) to MEMORY_AND_DISK. 
That would allow GraphX to spill partitions to disk, hurting performance but at 
least allowing the computation to finish.

You can do this by passing

    --edgeStorageLevel=MEMORY_AND_DISK --vertexStorageLevel=MEMORY_AND_DISK

to spark-submit.

> Should the numEPart equal to the number of nodes or number of nodes*cores?
> I think if numEPart is smaller, it will require less memory, just like the
> powergraph.

Right, increasing the number of edge partitions will increase the memory and 
communication overhead for both GraphX and PowerGraph. Setting the number of 
edge partitions to the total number of cores (nodes * cores) is a good starting 
point since it will allow GraphX to exploit parallelism fully, and you can 
experiment with half or double that number if necessary.

Ankur

Re: configuration needed to run twitter(25GB) dataset

Reply via email to