we went through a similar process, switching from scalding (where everything just works on large datasets) to spark (where it does not).
spark can be made to work on very large datasets, it just requires a little more effort. pay attention to your storage levels (should be memory-and-disk or disk-only), number of partitions (should be large, multiple of num executors), and avoid groupByKey also see: https://github.com/tresata/spark-sorted (for avoiding in memory operations for certain type of reduce operations) https://github.com/apache/spark/pull/6883 (for blockjoin) On Fri, Jun 26, 2015 at 5:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <[email protected]> wrote: > Not far at all. On large data sets everything simply fails with Spark. > Worst is am not able to figure out the reason of failure, the logs run > into millions of lines and i do not know the keywords to search for failure > reason > > On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf <[email protected]> > wrote: > >> How far did you get? >> >> On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <[email protected]> >> wrote: >> >>> We use Scoobi + MR to perform joins and we particularly use blockJoin() >>> API of scoobi >>> >>> >>> /** Perform an equijoin with another distributed list where this list is >>> considerably smaller >>> * than the right (but too large to fit in memory), and where the keys of >>> right may be >>> * particularly skewed. */ >>> >>> def blockJoin[B : WireFormat](right: DList[(K, B)]): DList[(K, (A, B))] >>> = >>> Relational.blockJoin(left, right) >>> >>> >>> I am trying to do a POC and what Spark join API(s) is recommended to >>> achieve something similar ? >>> >>> Please suggest. >>> >>> -- >>> Deepak >>> >>> >> > > > -- > Deepak > >
