I just did that, where can i find that "spark-1.4.0-bin-hadoop2.4.tgz" file ?
On Sun, Jun 28, 2015 at 12:15 PM, Ted Yu <yuzhih...@gmail.com> wrote: > You can use the following command to build Spark after applying the pull > request: > > mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package > > > Cheers > > > On Sun, Jun 28, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> > wrote: > >> I see that block support did not make it to spark 1.4 release. >> >> Can you share instructions of building spark with this support for hadoop >> 2.4.x distribution. >> >> appreciate. >> >> On Fri, Jun 26, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> >> wrote: >> >>> This is nice. Which version of Spark has this support ? Or do I need to >>> build it. >>> I have never built Spark from git, please share instructions for Hadoop >>> 2.4.x YARN. >>> >>> I am struggling a lot to get a join work between 200G and 2TB datasets. >>> I am constantly getting this exception >>> >>> 1000s of executors are failing with >>> >>> 15/06/26 13:05:28 ERROR storage.ShuffleBlockFetcherIterator: Failed to >>> get block(s) from phxdpehdc9dn2125.stratus.phx.ebay.com:60162 >>> java.io.IOException: Failed to connect to >>> executor_host_name/executor_ip_address:60162 >>> at >>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191) >>> at >>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) >>> at >>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) >>> at >>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) >>> at >>> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) >>> at >>> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:262) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:745) >>> >>> >>> >>> >>> On Fri, Jun 26, 2015 at 3:20 PM, Koert Kuipers <ko...@tresata.com> >>> wrote: >>> >>>> we went through a similar process, switching from scalding (where >>>> everything just works on large datasets) to spark (where it does not). >>>> >>>> spark can be made to work on very large datasets, it just requires a >>>> little more effort. pay attention to your storage levels (should be >>>> memory-and-disk or disk-only), number of partitions (should be large, >>>> multiple of num executors), and avoid groupByKey >>>> >>>> also see: >>>> https://github.com/tresata/spark-sorted (for avoiding in memory >>>> operations for certain type of reduce operations) >>>> https://github.com/apache/spark/pull/6883 (for blockjoin) >>>> >>>> >>>> On Fri, Jun 26, 2015 at 5:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> >>>> wrote: >>>> >>>>> Not far at all. On large data sets everything simply fails with Spark. >>>>> Worst is am not able to figure out the reason of failure, the logs run >>>>> into millions of lines and i do not know the keywords to search for >>>>> failure >>>>> reason >>>>> >>>>> On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf <nightwolf...@gmail.com> >>>>> wrote: >>>>> >>>>>> How far did you get? >>>>>> >>>>>> On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> We use Scoobi + MR to perform joins and we particularly use >>>>>>> blockJoin() API of scoobi >>>>>>> >>>>>>> >>>>>>> /** Perform an equijoin with another distributed list where this >>>>>>> list is considerably smaller >>>>>>> * than the right (but too large to fit in memory), and where the >>>>>>> keys of right may be >>>>>>> * particularly skewed. */ >>>>>>> >>>>>>> def blockJoin[B : WireFormat](right: DList[(K, B)]): DList[(K, (A, >>>>>>> B))] = >>>>>>> Relational.blockJoin(left, right) >>>>>>> >>>>>>> >>>>>>> I am trying to do a POC and what Spark join API(s) is recommended to >>>>>>> achieve something similar ? >>>>>>> >>>>>>> Please suggest. >>>>>>> >>>>>>> -- >>>>>>> Deepak >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Deepak >>>>> >>>>> >>>> >>> >>> >>> -- >>> Deepak >>> >>> >> >> >> -- >> Deepak >> >> > -- Deepak