Re: Join highly skewed datasets

๏̯͡๏ Sun, 28 Jun 2015 12:19:23 -0700

I just did that, where can i find that "spark-1.4.0-bin-hadoop2.4.tgz" file
?


On Sun, Jun 28, 2015 at 12:15 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> You can use the following command to build Spark after applying the pull
> request:
>
> mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package
>
>
> Cheers
>
>
> On Sun, Jun 28, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
> wrote:
>
>> I see that block support did not make it to spark 1.4 release.
>>
>> Can you share instructions of building spark with this support for hadoop
>> 2.4.x distribution.
>>
>> appreciate.
>>
>> On Fri, Jun 26, 2015 at 9:23 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
>> wrote:
>>
>>> This is nice. Which version of Spark has this support ? Or do I need to
>>> build it.
>>> I have never built Spark from git, please share instructions for Hadoop
>>> 2.4.x YARN.
>>>
>>> I am struggling a lot to get a join work between 200G and 2TB datasets.
>>> I am constantly getting this exception
>>>
>>> 1000s of executors are failing with
>>>
>>> 15/06/26 13:05:28 ERROR storage.ShuffleBlockFetcherIterator: Failed to
>>> get block(s) from phxdpehdc9dn2125.stratus.phx.ebay.com:60162
>>> java.io.IOException: Failed to connect to
>>> executor_host_name/executor_ip_address:60162
>>> at
>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
>>> at
>>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>>> at
>>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
>>> at
>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>>> at
>>> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>>> at
>>> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>>> at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>>
>>>
>>> On Fri, Jun 26, 2015 at 3:20 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> we went through a similar process, switching from scalding (where
>>>> everything just works on large datasets) to spark (where it does not).
>>>>
>>>> spark can be made to work on very large datasets, it just requires a
>>>> little more effort. pay attention to your storage levels (should be
>>>> memory-and-disk or disk-only), number of partitions (should be large,
>>>> multiple of num executors), and avoid groupByKey
>>>>
>>>> also see:
>>>> https://github.com/tresata/spark-sorted (for avoiding in memory
>>>> operations for certain type of reduce operations)
>>>> https://github.com/apache/spark/pull/6883 (for blockjoin)
>>>>
>>>>
>>>> On Fri, Jun 26, 2015 at 5:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
>>>> wrote:
>>>>
>>>>> Not far at all. On large data sets everything simply fails with Spark.
>>>>> Worst is am not able to figure out the reason of failure,  the logs run
>>>>> into millions of lines and i do not know the keywords to search for 
>>>>> failure
>>>>> reason
>>>>>
>>>>> On Mon, Jun 15, 2015 at 6:52 AM, Night Wolf <nightwolf...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> How far did you get?
>>>>>>
>>>>>> On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> We use Scoobi + MR to perform joins and we particularly use
>>>>>>> blockJoin() API of scoobi
>>>>>>>
>>>>>>>
>>>>>>> /** Perform an equijoin with another distributed list where this
>>>>>>> list is considerably smaller
>>>>>>> * than the right (but too large to fit in memory), and where the
>>>>>>> keys of right may be
>>>>>>> * particularly skewed. */
>>>>>>>
>>>>>>>  def blockJoin[B : WireFormat](right: DList[(K, B)]): DList[(K, (A,
>>>>>>> B))] =
>>>>>>>     Relational.blockJoin(left, right)
>>>>>>>
>>>>>>>
>>>>>>> I am trying to do a POC and what Spark join API(s) is recommended to
>>>>>>> achieve something similar ?
>>>>>>>
>>>>>>> Please suggest.
>>>>>>>
>>>>>>> --
>>>>>>> Deepak
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Deepak
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Deepak
>>>
>>>
>>
>>
>> --
>> Deepak
>>
>>
>


-- 
Deepak

Re: Join highly skewed datasets

Reply via email to