[ 
https://issues.apache.org/jira/browse/SPARK-43221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43221:
----------------------------------
    Fix Version/s: 4.0.0

> Host local block fetching should use a block status of a block stored on disk
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-43221
>                 URL: https://issues.apache.org/jira/browse/SPARK-43221
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 3.1.1, 3.2.0, 3.3.0
>            Reporter: Qiang Yang
>            Assignee: Attila Zsolt Piros
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0, 4.1.0
>
>         Attachments: image-2023-04-21-00-19-58-021.png, 
> image-2023-04-21-00-24-22-059.png, image-2023-04-21-00-30-41-851.png, 
> image-2023-04-21-00-50-10-918.png, image-2023-04-21-00-53-20-720.png, 
> image-2023-04-21-00-54-11-968.png, image-2023-04-21-00-57-29-140.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Spark on Yarn Cluster
> When multiple executors exist on a node, and the same block exists on both 
> executors, with some in memory and some on disk.
> Probabilistically, the executor failed to obtain the block,throw Exception:
> java.lang.ArrayIndexOutofBoundsException: 0
>     at 
> org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBlocks$1(TorrentBroadcast.scala:183)
>  
> Next, I will replay the process of the problem occurring: 
> step 1:
> The executor requests the driver to obtain block 
> information(locationsAndStatusOption). The input parameters are BlockId and 
> the host of its own node. Please note that it does not carry port information
> line:1092
> !image-2023-04-21-00-24-22-059.png!
> step 2:
> On the driver side, the driver obtains all blockManagers holding the block 
> based on the BlockId. For non remote shuffle scenarios, the driver will 
> retrieve the first one with the blockId and blockManager from the locations
> Assuming that there are two BlockManagers holding the BlockId on this node, 
> BM-1 holds the Block and stores it in memory, and BM-2 holds the Block and 
> stores it in disk
> Assuming the returned status is of type memory and its disksize is 0
> line: 852, 856
> !image-2023-04-21-00-30-41-851.png!
> step 3:
> This method will return a BlockLocationsAndStatus object. If there are BMs 
> using disk, the disk's path information will be stored in localDirs
> !image-2023-04-21-00-50-10-918.png!
> step 4:
> When the executor obtains locationsAndStatusOption, localDirs is not empty, 
> but status.diskSize is 0
> line: 1102
> !image-2023-04-21-00-54-11-968.png!
> step 5:
> The readDiskBlockFromSameHostExecutor only determines whether the Block file 
> exists, and then directly uses the incoming blocksize to read the byte array. 
> If the blocksize is 0, it returns an empty byte array
> Only checked if the file exists
> line: 1234, 1240
> !image-2023-04-21-00-57-29-140.png!
> Taking values from an empty array, causing an out of bounds problem



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to