RE: Regarding spark-3.2.0 decommission features.

Patidar, Mohanlal (Nokia - IN/Bangalore) Wed, 19 Jan 2022 22:28:08 -0800

Gentle reminder!!!

Br,
-Mohan Patidar




From: Patidar, Mohanlal (Nokia - IN/Bangalore)
Sent: Tuesday, January 18, 2022 2:02 PM
To: user@spark.apache.org
Cc: Rao, Abhishek (Nokia - IN/Bangalore) <abhishek....@nokia.com>; Gowda Tp, 
Thimme (Nokia - IN/Bangalore) <thimme.gowda...@nokia.com>; Sharma, Prakash 
(Nokia - IN/Bangalore) <prakash.sha...@nokia.com>; Tarun, N (Nokia - 
IN/Bangalore) <n.ta...@nokia.com>; Badagandi, Srinivas B. (Nokia - 
IN/Bangalore) <srinivas.b.badaga...@nokia.com>
Subject: Regarding spark-3.2.0 decommission features.

Hi,
     We're using Spark 3.2.0 and we have enabled the spark decommission 
feature. As part of validating this feature, we wanted to check if the rdd 
blocks and shuffle blocks from the decommissioned executors are migrated to 
other executors.
However, we could not see this happening. Below is the configuration we used.

  1.  Spark Configuration used:
     spark.local.dir /mnt/spark-ldir
     spark.decommission.enabled true
     spark.storage.decommission.enabled true
     spark.storage.decommission.rddBlocks.enabled true
     spark.storage.decommission.shuffleBlocks.enabled true
     spark.dynamicAllocation.enabled true
  2.  Brought up spark-driver and executors on the different nodes.
NAME                                                                            
          READY              STATUS               NODE
decommission-driver                                                             
1/1                 Running           Node1
gzip-compression-test-ae0b0b7e4d7fbe40-exec-1          1/1                 
Running           Node1
gzip-compression-test-ae0b0b7e4d7fbe40-exec-2          1/1                 
Running           Node2
gzip-compression-test-ae0b0b7e4d7fbe40-exec-3          1/1                 
Running           Node1
gzip-compression-test-ae0b0b7e4d7fbe40-exec-4          1/1                 
Running           Node2
gzip-compression-test-ae0b0b7e4d7fbe40-exec-5          1/1                 
Running           Node1
  3.  Bringdown Node2 so status of pods as are following.

NAME                                                                            
          READY              STATUS           NODE
decommission-driver                                                             
1/1                 Running           Node1
gzip-compression-test-ae0b0b7e4d7fbe40-exec-1          1/1                 
Running           Node1
gzip-compression-test-ae0b0b7e4d7fbe40-exec-2          1/1                 
Terminating    Node2
gzip-compression-test-ae0b0b7e4d7fbe40-exec-3          1/1                 
Running           Node1
gzip-compression-test-ae0b0b7e4d7fbe40-exec-4          1/1                 
Terminating    Node2
gzip-compression-test-ae0b0b7e4d7fbe40-exec-5          1/1                 
Running           Node1
  4.  Driver logs:
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.296Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.459Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.564Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.601Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:55:28.667Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.885Z", 
"timezone":"UTC", "log":"Notify executor 5 to decommissioning."}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", 
"timezone":"UTC", "log":"Notify executor 1 to decommissioning."}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", 
"timezone":"UTC", "log":"Notify executor 3 to decommissioning."}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:58:21.887Z", 
"timezone":"UTC", "log":"Mark BlockManagers (BlockManagerId(5, X.X.X.X, 33359, 
None), BlockManagerId(1, X.X.X.X, 38655, None), BlockManagerId(3, X.X.X.X, 
35797, None)) as being decommissioning."}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:59:24.426Z", 
"timezone":"UTC", "log":"Executor 2 is removed. Remove reason statistics: 
(gracefully decommissioned: 0, decommision unfinished: 0, driver killed: 0, 
unexpectedly exited: 1)."}
{"type":"log", "level":"INFO", "time":"2022-01-12T08:59:24.426Z", 
"timezone":"UTC", "log":"Executor 4 is removed. Remove reason statistics: 
(gracefully decommissioned: 0, decommision unfinished: 0, driver killed: 0, 
unexpectedly exited: 2)."}
  5.  Verified by Execute into all live executors(1,3,5) and checked at 
location (/mnt/spark-ldir/) so only one blockManger id present, not seeing any 
other blockManager id copied to this location.
Example:
                        $kubectl exec -it 
gzip-compression-test-ae0b0b7e4d7fbe40-exec-1   -n test bash
                        $cd /mnt/spark-ldir/
                        $ blockmgr-60872c99-e7d6-43ba-a43e-a97fc9f619ca

Since the migration was not happening, we tried to use fallback storage option 
by specifying the hdfs storage. But unfortunately we could not see the rdd and 
shuffle blocks in this fallback storage location as well. Below is the 
configuration we used.


  1.  Spark Configuration Used:
     spark.decommission.enabled true
     spark.storage.decommission.enabled true
     spark.storage.decommission.rddBlocks.enabled true
     spark.storage.decommission.shuffleBlocks.enabled true
     spark.storage.decommission.fallbackStorage.path 
hdfs://namenodeHA/tmp/fallbackstorage
     spark.dynamicAllocation.enabled true

  1.  Brought up one spark-driver and one executor on the different nodes.
      NAME                                                               READY  
        NODE
      decommission-driver                                                     
1/1             Node1
      gzip-compression-test-49acf67e679f9259-exec-1   1/1             Node2

   3. Bringdown Node2 so status of pods as are following.
     Example:
         NAME                                                                   
       READY      STATUS
         decommission-driver                                                    
1/1      Running
         gzip-compression-test-49acf67e679f9259-exec-1   1/1     Running
         gzip-compression-test-49acf67e679f9259-exec-1   1/1     Running
         gzip-compression-test-49acf67e679f9259-exec-1   1/1     Terminating

   4. Verified data migration on that storage fallback location:
       Example:
                     $ hdfs dfs -ls /tmp/fallbackstorage

       Note:  still empty this location.
  5. Driver logs is here.
         related to fallback
         {"type":"log", "level":"INFO", "time":"2022-01-17T10:40:21.682Z", 
"timezone":"UTC", "log":"Registering BlockManager BlockManagerId(fallback, 
remote, 7337, None)"}
         {"type":"log", "level":"INFO", "time":"2022-01-17T10:40:21.682Z", 
"timezone":"UTC", "log":"Registering block manager remote:7337 with 0.0 B RAM, 
BlockManagerId(fallback, remote, 7337, None)"}
         {"type":"log", "level":"INFO", "time":"2022-01-17T10:40:21.682Z", 
"timezone":"UTC", "log":"Registered BlockManager BlockManagerId(fallback, 
remote, 7337, None)"}
        related to decommissioning
         {"type":"log", "level":"INFO", "time":"2022-01-17T10:40:21.661Z", 
"timezone":"UTC", "log":"Adding decommission script to lifecycle"}
         {"type":"log", "level":"INFO", "time":"2022-01-17T10:46:17.952Z", 
"timezone":"UTC", "log":"Executor 1 is removed. Remove reason statistics: 
(gracefully decommissioned: 0, decommision unfinished: 0, driver killed: 0, 
unexpectedly exited: 1)."}


Please Let us know if we are missing anything which is stopping the migration 
of rdd and shuffle blocks.
Thanks and Regards,
-Mohan Patidar

RE: Regarding spark-3.2.0 decommission features.

Reply via email to