On 02/06/17 13:15, Marc Shapiro wrote:
I am pasting the result of smartctl -x /dev/sda below as I have no real
clue what to do with the information, but I have a few questions first.
1) I have purchased a new, very similar, Seagate 1TB drive and I plan to
install it and copy the whole system to the new drive.
It sounds like you don't have a backup of the failing 1 TB drive (?).
Do you have a file server with ~1 TB of free space? RAID?
Run memtest86+ for 24+ hours to verify that you don't have a memory problem.
Use SeaTools to wipe the new 1 TB drive and run the short and long
tests. Stop if anything fails.
What is the best
way to do this copy since I don't wangt to copy bad sectors?
I've done it with 'dd' in the past, but will use 'ddrescue' in the future.
2) Once I have verified that the new drive boots
I'd do a fresh install on a 16+ GB SSD (USB flash drives also work). A
recovered system disk image is too uncertain.
and everything is running properly
As I understand it, the drive microcontroller calculates and stores a
checksum with every sector (block). That's one way it knows that a
block is bad upon reading. So, when you copy out whatever blocks you
can get, you probably won't have errors in those blocks.
But, files and directories are stored on one or more sectors. Depending
upon your file system, fsck may or may not find the missing blocks.
When you're done, the destination disk is likely to be missing files
and/or directories.
I am hoping to reformat the old drive. This should
reallocate the bad sectors IIRC. I then would like to set up a raid
with both drives (keeping a close eye on the old drive).The
feasibility of this, I would guess, depends on what the posted smartctl
information tells someone who knows what to look for.
3) As I understand it, the above mentioned raid should be safe since,
even if the old drive deteriorates further, the system can run on just
the new drive. Is that correct?
Once you've copied out whatever blocks you can get, use SeaTools to wipe
the old 1 TB drive and run short and long tests. If all three pass, I
might be tempted to re-use the drive.
If it fails to wipe and has plaintext, destroy it with a sledge hammer.
(Wear safety glasses!)
If it wipes but fails the short or long tests, recycle it.
Here is the smafrtctl output:
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Interesting, given that the drive failed SeaTools (short test?).
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 121) The previous self-test
completed having
the read element of the test failed.
Matches SeaTools result.
Total time to complete Offline
data collection: ( 600) seconds.
...
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 117 095 006 - 165391146
3 Spin_Up_Time PO---- 095 093 000 - 0
4 Start_Stop_Count -O--CK 100 100 020 - 406
5 Reallocated_Sector_Ct PO--CK 072 072 036 - 1181
7 Seek_Error_Rate POSR-- 087 060 030 - 656506200
9 Power_On_Hours -O--CK 048 048 000 - 46195
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 100 100 020 - 203
183 Runtime_Bad_Block -O--CK 092 092 000 - 8
184 End-to-End_Error -O--CK 100 100 099 - 0
187 Reported_Uncorrect -O--CK 011 011 000 - 89
188 Command_Timeout -O--CK 100 097 000 - 51540394008
189 High_Fly_Writes -O-RCK 100 100 000 - 0
190 Airflow_Temperature_Cel -O---K 070 049 045 - 30 (Min/Max
27/32)
194 Temperature_Celsius -O---K 030 051 000 - 30 (0 20 0
0 0)
195 Hardware_ECC_Recovered -O-RC- 034 003 000 - 165391146
197 Current_Pending_Sector -O--C- 093 083 000 - 310
198 Offline_Uncorrectable ----C- 093 083 000 - 310
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 26
240 Head_Flying_Hours ------ 100 253 000 - 46718 (49
76 0)
241 Total_LBAs_Written ------ 100 253 000 - 1725386978
242 Total_LBAs_Read ------ 100 253 000 - 265479204
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
I have yet to find a good explanation for reading smartctl reports.
This post gives some clues:
https://ubuntuforums.org/showthread.php?t=2192335
Here are the statistics for my ST3000DM001:
Here is my ST3000DM001 for comparison:
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-- 115 099 006 - 90256224
3 Spin_Up_Time PO---- 094 094 000 - 0
4 Start_Stop_Count -O--CK 100 100 020 - 577
5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
7 Seek_Error_Rate POSR-- 063 060 030 - 1955231
9 Power_On_Hours -O--CK 096 096 000 - 3552
10 Spin_Retry_Count PO--C- 100 100 097 - 0
12 Power_Cycle_Count -O--CK 100 100 020 - 576
183 Runtime_Bad_Block -O--CK 100 100 000 - 0
184 End-to-End_Error -O--CK 100 100 099 - 0
187 Reported_Uncorrect -O--CK 100 100 000 - 0
188 Command_Timeout -O--CK 100 100 000 - 0
189 High_Fly_Writes -O-RCK 100 100 000 - 0
190 Airflow_Temperature_Cel -O---K 070 059 045 - 30 (Min/Max
19/30)
191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
192 Power-Off_Retract_Count -O--CK 100 100 000 - 35
193 Load_Cycle_Count -O--CK 100 100 000 - 1323
194 Temperature_Celsius -O---K 030 041 000 - 30 (0 17 0 0)
197 Current_Pending_Sector -O--C- 100 100 000 - 0
198 Offline_Uncorrectable ----C- 100 100 000 - 0
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
240 Head_Flying_Hours ------ 100 253 000 - 269092585999820
241 Total_LBAs_Written ------ 100 253 000 - 2338230420
242 Total_LBAs_Read ------ 100 253 000 - 19882466886
These statistics for your drive look suspicious:
Reallocated_Sector_Ct
Reported_Uncorrect
Runtime_Bad_Block
...
SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
Device Error Count: 89 (device log contains only the most recent 20 errors)
That's not good. Mine says:
No Errors Logged
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 46194
This could be SeaTools (?).
Let us know how it turns out.
David