M. Dietrich wrote:
Does this mean that with normal 686 or 486 kernel the corruption
doesn't happen?
yes.
So could be a kernel bug. Or the bigmem kernel trigger the problem early
or frequently.
Have you already searched through internet if someone had hit your
problem? Because i suspect it's not a kernel problem (see later)...
there are no special settings installed using hdparm:
/dev/sda:
multcount = 0 (off)
IO_support = 1 (32-bit)
readonly = 0 (off)
readahead = 256 (on)
geometry = 30401/255/63, sectors = 488397168, start = 0
This is the output of the command, but it doesn't tell all the things
you could have changed from the default. Have you customized
/etc/hdparm.conf?
For example i've set apm=254 but the above output doesn't report it.
My suggestion is: try to comment out everything you have customized
about the disk.
it's already installed, this is the output:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 085 069 034 Pre-fail Always
- 98867399
3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always
- 0
4 Start_Stop_Count 0x0032 001 001 020 Old_age Always
FAILING_NOW 248712
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always
- 0
7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always
- 40211526
9 Power_On_Hours 0x0032 095 095 000 Old_age Always
- 269350284038985
10 Spin_Retry_Count 0x0013 100 100 034 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always
- 448
184 End-to-End_Error 0x0032 100 253 000 Old_age Always
- 0
187 Reported_Uncorrect 0x003a 100 100 000 Old_age Always
- 0
189 High_Fly_Writes 0x0022 100 100 045 Old_age Always
- 0
190 Airflow_Temperature_Cel 0x0032 071 052 000 Old_age Always
- 29 (Lifetime Min/Max 10/48)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always
- 19
192 Power-Off_Retract_Count 0x0022 062 062 000 Old_age Always
- 77434
193 Load_Cycle_Count 0x001a 001 001 000 Old_age Always
- 320283
194 Temperature_Celsius 0x0012 029 048 000 Old_age Always
- 29 (0 10 0 0)
195 Hardware_ECC_Recovered 0x0010 070 061 000 Old_age Offline
- 98881899
196 Reallocated_Event_Count 0x003e 096 096 000 Old_age Always
- 3645 (28548, 0)
197 Current_Pending_Sector 0x0000 100 100 000 Old_age Offline
- 0
198 Offline_Uncorrectable 0x0032 100 100 000 Old_age Always
- 0
199 UDMA_CRC_Error_Count 0x0000 200 200 000 Old_age Offline
- 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline
- 0
202 Data_Address_Mark_Errs 0x0000 100 253 000 Old_age Offline
- 0
i wonder how to interpret that. Start_Stop_Count has FAILING_NOW, maybe because
hdaps is stopping the device often? why is that bad? hm.
Good question. I suggest to download a diagnostic tool from your disk's
vendor site and see if it report it as failing.
The problem (one of) with smart is that the semantic of the table above
in not consistent between manufacturers. So i suggest you to look at the
wikipedia SMART page, in particular the "Known ATA S.M.A.R.T.
attributes" table, but take it with a grain of salt:
http://en.wikipedia.org/wiki/S.M.A.R.T.
That said, from your smart table i'd do some search regarding this
attributes and you disk manufacturer:
* Raw_Read_Error_Rate
* Start_Stop_Count
* Seek_Error_Rate
* Power-Off_Retract_Count
* Load_Cycle_Count
* Hardware_ECC_Recovered
* Reallocated_Event_Count
I'd look if the raw values of Raw_Read_Error_Rate and Seek_Error_Rate as
used by your manufactured are worrying or not.
Same thing for Hardware_ECC_Recovered. At work we have at least 4 Maxtor
that show high and always increasing raw values but they work without
problem since years.
Also the Reallocated_Event_Count should require some investigation: why
is so high but Reallocated_Sector_Ct and Current_Pending_Sector are zero?
Last, looking from your smart table seems that your drive turn often in
standby/sleep mode. This can be seen by the high values of
Start_Stop_Count, Load_Cycle_Count and Power-Off_Retract_Count. An in
your initial report you said that you used suspend/resume.
I think that you should reduce these value because they are very high
and all this start/stop cycle will (or already have) reduce the life of
your disk.
Maybe on your system there is something that force too aggressive power
saving on your disk. Laptop-mode-tools is installed?
However it is a common problem, if you do some search.
It is the reason i've put "apm=254" in my hdparm configuration: without
this my disk parked its head a bit two often *during normal pc usage*.
And i could notice this as clicks and very brief unresponsiveness of the
system. With that parameter i've forced my disk to work at full power
without parking and going to sleep automatically.
Your disk could require different settings.
i never had strange behaviour
related to mem like kernel-freezes or program core dumps and i use the system
quite alot with big (cross-)compiles and everything that uses mem alot...
In your initial report you said that you noted the problem from 2.6.28
but you found it accidentaly. Another test could be try using previous
kernel to see if they work, for example 2.6.26 from Lenny.
You can test other kernel from:
http://snapshot.debian.net/
I understand you are in difficulties removing ram, but it is another of
the suspected. It's the original from Lenovo?
An hard problem to solve.
Good luck.
Cesare.
--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org