RAID1 設定したHDDのS.M.A.R.Tエラーを修復する

年明け早々、S.M.A.R.T で不良セクタが報告されました。

Disk Utility で確認したところ

このディスク、Ubuntu 上でFakeRAIDを組んだもの。
いままで RAID の修復なんてやったことないよ…

というわけで、やってみました。

バックアップ

なにはともあれ、まずはバックアップから。
対象のHDDに入っているデータを丸々退避。
Nasneに外付けハードディスクをつけたばかりだったので、データを退避する場所があったのは幸いでした。

エラー箇所の確認

エラー箇所を確認するためには、S.M.A.R.T のセルフテストを実行します。（short ではエラーが出ない場合もあるので、その場合は long にします）

$ sudo smartctl -t short /dev/sdd
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-35-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Sun Jan  6 10:46:18 2013

Use smartctl -X to abort test.

2分待てとのことなので、待ちます。
2分経ったら、結果を確認。

$ sudo smartctl -A -l selftest /dev/sdd
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-35-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   113   109   021    Pre-fail  Always       -       7325
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       523
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   072   072   000    Old_age   Always       -       20601
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       521
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       284
193 Load_Cycle_Count        0x0032   051   051   000    Old_age   Always       -       448507
194 Temperature_Celsius     0x0022   119   094   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%     20600         1348821247

2箇所もエラーがありますね。とりあえず、最初に問題のあるセクタは 1348821247 (LBA) であることがわかりました。

RAIDの解除

RAID を組んだままで修復すると何がおこるかわからないので、RAIDを解除します。
まずは、RAID-set name を確認。

$ sudo dmraid -s
*** Group superset isw_defhhdihjj
--> Active Subset
name   : isw_defhhdihjj_Volume0
size   : 1953517824
stride : 128
type   : mirror
status : ok
subsets: 0
devs   : 2
spares : 0

isw_defhhdihjj_Volume0 であることがわかりました。
マウントを解除してから、RAID のマッピングを解除。

$ sudo umount /dev/mapper/isw_defhhdihjj_Volume0p1
$ sudo dmraid -an isw_defhhdihjj_Volume0

修復

Current Pending Sector を修復、もしくは Reallocate するわけですが、基本的にはそのセクタに何かを書きこんでやるだけで良いはずです。
参考：Technical Memorandum: ハードディスクエラーとSMARTのPending sector, reallocated sector

LBA を直接指定できる hdparm を使用して書き込みを行なってみます。

$ sudo hdparm --write-sector 1348821247 /dev/sdd
Use of --write-sector is VERY DANGEROUS.
You are trying to deliberately overwrite a low-level sector on the media.
This is a BAD idea, and can easily result in total data loss.
Please supply the --yes-i-know-what-i-am-doing flag if you really want this.
Program aborted.

「データが簡単に壊れるけど、本当にいいのか？本当に良いなら --yes-i-know-what-i-am-doing をつけて実行しろ」とのこと。
ちなみに、使用した hdparm のバージョンは v9.37 です。
言われた通り、フラグをつけて実行してみました。

$ sudo hdparm --yes-i-know-what-i-am-doing --write-sector 1348821247 /dev/sdd

/dev/sdd:
re-writing sector 1348821247: succeeded

書き込めたようです。

修復されたかどうかを確認

セルフテストを実行して結果を確認。

$ sudo smartctl -t long /dev/sdd
$ sudo smartctl -A -l selftest /dev/sdd
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-35-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   113   109   021    Pre-fail  Always       -       7325
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       523
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   072   072   000    Old_age   Always       -       20606
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       521
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       284
193 Load_Cycle_Count        0x0032   051   051   000    Old_age   Always       -       448544
194 Temperature_Celsius     0x0022   114   094   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     20606         -
# 2  Short offline       Completed without error       00%     20602         -
# 3  Short offline       Completed: read failure       90%     20600         1348821247

今回の場合は 2つの Current Pending Sector があったので、もう一度同じ事をして修復する必要がありましたが、そこは割愛します。

以下のように、Current_Pending_Sector が0になりました。
本当は、Reallocated_Event_Count が増えるはずですが、今回は増加しませんでした。正常なセクタとして認識され直したようです。

$ sudo smartctl -A /dev/sdd
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-35-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   112   109   021    Pre-fail  Always       -       7366
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       524
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   072   072   000    Old_age   Always       -       20645
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       522
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       284
193 Load_Cycle_Count        0x0032   051   051   000    Old_age   Always       -       448908
194 Temperature_Celsius     0x0022   115   094   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

RAID組み直し

この状態で RAID を activate し直してみました。

$ sudo dmraid -an

特に問題なく　activate されました。

バックアップとの差分をチェック

バックアップとの diff をとってみました。

$ diff -r /mnt/Data /mnt/nasne/backup/Data

結果は特に問題なし。

まー、もともと問題なく動いていたので、大丈夫だとは思うのですが、Pending Sector だったところの修復はちゃんと行われているのかなど、少し不安です。

今回、dmraid (device-mapper) に関しての理解がまだまだ不十分だと感じました。
ちょっとこれから勉強してみようかと思います。

穀風