硬盘上的数据会在没有损坏警告的情况下降级吗？

我们都担心保持数据和文件的安全和完整，但是否有可能数据被损坏，并且在没有任何问题通知或警告的情况下被用户访问？今天的超级用户问答帖子回答了一位忧心忡忡的读者的问题。...

硬盘上的数据会在没有损坏警告的情况下降级吗？

我们都担心保持数据和文件的安全和完整，但是否有可能数据被损坏，并且在没有任何问题通知或警告的情况下被用户访问？今天的超级用户问答帖子回答了一位忧心忡忡的读者的问题。

今天的问答环节是由SuperUser提供的，SuperUser是Stack Exchange的一个分支，是一个由社区驱动的问答网站分组。

图片由通用公司（Flickr）提供。

问题

超级用户阅读器topo morto想知道硬盘上的数据是否可以降级，是否可以在没有损坏警告的情况下访问：

Is it possible that physical degradation of a hard drive could cause bits to “flip” in a file’s contents without the operating system noticing the change and notifying the user about it when reading the file? For example, could a “p” (binary 01110000) in an ASCII text file change to a “q” (binary 01110001), then when a user opens the file, they see “q” without being aware that a failure has occurred?

I am interested in answers relating to FAT, NTFS, or ReFS (if it makes a difference). I want to know if operating systems protect users from this, or if we should be checking our data for variances between copies over time.

硬盘上的数据是否会降级，并且可以在没有损坏警告的情况下访问？

答案

超级用户贡献者Guntram Blohm为我们提供了答案：

Yes, there is a thing called bit rot. But no, it will not affect a user unnoticed.

When a hard drive writes a sector to the platters, it does not just write the bits in the same way that they are stored in RAM, it uses an encoding to make sure there are no sequences of the same bit that are too long. It also adds ECC codes that allow it to repair errors that affect a few bits and detect errors that affect more than a few bits.

When the hard drive reads the sector, it checks these ECC codes and repairs the data if necessary (and if possible). What happens next depends on the circumstances and the firmware of the hard drive, which is influenced by the designation of the drive.

If a sector can be read and has no ECC code problems, then it is passed on to the operating system.
If a sector can be repaired easily, the repaired version may be written to disk, read back, then verified to determine if the error was a random one (i.e. co**ic rays, etc.) or if there is a systematic error with the media.
If the hard drive determines that there is an error with the media, it reallocates the sector.
If a sector can be neither read nor corrected after a few read attempts (on a hard drive that is designated as a RAID hard drive), then the hard drive will give up, reallocate the sector, and tell the controller that there was a problem. It relies on the RAID controller to rec***truct the sector from the other RAID members and write it back to the failed hard drive, which then stores it in the reallocated sector (that hopefully does not have a problem).
If a sector cannot be read or corrected on a desktop’s hard drive, then the hard drive will engage in more attempts to read it. Depending on the quality of the hard drive, this might involve repositioning the head, checking to see if there are any bits that flip when read repeatedly, checking which bits are the weakest, and a few other things. If any of these attempts succeed, the hard drive will reallocate the sector and write back the repaired data.

This is one of the main differences between hard drives that are sold as “desktop”, “NAS/RAID”, or “video surveillance” hard drives. A RAID hard drive can just give up quickly and make the controller repair the sector to avoid latency on the user’s side. A desktop hard drive will continue trying again and again because having the user wait a few seconds is probably better than telling them the data is lost. And a video hard drive values c***tant data rates more than error recovery as a damaged frame will typically not even be noticed.

At any rate, the hard drive will know if there has been bit rot, will typically recover from it, and if it cannot, it will tell the controller which will in turn tell the driver which will then tell the operating system. Then, it is up to the operating system to present the error to the user and act on it. This is why cybernard says:

I have never witnessed a single bit error myself, but I have seen plenty of hard drives where entire sectors have failed.

The hard drive will know if there is something wrong with a sector, but it will not know which bits have failed. A single bit that has failed will always be caught by ECC.

Please note that chkdsk and file systems that automatically repair themselves do not address repairing data within files. These are targeted at corruption within the structure of the file system itself, like a difference in a file’s size between the directory entry and the number of allocated blocks. The self-healing feature of NTFS will detect structural damage and prevent it from affecting your data further, but it will not repair any data that is already damaged.

There are, of course, other reas*** why data may become damaged. For example, bad RAM on a controller may alter data before it is even sent to the hard drive. In that case, no mechani** on the hard drive will detect or repair the data, and this may be one reason why the structure of a file system is damaged. Other reas*** include software bugs, blackouts while writing to the hard drive (although this is addressed by file system journaling), or bad file system drivers (the NTFS driver on Linux defaulted to read-only for a long time since NTFS was reverse engineered, not documented, and the developers did not trust their own code).

I had this scenario once where an application would save all of its files to two different servers in two different data centers in order to keep a working copy of the data available under all circumstances. After a few months, we noticed that about 0.1 percent of all the copied files did not match the MD5 check sum that the application stored in its database. It turned out to be a faulty fiber cable between the server and the SAN.

These other reas*** are why some file systems, like ZFS, keep additional check sum information in order to detect errors. They are designed to protect you from a lot more things that can go wrong than just bit rot.

有什么要补充的解释吗？在评论中发出声音。想从其他精通技术的Stack Exchange用户那里了解更多答案吗？在这里查看完整的讨论主题。