Some day last week, I was writing code for my research project as usual.
I made some innocent code changes, then the build failed, with a weird linker error reloc has bad offset 8585654
inside some debug info section.
I’ve never hit such an error before, and Google gives no useful results either. The error clearly looks like a toolchain bug. So I didn’t think too much, cleared ccache
, and ran the build again. This time, the build finished without problem.
Then a few days later, my code got a weird failure while running a benchmark. All the tests were passing (which also includes that benchmark on a smaller dataset), but on the real benchmark dataset the code failed.
I rolled back a few git commits, but the failure persists. Finally, I rolled back to a known “good” commit where I have ran the benchmark with no problem in the past. And the benchmark is now failing.
Now it is clear that the failure is caused by a configuration issue. On closer look, it turns out that the benchmark reads from a 50MB data file FASTA_5000000
. The file is not tracked by git
due to its size, and the benchmark script will generate the file if it does not exist.
I happen to have a copy of this file in another folder, so I ran diff
on the two files.
To my surprise, diff
found a difference! Specifically, exactly one byte of the two ~50MB files is different. The byte went from 0x41
(the letter A
) to 0xc1
, i.e., its 7-th bit is flipped.
Of course, the file is never modified since its creation many months ago. And given that exactly one bit was changed, my first thought is the urban legend that a cosmic ray striked my electronic and caused the bit-flip.
Checksum Revealed More Errors
I’m not a fan of urban legends, but having witnessed such a weird bug myself, I decided to take preventative measures. I decided to migrate my disk to Btrfs, a file system that (among other things) provides built-in checksum checks for all data and metadata. This way, any corruption should be automatically detected.
Then I discovered the situation isn’t that simple.
I used btrfs-convert
to convert my filesystem from ext4
to btrfs
. By default, it creates a snapshot file named image
that allows one to rollback to ext4
. After the conversion finished, for sanity, I ran btrfs scrub
to scan for disk errors. To my surprise, btrfs scrub
immediately reported 288 corruptions in the image
file!
It is clear that something is seriously wrong. Maybe I got a defective SSD that is already at its end of life after only 1 year of use? So I checked my SSD’s smartctl
log, but saw no errors.
Of course, it could be Western Digital didn’t faithfully implement their SSD error logging. But I suddenly realized another possibility.
All the corruptions I observed are disk-related (the ccache
, the benchmark data, the image
file), but the faulty hardware doesn’t have to be the disk. The corrupted benchmark data is especially misleading: I observed a flipped bit with my own eye. But even that could be explained by a RAM corruption: Linux has a page cache that caches disk contents. If the page cache is corrupted, then even though the disk file is intact, all the reads will see the corrupted contents until the corrupted page is evicted.
So before I conclude that WD is not trustworthy and go buy a new SSD from another manufacturer, I need to make sure that the problem is indeed my SSD.
So I looked into the kernel dmesg
log just to see if there’s anything interesting. Luckily, I soon noticed another unusual error, which complains some page table entry is broken for some process. This is a corruption that cannot result from a user-level software bug (as the page table is managed by the kernel), and it has nothing to do with the disk (unless the page table is swapped out, which is unlikely). The RAM is quickly becoming the top suspect.
Hunting down the Faulty RAM Chip
I ran memtester
to test my RAM, but no errors are found. And given that I can just use my laptop as usual, even if there were a hardware glitch, it’s likely that the glitch could only be triggered very rarely in complex conditions.
So to pinpoint the faulty component, I need a way to reliably reproduce the bug. btrfs-convert
can, but I can’t run the command repeatedly (and at the risk of my file system).
Fortunately I soon found such a reproduce. I was trying a tool called duperemove, as recommended by the Btrfs guide. The tool needs to scan the whole filesystem. As the tool was running, I soon noticed file corruption errors popping up in dmesg
:
1 | BTRFS warning (device nvme0n1p2): csum failed root 5 ino 73795687 off 19005440 csum 0xbd4923a63ec89b42 expected csum 0x6ee2d71bfaa19559 mirror 1 |
It turns out that if I run duperemove
in a dead loop, those errors show up sparsely, from once per few minutes to once per couple of hours.
Furthermore, the files on the disk are actually not corrupted. The ino
in the error messages can be backtracked to the file path using btrfs inspect-internal inode-resolve
, and there’s no issue to read the file contents again (note that Btrfs cannot fix corruption as my disk is not RAID
ed, so if the file is actually corrupted, every read will fail). This is a strong evidence that my SSD is innocent: the faulty hardware is the RAM.
Fortunately, I have a spare old laptop, so I replaced the RAM using the one from the old laptop. After that, duperemove
no longer results in dmesg
errors.
To be extra sure, since my new laptop has two RAM chips, I also tried to only install one of them. As expected, duperemove
triggers error when one of the two RAM chips is installed, but not the other one.
This clearly demonstrates that the faulty component is one of my two RAM chips:
But I’m replacing both chips anyway, as I don’t trust Corsair anymore – to be clear, the chip has served 5 years, so it’s hard to say that their quality is low. But as you can imagine, after all of these, I just don’t want to take any more risks.
Takeaways
We all know faulty hardwares are real: after all that’s why server RAMs have ECC and disks are RAIDed. Even CPUs can malfunction, as shown by Google’s paper.
But it feels completely different to experience a faulty hardware first-hand. For example, I used to believe that if my RAM became faulty, my computer won’t even boot, so why worry about ECC? But I have totally changed my mind now.
To summarize:
- Hardwares are not always reliable.
- Hardwares can fail in subtle ways: my faulty RAM worked just fine for normal daily usage, but in complex workloads (compilation, benchmarking, file system convertion) it silently failed and caused data corruptions.
- Error detection and recovering techniques like checksum, RAID and ECC are not just for the paranoids: they are there for good reasons.