「こんなきれいな星も、やっぱりここまで来てから、見れたのだと思うから。だから・・もっと遠くへ・・」

Debugging a Bit-Flip Error

Some day last week, I was writing code for my research project as usual.

I made some innocent code changes, then the build failed, with a weird linker error reloc has bad offset 8585654 inside some debug info section.

I’ve never hit such an error before, and Google gives no useful results either. The error clearly looks like a toolchain bug. So I didn’t think too much, cleared ccache, and ran the build again. This time, the build finished without problem.

Then a few days later, my code got a weird failure while running a benchmark. All the tests were passing (which also includes that benchmark on a smaller dataset), but on the real benchmark dataset the code failed.

I rolled back a few git commits, but the failure persists. Finally, I rolled back to a known “good” commit where I have ran the benchmark with no problem in the past. And the benchmark is now failing.

Now it is clear that the failure is caused by a configuration issue. On closer look, it turns out that the benchmark reads from a 50MB data file FASTA_5000000. The file is not tracked by git due to its size, and the benchmark script will generate the file if it does not exist.

I happen to have a copy of this file in another folder, so I ran diff on the two files.

To my surprise, diff found a difference! Specifically, exactly one byte of the two ~50MB files is different. The byte went from 0x41 (the letter A) to 0xc1, i.e., its 7-th bit is flipped.

Of course, the file is never modified since its creation many months ago. And given that exactly one bit was changed, my first thought is the urban legend that a cosmic ray striked my electronic and caused the bit-flip.

Checksum Revealed More Errors

I’m not a fan of urban legends, but having witnessed such a weird bug myself, I decided to take preventative measures. I decided to migrate my disk to Btrfs, a file system that (among other things) provides built-in checksum checks for all data and metadata. This way, any corruption should be automatically detected.

Then I discovered the situation isn’t that simple.

I used btrfs-convert to convert my filesystem from ext4 to btrfs. By default, it creates a snapshot file named image that allows one to rollback to ext4. After the conversion finished, for sanity, I ran btrfs scrub to scan for disk errors. To my surprise, btrfs scrub immediately reported 288 corruptions in the image file!

It is clear that something is seriously wrong. Maybe I got a defective SSD that is already at its end of life after only 1 year of use? So I checked my SSD’s smartctl log, but saw no errors.

Of course, it could be Western Digital didn’t faithfully implement their SSD error logging. But I suddenly realized another possibility.

All the corruptions I observed are disk-related (the ccache, the benchmark data, the image file), but the faulty hardware doesn’t have to be the disk. The corrupted benchmark data is especially misleading: I observed a flipped bit with my own eye. But even that could be explained by a RAM corruption: Linux has a page cache that caches disk contents. If the page cache is corrupted, then even though the disk file is intact, all the reads will see the corrupted contents until the corrupted page is evicted.

So before I conclude that WD is not trustworthy and go buy a new SSD from another manufacturer, I need to make sure that the problem is indeed my SSD.

So I looked into the kernel dmesg log just to see if there’s anything interesting. Luckily, I soon noticed another unusual error, which complains some page table entry is broken for some process. This is a corruption that cannot result from a user-level software bug (as the page table is managed by the kernel), and it has nothing to do with the disk (unless the page table is swapped out, which is unlikely). The RAM is quickly becoming the top suspect.

Hunting down the Faulty RAM Chip

I ran memtester to test my RAM, but no errors are found. And given that I can just use my laptop as usual, even if there were a hardware glitch, it’s likely that the glitch could only be triggered very rarely in complex conditions.

So to pinpoint the faulty component, I need a way to reliably reproduce the bug. btrfs-convert can, but I can’t run the command repeatedly (and at the risk of my file system).

Fortunately I soon found such a reproduce. I was trying a tool called duperemove, as recommended by the Btrfs guide. The tool needs to scan the whole filesystem. As the tool was running, I soon noticed file corruption errors popping up in dmesg:

1
2
BTRFS warning (device nvme0n1p2): csum failed root 5 ino 73795687 off 19005440 csum 0xbd4923a63ec89b42 expected csum 0x6ee2d71bfaa19559 mirror 1
BTRFS error (device nvme0n1p2): bdev /dev/nvme0n1p2 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0

It turns out that if I run duperemove in a dead loop, those errors show up sparsely, from once per few minutes to once per couple of hours.

Furthermore, the files on the disk are actually not corrupted. The ino in the error messages can be backtracked to the file path using btrfs inspect-internal inode-resolve, and there’s no issue to read the file contents again (note that Btrfs cannot fix corruption as my disk is not RAIDed, so if the file is actually corrupted, every read will fail). This is a strong evidence that my SSD is innocent: the faulty hardware is the RAM.

Fortunately, I have a spare old laptop, so I replaced the RAM using the one from the old laptop. After that, duperemove no longer results in dmesg errors.

To be extra sure, since my new laptop has two RAM chips, I also tried to only install one of them. As expected, duperemove triggers error when one of the two RAM chips is installed, but not the other one.

This clearly demonstrates that the faulty component is one of my two RAM chips:

The faulty RAM chip :)

But I’m replacing both chips anyway, as I don’t trust Corsair anymore – to be clear, the chip has served 5 years, so it’s hard to say that their quality is low. But as you can imagine, after all of these, I just don’t want to take any more risks.

Takeaways

We all know faulty hardwares are real: after all that’s why server RAMs have ECC and disks are RAIDed. Even CPUs can malfunction, as shown by Google’s paper.

But it feels completely different to experience a faulty hardware first-hand. For example, I used to believe that if my RAM became faulty, my computer won’t even boot, so why worry about ECC? But I have totally changed my mind now.

To summarize:

  1. Hardwares are not always reliable.
  2. Hardwares can fail in subtle ways: my faulty RAM worked just fine for normal daily usage, but in complex workloads (compilation, benchmarking, file system convertion) it silently failed and caused data corruptions.
  3. Error detection and recovering techniques like checksum, RAID and ECC are not just for the paranoids: they are there for good reasons.