On btrfs and memory corruption
As you may have heard, I have a home server, which hosts
mirror.quantum5.ca
and doubles as my home NAS. To ensure my data is
protected, I am running btrfs, which has data checksums to ensure
that bit rot can be detected, and I am using the raid1
mode to
enable btrfs to recover from such events and restore the correct data. In this
mode, btrfs ensures that there are two copies of every file, each on a distinct
drive. In theory, if one of the copies is damaged due to bit rot, or even if an
entire drive is lost, all of your data can still be recovered.
For years, this setup has worked perfectly. I originally created this btrfs array on my old Atomic Pi, with drives inside a USB HDD dock, and the same array is still running on my current Ryzen home server—five years later—even after a bunch of drive changes and capacity upgrades.
In the past week, however, my NAS has experienced some terrible data corruption issues. Most annoyingly, it damaged a backup that I needed to restore, forcing me to perform some horrific sorcery to recover most of the data. After a whole day of troubleshooting, I was eventually able to track the problem down to a bad stick of RAM. Removing it enabled my NAS to function again, albeit with less available RAM than before.
I will now explain my setup and detail the entire event for posterity, including my thoughts on how btrfs fared against such memory corruption, how I managed to mostly recover the broken backup, and what might be done to prevent this in the future.
Background: Why btrfs?
A lot of you might be wondering—why btrfs? Why not ZFS, which is the more popular filesystem with similar data protection?
There are several reasons for this:
-
The biggest reason is that btrfs is completely flexible on what sort of devices you can include in the array. This was my biggest consideration as a poor student who couldn’t easily afford all the drives I’d end up needing up front and had to slowly upgrade by adding drives.
If say, I bought two 4 TB drives, then in btrfs
raid1
mode, I’d have 4 TB of usable storage. If I wanted more, I could add an 8 TB drive, and btrfs will ensure each file is mirrored on one of the 4 TB drives, creating 8 TB of usable storage. This was a very attractive feature when money was tight. With ZFS, this kind of flexibility doesn’t exist. While there are upgrade options, it’s highly restrictive and I will not go into the details here since that would involve explaining the entire ZFS structure, which can easily take up an entire blog post on its own. -
btrfs is in-tree, while ZFS isn’t and never will be due to licensing issues. As a Linux user, this means that I can upgrade the kernel at any time I wish without worrying about whether ZFS-on-Linux supports the new kernel or not, and historically, the kernel developers have not cared about breaking ZFS. It also means that any modern Linux install can read my data and I don’t need to specifically prepare a rescue disk that can read ZFS in case things go wrong, which really limits my options in case of emergencies.
-
btrfs is fairly lightweight on memory usage, and I started this entire journey on an Atomic Pi with 2 GiB of RAM. While it is possible to configure ZFS to run on systems with such low RAM requirements, it wasn’t so practical and has been known to trigger the OOM killer, and I didn’t want to fiddle around with all the ZFS knobs to get it to work even though it’s theoretically possible.
To be fair to ZFS, there are several things that I wish I could use:
-
Fully functional data striping with RAIDZ, allowing more efficient usage with a larger number of drives. While btrfs has RAID 5/6 support, it’s still not fully ready, and at this rate, may never be. This means I am effectively locked to
raid1
, which results in having only 50% of the total capacity being usable. This isn’t too big of a deal though, especially since I still only have two drives. Back then, it was also impossible to expand a RAIDZ array, so adding a drive would require the entire array to be rebuilt… -
zvol
s, which are block devices similar to LVM, but have all the niceties of ZFS checksums. These could be used for virtual machines, for example, and guarantee that the data is protected. btrfs unfortunately doesn’t have any similar features.
Ultimately though, these are not sufficiently compelling for me to abandon a completely working btrfs setup, especially since I’d effectively need to buy a bunch of new drives and build a completely new array, which is a lot of work and will inevitably incur significant downtime. If I ever decide to build a giant, rack-mounted dedicated storage server with 20+ 20 TB drives, I’d probably switch to ZFS, but that’s not happening anytime soon.
Other details of the setup
Here’s the specification for relevant hardware in my home server immediately before the event:
- CPU: AMD Ryzen 9 3900X
- Motherboard: ASUS Prime X570-P
- RAM: 4×ADATA XPG Gammix D10 16 GiB 3200 MT/s
The CPU and motherboard are from my old desktop, while the RAM was the cheapest available back during COVID-19 supply chain disruptions. For a NAS, I’d probably have tried to get ECC memory, which can correct single-bit flips, detect bigger errors, and report them. It’s highly recommended for ZFS and btrfs, but it wasn’t priced sanely back then, which is a shame since AMD has unofficial ECC support on the CPU and the motherboard also supports it.
The Corruption
So what exactly happened? Well, I had a Windows 10 gaming VM1 with
Looking Glass, and I wanted to upgrade to Windows 11 with the
end-of-support date for Windows 10 coming up. To prevent any accidents, I took a
backup of the VM by using zstd
2 on the entire disk image and stashing it
on my NAS.
Well, somehow I managed to botch the Windows 11 update, so I decided to restore the backup I took, because it was easier than trying to fix the Windows 11 install.
So I ran the command to restore the disk3:
$ zstd -dc win10.img.zst | sudo dd of=/dev/vg0/win10 bs=1M
zstd: error 37 : Read error
What the @!#* just happened? Do I have to rebuild this VM from scratch now?
After I recovered from the shock, I did the obvious thing of looking at
dmesg
on the NAS to see what was going on:
Dec 21 01:54:17 kernel: BTRFS warning (device sda1): csum failed root 5 ino 1507910 off 3079737344 csum 0xee82929e expected csum 0x507475f6 mirror 2
Dec 21 01:54:17 kernel: BTRFS error (device sda1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 23, gen 0
Dec 21 01:54:17 kernel: BTRFS warning (device sda1): csum failed root 5 ino 1507910 off 3079737344 csum 0xee82929e expected csum 0x507475f6 mirror 1
Dec 21 01:54:17 kernel: BTRFS error (device sda1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 22, gen 0
This was not good. It meant btrfs found a checksum mismatch and couldn’t recover from it by reading from the mirror. Since somehow both copies had the same incorrect checksum, the file wasn’t damaged on disk, but in memory.
At first, I thought this was just a soft error and I just happened
to got really unlucky. If only I had that ECC RAM… it might have corrected the
error and I’d have not lost my data. But then, looking at dmesg
, I saw
something ominous—similar errors on my SSD btrfs array, which I am using to
store things like my photos with Immich.
Something seemed wrong, so the first thing I did was to run a btrfs scrub
to determine the extent of the damage. It seemed like two files in my photo
library were damaged. Well, at least my photos are backed up4, so I
pulled the damaged files from the backup and replaced them. Running another
btrfs scrub
revealed that the SSD array was fully repaired. Whew.
In retrospect, it should have been obvious what went wrong, since multiple files got corrupted, and if it was soft errors, that was a lot in such a short span of time. It was more likely that something was wrong with the hardware. The smart thing to do then would be to shut down the server and figure out what went wrong. But no, I thought maybe I just got really unlucky and multiple bits got flipped from a single cosmic ray and somehow corrupted three different files… So instead, I assumed it was something transient and proceeded to recover my backup. This wasn’t a great idea in retrospect. If you ever find yourself in a similar situation, please make sure your hardware is working first.
Data recovery attempt
Since I found out about the problems while overwriting the VM disk, it means the
Windows 11 install, on top of the existing problems, was also partially
overwritten and fully corrupted. At this point, it is impossible to recover
anything from it—and I’ve tried with ntfsrecover
from ntfs-3g
—which means I
am fully committed to restoring the backup.
While something was obviously wrong with the backup, I could probably still get
most of the data on the VM back if I was able to decompress past the errors.
This means I have to get the file to zstd
without hitting the EIO
that
caused it to error out. So what if I just salvaged all the parts of the file
that I could and left the broken sections as zeroes?
Since the drive failure incident, I’ve learned how to use ddrescue
,
which makes multiple attempts to read all the readable blocks in the file,
leaving the totally unreadable blocks as zeroes. So I simply ran ddrescue
to
rescue the file5:
$ ddrescue /path/to/broken/btrfs/win10.img.zst /path/to/other/disk/win10.img.zst
After ages and a bunch more errors in dmesg
(not a good sign…), ddrescue
finished. So of course, I tried again:
$ ssh nas zstd -dc /path/to/other/disk/win10.img.zst | sudo dd of=/dev/vg0/win10 bs=1M
/path/to/other/disk/win10.img.zst : Decoding error (36) : Data corruption detected
This wasn’t good. It was probably because btrfs failed to return anything in the
entire 4 kiB block that had the corruption, so it was all replaced with zeroes
and zstd
couldn’t continue decompressing. Now what?
I refused to believe that losing a single 4 kiB section of the backup meant that
I couldn’t recover anything, so I persisted. I figured that while a bit or two
in the 4 kiB block might be bad, having the block available to zstd
would probably allow the whole file to be decompressed. So I just needed to
convince btrfs to return the corrupted blocks to its best ability…
btrfs, of course, has exactly that ability. You simply need to mount with
-o rescue=ignoredatacsums
. Now, you can only use this option if you mount the
filesystem as read-only, which means I’d need to take down the entire server.
Since the server was running a bunch of “important” stuff, and I thought the
corruption was a single event, I figured there had to be another way that
enabled me to do this online.
I figured that if I could just find where on the block device btrfs stored the
corrupted block, I could just use dd
to read the block out and replace the
zeroed block in the ddrescue
output. As it turned out, the filefrag
command does exactly that, returning the physical offset of each fragment of the
file.
But wait… I was running btrfs in RAID 1 mode on two different drives.
Presumably, btrfs doesn’t put the same data on the same offsets on each drive.
Which drive was the physical offset relative to? So I tried looking in both
drives on a test file with known content, and that offset on neither drive
contained the correct data. As it turned out, it was actually the btrfs
“logical” block that filefrag
returned, which wasn’t that helpful for this
task.
After some Googling, I found this C program on GitHub that returned a
mapping similar to filefrag
, except it also returned the physical offset on
each device in the btrfs array. So of course, I compiled the program,
dd
ed the physical block for the test file, and voilà, I got the correct data.
So I just had to compute the offset for each corrupted block and recover the
data, right? Unfortunately, there were 11 different corrupted blocks, and I
didn’t want to manually calculate the offset for each block. So instead, I wrote
a Python library that can parse the output of filefrag
and compute the
physical offset on each device:
import bisect
from dataclasses import dataclass
from operator import attrgetter
@dataclass
class BtrfsExtent:
file_offset: int
file_size: int
extent_offset: int
extent_type: str
logical_size: str
logical_offset: str
physical_size: int
device_offset: dict[int, int]
@dataclass
class BtrfsMapPhysical:
extents: list[BtrfsExtent]
@classmethod
def from_file(cls, file):
extents = []
for line in file:
if line.startswith('FILE'):
continue
fields = line.rstrip('\r\n').split('\t')
if fields[0]:
extents.append(BtrfsExtent(
int(fields[0]), # file_offset
int(fields[1]), # file_size
int(fields[2]), # extent_offset
fields[3], # extent_type
int(fields[4]), # logical_size
int(fields[5]), # logical_offset
int(fields[6]), # physical_size
{int(fields[7]): int(fields[8])},
))
else:
extents[-1].device_offset[int(fields[7])] = int(fields[8])
return cls(extents)
def lookup_physical(self, offset):
extent = self.extents[bisect.bisect(self.extents, offset, key=attrgetter('file_offset')) - 1]
assert extent.file_offset <= offset < extent.file_offset + extent.file_size
delta = offset - extent.file_offset
return {devid: dev_offset + delta for devid, dev_offset in extent.device_offset.items()}
Since the device_offset
attribute is a mapping from device ID to a physical
offset, I needed to ask btrfs which device ID maps to which block device:
# btrfs filesystem show /data
Label: none uuid: bc943e9f-598b-4cb1-8276-e4158a0dc5b7s
Total devices 3 FS bytes used 10.27TiB
devid 4 size 14.55TiB used 10.41TiB path /dev/sda1
devid 5 size 14.55TiB used 10.41TiB path /dev/sdc1
This means I could use the device ID 4 offset to read from /dev/sda1
and the
device ID 5 offset to read from /dev/sdc1
.
Then, I wrote a little script to generate dd
commands to copy the raw blocks
to fill in the blanks in the ddrescue
output:
offsets = [
3079737344,
# ... more offsets pulled from dmesg
]
for i in offsets:
j = fsmap.lookup_physical(i)[4] // 4096
print(f'sudo dd conv=notrunc if=/dev/sda1 bs=4k skip={j} count=1 seek={i // 4096} of=win10.img.zst')
The generated command used conv=notrunc
to allow dd
to edit a segment of the
file without deleting anything else. I used bs=4k
to do 4 kiB blocks, which is
what btrfs uses, and skip
to jump to offset on /dev/sda1
, seek
to jump to
the logical offset in the output file, and count=1
to copy a single block. I
also divided the byte offset by 4096 to get the block offset.
Then I ran the dd
commands generated and tried decompressing again. This time,
no errors, and the VM promptly booted into Windows 10.
Did I recover the data to successfully decompress the backup? Yes.
Did I recover the entire VM disk as it was originally? No.
Yet, the VM did boot, so I suppose it was a success in a way. Still, something probably got corrupted, but oh well. I didn’t really have anything particularly important on the VM, and I was able to successfully upgrade to Windows 11 later, which should have fixed any corrupted system files, and Steam was able to fix any corrupted game files.
Attempt to fix the filesystem
Since I still thought it was a singular event, I attempted to then fix the
filesystem. So naturally, I ran btrfs scrub
on the HDD array, which took over
half a day to complete. It managed to correct some files, but there were several
files irreversibly damaged, but fortunately they were either backups that could
be deleted, or they were easily restored from backup.
Still, it was a bit weird that some of the files that btrfs scrub
reported as
corrupted were then able to be read successfully on a subsequent attempt. That
was a bit strange…
So anyways, after the scrubbing, I decided that I might as well keep the rescued Windows 10 image as a backup indefinitely, so I copied the file back into its original location, which succeeded without issue.
Just to be sure that nothing went wrong though, I decided to run sha512sum
on both files:
$ sha512sum /path/to/btrfs/win10.img.zst /path/to/other/disk/win10.img.zst
b9f2ca1c5d6a964cfae9dd59f20df3635e4f57530e8c5e33e6fd057d08eb3699e34d6568832b41584369dba503da2ee58079fccbd05a51b82cefefbf8b509777 /path/to/btrfs/win10.img.zst
120f3f6bffb3ac141e18f6253c8b7acc735de1701acd50574fdf6cb86e528f73633c07c7958f6cc72dfed82d40254cbb3ce5e6c0b90a271577ace42610352e88 /path/to/other/disk/win10.img.zst
What the @#*$? Okay, something is REALLY wrong.
Just to make sure it wasn’t a fluke, I copied it again.
$ sha512sum /path/to/btrfs/win10.img.zst /path/to/other/disk/win10.img.zst
cff0f83d39feedc2e0ebab29460ed9366d6c8a8a54c086b791e2ae50a532642b6291d6a03cca4c44e2af57cd555766fa6fc2da8417bd796a820ad67348a1d7de /path/to/btrfs/win10.img.zst
120f3f6bffb3ac141e18f6253c8b7acc735de1701acd50574fdf6cb86e528f73633c07c7958f6cc72dfed82d40254cbb3ce5e6c0b90a271577ace42610352e88 /path/to/other/disk/win10.img.zst
Oh no. OH NO! Something is really messed up.
This could only mean… data was actively being corrupted in memory. Suddenly, everything made sense:
- Why were files corrupted? Memory was corrupted.
- Why were both copies corrupted in
raid1
? Because the memory was corrupted and written to both drives. - Why did some files have unrecoverable errors during scrubs but were able to be read later perfectly fine? Because the memory was corrupted during the scrub, the checksum didn’t match.
There was no safe way to keep the server running, so I shut it down and ran memtest86+.
Memory testing
To no particular surprise, this was what I got:
Memtest86+ v7.20 │ AMD Ryzen 9 3900X 12-Core Processor
CLK/Temp: 3800MHz 55/70°C │ Pass 13: ####
L1 Cache: 32KB 239 GB/s │ Test 70% ############################
L2 Cache: 512KB 94.5 GB/s │ Test #1 [Moving inversions, 8 bit pattern]
L3 Cache: 64MB 17.3 GB/s │ Testing: 43GB - 44GB [1GB of 63.9GB]
Memory: 63.9GB 5.64 GB/s │ Pattern: 0x2020202020202020
─────────────────────────────┴─────────────┬────────────────────────────────────
CPU: 12 Cores 24 Threads SMP: 24T (PAR) │ Time: 0:08:16 Status: Failed! /
IMC: 1600MHz (DDR4-3200) CAS 16-20-20-38 │ Pass: 0 Errors: 272
───────────────────────────────────────────┴────────────────────────────────────
pCPU Pass Test Failing Address Expected Found
---- ---- ---- --------------------- ---------------- ----------------
12 0 4 000320a22b38 (12.5GB) fefefefefefefefe fefe7efefefefefe
11 0 4 00031e036b38 (12.4GB) fefefefefefefefe fefe7efefefefefe
12 0 4 000320b26a38 (12.5GB) fefefefefefefefe fefe7efefefefefe
9 0 4 00031a902b38 (12.4GB) fefefefefefefefe fefe7efefefefefe
12 0 4 000320b86a38 (12.5GB) fefefefefefefefe fefe7efefefefefe
11 0 4 00031f576b38 (12.4GB) 0101010101010101 0101810101010101
12 0 4 000321b22b38 (12.5GB) 0101010101010101 0101810101010101
11 0 4 00031ec06a38 (12.5GB) 0101010101010101 0101810101010101
19 0 4 00031bc42b38 (12.4GB) 0101010101010101 0101810101010101
9 0 4 000318fb2b38 (12.5GB) 0101010101010101 0101810101010101
11 0 4 00031e9e2b38 (12.4GB) 0101010101010101 0101810101010101
13 0 4 000323446b38 (12.3GB) 0101010101010101 0101810101010101
<ESC> Exit <F1> Configuration <Space> Scroll Lock
I guess I have some bad RAM—or RAM slot, or worse, a bad memory controller. Well, the easiest thing was to disable XMP/DOCP, which is technically considered “overclocking” even though it’s the speed printed on the memory module, opting to run the memory at the lower default speeds6 instead…
Still a deluge of errors. Oh well, time to dig deeper.
So I took out two sticks of RAM (the second and fourth slots from the CPU socket), and lo and behold, memtest86+ passed. Since the two sticks were on different memory channels, I guess that also rules out the memory controller.
So I plugged in another stick in the second slot, and the system refused to boot. I tried the other stick in the same slot and got the same results. So I put a stick into the last slot, and it booted. I guess the motherboard really didn’t like having the first three slots populated for some reason.
Then I ran memtest86+ again:
Memtest86+ v7.20 │ AMD Ryzen 9 3900X 12-Core Processor
CLK/Temp: 3793MHz 48/64°C │ Pass 10%: ####
L1 Cache: 32KB 244 GB/s │ Test 8%: ###
L2 Cache: 512KB 94.6 GB/s │ Test #1 [Moving inversions, 8 bit pattern]
L3 Cache: 64MB 10.1 GB/s │ Testing: 1GB - 2GB [1GB of 47.9GB]
Memory: 47.9GB 3.57 GB/s │ Pattern: 0x0808080808080808
─────────────────────────────┴─────────────┬────────────────────────────────────
CPU: 12 Cores 24 Threads SMP: 24T (PAR) │ Time: 0:04:49 Status: Failed! \
IMC: 1333MHz (DDR4-2666) CAS 20-19-19-43 │ Pass: 0 Errors: 54
───────────────────────────────────────────┴────────────────────────────────────
pCPU Pass Test Failing Address Expected Found
---- ---- ---- --------------------- ---------------- ----------------
15 0 3 0001a8d13538 (6.63GB) 0000000000000000 0008000000000000
15 0 3 0001a8593538 (6.62GB) ffffffffffffffff ff7fffffffffffff
15 0 3 0001a8bb3538 (6.63GB) ffffffffffffffff ff7fffffffffffff
14 0 3 0001a650b538 (6.59GB) ffffffffffffffff ff7fffffffffffff
15 0 3 0001a8bd9538 (6.63GB) ffffffffffffffff ff7fffffffffffff
14 0 3 0001a6b41538 (6.60GB) ffffffffffffffff ff7fffffffffffff
13 0 3 0001a54d1538 (6.58GB) ffffffffffffffff ff7fffffffffffff
14 0 3 0001a7653538 (6.61GB) 0000000000000000 0008000000000000
13 0 3 0001a4911538 (6.57GB) 0000000000000000 0008000000000000
15 0 3 0001a98b1538 (6.64GB) 0000000000000000 0008000000000000
14 0 3 0001a6d95338 (6.60GB) 0000000000000000 0008000000000000
15 0 3 0001a96b9538 (6.64GB) 0000000000000000 0008000000000000
<ESC> Exit <F1> Configuration <Space> Scroll Lock
It seems like we found a bad stick. Let’s double check to make sure the other stick works in the slot to rule out a second bad stick or a bad slot:
Memtest86+ v7.20 │ AMD Ryzen 9 3900X 12-Core Processor
CLK/Temp: 3792MHz 60/65°C │ Pass 1%:
L1 Cache: 32KB 242 GB/s │ Test 10%: ####
L2 Cache: 512KB 97.1 GB/s │ Test #2 [Address test, own address + window]
L3 Cache: 64MB 10.1 GB/s │ Testing: 1GB - 2GB [1GB of 47.9GB]
Memory: 47.9GB 3.58 GB/s │ Pattern: own address
─────────────────────────────┴─────────────┬────────────────────────────────────
CPU: 12 Cores 24 Threads SMP: 24T (PAR) │ Time: 1:24:14 Status: Pass \
IMC: 1333MHz (DDR4-2666) CAS 20-19-19-43 │ Pass: 1 Errors: 0
───────────────────────────────────────────┴────────────────────────────────────
Memory SPD Information
------------------
- Slot 0: 16GB DD ###### ## ##### #####
- Slot 1: 16GB DD ## ## #### ## ## ## ##
- Slot 2: 16GB DD ## ## ## ## ## ##
###### ## ## ##### #####
## ######## ## ##
## ## ## ## ## ## ##
## ## ## ##### #####
Press any key to remove this banner
ASUSTeK COMPUTER INC. PRIME X570-P
<ESC> Exit <F1> Configuration <Space> Scroll Lock 7.20.549cc.x64
Mystery solved.
I guess for now, instead of 64 GiB of RAM, I’ll have to contend with 48 GiB. It’s not the end of the world, though it is irritating and I had to make some changes to the system to get it to squeeze in, including adding more swap.
Making sure it all works
Before booting into the real OS though, I booted into Finnix instead, so I could
make sure that btrfs is working as expected. I simply ran
btrfs check --readonly
on every filesystem, and they all received a clean bill
of health. With this, I finally rebooted into the real OS.
Still, I made sure to run btrfs scrub
on all the affected filesystems too,
just to detect all lingering and detectable data corruption. Thankfully, no
errors were detected.
Warranty
With an obviously defective RAM stick, the obvious thought would be to submit a warranty claim and have the manufacturer send me a new stick of RAM. The ADATA XPG Gammix D10 RAM is covered by a lifetime warranty, so I am all good, right? I couldn’t be more wrong.
So yes, in theory, the RAM is covered by a lifetime warranty. However, to submit a warranty claim, I’d have to sign up on ADATA/XPG’s web portal, which contained the following snippet:
Thank you for registering an ADATA/XPG member account. If you used this email for the registration, please click the link below to complete the email verification:
Verify your email >>
Yes, the link in the email was unclickable, just like the transcription above. It seemed like I couldn’t file a claim…
Later, I randomly checked my spambox, and found 13 emails, all identical, all from ADATA, containing an actual activation link. So I was able to get into the portal… eventually. It would be an understatement to say this was a janky experience though.
Then, when I attempted to submit a warranty claim, there was a lack of mention of shipping labels. A quick Google revealed that ADATA makes you send the product yourself to them at your own expense. Furthermore, ADATA also mentioned this:
If we do not receive your product within 30 days after successful submission of your RMA application, your application will be automatically voided. You will be required to submit a new RMA application.
I decided to not bother, since the cost of international shipping, especially one that’s not guaranteed to take more than 30 days, outweighs the cost of getting another stick of DDR4-3200 RAM, which can be found for $33 on Amazon Canada at the time of writing.
I am less than pleased with ADATA/XPG, especially since this happened after my horrible experience with the SSD failure, which was also an ADATA/XPG product. Henceforth, I will never purchase any ADATA/XPG product and I would strongly advise everyone to avoid this brand at all costs.
Retrospective
I suppose with all the errors and difficulties recovering the data, some might feel that it’s indicative of btrfs’s fragility. Yet, after this experience, I feel that btrfs’s approach of failing loudly was ultimately the right call. If I had used ext4, my backup would have been restored perfectly fine, and I wouldn’t have been aware of the memory corruption at all until much later, when the corruption had gotten bad enough to ruin my data in much more conspicuous ways. By then, it would have been way too late.
I will also note that warranty offers little consolation after wasting a whole day on this madness. With the price of RAM as it is right now, if I consider the value of a RAM replacement as compensation for my troubles, I’d be working for far less than minimum wage. So no, it’s almost certainly worth it to buy better RAM than to save a buck on crappy RAM, even if the lifetime warranty is practical to use. With that, I’ve decided to purchase ECC RAM from a reputable brand—which happened to be priced sanely at the time of writing—and use that in my server going forward.
Still, what does this mean for my data going forward? Well, everything that was written to disk before the incident was probably fine. All files recently written to could potentially be damaged silently, but there really isn’t much I can do about that. All the important files have backups, so I could always pull it from a backup. I’ll definitely make sure to retain backups from before this incident and fix things…
I will end this post by referencing Linus Torvalds’ angry email about the importance of ECC:
ECC absolutely matters.
…
Yes, I’m pissed off about it. You can find me complaining about this literally for decades now. I don’t want to say “I was right”. I want this fixed, and I want ECC.
If I had ECC RAM, then this memory corruption would have been detected, and I’d be made aware of it without any data corruption. Unfortunately, ECC isn’t a standard feature on PCs and wasn’t priced sanely when I built the NAS. If only it was…
If you can afford ECC, I would strongly advise you to do so. Subtle data corruption isn’t something you want to deal with. It always comes to bite you when you need the data the most. I’d rather spend a bit more and have full faith in the integrity of my data, instead of wondering what data was corrupted.
Notes
-
Yes, I wrote that series on a fresh Windows 11 VM and was too lazy to upgrade my Windows 10 VM this whole time. ↩
-
Why
zstd
? Because it’s fast thangzip
and compresses better, and I didn’t have the patience to wait forxz
to finish compressing. ↩ -
The astute of you may have noticed that I am actually using LVM to store the VM disk, so I could have taken an LVM snapshot instead and reverted it that way. Yes, that probably would have been smarter, but then I wouldn’t have discovered this memory corruption and probably lost a lot more data. ↩
-
If this doesn’t show how important backups are, I don’t know what does. Please back up your important files! ↩
-
Yes, I ran this on the server. In retrospect, doing this on a machine with memory corruption wasn’t a good idea… ↩
-
This is commonly called “JEDEC speed” even though JEDEC stands for the “Joint Electron Device Engineering Council”, which is the organization that manages RAM standards, including the XMP standard. ↩