On btrfs and memory corruption

As you may have heard, I have a home server, which hosts mirror.quantum5.ca and doubles as my home NAS. To ensure my data is protected, I am running btrfs, which has data checksums to ensure that bit rot can be detected, and I am using the raid1 mode to enable btrfs to recover from such events and restore the correct data. In this mode, btrfs ensures that there are two copies of every file, each on a distinct drive. In theory, if one of the copies is damaged due to bit rot, or even if an entire drive is lost, all of your data can still be recovered.

For years, this setup has worked perfectly. I originally created this btrfs array on my old Atomic Pi, with drives inside a USB HDD dock, and the same array is still running on my current Ryzen home server—five years later—even after a bunch of drive changes and capacity upgrades.

In the past week, however, my NAS has experienced some terrible data corruption issues. Most annoyingly, it damaged a backup that I needed to restore, forcing me to perform some horrific sorcery to recover most of the data. After a whole day of troubleshooting, I was eventually able to track the problem down to a bad stick of RAM. Removing it enabled my NAS to function again, albeit with less available RAM than before.

I will now explain my setup and detail the entire event for posterity, including my thoughts on how btrfs fared against such memory corruption, how I managed to mostly recover the broken backup, and what might be done to prevent this in the future.

Background: Why btrfs?

A lot of you might be wondering—why btrfs? Why not ZFS, which is the more popular filesystem with similar data protection?

There are several reasons for this:

The biggest reason is that btrfs is completely flexible on what sort of devices you can include in the array. This was my biggest consideration as a poor student who couldn’t easily afford all the drives I’d end up needing up front and had to slowly upgrade by adding drives.

If say, I bought two 4 TB drives, then in btrfs raid1 mode, I’d have 4 TB of usable storage. If I wanted more, I could add an 8 TB drive, and btrfs will ensure each file is mirrored on one of the 4 TB drives, creating 8 TB of usable storage. This was a very attractive feature when money was tight. With ZFS, this kind of flexibility doesn’t exist. While there are upgrade options, it’s highly restrictive and I will not go into the details here since that would involve explaining the entire ZFS structure, which can easily take up an entire blog post on its own.
btrfs is in-tree, while ZFS isn’t and never will be due to licensing issues. As a Linux user, this means that I can upgrade the kernel at any time I wish without worrying about whether ZFS-on-Linux supports the new kernel or not, and historically, the kernel developers have not cared about breaking ZFS. It also means that any modern Linux install can read my data and I don’t need to specifically prepare a rescue disk that can read ZFS in case things go wrong, which really limits my options in case of emergencies.
btrfs is fairly lightweight on memory usage, and I started this entire journey on an Atomic Pi with 2 GiB of RAM. While it is possible to configure ZFS to run on systems with such low RAM requirements, it wasn’t so practical and has been known to trigger the OOM killer, and I didn’t want to fiddle around with all the ZFS knobs to get it to work even though it’s theoretically possible.

To be fair to ZFS, there are several things that I wish I could use:

Fully functional data striping with RAIDZ, allowing more efficient usage with a larger number of drives. While btrfs has RAID 5/6 support, it’s still not fully ready, and at this rate, may never be. This means I am effectively locked to raid1, which results in having only 50% of the total capacity being usable. This isn’t too big of a deal though, especially since I still only have two drives. Back then, it was also impossible to expand a RAIDZ array, so adding a drive would require the entire array to be rebuilt…
zvols, which are block devices similar to LVM, but have all the niceties of ZFS checksums. These could be used for virtual machines, for example, and guarantee that the data is protected. btrfs unfortunately doesn’t have any similar features.

Ultimately though, these are not sufficiently compelling for me to abandon a completely working btrfs setup, especially since I’d effectively need to buy a bunch of new drives and build a completely new array, which is a lot of work and will inevitably incur significant downtime. If I ever decide to build a giant, rack-mounted dedicated storage server with 20+ 20 TB drives, I’d probably switch to ZFS, but that’s not happening anytime soon.

Other details of the setup

Here’s the specification for relevant hardware in my home server immediately before the event:

CPU: AMD Ryzen 9 3900X
Motherboard: ASUS Prime X570-P
RAM: 4×ADATA XPG Gammix D10 16 GiB 3200 MT/s

The CPU and motherboard are from my old desktop, while the RAM was the cheapest available back during COVID-19 supply chain disruptions. For a NAS, I’d probably have tried to get ECC memory, which can correct single-bit flips, detect bigger errors, and report them. It’s highly recommended for ZFS and btrfs, but it wasn’t priced sanely back then, which is a shame since AMD has unofficial ECC support on the CPU and the motherboard also supports it.

The Corruption

So what exactly happened? Well, I had a Windows 10 gaming VM¹ with Looking Glass, and I wanted to upgrade to Windows 11 with the end-of-support date for Windows 10 coming up. To prevent any accidents, I took a backup of the VM by using zstd² on the entire disk image and stashing it on my NAS.

Well, somehow I managed to botch the Windows 11 update, so I decided to restore the backup I took, because it was easier than trying to fix the Windows 11 install.

So I ran the command to restore the disk³:

$ zstd -dc win10.img.zst | sudo dd of=/dev/vg0/win10 bs=1M 
zstd: error 37 : Read error

What the @!#* just happened? Do I have to rebuild this VM from scratch now?

After I recovered from the shock, I did the obvious thing of looking at dmesg on the NAS to see what was going on:

Dec 21 01:54:17 kernel: BTRFS warning (device sda1): csum failed root 5 ino 1507910 off 3079737344 csum 0xee82929e expected csum 0x507475f6 mirror 2
Dec 21 01:54:17 kernel: BTRFS error (device sda1): bdev /dev/sda1 errs: wr 0, rd 0, flush 0, corrupt 23, gen 0
Dec 21 01:54:17 kernel: BTRFS warning (device sda1): csum failed root 5 ino 1507910 off 3079737344 csum 0xee82929e expected csum 0x507475f6 mirror 1
Dec 21 01:54:17 kernel: BTRFS error (device sda1): bdev /dev/sdc1 errs: wr 0, rd 0, flush 0, corrupt 22, gen 0

This was not good. It meant btrfs found a checksum mismatch and couldn’t recover from it by reading from the mirror. Since somehow both copies had the same incorrect checksum, the file wasn’t damaged on disk, but in memory.

At first, I thought this was just a soft error and I just happened to got really unlucky. If only I had that ECC RAM… it might have corrected the error and I’d have not lost my data. But then, looking at dmesg, I saw something ominous—similar errors on my SSD btrfs array, which I am using to store things like my photos with Immich.

Something seemed wrong, so the first thing I did was to run a btrfs scrub to determine the extent of the damage. It seemed like two files in my photo library were damaged. Well, at least my photos are backed up⁴, so I pulled the damaged files from the backup and replaced them. Running another btrfs scrub revealed that the SSD array was fully repaired. Whew.

In retrospect, it should have been obvious what went wrong, since multiple files got corrupted, and if it was soft errors, that was a lot in such a short span of time. It was more likely that something was wrong with the hardware. The smart thing to do then would be to shut down the server and figure out what went wrong. But no, I thought maybe I just got really unlucky and multiple bits got flipped from a single cosmic ray and somehow corrupted three different files… So instead, I assumed it was something transient and proceeded to recover my backup. This wasn’t a great idea in retrospect. If you ever find yourself in a similar situation, please make sure your hardware is working first.

Data recovery attempt

Since I found out about the problems while overwriting the VM disk, it means the Windows 11 install, on top of the existing problems, was also partially overwritten and fully corrupted. At this point, it is impossible to recover anything from it—and I’ve tried with ntfsrecover from ntfs-3g—which means I am fully committed to restoring the backup.

While something was obviously wrong with the backup, I could probably still get most of the data on the VM back if I was able to decompress past the errors. This means I have to get the file to zstd without hitting the EIO that caused it to error out. So what if I just salvaged all the parts of the file that I could and left the broken sections as zeroes?

Since the drive failure incident, I’ve learned how to use ddrescue, which makes multiple attempts to read all the readable blocks in the file, leaving the totally unreadable blocks as zeroes. So I simply ran ddrescue to rescue the file⁵:

$ ddrescue /path/to/broken/btrfs/win10.img.zst /path/to/other/disk/win10.img.zst 

After ages and a bunch more errors in dmesg (not a good sign…), ddrescue finished. So of course, I tried again:

$ ssh nas zstd -dc /path/to/other/disk/win10.img.zst | sudo dd of=/dev/vg0/win10 bs=1M
/path/to/other/disk/win10.img.zst : Decoding error (36) : Data corruption detected

This wasn’t good. It was probably because btrfs failed to return anything in the entire 4 kiB block that had the corruption, so it was all replaced with zeroes and zstd couldn’t continue decompressing. Now what?

I refused to believe that losing a single 4 kiB section of the backup meant that I couldn’t recover anything, so I persisted. I figured that while a bit or two in the 4 kiB block might be bad, having the block available to zstd would probably allow the whole file to be decompressed. So I just needed to convince btrfs to return the corrupted blocks to its best ability…

btrfs, of course, has exactly that ability. You simply need to mount with -o rescue=ignoredatacsums. Now, you can only use this option if you mount the filesystem as read-only, which means I’d need to take down the entire server. Since the server was running a bunch of “important” stuff, and I thought the corruption was a single event, I figured there had to be another way that enabled me to do this online.

I figured that if I could just find where on the block device btrfs stored the corrupted block, I could just use dd to read the block out and replace the zeroed block in the ddrescue output. As it turned out, the filefrag command does exactly that, returning the physical offset of each fragment of the file.

But wait… I was running btrfs in RAID 1 mode on two different drives. Presumably, btrfs doesn’t put the same data on the same offsets on each drive. Which drive was the physical offset relative to? So I tried looking in both drives on a test file with known content, and that offset on neither drive contained the correct data. As it turned out, it was actually the btrfs “logical” block that filefrag returned, which wasn’t that helpful for this task.

After some Googling, I found this C program on GitHub that returned a mapping similar to filefrag, except it also returned the physical offset on each device in the btrfs array. So of course, I compiled the program, dded the physical block for the test file, and voilà, I got the correct data.

So I just had to compute the offset for each corrupted block and recover the data, right? Unfortunately, there were 11 different corrupted blocks, and I didn’t want to manually calculate the offset for each block. So instead, I wrote a Python library that can parse the output of filefrag and compute the physical offset on each device:

import bisect
from dataclasses import dataclass
from operator import attrgetter


@dataclass
class BtrfsExtent:
    file_offset: int
    file_size: int
    extent_offset: int
    extent_type: str
    logical_size: str
    logical_offset: str
    physical_size: int
    device_offset: dict[int, int]


@dataclass
class BtrfsMapPhysical:
    extents: list[BtrfsExtent]

    @classmethod
    def from_file(cls, file):
        extents = []

        for line in file:
            if line.startswith('FILE'):
                continue
            fields = line.rstrip('\r\n').split('\t')
            if fields[0]:
                extents.append(BtrfsExtent(
                    int(fields[0]), # file_offset
                    int(fields[1]), # file_size
                    int(fields[2]), # extent_offset
                    fields[3],      # extent_type
                    int(fields[4]), # logical_size
                    int(fields[5]), # logical_offset
                    int(fields[6]), # physical_size
                    {int(fields[7]): int(fields[8])},
                ))
            else:
                extents[-1].device_offset[int(fields[7])] = int(fields[8])

        return cls(extents)

    def lookup_physical(self, offset):
        extent = self.extents[bisect.bisect(self.extents, offset, key=attrgetter('file_offset')) - 1]
        assert extent.file_offset <= offset < extent.file_offset + extent.file_size

        delta = offset - extent.file_offset
        return {devid: dev_offset + delta for devid, dev_offset in extent.device_offset.items()}

Since the device_offset attribute is a mapping from device ID to a physical offset, I needed to ask btrfs which device ID maps to which block device:

# btrfs filesystem show /data
Label: none  uuid: bc943e9f-598b-4cb1-8276-e4158a0dc5b7s
	Total devices 3 FS bytes used 10.27TiB
	devid    4 size 14.55TiB used 10.41TiB path /dev/sda1
	devid    5 size 14.55TiB used 10.41TiB path /dev/sdc1

This means I could use the device ID 4 offset to read from /dev/sda1 and the device ID 5 offset to read from /dev/sdc1.

Then, I wrote a little script to generate dd commands to copy the raw blocks to fill in the blanks in the ddrescue output:

offsets = [
    3079737344,
    # ... more offsets pulled from dmesg
]

for i in offsets:
    j = fsmap.lookup_physical(i)[4] // 4096
    print(f'sudo dd conv=notrunc if=/dev/sda1 bs=4k skip={j} count=1 seek={i // 4096} of=win10.img.zst')

The generated command used conv=notrunc to allow dd to edit a segment of the file without deleting anything else. I used bs=4k to do 4 kiB blocks, which is what btrfs uses, and skip to jump to offset on /dev/sda1, seek to jump to the logical offset in the output file, and count=1 to copy a single block. I also divided the byte offset by 4096 to get the block offset.

Then I ran the dd commands generated and tried decompressing again. This time, no errors, and the VM promptly booted into Windows 10.

Did I recover the data to successfully decompress the backup? Yes.

Did I recover the entire VM disk as it was originally? No.

Yet, the VM did boot, so I suppose it was a success in a way. Still, something probably got corrupted, but oh well. I didn’t really have anything particularly important on the VM, and I was able to successfully upgrade to Windows 11 later, which should have fixed any corrupted system files, and Steam was able to fix any corrupted game files.

Attempt to fix the filesystem

Since I still thought it was a singular event, I attempted to then fix the filesystem. So naturally, I ran btrfs scrub on the HDD array, which took over half a day to complete. It managed to correct some files, but there were several files irreversibly damaged, but fortunately they were either backups that could be deleted, or they were easily restored from backup.

Still, it was a bit weird that some of the files that btrfs scrub reported as corrupted were then able to be read successfully on a subsequent attempt. That was a bit strange…

So anyways, after the scrubbing, I decided that I might as well keep the rescued Windows 10 image as a backup indefinitely, so I copied the file back into its original location, which succeeded without issue.

Just to be sure that nothing went wrong though, I decided to run sha512sum on both files:

$ sha512sum /path/to/btrfs/win10.img.zst /path/to/other/disk/win10.img.zst
b9f2ca1c5d6a964cfae9dd59f20df3635e4f57530e8c5e33e6fd057d08eb3699e34d6568832b41584369dba503da2ee58079fccbd05a51b82cefefbf8b509777  /path/to/btrfs/win10.img.zst
120f3f6bffb3ac141e18f6253c8b7acc735de1701acd50574fdf6cb86e528f73633c07c7958f6cc72dfed82d40254cbb3ce5e6c0b90a271577ace42610352e88  /path/to/other/disk/win10.img.zst

What the @#*$? Okay, something is REALLY wrong.

Just to make sure it wasn’t a fluke, I copied it again.

$ sha512sum /path/to/btrfs/win10.img.zst /path/to/other/disk/win10.img.zst
cff0f83d39feedc2e0ebab29460ed9366d6c8a8a54c086b791e2ae50a532642b6291d6a03cca4c44e2af57cd555766fa6fc2da8417bd796a820ad67348a1d7de  /path/to/btrfs/win10.img.zst
120f3f6bffb3ac141e18f6253c8b7acc735de1701acd50574fdf6cb86e528f73633c07c7958f6cc72dfed82d40254cbb3ce5e6c0b90a271577ace42610352e88  /path/to/other/disk/win10.img.zst

Oh no. OH NO! Something is really messed up.

This could only mean… data was actively being corrupted in memory. Suddenly, everything made sense:

Why were files corrupted? Memory was corrupted.
Why were both copies corrupted in raid1? Because the memory was corrupted and written to both drives.
Why did some files have unrecoverable errors during scrubs but were able to be read later perfectly fine? Because the memory was corrupted during the scrub, the checksum didn’t match.

There was no safe way to keep the server running, so I shut it down and ran memtest86+.

Memory testing

To no particular surprise, this was what I got:

       Memtest86+ v7.20      │ AMD Ryzen 9 3900X 12-Core Processor
CLK/Temp: 3800MHz    55/70°C │ Pass 13: ####
L1 Cache:   32KB    239 GB/s │ Test 70% ############################
L2 Cache:  512KB   94.5 GB/s │ Test #1  [Moving inversions, 8 bit pattern]
L3 Cache:   64MB   17.3 GB/s │ Testing: 43GB - 44GB [1GB of 63.9GB]
Memory:   63.9GB   5.64 GB/s │ Pattern: 0x2020202020202020
─────────────────────────────┴─────────────┬────────────────────────────────────
CPU: 12 Cores 24 Threads   SMP: 24T (PAR)  │ Time:   0:08:16  Status: Failed! /
IMC: 1600MHz (DDR4-3200)  CAS 16-20-20-38  │ Pass:   0        Errors: 272
───────────────────────────────────────────┴────────────────────────────────────

pCPU  Pass  Test  Failing Address        Expected          Found
----  ----  ----  ---------------------  ----------------  ----------------
  12     0    4   000320a22b38 (12.5GB)  fefefefefefefefe  fefe7efefefefefe
  11     0    4   00031e036b38 (12.4GB)  fefefefefefefefe  fefe7efefefefefe
  12     0    4   000320b26a38 (12.5GB)  fefefefefefefefe  fefe7efefefefefe
   9     0    4   00031a902b38 (12.4GB)  fefefefefefefefe  fefe7efefefefefe
  12     0    4   000320b86a38 (12.5GB)  fefefefefefefefe  fefe7efefefefefe
  11     0    4   00031f576b38 (12.4GB)  0101010101010101  0101810101010101
  12     0    4   000321b22b38 (12.5GB)  0101010101010101  0101810101010101
  11     0    4   00031ec06a38 (12.5GB)  0101010101010101  0101810101010101
  19     0    4   00031bc42b38 (12.4GB)  0101010101010101  0101810101010101
   9     0    4   000318fb2b38 (12.5GB)  0101010101010101  0101810101010101
  11     0    4   00031e9e2b38 (12.4GB)  0101010101010101  0101810101010101
  13     0    4   000323446b38 (12.3GB)  0101010101010101  0101810101010101
 <ESC> Exit  <F1> Configuration  <Space> Scroll Lock

I guess I have some bad RAM—or RAM slot, or worse, a bad memory controller. Well, the easiest thing was to disable XMP/DOCP, which is technically considered “overclocking” even though it’s the speed printed on the memory module, opting to run the memory at the lower default speeds⁶ instead…

Still a deluge of errors. Oh well, time to dig deeper.

So I took out two sticks of RAM (the second and fourth slots from the CPU socket), and lo and behold, memtest86+ passed. Since the two sticks were on different memory channels, I guess that also rules out the memory controller.

So I plugged in another stick in the second slot, and the system refused to boot. I tried the other stick in the same slot and got the same results. So I put a stick into the last slot, and it booted. I guess the motherboard really didn’t like having the first three slots populated for some reason.

Then I ran memtest86+ again:

       Memtest86+ v7.20      │ AMD Ryzen 9 3900X 12-Core Processor
CLK/Temp: 3793MHz    48/64°C │ Pass 10%: ####
L1 Cache:   32KB    244 GB/s │ Test  8%: ###
L2 Cache:  512KB   94.6 GB/s │ Test #1  [Moving inversions, 8 bit pattern]
L3 Cache:   64MB   10.1 GB/s │ Testing: 1GB - 2GB [1GB of 47.9GB] 
Memory:   47.9GB   3.57 GB/s │ Pattern: 0x0808080808080808
─────────────────────────────┴─────────────┬────────────────────────────────────
CPU: 12 Cores 24 Threads   SMP: 24T (PAR)  │ Time:   0:04:49  Status: Failed! \
IMC: 1333MHz (DDR4-2666) CAS 20-19-19-43   │ Pass:   0        Errors: 54
───────────────────────────────────────────┴────────────────────────────────────

pCPU  Pass  Test  Failing Address        Expected          Found
----  ----  ----  ---------------------  ----------------  ----------------
  15     0    3   0001a8d13538 (6.63GB)  0000000000000000  0008000000000000
  15     0    3   0001a8593538 (6.62GB)  ffffffffffffffff  ff7fffffffffffff
  15     0    3   0001a8bb3538 (6.63GB)  ffffffffffffffff  ff7fffffffffffff
  14     0    3   0001a650b538 (6.59GB)  ffffffffffffffff  ff7fffffffffffff
  15     0    3   0001a8bd9538 (6.63GB)  ffffffffffffffff  ff7fffffffffffff
  14     0    3   0001a6b41538 (6.60GB)  ffffffffffffffff  ff7fffffffffffff
  13     0    3   0001a54d1538 (6.58GB)  ffffffffffffffff  ff7fffffffffffff
  14     0    3   0001a7653538 (6.61GB)  0000000000000000  0008000000000000
  13     0    3   0001a4911538 (6.57GB)  0000000000000000  0008000000000000
  15     0    3   0001a98b1538 (6.64GB)  0000000000000000  0008000000000000
  14     0    3   0001a6d95338 (6.60GB)  0000000000000000  0008000000000000
  15     0    3   0001a96b9538 (6.64GB)  0000000000000000  0008000000000000
 <ESC> Exit  <F1> Configuration  <Space> Scroll Lock

It seems like we found a bad stick. Let’s double check to make sure the other stick works in the slot to rule out a second bad stick or a bad slot:

       Memtest86+ v7.20      │ AMD Ryzen 9 3900X 12-Core Processor
CLK/Temp: 3792MHz    60/65°C │ Pass  1%: 
L1 Cache:   32KB    242 GB/s │ Test 10%: ####
L2 Cache:  512KB   97.1 GB/s │ Test #2  [Address test, own address + window]
L3 Cache:   64MB   10.1 GB/s │ Testing: 1GB - 2GB [1GB of 47.9GB] 
Memory:   47.9GB   3.58 GB/s │ Pattern: own address
─────────────────────────────┴─────────────┬────────────────────────────────────
CPU: 12 Cores 24 Threads   SMP: 24T (PAR)  │ Time:   1:24:14  Status: Pass    \
IMC: 1333MHz (DDR4-2666) CAS 20-19-19-43   │ Pass:   1        Errors: 0
───────────────────────────────────────────┴────────────────────────────────────


Memory SPD Information
------------------                                             
 - Slot 0: 16GB DD     ######      ##      #####    #####      
 - Slot 1: 16GB DD     ##   ##    ####    ##   ##  ##   ##     
 - Slot 2: 16GB DD     ##   ##   ##  ##   ##       ##          
                       ######   ##    ##   #####    #####      
                       ##       ########       ##       ##     
                       ##       ##    ##  ##   ##  ##   ##     
                       ##       ##    ##   #####    #####      
                                                               
                       Press any key to remove this banner     
                                                               
                        ASUSTeK COMPUTER INC. PRIME X570-P
 <ESC> Exit  <F1> Configuration  <Space> Scroll Lock            7.20.549cc.x64

Mystery solved.

I guess for now, instead of 64 GiB of RAM, I’ll have to contend with 48 GiB. It’s not the end of the world, though it is irritating and I had to make some changes to the system to get it to squeeze in, including adding more swap.

Making sure it all works

Before booting into the real OS though, I booted into Finnix instead, so I could make sure that btrfs is working as expected. I simply ran btrfs check --readonly on every filesystem, and they all received a clean bill of health. With this, I finally rebooted into the real OS.

Still, I made sure to run btrfs scrub on all the affected filesystems too, just to detect all lingering and detectable data corruption. Thankfully, no errors were detected.

Warranty

With an obviously defective RAM stick, the obvious thought would be to submit a warranty claim and have the manufacturer send me a new stick of RAM. The ADATA XPG Gammix D10 RAM is covered by a lifetime warranty, so I am all good, right? I couldn’t be more wrong.

So yes, in theory, the RAM is covered by a lifetime warranty. However, to submit a warranty claim, I’d have to sign up on ADATA/XPG’s web portal, which contained the following snippet:

Thank you for registering an ADATA/XPG member account. If you used this email for the registration, please click the link below to complete the email verification:

Verify your email >>

Yes, the link in the email was unclickable, just like the transcription above. It seemed like I couldn’t file a claim…

Later, I randomly checked my spambox, and found 13 emails, all identical, all from ADATA, containing an actual activation link. So I was able to get into the portal… eventually. It would be an understatement to say this was a janky experience though.

Then, when I attempted to submit a warranty claim, there was a lack of mention of shipping labels. A quick Google revealed that ADATA makes you send the product yourself to them at your own expense. Furthermore, ADATA also mentioned this:

If we do not receive your product within 30 days after successful submission of your RMA application, your application will be automatically voided. You will be required to submit a new RMA application.

I decided to not bother, since the cost of international shipping, especially one that’s not guaranteed to take more than 30 days, outweighs the cost of getting another stick of DDR4-3200 RAM, which can be found for $33 on Amazon Canada at the time of writing.

I am less than pleased with ADATA/XPG, especially since this happened after my horrible experience with the SSD failure, which was also an ADATA/XPG product. Henceforth, I will never purchase any ADATA/XPG product and I would strongly advise everyone to avoid this brand at all costs.

Retrospective

I suppose with all the errors and difficulties recovering the data, some might feel that it’s indicative of btrfs’s fragility. Yet, after this experience, I feel that btrfs’s approach of failing loudly was ultimately the right call. If I had used ext4, my backup would have been restored perfectly fine, and I wouldn’t have been aware of the memory corruption at all until much later, when the corruption had gotten bad enough to ruin my data in much more conspicuous ways. By then, it would have been way too late.

I will also note that warranty offers little consolation after wasting a whole day on this madness. With the price of RAM as it is right now, if I consider the value of a RAM replacement as compensation for my troubles, I’d be working for far less than minimum wage. So no, it’s almost certainly worth it to buy better RAM than to save a buck on crappy RAM, even if the lifetime warranty is practical to use. With that, I’ve decided to purchase ECC RAM from a reputable brand—which happened to be priced sanely at the time of writing—and use that in my server going forward.

Still, what does this mean for my data going forward? Well, everything that was written to disk before the incident was probably fine. All files recently written to could potentially be damaged silently, but there really isn’t much I can do about that. All the important files have backups, so I could always pull it from a backup. I’ll definitely make sure to retain backups from before this incident and fix things…

I will end this post by referencing Linus Torvalds’ angry email about the importance of ECC:

ECC absolutely matters.

…

Yes, I’m pissed off about it. You can find me complaining about this literally for decades now. I don’t want to say “I was right”. I want this fixed, and I want ECC.

If I had ECC RAM, then this memory corruption would have been detected, and I’d be made aware of it without any data corruption. Unfortunately, ECC isn’t a standard feature on PCs and wasn’t priced sanely when I built the NAS. If only it was…

If you can afford ECC, I would strongly advise you to do so. Subtle data corruption isn’t something you want to deal with. It always comes to bite you when you need the data the most. I’d rather spend a bit more and have full faith in the integrity of my data, instead of wondering what data was corrupted.

Update: I wrote about my experience with ECC RAM.

Notes

Yes, I wrote that series on a fresh Windows 11 VM and was too lazy to upgrade my Windows 10 VM this whole time. ↩
Why zstd? Because it’s fast than gzip and compresses better, and I didn’t have the patience to wait for xz to finish compressing. ↩
The astute of you may have noticed that I am actually using LVM to store the VM disk, so I could have taken an LVM snapshot instead and reverted it that way. Yes, that probably would have been smarter, but then I wouldn’t have discovered this memory corruption and probably lost a lot more data. ↩
If this doesn’t show how important backups are, I don’t know what does. Please back up your important files! ↩
Yes, I ran this on the server. In retrospect, doing this on a machine with memory corruption wasn’t a good idea… ↩
This is commonly called “JEDEC speed” even though JEDEC stands for the “Joint Electron Device Engineering Council”, which is the organization that manages RAM standards, including the XMP standard. ↩