On ECC RAM on AMD Ryzen

Last time, I talked about how a bad stick of RAM drove me into buying ECC RAM for my Ryzen 9 3900X home server build—mostly that ECC would have been able to detect that something was wrong with the RAM and also correct for single-bit errors, which would have saved me a ton of headache.

Now that I’ve received the RAM and ran it for a while, I’ll write about the entire experience of getting the RAM working and my attempts to cause errors to verify the ECC functionality.

Spoilers: Injecting faults was way harder than it appeared from online research.

Buying the ECC RAM

Before buying RAM, we must first determine what type of RAM we need. For the Ryzen 9 3900X, we need DDR4 SDRAM, but DDR3 and DDR5 SDRAM are also in common use at the time of writing.

Among the modern generation of RAM though, there are several subtypes of RAM:

DIMM (dual inline memory module): due to SIMMs being obsolete for two decades at this point, this is the only type of desktop RAM that matters;
SO-DIMM (small outline DIMM): this is the smaller RAM commonly used for laptops and mini-PCs;
Unbuffered: memory directly accessed by the memory controller on the CPU; and
Registered: memory with a register between the DRAM modules and the CPU’s memory controller, generating less electrical load on the memory controller and allow it to handle more memory modules and thus higher capacities.

Typically, registered memory is used in server-class machines while unregistered used in desktop-class machines, ostensibly to save costs. However, registered ECC modules are way more common than unregistered ECC modules and are thus cheaper when buying second hand. This is because nearly all registered server memory are ECC while most unbuffered desktop RAM aren’t ECC due to the misconception that ECC is unnecessary for desktop applications.

Nevertheless, the Ryzen 9 3900X requires unbuffered memory, and the desktop motherboard requires the DIMM form factor, so I needed unbuffered DIMMs (UDIMMs). This was the RAM type I needed to buy. If you are dealing with a different system, make sure you are buying the correct RAM type.

Also, since I wanted 64 GiB of RAM in the system, and I have four RAM slots, I needed four 16 GiB sticks. This meant I needed four sticks of DDR4 ECC UDIMMs of 16 GiB capacity each.

Buying the RAM I needed was as easy as searching for ddr4 ecc udimm 16gb on eBay. I wasn’t so picky with the RAM speed, but I found some brand new Crucial CT16G4WFD8266 RAM, which is 2666 MT/s¹, on sale for about as cheap as used RAM, so I went with that.

Naturally, if you are considering buying RAM, it’s up to you to score a good deal. Remember to check reviews and be careful out there. While I trust eBay’s buyer protection, you can still have strange experiences, as I did.

eBay delivery

So I ordered the RAM from eBay in December and the package arrived in early January. Now, that should be the end of this, but as it turned out, the package that was tracked by eBay contained… a USB cable. At this point, I was pretty upset and opened a dispute, to which the seller didn’t reply, at least initially.

Then, a few days later, an unexpected FedEx package arrived at my door. Opening it, I found the RAM. Adding to the mystery was that the FedEx tracking showed that it had been sent out before I received the other package containing the USB cable.

Now, FedEx has some pretty scummy practices here, like sending an invoice for duties charged weeks after the package was delivered, including a “disbursement fee” for clearing customs on my behalf… without ever receiving my consent to act as my agent for this. This was rather upsetting, but at least the eBay seller eventually refunded me the amount that FedEx charged, since it was ultimately their problem for not sending the package in the eBay package with duties already paid.

I have no idea why any of this happened, but at least I got what I paid for in the end…

Installing the RAM

So once I had the RAM, I naturally needed to install it. So I found a convenient time, took down the server, and replaced the three remaining sticks of unbuffered non-ECC DIMMs with the four sticks of ECC RAM. The server immediately booted up, but I chose to boot into Finnix while I figured out the ECC.

Verifying ECC functionality

The first thing to check is that the RAM says it’s ECC:

Linux finnix 6.8.12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.8.12-1 (2024-05-31) x86_64
root@tty1:~# dmidecode -t memory | grep -E 'Width|Memory Device'
Memory Device
	Total Width: 72 bits
	Data Width: 64 bits
Memory Device
	Total Width: 72 bits
	Data Width: 64 bits
Memory Device
	Total Width: 72 bits
	Data Width: 64 bits
Memory Device
	Total Width: 72 bits
	Data Width: 64 bits

We see that the total width is 8 bits longer than the data width, which suggests there’s an extra hidden byte in the RAM for ECC.

Then, we check that the EDAC kernel module has detected the memory controller and some memory chips. This information is available in dmesg:

root@tty1:~# dmesg | grep -i edac
[    0.709482] EDAC MC: Ver: 3.0.0
[   19.833492] EDAC MC0: Giving out device to module amd64_edac controller F17h_M70h: DEV 0000:00:18.3 (INTERRUPT)
[   19.833491] EDAC amd64: F17h_M70h detected (node 0).
[   19.833499] EDAC MC: UMC0 chip selects:
[   19.833500] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[   19.833502] EDAC amd64: MC: 2: 8192MB 3: 8192MB
[   19.833501] EDAC MC: UMC1 chip selects:
[   19.833506] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[   19.833507] EDAC amd64: MC: 2: 8192MB 3: 8192MB

Okay, this definitely wasn’t there before with the old RAM, which suggests that the kernel is able to detect the ECC.

There are also a bunch of new files under /proc that report the status of the ECC RAM:

root@tty1:~# cat /sys/devices/system/edac/mc/mc0/ce_count
0
root@tty1:~# cat /sys/devices/system/edac/mc/mc0/ue_count
0

As expected, there are 0 correctable and 0 uncorrectable errors. I’d be very concerned if this is greater than 0, given that I’d just installed the RAM. So ECC appears to work.

Trying to induce errors

On the Internet, the commonly quoted way to make sure ECC works is to overclock the RAM, which should induce instability and cause some correctable errors. For example, this reviewer on Hardware Canucks was able to induce frequent correctable errors on some very similar RAM by tightening the timings—which will eventually cause the CPU to read from the memory bus before the memory output has fully stabilized—so I decided I try something similar.

For testing, I decided to watch dmesg while running stress, like the forum post, and then also memtester for good measure. I used the good old bash job control to run these concurrently:

root@tty1:~# dmesg -w
...
^Z
root@tty1:~# stress --vm 250

I used stress --vm 250 since each VM worker uses 256 MiB by default, so this should in theory have consumed the full 64 GiB on the system. I used something similar for memtester, running memtester 60G since it got OOM-killed for anything bigger.

Unfortunately, this proved to be way more tedious than I’d ever expected. I didn’t record everything I tried, but suffice to say, when I messed with the timings or the RAM frequencies, one of the following outcomes happened:

The server booted as normal into Finnix, after which neither stress nor memtester triggered any errors (it might have eventually, but I haven’t got all day), after which I dialled up the settings;

The server fails to boot normally and “POSTed in safe mode”:

The system has POSTed in safe mode.  
This may be due to the previous POST attempt failing because of system instability,  
or if the power button was held in to force the system off.  
If the system failed to POST after you made changes to UEFI settings,  
you may wish to revert to stable settings to prevent POST failure.  
Press F1 to Run SETUP.

At this point, I tried to dial down the settings;

The server is stuck in some boot loop and I had to manually reset the CMOS, after which I dialled down the settings. This was super annoying and I finally understood why people pay the premium for overclocking motherboards with clear CMOS buttons on the back panel; or
The server POSTs but crashes during booting, and then hangs during reboot.

After hours, I was unable to observe a single error, either correctable or uncorrectable.

Now, at this point, I might have found some usable overclocks, so I could theoretically leave the server overclocked to see if errors creep up. Unfortunately, most overclocks that booted successfully also ended up in some cursed state where the system would not reboot cleanly, requiring a hard reset. Since this would then require physical access to fix, and Murphy’s law dictates that this kind of thing will eventually happen while I am away from home, I opted against risking the overclocks and just ran the ECC RAM as the regular rated speed and timings.

In the end, I simply had to trust other people’s finding that ECC truly works on Ryzen. I’ll report back if I actually observe an ECC error.

Monitoring ECC

Since I chose to believe ECC actually works, I decided to set up an ECC monitoring daemon, which should notify me of any ECC issues and also record them so that they persist through reboots. The currently recommended tool for this is rasdaemon, so I installed it:

$ sudo apt install rasdaemon/bookworm-backports
...
Selected version '0.8.1-3~bpo12+1' (Debian Backports:stable-backports [amd64]) for 'rasdaemon'
...
Setting up rasdaemon (0.8.1-3~bpo12+1) ...
...
$ sudo systemctl enable --now rasdaemon.service
Synchronizing state of rasdaemon.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable rasdaemon

This installs rasdaemon and ensures it’s running. The bookworm-backports is required to get support for triggers, which are scripts that are run upon an error being detected.

We can then check the RAM status by running:

$ sudo ras-mc-ctl --error-count
Label               	CE	UE
mc#0csrow#3channel#0	0	0
mc#0csrow#3channel#1	0	0
mc#0csrow#0channel#1	0	0
mc#0csrow#0channel#0	0	0
mc#0csrow#1channel#0	0	0
mc#0csrow#1channel#1	0	0
mc#0csrow#2channel#0	0	0
mc#0csrow#2channel#1	0	0

As expected, there are no errors, either correctable or uncorrectable. However, the RAM labels are confusing. It is possible to fix this if you know the mapping.

Fortunately for me, rasdaemon’s upstream version knows the memory mapping for my ASUS PRIME X570-P motherboard, even though the version in Debian doesn’t, so I can just install that file:

$ sudo wget -O /etc/ras/dimm_labels.d/asus https://raw.githubusercontent.com/mchehab/rasdaemon/refs/heads/master/labels/asus
...
$ ras-mc-ctl --register-labels

Now this is much more helpful if one of the sticks went bad and I needed to replace it:

$ sudo ras-mc-ctl --error-count
Label  	CE	UE
DIMM_B2	0	0
DIMM_B2	0	0
DIMM_A1	0	0
DIMM_A1	0	0
DIMM_A2	0	0
DIMM_A2	0	0
DIMM_B1	0	0
DIMM_B1	0	0

Automatic notifications

Having rasdaemon isn’t really that helpful unless you can be notified about the ECC errors, so I decided to write a script to send it to my ntfy instance. Since rasdaemon passes the information as environment variables, I decided to use the ntfy-run Rust program that I wrote a while ago for fun to run env and send the output to ntfy in the script.

The actual script /etc/ras/triggers/mc_event_trigger is:

#!/bin/sh
. /etc/ras/triggers/env
ntfy-run --success-title="RAS Detected RAM error on $(hostname)" --success-tags sob env
exit 0

/etc/ras/triggers/env contains NTFY_TOKEN and NTFY_URL for ntfy-run, which is kept secret for obvious reasons.

Then, you’ll need to configure /etc/default/rasdaemon, changing the last few lines to enable the triggers:

# Event Trigger

# Event trigger will be executed when the specified event occurs.
#
# Execute triggers path
# For example: TRIGGER_DIR=/etc/ras/triggers
TRIGGER_DIR=/etc/ras/triggers

# Execute these triggers when the mc_event occurred, the triggers will not
# be executed if the trigger is not specified.
# For example:
#   MC_CE_TRIGGER=mc_event_trigger
#   MC_UE_TRIGGER=mc_event_trigger
MC_CE_TRIGGER=mc_event_trigger
MC_UE_TRIGGER=mc_event_trigger

Since I’ve never seen the notification script in action, I am unable to confirm that it actually works. I’ll update this post if it ever comes to that.

In the meantime, I guess run sudo ras-mc-ctl --error-count every once in a while to check that no memory errors happened, perhaps monthly.

Conclusion

In the end, ECC RAM worked out of the box, though I was unable to create errors to verify that. Still, there is no reason to believe that ECC doesn’t work and the server is currently working nicely. rasdaemon should in theory record any memory errors—even if the server is rebooted—and hopefully I’ll be notified via ntfy if any memory errors happen, although I also hope they never do.

Notes

Note that I wrote “MT/s” instead of the more popular (and wrong) MHz. This is because DDR RAM is “double data rate” and transfers data at both the rising and falling edges of the clock. So technically, DDR4 2666 MT/s RAM runs at 1333 MHz, doing 2666 million transfers per second, or MT/s. ↩