Fast and cheap bulk storage: using LVM to cache HDDs on SSDs

Since the inception of solid-state drives (SSDs), there has been a choice to make—either use SSDs for vastly superior speeds, especially with non-sequential read and writes (“random I/O”), or use legacy spinning rust hard disk drives (HDDs) for cheaper storage that’s a bit slow for sequential I/O¹ and painfully slow for random I/O.

The idea of caching frequently used data on SSDs and storing the rest on HDDs is nothing new—solid-state hybrid drives (SSHDs) embodied this idea in hardware form, while filesystems like ZFS support using SSDs as L2ARC. However, with the falling price of SSDs, this no longer makes sense outside of niche scenarios with very large amounts of storage. For example, I have not needed to use HDDs in my PC for many years at this point, since all my data easily fits on an SSD.

One of the scenarios in which this makes sense is for the mirrors I host at home. Oftentimes, a project will require hundreds of gigabytes of data to be mirrored just in case anyone needs it, but only a few files are frequently accessed and could be cached on SSDs for fast access². Similarly, I have many LLMs locally with Ollama, but there are only a few I use very frequently. The frequently used ones can be cached while the rest can be loaded slowly from HDD when needed.

While ZFS may seem like the obvious option here, due to Linux compatibility issues with ZFS mentioned previously, I decided to use Linux’s Logical Volume Manager (LVM) instead for this task to save myself some headache. To ensure reliable storage in the event of HDD failures, I am running the HDDs in RAID 1 with Linux’s mdadm software RAID.

This post documents how to build such a cached RAID array and explores some considerations when building reliable and fast storage.

Why use LVM cache?
A quick introduction to LVM
The hardware setup
Why use RAID 1 on HDDs?
Setting up RAID 1 with mdadm
Creating the SSD cache partition
Creating a new volume group
Creating the cached LV
Creating a filesystem
Mounting the new filesystem
Monitoring
Conclusion

Why use LVM cache?

There are several alternative block device caching solutions on Linux, such as:

bcache: a built-in Linux kernel module that does similar caching as LVM. I don’t like the way it’s set up by owning the entire block device and non-persistent sysfs configurations, compared to LVM remembering all the configuration options, nor do I enjoy hearing about all the reports of bcache corrupting data; and
EnhanceIO: an old kernel module that does something similar to bcache and LVM cache, but hasn’t been maintained for over a decade.

Since I am very familiar with LVM and have already used it for other reasons, I opted to use LVM for this exercise as well.

A quick introduction to LVM

If you aren’t familiar with LVM, we’ll need to first introduce some concepts, or none of the LVM portions of this post will make any sense.

First, we’ll need to introduce block devices, which are just devices with a fixed number of blocks that can be read at any offset. HDDs and SSDs show up as block devices, such /dev/sda. They can be partitioned into multiple pieces, showing up as smaller block devices such as /dev/sda1, the first partition on /dev/sda. Filesystems can be created directly on block devices, but these block devices can also be used with more advanced things like RAID and LVM.

LVM is a volume manager that allows you to create logical volumes that can be expanded much more easily than regular partitions. In LVM, there are three major entity types:

Physical volumes (PVs): block devices that are used as the underlying storage for LVM;
Logical volumes (LVs): block devices that are presented by LVM, stored on one or more PVs; and
Volume groups (VGs): a group of PVs on which LVs can be created.

LVs can be used just like partitions to store files, with the flexibility of being able to expand them at will while they are actively being accessed, without having to be contiguous like real partitions.

There are more advanced LV types, such as thin pools, which doesn’t allocate space for LVs until they are actually used to store data, and cached volumes, which this post is about.

The hardware

For the purposes of this post, we will assume that there are two SATA HDDs ( 4 TB each in my case), available as block devices /dev/sda and /dev/sdb:

$ lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda                  8:0    0   3.6T  0 disk
sdb                  8:16   0   3.6T  0 disk
...

Warning: before copying any commands, ensure that you are operating on the correct device. There is no undo button for most of the commands in this post, so be very careful lest you destroy your precious data! When in doubt, run lsblk to double check!

We’ll also assume that the SSD is /dev/nvme0n1³ (2 TB in my case), and we will allocate 100 GiB of it as the cache by creating a partition.

Effectively, the setup looks like this:

Diagram of the LVM cache setup

Why use RAID 1 on HDDs?

Mechanical HDDs, like everything mechanical, fail. It’s an inevitable fact of life. There are two choices here:

Treat your data as ephemeral and replace it when the drive fails, accepting the inevitable downtime this causes; or
Store your data in a redundant fashion (i.e. with RAID), so that it continues to be available despite drive failures⁴.

If your data is really that unimportant, I suppose you could store it on a single drive, or even use RAID 0 to stripe it across multiple drives such that it’s lost if any one drive fails, but benefit from being able to pool all the drives together.

However, as I learned the hard way, even easily replaceable data still requires effort to replace them. I once deployed this exact setup with RAID 0 and one of the constituent drives suffered a failure, causing a few files to become unreadable. While I could easily download them again, it created a lot of downtime due to having to destroy the entire array and start over after replacing the failed drive.

This may not matter for your use case, but I would rather that my mirror experience minimal downtime in the event of a drive failure. For this reason, I chose to run the drives together in RAID 1.

Setting up RAID 1 with `mdadm`

One thing worth noting before we start with setting up RAID is that all block devices (either whole drives or partitions) in a RAID must be identical in size⁵. This presents some interesting challenges, since a 4 TB HDD drive isn’t always the same size. Normally, for a drive to be sold as “4 TB,” it has to have at least 4 000 000 000 000 bytes (that’s 4 trillion bytes). This is around 3.638 TiB using power-of-two IEC units. Typically, they have slightly more, though this varies by manufacturer or even model.

This poses a problem when using non-identical drive models, which you are encouraged to do to avoid drives failing at the same time. Drives produced in the same batch subjected to the same operations tend to fail at similar times, so that’s a good precaution to take to avoid failures. A similar problem occurs when it comes to replacing the drives when they fail, especially if you can’t source an identical model.

To avoid this problem, we will partition the drive and cut the data partition off exactly at the 4 TB mark. This will ensure that any “4 TB” HDD could be similarly partitioned and used as a replacement. Another reason to partition is to avoid the drive being treated as uninitialized on operating systems that don’t understand Linux’s mdadm RAID, such as Windows.

Partitioning the drives

We’ll need to do some math to figure out which 512-byte logical sector to end the partition on. For a 4 TB drive, we want to end it at the exact 4 TB mark:

>>> 4e12/512 - 1
7812499999.0

Since partition tools typically ask for the offset of the last sector to be included in the partition, we’ll need to subtract 1.

To partition the drive, we first need to clean everything on it first:

$ sudo wipefs -a /dev/sda
...
$ sudo wipefs -a /dev/sdb
...

(You can skip this if you are using a brand new drive.)

Then, create the partition with gdisk:

$ sudo gdisk /dev/sda
GPT fdisk (gdisk) version 1.0.9

Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: not present

Creating new GPT entries in memory.

Command (? for help): n
Partition number (1-128, default 1):
First sector (34-7814037134, default = 2048) or {+-}size{KMGTP}:
Last sector (2048-7814037134, default = 7814035455) or {+-}size{KMGTP}: 7812499999
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): fd00
Changed type of partition to 'Linux RAID'

Command (? for help): c
Using 1
Enter name: cached_raid1_a

Command (? for help): p
Disk /dev/sda: 7814037168 sectors, 3.6 TiB
Model: ST4000VN008-2DR1
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): [redacted]
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 7814037134
Partitions will be aligned on 2048-sector boundaries
Total free space is 1539149 sectors (751.5 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      7812499999   3.6 TiB     FD00  cached_raid1_a

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): y
OK; writing new GUID partition table (GPT) to /dev/sda.
The operation has completed successfully.

Now repeat this for /dev/sdb. Note that you don’t have to name the partitions with the c command, but it makes it easier to identify which partition is which if you have a lot of drives.

The partitions /dev/sda1 and /dev/sdb1 should now be available. If not, run partprobe to reload the partition table.

Creating the `mdadm` RAID array

Now we can create the array on /dev/md0 by running mdadm:

$ sudo mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: size set to 3906116864K
mdadm: automatically enabling write-intent bitmap on large array
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

To avoid having to assemble this array on every boot, you should declare it in /etc/mdadm/mdadm.conf. To do this, first run a command to get the definition:

$ sudo mdadm --detail --scan
ARRAY /dev/md0 metadata=1.2 name=example:0 UUID=6d539f5d:5b37:4bf0:b2d9:2af5efc99e6a

Now, append the output to /etc/mdadm/mdadm.conf.

Then, make sure that this configuration is updated in the initrd for all kernels:

$ sudo update-initramfs -u -k all
update-initramfs: Generating /boot/initrd.img-6.1.0-34-amd64
update-initramfs: Generating /boot/initrd.img-6.1.0-33-amd64
...

The RAID 1 array on /dev/md0 is now ready to be used as a PV containing the HDD storage.

Background operations

In the background, Linux’s MD RAID driver is working hard to synchronize the two drives so that they store identical data:

$ cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sde1[1] sdd1[0]
      3906116864 blocks super 1.2 [2/2] [UU]
      [=>...................]  resync =  9.7% (379125696/3906116864) finish=402.8min speed=145930K/sec
      bitmap: 29/30 pages [116KB], 65536KB chunk

unused devices: <none>

We can safely ignore this and continue. It will finish eventually.

Creating the SSD cache partition

You’ll need a partition on an SSD to serve as cache. This needs to be a real partition, not an LVM LV, as that would involve nested LVM. That never works reliably in my experience, and I’ve given up trying. This is especially nasty because I also use LVM to hold virtual machine disks, and if I just blanketly allow nested LVM, then the host machine can access all the LVM volumes inside all the VMs, which can cause data corruption.

If you don’t have unpartitioned space lying around, you’ll need to shrink a partition and reallocate its space as a separate partition.

Calculating the size

In my case, I had two partitions on my SSD, one EFI system partition (ESP) for the bootloader, and an LVM PV covering the rest of the disk. It looks something like this:

$ sudo gdisk -l /dev/nvme0n1
...
Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048          206847   100.0 MiB   EF00  EFI system partition
   2          206848      3907029134   1.8 TiB     8E00  main_lvm_pv

For a 100 GiB cache, we’ll need to shrink the LVM PV by 100 GiB, and then edit the partition table. To avoid off-by-one errors, we’ll shrink the LV by 200 GiB or so, fix up the partition table, and then expand it afterwards.

Effectively, we want to end the LVM PV at sector 3697313934, which is exactly 100 GiB worth of 512-byte sectors before the current last sector:

>>> 3907029134 - 100*1024*1024*2
3697313934

Note that we multiply by 1024 once to convert from GiB to MiB, then a second time to convert from MiB to KiB, and there are two sectors per KiB.

Shrink existing partition data

First, shrinking the PV:

$ sudo pvresize --setphysicalvolumesize 1600G /dev/nvme0n1p2
/dev/nvme0n1p2: Requested size 1.56 TiB is less than real size <1.82 TiB. Proceed?  [y/n]: y
  WARNING: /dev/nvme0n1p2: Pretending size is 3355443200 not 3906822287 sectors.
  Physical volume "/dev/nvme0n1p2" changed
  1 physical volume(s) resized or updated / 0 physical volume(s) not resized

If you aren’t using LVM, but instead a regular ext4 filesystem, you can try using resize2fs, passing the size as the second positional argument. This would require you to unmount the partition first, since ext4 doesn’t have online shrinking, unlike LVM.

Editing the partition table

Then, we edit the partition table to shrink the partition for the PV and create a new one in the freed space:

$ sudo gdisk /dev/nvme0n1
...
Command (? for help): d
Partition number (1-2): 2

Command (? for help): n
Partition number (2-128, default 2):
First sector (34-3907029134, default = 206848) or {+-}size{KMGTP}:
Last sector (206848-3907029134, default = 3907028991) or {+-}size{KMGTP}: 3697313934
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): 8e00
Changed type of partition to 'Linux LVM'

Command (? for help): n
Partition number (3-128, default 3):
First sector (34-3907029134, default = 3697315840) or {+-}size{KMGTP}:
Last sector (3697315840-3907029134, default = 3907028991) or {+-}size{KMGTP}:
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): 8e00
Changed type of partition to 'Linux LVM'

Command (? for help): c
Partition number (1-3): 3
Enter name: cached_cache_pv

Command (? for help): p
...

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048          206847   100.0 MiB   EF00  EFI system partition
   2          206848      3697313934   1.7 TiB     8E00  main_lvm_pv
   3      3697315840      3907028991   100.0 GiB   8E00  cached_cache_pv

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): y
OK; writing new GUID partition table (GPT) to /dev/nvme0n1.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.

Note that with gdisk, changing the size of a partition requires deleting it and recreating it with the same partition number at the same starting offset. The data in the partition is unaffected.

Now, we need to notify the kernel that the partition has shrunk:

$ sudo partprobe /dev/nvme0n1

Expand shrunk partition to fit new space

Then, we can expand the PV to fit all the available space:

$ sudo pvresize /dev/nvme0n1p2
  Physical volume "/dev/nvme0n1p2" changed
  1 physical volume(s) resized or updated / 0 physical volume(s) not resized
$ sudo pvdisplay /dev/nvme0n1p2
  --- Physical volume ---
  PV Name               /dev/nvme0n1p2
  PV Size               1.72 TiB / not usable <3.07 MiB
...

As we can see, the PV size is now exactly the reduced size of the partition. Now that’s done, we can use /dev/nvme0n1p3 as a PV containing our SSD cache.

Creating a new volume group

Now that we have the partitions to serve as our PVs, we can create a volume group called cached:

$ sudo vgcreate cached /dev/md0 /dev/nvme0n1p3
  WARNING: Devices have inconsistent physical block sizes (4096 and 512).
  Physical volume "/dev/md0" successfully created.
  Physical volume "/dev/nvme0n1p3" successfully created.
  Volume group "cached" successfully created

Creating the cached LV

Creating a cached LV is somehow a multistep process that requires a lot of math.

Creating an LV on the HDD

First, you’ll need to create an LV containing the underlying data. Let’s put it on /dev/md0, using up all available space. You can obviously use less space if you want and expand it later. This is the command:

$ sudo lvcreate -n example -l 100%FREE cached /dev/md0
  Logical volume "example" created.

Creating the cache metadata LV

Next, we need a cache metadata volume on the SSD. 1 GiB should be plenty:

$ sudo lvcreate -n example_meta -L 1G cached /dev/nvme0n1p3
  Logical volume "example_meta" created.

Creating the cache LV

Now, we’ll need to use all remaining space on the /dev/nvme0n1p3 PV to serve as our cache. However, -l 100%FREE will not work because creating a cached pool requires some free space for a spare pool metadata LV for repair operations of the exact same size as the metadata. Since our metadata is 256 extents long, we’ll need to identify how much space we have available and reduce it by 256 (adjust if your metadata size is different):

$ sudo pvdisplay /dev/nvme0n1p3
  --- Physical volume ---
  PV Name               /dev/nvme0n1p3
  VG Name               cached
  PV Size               <100.00 GiB / not usable 3.00 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              25599
  Free PE               25343
  Allocated PE          256

As you can see, we have 25343 extents left. We’ll need to subtract 256:

>>> 25343-256
25087

We can now create the actual cache LV:

$ sudo lvcreate -n example_cache -l 25087 cached /dev/nvme0n1p3
  Logical volume "example_cache" created.

Creating a cache pool

We can now merge the cache metadata and actual cache LV into a cache pool LV:

$ sudo lvconvert --type cache-pool --poolmetadata cached/example_meta cached/example_cache
  Using 128.00 KiB chunk size instead of default 64.00 KiB, so cache pool has less than 1000000 chunks.
  WARNING: Converting cached/example_cache and cached/example_meta to cache pool's data and metadata volumes with metadata wiping.
  THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
Do you really want to convert cached/example_cache and cached/example_meta? [y/n]: y
  Converted cached/example_cache and cached/example_meta to cache pool.

Here, we used the default chunk size chosen by LVM, but depending on the size of your files, you might benefit from a different chunk size. The lvmcache(7) man page has this to say:

The value must be a multiple of 32 KiB between 32 KiB and 1 GiB. Cache chunks bigger than 512 KiB shall be only used when necessary.

Using a chunk size that is too large can result in wasteful use of the cache, in which small reads and writes cause large sections of an LV to be stored in the cache. It can also require increasing migration threshold which defaults to 2048 sectors (1 MiB). Lvm2 ensures migration threshold is at least 8 chunks in size. This may in some cases result in very high bandwidth load of transferring data between the cache LV and its cache origin LV. However, choosing a chunk size that is too small can result in more overhead trying to manage the numerous chunks that become mapped into the cache. Overhead can include both excessive CPU time searching for chunks, and excessive memory tracking chunks.

Attach the cache pool to the HDD LV

Once that’s done, we can now attach the cache pool to the underlying storage to create a cached LV:

$ sudo lvconvert --type cache --cachepool cached/example_cache cached/example
Do you want wipe existing metadata of cache pool cached/example_cache? [y/n]: y
  Logical volume cached/example is now cached.

We can now see this LV:

$ sudo lvs
  LV             VG             Attr       LSize   Pool                  Origin          Data%  Meta%  Move Log Cpy%Sync Convert
  example        cached         Cwi-a-C---  <3.64t [example_cache_cpool] [example_corig] 0.01   0.62            0.00
...            

Cache modes

Note that there are several cache modes in LVM:

writethrough: any data written to the cached LV is stored in both the cache and the underlying block device (the default). This means that if the SSD fails for some reason, you don’t lose your data, but it also means writes are slower; and
writeback: data is written to cache, and after some unspecified delay, is written to the underlying block device. This means that cache drive failure can result in data loss.

Basically, use writethrough if you want your data to survive an SSD failure, or writeback if you don’t care.

Since I am using RAID 1 for reliability, it’d be pretty silly to then use writeback and risk losing the data and creating an outage, so I kept the default of writethrough.

To use writeback, you can specify --cachemode writeback during the initial lvconvert, or use sudo lvchange --cachemode writeback cached/example afterwards.

Creating a filesystem

Now that the cached LV is created, we just have to create a filesystem on it and mount it. For this exercise, we’ll use ext4, since that’s the traditional Linux filesystem and the most well-supported. I wouldn’t recommend using something like btrfs or ZFS since they are designed to access raw drives.

Creating an ext4 partition is simple:

$ sudo mkfs.ext4 /dev/cached/example
mke2fs 1.47.0 (5-Feb-2023)
Discarding device blocks: done                            
Creating filesystem with 976528384 4k blocks and 244137984 inodes
Filesystem UUID: bb93c359-1915-4f09-b23f-2f3a5e8b8663
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

Mounting the new filesystem

Now, we need to mount it. We can just run mount, but it makes more sense to define a permanent place for it in /etc/fstab. For this exercise, let’s mount in /example.

First, we create /example:

$ sudo mkdir /example

Then, we add the following line to /etc/fstab:

/dev/cached/example /example ext4 rw,noatime 0 2

Now, let’s mount it:

$ sudo systemctl daemon-reload
$ sudo mount /example
$ ls /example
lost+found

And there we have it. Our new cached LV is mounted on /example, and the default ext4 lost+found directory is visible. Now you can store anything you want in /example.

Monitoring

You can find most cache metrics by running lvdisplay on the cached LV:

$ sudo lvdisplay /dev/cached/example
  --- Logical volume ---
  LV Path                /dev/cached/example
  LV Name                example
  VG Name                cached
...
  LV Size                <3.64 TiB
  Cache used blocks      8.40%
  Cache metadata blocks  0.62%
  Cache dirty blocks     0.00%
  Cache read hits/misses 84786 / 40435
  Cache wrt hits/misses  222496 / 1883192
  Cache demotions        0
  Cache promotions       67420
  Current LE             953641
...

Conclusion

In the previous iteration of this before the drive failure, I was able to hit over 95% cache hits on reads storing a mix of mirrors and LLMs, with most of the files very infrequently read. If you have a similar workload, LVM caching is probably highly beneficial.

Note that this technique doesn’t have to be used to cache HDDs. Another possible application lies in the cloud, where you frequently have access to very large but slow block storage over the network and fast but small local storage. You can use LVM cache in this scenario also to cache the slower networked block device with the local storage.

I hope this was helpful and you learned something about LVM. See you next time!

Notes

After getting spoiled by 3+ GB/s NVMe SSDs, the paltry 200 MB/s you can get on HDDs feels slow, but it’s probably fine in most situations. ↩
Some smaller, higher traffic projects can easily justify being hosted completely on SSDs, which is what I do. The rest are hosted on the cached HDD array. ↩
NVMe devices are a bit confusing due to NVMe namespaces. The drive is /dev/nvme0 while the first namespace is /dev/nvme0n1. On most consumer drives, only a single NVMe namespace is supported, but enterprise drives may support dividing into multiple namespaces, like partitioning except at a lower level. Namespace-level features may include encryption and write protection. ↩
RAID is not a backup! RAID ensures data availability in the event of drive failures, but it doesn’t protect you from accidental deletion, ransomware, or corruption. You should always back up important data! ↩
Those of you who have read the previous post on btrfs know that this isn’t the case when using btrfs’s raid1 profile. However, since btrfs doesn’t support SSD caching, I am forced to run ext4 on cached LVM instead. ↩

Table of contents