Introducing my own mirroring service: mirror.quantum5.ca
In January, I upgraded my home Internet connection to 3 Gbps symmetric, because, strangely enough, it was cheaper than the package I already had at the time (1500 Mbps down, 940 Mbps up). This was connected to the second port on my ConnectX-3, allowing my home server to achieve the full speed where 2.5 Gbps Ethernet would have failed. Unfortunately, nothing I was doing could have harnessed the full speed of this Internet connection, or anywhere near it, so I started thinking…
In February, I realized that I could run a mirroring service for open-source software to serve the community at basically no additional cost—I am already paying for this 3 Gbps Internet connection and I have some spare disk space on my SSD. So I decided to do exactly that.
Today, I am happy to announce that this mirror, mirror.quantum5.ca, has been tested for a few months and is fully ready for production. If you find the service helpful, please feel free to support me via GitHub Sponsors, Ko-fi, Liberapay, or directly with credit card or bank through Stripe (CAD), though this is of course strictly optional.
If you are interested in how it’s all set up, please read on:
Beginnings
I started by mirroring Arch Linux, because it was decently popular, didn’t
require much disk space, and was relatively welcoming to new mirrors. The
process was fairly easy—I basically created a virtual host for
mirror.quantum5.ca
in nginx
, set it to serve files from a directory, and
created a cron to periodically run rsync
to update the Arch Linux files from
another mirror. Then I simply filed a ticket on Arch Linux’s bug tracker, and in
a few days, I became an official tier 2 mirror.
To avoid this accidentally saturating my Internet connection and start affecting other stuff, I’ve configured it in such a way that the mirror can never use more than 2 Gbps of upload bandwidth, leaving plenty of room for other things.
Due to the way Arch Linux spreads the load between mirrors, almost immediately I started pushing over 100 GB a day. Surprisingly, people at work started noticing the mirror and thanking me for running it. Clearly, people care about this stuff.
Making it look pretty
Encouraged by people caring, I decided to build a nice-looking page at the root
of the mirror domain. I started by doing the minimal amount of work
necessary, pulling out Bootstrap to make a simple page that looks decent. I also
wrote a Python script to render it as a Jinja2 template so I could have the last
synchronization time and the size dynamically updated. The script is run every
time rsync
is run.
However, nginx
’s autogenerated index pages simply looked out of place, and I
wanted them to be pretty. It wouldn’t be hard to write some web app that
rendered the directory index pages, but I didn’t really want to maintain a
separate application. Instead, I wanted something that just runs inside nginx
.
As it turns out, this was possible, and thus I became thoroughly
nerd-sniped.
You see, nginx
could output the autogenerated index in multiple formats—the
default HTML, XML, JSON, and the ancient JSONP. nginx
also has an XSLT
processing module that allows you to transform any XML server-side. Naturally,
the idea was to write an XSLT stylesheet that transformed the autogenerated XML
index into HTML.
The result was an XSLT stylesheet that looked something like this (example output):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:str="http://exslt.org/strings" exclude-result-prefixes="str">
<xsl:output method="html" omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
<xsl:param name="uri"/>
<xsl:template match="/">
<xsl:text disable-output-escaping='yes'><!DOCTYPE html></xsl:text>
...
<h2 id="path">
<xsl:text>Index of /</xsl:text>
<xsl:variable name="path" select="str:tokenize($uri, '/')" />
<xsl:variable name="levels" select="count($path)"/>
<xsl:for-each select="$path">
<xsl:variable name="pos" select="position()"/>
<xsl:variable name="parent">
<xsl:for-each select="$path[position() <= $pos]">
<xsl:value-of select="."/><xsl:text>/</xsl:text>
</xsl:for-each>
</xsl:variable>
<a href="/{$parent}"><xsl:value-of select="."/></a>
<xsl:text>/</xsl:text>
</xsl:for-each>
<xsl:text> </xsl:text>
<a class="button" href="..">⬆️</a>
</h2>
<table class="table table-sm table-hover sortable">
<thead class="thead-dark">
<tr>
<th scope="col">Name</th>
<th scope="col" class="col-1 text-right">Size</th>
<th scope="col" class="col-6 col-sm-5 col-md-4 col-xl-3 text-right">Updated (UTC)</th>
</tr>
</thead>
<tbody>
<xsl:for-each select="list/*">
<xsl:variable name="name">
<xsl:value-of select="."/>
</xsl:variable>
<xsl:variable name="size">
<xsl:if test="string-length(@size) > 0">
<xsl:if test="number(@size) > 0">
<xsl:choose>
<xsl:when test="(@size div 1024) < 0.9"><xsl:value-of select="@size" /></xsl:when>
<xsl:when test="(@size div 1048576) < 0.9"><xsl:value-of select="format-number((@size div 1024), '0.0')" />k</xsl:when>
<xsl:when test="(@size div 1073741824) < 0.9"><xsl:value-of select="format-number((@size div 1048576), '0.00')" />M</xsl:when>
<xsl:otherwise><xsl:value-of select="format-number((@size div 1073741824), '0.00')" />G</xsl:otherwise>
</xsl:choose>
</xsl:if>
</xsl:if>
</xsl:variable>
<xsl:variable name="date">
<xsl:value-of select="substring(@mtime,1,4)"/>-<xsl:value-of select="substring(@mtime,6,2)"/>-<xsl:value-of select="substring(@mtime,9,2)"/><xsl:text> </xsl:text>
<xsl:value-of select="substring(@mtime,12,2)"/>:<xsl:value-of select="substring(@mtime,15,2)"/>:<xsl:value-of select="substring(@mtime,18,2)"/>
</xsl:variable>
<tr>
<td><a href="{$name}"><xsl:value-of select="."/></a></td>
<td class="text-right" data-sort="{@size}"><xsl:value-of select="$size"/></td>
<td class="col-6 col-sm-5 col-md-4 col-xl-3 text-right"><xsl:value-of select="$date"/></td>
</tr>
</xsl:for-each>
</tbody>
</table>
...
</xsl:template>
</xsl:stylesheet>
This involved some major struggle with XSLT, since nginx
—or really,
libxslt
—only supported XSLT 1.0, which is rather underpowered compared to
newer versions. This made things like linking to all the parent directories
difficult. Nevertheless, with some crazy EXSLT extensions (that libxslt
does
support), I did work out a solution, but it probably wasn’t worth the effort in
retrospect. Still, I was impressed by the power of XSLT and would recommend
using it for times when running an application server is overkill.
Of course, since I wanted to keep the same header and footer as the home page, I converted the Jinja2 template to use template inheritance instead and made a template to render the XSLT. This had several issues, most notable of which was the different handling of self-closing tags between HTML and XML, but it was nothing that some variables couldn’t fix.
With this, the mirror finally looked good enough for my taste.
Drive failure
I also suffered a drive failure during the testing period, which was completely unexpected. This all stemmed from buying a cheap M.2 SSD, the 2 TB ADATA XPG GAMMIX S50 Lite. First of all, this SSD was constantly overheating under normal loads, causing it to thermal throttle, which then caused some I/O operations to timeout. This forced me to install a fan to blow on it to keep the temperatures under control1. This doesn’t bode well, but I didn’t do anything, which was a mistake.
Then finally, in April, the SSD gave up the ghost, suddenly failing to respond to any I/O, causing all I/O to timeout. Since it contained the rootfs, the entire OS locked up, and the mirror (plus everything else) was dead. After a hard power cycle (turning off the PSU), it was working again… for 10 minutes before the same thing happened. Of course, this also happened right before I was going to sleep, so I had to do it while half-asleep.
Having no choice, I had to copy everything off the drive as fast as possible. In
the end, I pulled out another SSD that I was using as a big flash drive and
ended up using dd
with a progress bar, restarting with seek
and skip
after
the drive locked up2. This was highly unpleasant, but fortunately, there was
no data loss. After this, I bought a 2 TB Samsung 970 Evo Plus to relieve the
other cheap SSD.
Lessons learned:
- SSDs are said to be way more reliable than HDDs, but they still fail;
- SSDs are commonly cited to fail due to the NAND flash wearing out, which would force it to enter a permanent read-only mode, requiring replacement, but without data loss. However, the controller itself can fail too, and that could easily be catastrophic; and
- Don’t cheap out on storage or buy from non-reputable brands, at least for things you care about. There’s no reason to buy the latest generation and pay the early-adopter tax though, since I didn’t need a PCIe Gen 4 SSD—Gen 3 is more than enough to saturate my networking.
More mirroring
I just left the mirror sitting there for a while. In May, I saw the post by Kenneth Finnegan about what he calls “Micro Mirrors”. Clearly, there’s more demand for these things than I thought, so I decided to mirror a few more things.
Not sure what the best things to mirror are, I wrote to Kenneth, and he kindly
responded with some stats from mirror.fcix.net, the (macro) mirror he
also runs. Dividing the daily traffic by the size of the directory allowed him
to compute a CDN efficiency coefficient. The higher the number, the more impact
mirroring this directory has given the amount of disk space used. From this
data, I selected ubuntu-releases
(ISOs) and LibreOffice to mirror, since these
are reasonably sized and easy to mirror.
I also mirrored some software that I regularly use, such as TeX (really, the Comprehensive TeX Archive Network or CTAN) and Termux.
There are some projects that I use but decided not to mirror, the most notable of which is Debian, which has not been processing new mirror tickets for almost a year at the time of writing. What was the point of blowing at least a terabyte on the Debian archive when no one will even hear about the mirror? This was rather disheartening, to be honest.
If you would like me to mirror more stuff, feel free to put your suggestions in the comments and donate some money to support this endeavour. The links are in the sidebar.
Notes
-
A large passive heatsink might be enough, but my motherboard didn’t come with one, and the one the SSD came with was clearly insufficient. ↩
-
I probably should have used
ddrescue
, but since this was also my router, my home Internet died as well and I didn’t want to figure out how to useddrescue
at some ungodly hour on mobile data when I had a working solution. ↩