But this is going to be a bit of a journey and is about my personal infrastructure so this feels like a better home for it!
What is this all about?
For years now, I’ve been using dedicated servers from the likes of Hetzner or OVH to host my main online services, things ranging from DNS servers, to this blog, to websites for friends and family, to more critical things like the linuxcontainers.org website, forum and main image publishing logic.
All in all, that’s around 30 LXD instances with a mix of containers and virtual machines that need to run properly 24/7 and have good internet access.
I’m a sysadmin at heart, I know how to design and run complex infrastructures, how to automate things, monitor things and fix them when things go bad. But having everything rely on a single beefy machine rented month to month from an online provider definitely has its limitations and this series is all about fixing that!
The current state of things
As mentioned, I have about 30 LXD instances that need to be online 24/7.
This is currently done using a single server at OVH in Montreal with:
- CPU: Intel Xeon E3-1270v6 (4c/8t)
- RAM: 32GB
- HDD: 2x500GB NVME + 2x2TB CMR HDD
- Network: 1Gb/s (with occasional limit down to 500Mb/s)
- OS: Ubuntu 18.04 LTS with HWE kernel and latest LXD
- Cost: 160.95 CAD/month (with taxes)
I consider this a pretty good value for the cost, it comes with BMC access for remote maintenance, some amount of monitoring and on-site staff to deal with hardware failures.
But everything goes offline if:
- Any hardware fail (except for storage which is redundant)
- System needs rebooting
- LXD or kernel crashes
LXD has a very solid clustering feature now which requires a minimum of 3 servers and will provide a highly available database and API layer. This can be combined with distributed storage through Ceph and distributed networking through OVN.
But to benefit from this, you need 3 servers and you need fast networking between those 3 servers. Looking around for options in the sub-500CAD price range didn’t turn up anything particularly suitable so I started considering alternatives.
Going with my own hardware
If you can’t rent anything at a reasonable price, the alternative is to own it.
Buying 3 brand new servers and associated storage and network equipment was out of the question. This would quickly balloon into the tens of thousands of dollars for something I’d like to buy new and just isn’t worth it given the amount of power I actually need out of this cluster.
So as many in that kind of situation, I went on eBay 🙂
My criteria list ended up being:
- SuperMicro servers
This is simply because I already own a few and I operate a rack full of them for NorthSec. I know their product lines, I know their firmware and BMCs and I know how to get replacement parts and fix them.
- Everything needs to be dual-PSU
- Servers must be on the SuperMicro X10 platform or more recent.
This effectively means I get Xeon E-series v3 or v4 and get basic RedFish support in the firmware for remote management.
- 64GB of RAM or more
- At least two 10Gb ports
- Cost 1000CAD or less a piece
In the end, I ended up shopping directly with UnixSurplus through their eBay store. I’ve used them before and they pretty much beat everyone else on pricing for used SuperMicro kit.
What I settled on is three:
- System: SuperMicro SYS-6018U-TR4T
- CPU: 2x Intel Xeon E5-2630v3
- RAM: 64GB DDR4
- Network: 4x10Gb (over Base-T copper sadly)
- Cost: 955CAD a piece with shipping
The motherboard supports Xeon E5v4 chips, so the CPUs can be swapped for more recent and much more powerful chips should the need arise and good candidates show up on eBay. Same story with the memory, this is just 4 sticks of 16GB leaving 20 free slots for expansion.
For each of them, I’ve then added some storage and networking:
- 1x 500GB Samsung 970 Pro NVME (avoid the 980 Pro, they’re not as good)
- 1x 2TB Samsung 860 Pro SATA SSD
- 1x 4TB Seagate Ironwolf Pro CMR HDD
- 1x 6TB Seagate Ironwolf Pro CMR HDD
- 1x U.2 to PCIe adapter (no U.2 on this motherboard)
- 1x 2.5″ to 3.5″ adapter (so the SSD can fit in a tray)
- 1x Intel I350-T2
- Cost: 950CAD per server
For those, I went with new off Amazon/Newegg and picked what felt like the best deal at the time. I went with high quality consumer/NAS parts rather than DC grade but using parts I’ve been running 24/7 elsewhere before and that in my experience provide adequate performance.
For the network side of things, I wanted a 24 ports gigabit switch with dual power supply, hot replaceable fans and support for 10Gbit uplink. NorthSec has a whole bunch of C3750X which have worked well for us and are at the end of their supported life making them very cheap on eBay, so I got a C3750X with a 10Gb module for around 450CAD.
Add everything up and the total hardware cost ends up at a bit over 6000CAD, make it 6500CAD with extra cables and random fees.
My goal is to keep that hardware running for around 5 years so a monthly cost of just over 100CAD.
Where to put all this
Before I actually went ahead and ordered all that stuff though, I had to figure out a place for it and sort out a deal for power and internet.
Getting good co-location deals for less than 5U of space is pretty tricky. Most datacenter won’t even talk to you if you want less than a half rack or a rack.
Luckily for me, I found a Hive Datacenter that’s less than a 30min drive from here and which has nice public pricing on a per-U basis. After some chit chat, I got a contract for 4U of space with enough power and bandwidth for my needs. They also have a separate network for your OOB/IPMI/BMC equipment which you can connect to over VPN!
This sorts out the where to put it all, so I placed my eBay order and waited for the hardware to arrive!
So that post pretty much cover the needs and current situation and the hardware and datacenter I’ll be using for the new setup.
What’s not described is how I actually intend to use all this hardware to get me the highly available setup that I’m looking for!
The next post goes over all the redundancy in this setup, looking at storage, network and control plane and how it can handle various failure cases.