Inexpensive highly available LXD cluster: Introduction

It’s been a couple of years since I last posted here, instead I’ve mostly been focusing on LXD specific content which I’ve been publishing on our discussion forum rather than on my personal blog.

But this is going to be a bit of a journey and is about my personal infrastructure so this feels like a better home for it!

What is this all about?

For years now, I’ve been using dedicated servers from the likes of Hetzner or OVH to host my main online services, things ranging from DNS servers, to this blog, to websites for friends and family, to more critical things like the linuxcontainers.org website, forum and main image publishing logic.

All in all, that’s around 30 LXD instances with a mix of containers and virtual machines that need to run properly 24/7 and have good internet access.

I’m a sysadmin at heart, I know how to design and run complex infrastructures, how to automate things, monitor things and fix them when things go bad. But having everything rely on a single beefy machine rented month to month from an online provider definitely has its limitations and this series is all about fixing that!

The current state of things

As mentioned, I have about 30 LXD instances that need to be online 24/7.
This is currently done using a single server at OVH in Montreal with:

  • CPU: Intel Xeon E3-1270v6 (4c/8t)
  • RAM: 32GB
  • HDD: 2x500GB NVME + 2x2TB CMR HDD
  • Network: 1Gb/s (with occasional limit down to 500Mb/s)
  • OS: Ubuntu 18.04 LTS with HWE kernel and latest LXD
  • Cost: 160.95 CAD/month (with taxes)

I consider this a pretty good value for the cost, it comes with BMC access for remote maintenance, some amount of monitoring and on-site staff to deal with hardware failures.

But everything goes offline if:

  • Any hardware fail (except for storage which is redundant)
  • System needs rebooting
  • LXD or kernel crashes

LXD has a very solid clustering feature now which requires a minimum of 3 servers and will provide a highly available database and API layer. This can be combined with distributed storage through Ceph and distributed networking through OVN.

But to benefit from this, you need 3 servers and you need fast networking between those 3 servers. Looking around for options in the sub-500CAD price range didn’t turn up anything particularly suitable so I started considering alternatives.

Going with my own hardware

If you can’t rent anything at a reasonable price, the alternative is to own it.
Buying 3 brand new servers and associated storage and network equipment was out of the question. This would quickly balloon into the tens of thousands of dollars for something I’d like to buy new and just isn’t worth it given the amount of power I actually need out of this cluster.

So as many in that kind of situation, I went on eBay 🙂
My criteria list ended up being:

  • SuperMicro servers
    This is simply because I already own a few and I operate a rack full of them for NorthSec. I know their product lines, I know their firmware and BMCs and I know how to get replacement parts and fix them.
  • Everything needs to be dual-PSU
  • Servers must be on the SuperMicro X10 platform or more recent.
    This effectively means I get Xeon E-series v3 or v4 and get basic RedFish support in the firmware for remote management.
  • 64GB of RAM or more
  • At least two 10Gb ports
  • Cost 1000CAD or less a piece

In the end, I ended up shopping directly with UnixSurplus through their eBay store. I’ve used them before and they pretty much beat everyone else on pricing for used SuperMicro kit.

What I settled on is three:

The motherboard supports Xeon E5v4 chips, so the CPUs can be swapped for more recent and much more powerful chips should the need arise and good candidates show up on eBay. Same story with the memory, this is just 4 sticks of 16GB leaving 20 free slots for expansion.

For each of them, I’ve then added some storage and networking:

  • 1x 500GB Samsung 970 Pro NVME (avoid the 980 Pro, they’re not as good)
  • 1x 2TB Samsung 860 Pro SATA SSD
  • 1x 4TB Seagate Ironwolf Pro CMR HDD
  • 1x 6TB Seagate Ironwolf Pro CMR HDD
  • 1x U.2 to PCIe adapter (no U.2 on this motherboard)
  • 1x 2.5″ to 3.5″ adapter (so the SSD can fit in a tray)
  • 1x Intel I350-T2
  • Cost: 950CAD per server

For those, I went with new off Amazon/Newegg and picked what felt like the best deal at the time. I went with high quality consumer/NAS parts rather than DC grade but using parts I’ve been running 24/7 elsewhere before and that in my experience provide adequate performance.

For the network side of things, I wanted a 24 ports gigabit switch with dual power supply, hot replaceable fans and support for 10Gbit uplink. NorthSec has a whole bunch of C3750X which have worked well for us and are at the end of their supported life making them very cheap on eBay, so I got a C3750X with a 10Gb module for around 450CAD.

Add everything up and the total hardware cost ends up at a bit over 6000CAD, make it 6500CAD with extra cables and random fees.
My goal is to keep that hardware running for around 5 years so a monthly cost of just over 100CAD.

Where to put all this

Before I actually went ahead and ordered all that stuff though, I had to figure out a place for it and sort out a deal for power and internet.

Getting good co-location deals for less than 5U of space is pretty tricky. Most datacenter won’t even talk to you if you want less than a half rack or a rack.

Luckily for me, I found a Hive Datacenter that’s less than a 30min drive from here and which has nice public pricing on a per-U basis. After some chit chat, I got a contract for 4U of space with enough power and bandwidth for my needs. They also have a separate network for your OOB/IPMI/BMC equipment which you can connect to over VPN!

This sorts out the where to put it all, so I placed my eBay order and waited for the hardware to arrive!

Up next

So that post pretty much cover the needs and current situation and the hardware and datacenter I’ll be using for the new setup.

What’s not described is how I actually intend to use all this hardware to get me the highly available setup that I’m looking for!

The next post goes over all the redundancy in this setup, looking at storage, network and control plane and how it can handle various failure cases.

Posted in Canonical voices, LXD, Planet Ubuntu | Tagged | Leave a comment

Custom user mappings in LXD containers

LXD logo

Introduction

As you may know, LXD uses unprivileged containers by default.
The difference between an unprivileged container and a privileged one is whether the root user in the container is the “real” root user (uid 0 at the kernel level).

The way unprivileged containers are created is by taking a set of normal UIDs and GIDs from the host, usually at least 65536 of each (to be POSIX compliant) and mapping those into the container.

The most common example and what most LXD users will end up with by default is a map of 65536 UIDs and GIDs, with a host base id of 100000. This means that root in the container (uid 0) will be mapped to the host uid 100000 and uid 65535 in the container will be mapped to uid 165535 on the host. UID/GID 65536 and higher in the container aren’t mapped and will return an error if you attempt to use them.

From a security point of view, that means that anything which is not owned by the users and groups mapped into the container will be inaccessible. Any such resource will show up as being owned by uid/gid “-1” (rendered as 65534 or nobody/nogroup in userspace). It also means that should there be a way to escape the container, even root in the container would find itself with just as much privileges on the host as a nobody user.

LXD does offer a number of options related to unprivileged configuration:

  • Increasing the size of the default uid/gid map
  • Setting up per-container maps
  • Punching holes into the map to expose host users and groups

Increasing the size of the default map

As mentioned above, in most cases, LXD will have a default map that’s made of 65536 uids/gids.

In most cases you won’t have to change that. There are however a few cases where you may have to:

  • You need access to uid/gid higher than 65535.
    This is most common when using network authentication inside of your containers.
  • You want to use per-container maps.
    In which case you’ll need 65536 available uid/gid per container.
  • You want to punch some holes in your container’s map and need access to host uids/gids.

The default map is usually controlled by the “shadow” set of utilities and files. On systems where that’s the case, the “/etc/subuid” and “/etc/subgid” files are used to configure those maps.

On systems that do not have a recent enough version of the “shadow” package. LXD will assume that it doesn’t have to share uid/gid ranges with anything else and will therefore assume control of a billion uids and gids, starting at the host uid/gid 100000.

But the common case, is a system with a recent version of shadow.
An example of what the configuration may look like is:

stgraber@castiana:~$ cat /etc/subuid
lxd:100000:65536
root:100000:65536

stgraber@castiana:~$ cat /etc/subgid
lxd:100000:65536
root:100000:65536

The maps for “lxd” and “root” should always be kept in sync. LXD itself is restricted by the “root” allocation. The “lxd” entry is used to track what needs to be removed if LXD is uninstalled.

Now if you want to increase the size of the map available to LXD. Simply edit both of the files and bump the last value from 65536 to whatever size you need. I tend to bump it to a billion just so I don’t ever have to think about it again:

stgraber@castiana:~$ cat /etc/subuid
lxd:100000:1000000000
root:100000:1000000000

stgraber@castiana:~$ cat /etc/subgid
lxd:100000:1000000000
root:100000:100000000

After altering those files, you need to restart LXD to have it detect the new map:

root@vorash:~# systemctl restart lxd
root@vorash:~# cat /var/log/lxd/lxd.log
lvl=info msg="LXD 2.14 is starting in normal mode" path=/var/lib/lxd t=2017-06-14T21:21:13+0000
lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2017-06-14T21:21:13+0000
lvl=info msg="Kernel uid/gid map:" t=2017-06-14T21:21:13+0000
lvl=info msg=" - u 0 0 4294967295" t=2017-06-14T21:21:13+0000
lvl=info msg=" - g 0 0 4294967295" t=2017-06-14T21:21:13+0000
lvl=info msg="Configured LXD uid/gid map:" t=2017-06-14T21:21:13+0000
lvl=info msg=" - u 0 1000000 1000000000" t=2017-06-14T21:21:13+0000
lvl=info msg=" - g 0 1000000 1000000000" t=2017-06-14T21:21:13+0000
lvl=info msg="Connecting to a remote simplestreams server" t=2017-06-14T21:21:13+0000
lvl=info msg="Expiring log files" t=2017-06-14T21:21:13+0000
lvl=info msg="Done expiring log files" t=2017-06-14T21:21:13+0000
lvl=info msg="Starting /dev/lxd handler" t=2017-06-14T21:21:13+0000
lvl=info msg="LXD is socket activated" t=2017-06-14T21:21:13+0000
lvl=info msg="REST API daemon:" t=2017-06-14T21:21:13+0000
lvl=info msg=" - binding Unix socket" socket=/var/lib/lxd/unix.socket t=2017-06-14T21:21:13+0000
lvl=info msg=" - binding TCP socket" socket=[::]:8443 t=2017-06-14T21:21:13+0000
lvl=info msg="Pruning expired images" t=2017-06-14T21:21:13+0000
lvl=info msg="Updating images" t=2017-06-14T21:21:13+0000
lvl=info msg="Done pruning expired images" t=2017-06-14T21:21:13+0000
lvl=info msg="Done updating images" t=2017-06-14T21:21:13+0000
root@vorash:~#

As you can see, the configured map is logged at LXD startup and can be used to confirm that the reconfiguration worked as expected.

You’ll then need to restart your containers to have them start using your newly expanded map.

Per container maps

Provided that you have a sufficient amount of uid/gid allocated to LXD, you can configure your containers to use their own, non-overlapping allocation of uids and gids.

This can be useful for two reasons:

  1. You are running software which alters kernel resource ulimits.
    Those user-specific limits are tied to a kernel uid and will cross container boundaries leading to hard to debug issues where one container can perform an action but all others are then unable to do the same.
  2. You want to know that should there be a way for someone in one of your containers to somehow get access to the host that they still won’t be able to access or interact with any of the other containers.

The main downsides to using this feature are:

  • It’s somewhat wasteful with using 65536 uids and gids per container.
    That being said, you’d still be able to run over 60000 isolated containers before running out of system uids and gids.
  • It’s effectively impossible to share storage between two isolated containers as everything written by one will be seen as -1 by the other. There is ongoing work around virtual filesystems in the kernel that will eventually let us get rid of that limitation.

To have a container use its own distinct map, simply run:

stgraber@castiana:~$ lxc config set test security.idmap.isolated true
stgraber@castiana:~$ lxc restart test
stgraber@castiana:~$ lxc config get test volatile.last_state.idmap
[{"Isuid":true,"Isgid":false,"Hostid":165536,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":165536,"Nsid":0,"Maprange":65536}]

The restart step is needed to have LXD remap the entire filesystem of the container to its new map.
Note that this step will take a varying amount of time depending on the number of files in the container and the speed of your storage.

As can be seen above, after restart, the container is shown to have its own map of 65536 uids/gids.

If you want LXD to allocate more than the default 65536 uids/gids to an isolated container, you can bump the size of the allocation with:

stgraber@castiana:~$ lxc config set test security.idmap.size 200000
stgraber@castiana:~$ lxc restart test
stgraber@castiana:~$ lxc config get test volatile.last_state.idmap
[{"Isuid":true,"Isgid":false,"Hostid":165536,"Nsid":0,"Maprange":200000},{"Isuid":false,"Isgid":true,"Hostid":165536,"Nsid":0,"Maprange":200000}]

If you’re trying to allocate more uids/gids than are left in LXD’s allocation, LXD will let you know:

stgraber@castiana:~$ lxc config set test security.idmap.size 2000000000
error: Not enough uid/gid available for the container.

Direct user/group mapping

The fact that all uids/gids in an unprivileged container are mapped to a normally unused range on the host means that sharing of data between host and container is effectively impossible.

Now, what if you want to share your user’s home directory with a container?

The obvious answer to that is to define a new “disk” entry in LXD which passes your home directory to the container:

stgraber@castiana:~$ lxc config device add test home disk source=/home/stgraber path=/home/ubuntu
Device home added to test

So that was pretty easy, but did it work?

stgraber@castiana:~$ lxc exec test -- bash
root@test:~# ls -lh /home/
total 529K
drwx--x--x 45 nobody nogroup 84 Jun 14 20:06 ubuntu

No. The mount is clearly there, but it’s completely inaccessible to the container.
To fix that, we need to take a few extra steps:

  • Allow LXD’s use of our user uid and gid
  • Restart LXD to have it load the new map
  • Set a custom map for our container
  • Restart the container to have the new map apply
stgraber@castiana:~$ printf "lxd:$(id -u):1\nroot:$(id -u):1\n" | sudo tee -a /etc/subuid
lxd:201105:1
root:201105:1

stgraber@castiana:~$ printf "lxd:$(id -g):1\nroot:$(id -g):1\n" | sudo tee -a /etc/subgid
lxd:200512:1
root:200512:1

stgraber@castiana:~$ sudo systemctl restart lxd

stgraber@castiana:~$ printf "uid $(id -u) 1000\ngid $(id -g) 1000" | lxc config set test raw.idmap -

stgraber@castiana:~$ lxc restart test

At which point, things should be working in the container:

stgraber@castiana:~$ lxc exec test -- su ubuntu -l
ubuntu@test:~$ ls -lh
total 119K
drwxr-xr-x 5  ubuntu ubuntu 8 Feb 18 2016 data
drwxr-x--- 4  ubuntu ubuntu 6 Jun 13 17:05 Desktop
drwxr-xr-x 3  ubuntu ubuntu 28 Jun 13 20:09 Downloads
drwx------ 84 ubuntu ubuntu 84 Sep 14 2016 Maildir
drwxr-xr-x 4  ubuntu ubuntu 4 May 20 15:38 snap
ubuntu@test:~$ 

Conclusion

User namespaces, the kernel feature that makes those uid/gid mappings possible is a very powerful tool which finally made containers on Linux safe by design. It is however not the easiest thing to wrap your head around and all of that uid/gid map math can quickly become a major issue.

In LXD we’ve tried to expose just enough of those underlying features to be useful to our users while doing the actual mapping math internally. This makes things like the direct user/group mapping above significantly easier than it otherwise would be.

Going forward, we’re very interested in some of the work around uid/gid remapping at the filesystem level, this would let us decouple the on-disk user/group map from that used for processes, making it possible to share data between differently mapped containers and alter the various maps without needing to also remap the entire filesystem.

Extra information

The main LXD website is at: https://linuxcontainers.org/lxd
Development happens on Github at: https://github.com/lxc/lxd
Discussion forun: https://discuss.linuxcontainers.org
Mailing-list support happens on: https://lists.linuxcontainers.org
IRC support happens in: #lxcontainers on irc.freenode.net
Try LXD online: https://linuxcontainers.org/lxd/try-it

Posted in Canonical voices, LXC, LXD, Planet Ubuntu | Tagged | 10 Comments

Using Wake on LAN with MAAS 2.x

Introduction

I maintain a number of development systems that are used as throw away machines to reproduce LXC and LXD bugs by the upstream developers. I use MAAS to track who’s using what and to have the machines deployed with whatever version of Ubuntu or Centos is needed to reproduce a given bug.

A number of those systems are proper servers with hardware BMCs on a management network that MAAS can drive using IPMI. Another set of systems are virtual machines that MAAS drives through libvirt.

But I’ve long had another system I wanted to get in there. That machine is a desktop computer but with a server grade SAS controller and internal and external arrays. That machine also has a Fiber Channel HBA and Infiniband card for even less common setups.

The trouble is that this being a desktop computer, it’s lacking any kind of remote management that MAAS supports. That machine does however have a good PCIe network card which provides reliable wake-on-lan.

Back in the days (MAAS 1.x), there was a wake-on-lan power type that would have covered my use case. This feature was however removed from MAAS 2.x (see LP: #1589140) and the development team suggests that users who want the old wake-on-lan feature, instead install Ubuntu 14.04 and the old MAAS 1.x branch.

Implementing Wake on LAN in MAAS 2.x

I am, however not particularly willing to install an old Ubuntu release and an old version of MAAS just for that one trivial feature, so I instead spent a bit of time to just implement the bits I needed and keep a patch around to be re-applied whenever MAAS changes.

MAAS doesn’t provide a plugin system for power types, so I unfortunately couldn’t just write a plugin and distribute that as an unofficial power type for those who need WOL. I instead had to resort to modifying MAAS directly to add the extra power type.

The code change needed to re-implement a wake-on-lan power type is pretty simple and only took me a few minutes to sort out. The patch can be found here: https://dl.stgraber.org/maas-wakeonlan.diff

To apply it to your MAAS, do:

sudo apt install wakeonlan
wget https://dl.stgraber.org/maas-wakeonlan.diff
sudo patch -p1 -d /usr/lib/python3/dist-packages/provisioningserver/ < maas-wakeonlan.diff
sudo systemctl restart maas-rackd.service maas-regiond.service

Once done, you’ll now see this in the web UI:

After selecting the new “Wake on LAN” power type, enter the MAC address of the network interface that you have WOL enabled on and save the change.

MAAS will then be able to turn the system on, allowing for the normal commissioning and deployment stages. For everything else, this power type behaves like the “Manual” type, asking the user to manually go shutdown or reboot the system as you can’t do that through Wake on LAN.

Note that you’ll have to re-apply part of the patch whenever MAAS is updated. The patch modifies two files and adds a new one. The new file won’t be removed during an upgrade, but the two modified files will get reverted and need patching again.

Conclusion

This is certainly a hack and if your system supports anything better than Wake on LAN, or you’re willing to buy a supported PDU just for that one system, then you should do that instead.

But if the inability to turn a system on is all that stands in your way from adding it to your MAAS, as was the case for me, then that patch may help you.

I hope that in time MAAS will either get that feature back in some way or get a plugin system that I can use to ship that extra power type in its own separate package without needing to alter any of MAAS’ own files.

Posted in Canonical voices, Planet Ubuntu | Tagged | 12 Comments

USB hotplug with LXD containers

LXD logo

USB devices in containers

It can be pretty useful to pass USB devices to a container. Be that some measurement equipment in a lab or maybe more commonly, an Android phone or some IoT device that you need to interact with.

Similar to what I wrote recently about GPUs, LXD supports passing USB devices into containers. Again, similarly to the GPU case, what’s actually passed into the container is a Unix character device, in this case, a /dev/bus/usb/ device node.

This restricts USB passthrough to those devices and software which use libusb to interact with them. For devices which use a kernel driver, the module should be installed and loaded on the host, and the resulting character or block device be passed to the container directly.

Note that for this to work, you’ll need LXD 2.5 or higher.

Example (Android debugging)

As an example which quite a lot of people should be able to relate to, lets run a LXD container with the Android debugging tools installed, accessing a USB connected phone.

This would for example allow you to have your app’s build system and CI run inside a container and interact with one or multiple devices connected over USB.

First, plug your phone over USB, make sure it’s unlocked and you have USB debugging enabled:

stgraber@dakara:~$ lsusb
Bus 002 Device 003: ID 0451:8041 Texas Instruments, Inc. 
Bus 002 Device 002: ID 0451:8041 Texas Instruments, Inc. 
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 021: ID 17ef:6047 Lenovo 
Bus 001 Device 031: ID 046d:082d Logitech, Inc. HD Pro Webcam C920
Bus 001 Device 004: ID 0451:8043 Texas Instruments, Inc. 
Bus 001 Device 005: ID 046d:0a01 Logitech, Inc. USB Headset
Bus 001 Device 033: ID 0fce:51da Sony Ericsson Mobile Communications AB 
Bus 001 Device 003: ID 0451:8043 Texas Instruments, Inc. 
Bus 001 Device 002: ID 072f:90cc Advanced Card Systems, Ltd ACR38 SmartCard Reader
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Spot your phone in that list, in my case, that’d be the “Sony Ericsson Mobile” entry.

Now let’s create our container:

stgraber@dakara:~$ lxc launch ubuntu:16.04 c1
Creating c1
Starting c1

And install the Android debugging client:

stgraber@dakara:~$ lxc exec c1 -- apt install android-tools-adb
Reading package lists... Done
Building dependency tree 
Reading state information... Done
The following NEW packages will be installed:
 android-tools-adb
0 upgraded, 1 newly installed, 0 to remove and 0 not upgraded.
Need to get 68.2 kB of archives.
After this operation, 198 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu xenial/universe amd64 android-tools-adb amd64 5.1.1r36+git20160322-0ubuntu3 [68.2 kB]
Fetched 68.2 kB in 0s (0 B/s) 
Selecting previously unselected package android-tools-adb.
(Reading database ... 25469 files and directories currently installed.)
Preparing to unpack .../android-tools-adb_5.1.1r36+git20160322-0ubuntu3_amd64.deb ...
Unpacking android-tools-adb (5.1.1r36+git20160322-0ubuntu3) ...
Processing triggers for man-db (2.7.5-1) ...
Setting up android-tools-adb (5.1.1r36+git20160322-0ubuntu3) ...

We can now attempt to list Android devices with:

stgraber@dakara:~$ lxc exec c1 -- adb devices
* daemon not running. starting it now on port 5037 *
* daemon started successfully *
List of devices attached

Since we’ve not passed any USB device yet, the empty output is expected.

Now, let’s pass the specific device listed in “lsusb” above:

stgraber@dakara:~$ lxc config device add c1 sony usb vendorid=0fce productid=51da
Device sony added to c1

And try to list devices again:

stgraber@dakara:~$ lxc exec c1 -- adb devices
* daemon not running. starting it now on port 5037 *
* daemon started successfully *
List of devices attached 
CB5A28TSU6 device

To get a shell, you can then use:

stgraber@dakara:~$ lxc exec c1 -- adb shell
* daemon not running. starting it now on port 5037 *
* daemon started successfully *
E5823:/ $

LXD USB devices support hotplug by default. So unplugging the device and plugging it back on the host will have it removed and re-added to the container.

The “productid” property isn’t required, you can set only the “vendorid” so that any device from that vendor will be automatically attached to the container. This can be very convenient when interacting with a number of similar devices or devices which change productid depending on what mode they’re in.

stgraber@dakara:~$ lxc config device remove c1 sony
Device sony removed from c1
stgraber@dakara:~$ lxc config device add c1 sony usb vendorid=0fce
Device sony added to c1
stgraber@dakara:~$ lxc exec c1 -- adb devices
* daemon not running. starting it now on port 5037 *
* daemon started successfully *
List of devices attached 
CB5A28TSU6 device

The optional “required” property turns off the hotplug behavior, requiring the device be present for the container to be allowed to start.

More details on USB device properties can be found here.

Conclusion

We are surrounded by a variety of odd USB devices, a good number of which come with possibly dodgy software, requiring a specific version of a specific Linux distribution to work. It’s sometimes hard to accommodate those requirements while keeping a clean and safe environment.

LXD USB device passthrough helps a lot in such cases, so long as the USB device uses a libusb based workflow and doesn’t require a specific kernel driver.

If you want to add a device which does use a kernel driver, locate the /dev node it creates, check if it’s a character or block device and pass that to LXD as a unix-char or unix-block type device.

Extra information

The main LXD website is at: https://linuxcontainers.org/lxd
Development happens on Github at: https://github.com/lxc/lxd
Mailing-list support happens on: https://lists.linuxcontainers.org
IRC support happens in: #lxcontainers on irc.freenode.net
Try LXD online: https://linuxcontainers.org/lxd/try-it

Posted in Canonical voices, LXD, Planet Ubuntu | Tagged | 1 Comment

NVidia CUDA inside a LXD container

LXD logo

GPU inside a container

LXD supports GPU passthrough but this is implemented in a very different way than what you would expect from a virtual machine. With containers, rather than passing a raw PCI device and have the container deal with it (which it can’t), we instead have the host setup with all needed drivers and only pass the resulting device nodes to the container.

This post focuses on NVidia and the CUDA toolkit specifically, but LXD’s passthrough feature should work with all other GPUs too. NVidia is just what I happen to have around.

The test system used below is a virtual machine with two NVidia GT 730 cards attached to it. Those are very cheap, low performance GPUs, that have the advantage of existing in low-profile PCI cards that fit fine in one of my servers and don’t require extra power.
For production CUDA workloads, you’ll want something much better than this.

Note that for this to work, you’ll need LXD 2.5 or higher.

Host setup

Install the CUDA tools and drivers on the host:

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo apt update
sudo apt install cuda

Then reboot the system to make sure everything is properly setup. After that, you should be able to confirm that your NVidia GPU is properly working with:

ubuntu@canonical-lxd:~$ nvidia-smi 
Tue Mar 21 21:28:34 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 730      Off  | 0000:02:06.0     N/A |                  N/A |
| 30%   30C    P0    N/A /  N/A |      0MiB /  2001MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GT 730      Off  | 0000:02:08.0     N/A |                  N/A |
| 30%   26C    P0    N/A /  N/A |      0MiB /  2001MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
|    1                  Not Supported                                         |
+-----------------------------------------------------------------------------+

And can check that the CUDA tools work properly with:

ubuntu@canonical-lxd:~$ /usr/local/cuda-8.0/extras/demo_suite/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GT 730
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			3059.4

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			3267.4

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			30805.1

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Container setup

First lets just create a regular Ubuntu 16.04 container:

ubuntu@canonical-lxd:~$ lxc launch ubuntu:16.04 c1
Creating c1
Starting c1

Then install the CUDA demo tools in there:

lxc exec c1 -- wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
lxc exec c1 -- dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
lxc exec c1 -- apt update
lxc exec c1 -- apt install cuda-demo-suite-8-0 --no-install-recommends

At which point, you can run:

ubuntu@canonical-lxd:~$ lxc exec c1 -- nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Which is expected as LXD hasn’t been told to pass any GPU yet.

LXD GPU passthrough

LXD allows for pretty specific GPU passthrough, the details can be found here.
First let’s start with the most generic one, just allow access to all GPUs:

ubuntu@canonical-lxd:~$ lxc config device add c1 gpu gpu
Device gpu added to c1
ubuntu@canonical-lxd:~$ lxc exec c1 -- nvidia-smi
Tue Mar 21 21:47:54 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 730      Off  | 0000:02:06.0     N/A |                  N/A |
| 30%   30C    P0    N/A /  N/A |      0MiB /  2001MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GT 730      Off  | 0000:02:08.0     N/A |                  N/A |
| 30%   27C    P0    N/A /  N/A |      0MiB /  2001MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
|    1                  Not Supported                                         |
+-----------------------------------------------------------------------------+
ubuntu@canonical-lxd:~$ lxc config device remove c1 gpu
Device gpu removed from c1

Now just pass whichever is the first GPU:

ubuntu@canonical-lxd:~$ lxc config device add c1 gpu gpu id=0
Device gpu added to c1
ubuntu@canonical-lxd:~$ lxc exec c1 -- nvidia-smi
Tue Mar 21 21:50:37 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 730      Off  | 0000:02:06.0     N/A |                  N/A |
| 30%   30C    P0    N/A /  N/A |      0MiB /  2001MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
+-----------------------------------------------------------------------------+
ubuntu@canonical-lxd:~$ lxc config device remove c1 gpu
Device gpu removed from c1

You can also specify the GPU by vendorid and productid:

ubuntu@canonical-lxd:~$ lspci -nnn | grep NVIDIA
02:06.0 VGA compatible controller [0300]: NVIDIA Corporation GK208 [GeForce GT 730] [10de:1287] (rev a1)
02:07.0 Audio device [0403]: NVIDIA Corporation GK208 HDMI/DP Audio Controller [10de:0e0f] (rev a1)
02:08.0 VGA compatible controller [0300]: NVIDIA Corporation GK208 [GeForce GT 730] [10de:1287] (rev a1)
02:09.0 Audio device [0403]: NVIDIA Corporation GK208 HDMI/DP Audio Controller [10de:0e0f] (rev a1)
ubuntu@canonical-lxd:~$ lxc config device add c1 gpu gpu vendorid=10de productid=1287
Device gpu added to c1
ubuntu@canonical-lxd:~$ lxc exec c1 -- nvidia-smi
Tue Mar 21 21:52:40 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 730      Off  | 0000:02:06.0     N/A |                  N/A |
| 30%   30C    P0    N/A /  N/A |      0MiB /  2001MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GT 730      Off  | 0000:02:08.0     N/A |                  N/A |
| 30%   27C    P0    N/A /  N/A |      0MiB /  2001MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
|    1                  Not Supported                                         |
+-----------------------------------------------------------------------------+
ubuntu@canonical-lxd:~$ lxc config device remove c1 gpu
Device gpu removed from c1

Which adds them both as they are exactly the same model in my setup.

But for such cases, you can also select using the card’s PCI ID with:

ubuntu@canonical-lxd:~$ lxc config device add c1 gpu gpu pci=0000:02:08.0
Device gpu added to c1
ubuntu@canonical-lxd:~$ lxc exec c1 -- nvidia-smi
Tue Mar 21 21:56:52 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 730      Off  | 0000:02:08.0     N/A |                  N/A |
| 30%   27C    P0    N/A /  N/A |      0MiB /  2001MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
+-----------------------------------------------------------------------------+
ubuntu@canonical-lxd:~$ lxc config device remove c1 gpu 
Device gpu removed from c1

And lastly, lets confirm that we get the same result as on the host when running a CUDA workload:

ubuntu@canonical-lxd:~$ lxc config device add c1 gpu gpu
Device gpu added to c1
ubuntu@canonical-lxd:~$ lxc exec c1 -- /usr/local/cuda-8.0/extras/demo_suite/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GT 730
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			3065.4

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			3305.8

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			30825.7

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Conclusion

LXD makes it very easy to share one or multiple GPUs with your containers.
You can either dedicate specific GPUs to specific containers or just share them.

There is no of the overhead involved with usual PCI based passthrough and only a single instance of the driver is running with the containers acting just like normal host user processes would.

This does however require that your containers run a version of the CUDA tools which supports whatever version of the NVidia drivers is installed on the host.

Extra information

The main LXD website is at: https://linuxcontainers.org/lxd
Development happens on Github at: https://github.com/lxc/lxd
Mailing-list support happens on: https://lists.linuxcontainers.org
IRC support happens in: #lxcontainers on irc.freenode.net
Try LXD online: https://linuxcontainers.org/lxd/try-it

Posted in Canonical voices, LXD, Planet Ubuntu | Tagged | 16 Comments