LXD 2.0: Resource control [4/12]

This is the fourth blog post in this series about LXD 2.0.

LXD logo

Available resource limits

LXD offers a variety of resource limits. Some of those are tied to the container itself, like memory quotas, CPU limits and I/O priorities. Some are tied to a particular device instead, like I/O bandwidth or disk usage limits.

As with all LXD configuration, resource limits can be dynamically changed while the container is running. Some may fail to apply, for example if setting a memory value smaller than the current memory usage, but LXD will try anyway and report back on failure.

All limits can also be inherited through profiles in which case each affected container will be constrained by that limit. That is, if you set limits.memory=256MB in the default profile, every container using the default profile (typically all of them) will have a memory limit of 256MB.

We don’t support resource limits pooling where a limit would be shared by a group of containers, there is simply no good way to implement something like that with the existing kernel APIs.

Disk

This is perhaps the most requested and obvious one. Simply setting a size limit on the container’s filesystem and have it enforced against the container.

And that’s exactly what LXD lets you do!
Unfortunately this is far more complicated than it sounds. Linux doesn’t have path-based quotas, instead most filesystems only have user and group quotas which are of little use to containers.

This means that right now LXD only supports disk limits if you’re using the ZFS or btrfs storage backend. It may be possible to implement this feature for LVM too but this depends on the filesystem being used with it and gets tricky when combined with live updates as not all filesystems allow online growth and pretty much none of them allow online shrink.

CPU

When it comes to CPU limits, we support 4 different things:

  • Just give me X CPUs
    In this mode, you let LXD pick a bunch of cores for you and then load-balance things as more containers and CPUs go online/offline.
    The container only sees that number of CPU.
  • Give me a specific set of CPUs (say, core 1, 3 and 5)
    Similar to the first mode except that no load-balancing is happening, you’re stuck with those cores no matter how busy they may be.
  • Give me 20% of whatever you have
    In this mode, you get to see all the CPUs but the scheduler will restrict you to 20% of the CPU time but only when under load! So if the system isn’t busy, your container can have as much fun as it wants. When containers next to it start using the CPU, then it gets capped.
  • Out of every measured 200ms, give me 50ms (and no more than that)
    This mode is similar to the previous one in that you get to see all the CPUs but this time, you can only use as much CPU time as you set in the limit, no matter how idle the system may be. On a system without over-commit this lets you slice your CPU very neatly and guarantees constant performance to those containers.

It’s also possible to combine one of the first two with one of the last two, that is, request a set of CPUs and then further restrict how much CPU time you get on those.

On top of that, we also have a generic priority knob which is used to tell the scheduler who wins when you’re under load and two containers are fighting for the same resource.

Memory

Memory sounds pretty simple, just give me X MB of RAM!

And it absolutely can be that simple. We support that kind of limits as well as percentage based requests, just give me 10% of whatever the host has!

Then we support some extra stuff on top. For example, you can choose to turn swap on and off on a per-container basis and if it’s on, set a priority so you can choose what container will have their memory swapped out to disk first!

Oh and memory limits are “hard” by default. That is, when you run out of memory, the kernel out of memory killer will start having some fun with your processes.

Alternatively you can set the enforcement policy to “soft”, in which case you’ll be allowed to use as much memory as you want so long as nothing else is. As soon as something else wants that memory, you won’t be able to allocate anything until you’re back under your limit or until the host has memory to spare again.

Network I/O

Network I/O is probably our simplest looking limit, trust me, the implementation really isn’t simple though!

We support two things. The first is a basic bit/s limits on network interfaces. You can set a limit of ingress and egress or just set the “max” limit which then applies to both. This is only supported for “bridged” and “p2p” type interfaces.

The second thing is a global network I/O priority which only applies when the network interface you’re trying to talk through is saturated.

Block I/O

I kept the weirdest for last. It may look straightforward and feel like that to the user but there are a bunch of cases where it won’t exactly do what you think it should.

What we support here is basically identical to what I described in Network I/O.

You can set IOps or byte/s read and write limits directly on a disk device entry and there is a global block I/O priority which tells the I/O scheduler who to prefer.

The weirdness comes from how and where those limits are applied. Unfortunately the underlying feature we use to implement those uses full block devices. That means we can’t set per-partition I/O limits let alone per-path.

It also means that when using ZFS or btrfs which can use multiple block devices to back a given path (with or without RAID), we effectively don’t know what block device is providing a given path.

This means that it’s entirely possible, in fact likely, that a container may have multiple disk entries (bind-mounts or straight mounts) which are coming from the same underlying disk.

And that’s where things get weird. To make things work, LXD has logic to guess what block devices back a given path, this does include interrogating the ZFS and btrfs tools and even figures things out recursively when it finds a loop mounted file backing a filesystem.

That logic while not perfect, usually yields a set of block devices that should have a limit applied. LXD then records that and moves on to the next path. When it’s done looking at all the paths, it gets to the very weird part. It averages the limits you’ve set for every affected block devices and then applies those.

That means that “in average” you’ll be getting the right speed in the container, but it also means that you can’t have a “/fast” and a “/slow” directory both coming from the same physical disk and with differing speed limits. LXD will let you set it up but in the end, they’ll both give you the average of the two values.

How does it all work?

Most of the limits described above are applied through the Linux kernel Cgroups API. That’s with the exception of the network limits which are applied through good old “tc”.

LXD at startup time detects what cgroups are enabled in your kernel and will only apply the limits which your kernel support. Should you be missing some cgroups, a warning will also be printed by the daemon which will then get logged by your init system.

On Ubuntu 16.04, everything is enabled by default with the exception of swap memory accounting which requires you pass the “swapaccount=1” kernel boot parameter.

Applying some limits

All the limits described above are applied directly to the container or to one of its profiles. Container-wide limits are applied with:

lxc config set CONTAINER KEY VALUE

or for a profile:

lxc profile set PROFILE KEY VALUE

while device-specific ones are applied with:

lxc config device set CONTAINER DEVICE KEY VALUE

or for a profile:

lxc profile device set PROFILE DEVICE KEY VALUE

The complete list of valid configuration keys, device types and device keys can be found here.

CPU

To just limit a container to any 2 CPUs, do:

lxc config set my-container limits.cpu 2

To pin to specific CPU cores, say the second and fourth:

lxc config set my-container limits.cpu 1,3

More complex pinning ranges like this works too:

lxc config set my-container limits.cpu 0-3,7-11

The limits are applied live, as can be seen in this example:

stgraber@dakara:~$ lxc exec zerotier -- cat /proc/cpuinfo | grep ^proces
processor : 0
processor : 1
processor : 2
processor : 3
stgraber@dakara:~$ lxc config set zerotier limits.cpu 2
stgraber@dakara:~$ lxc exec zerotier -- cat /proc/cpuinfo | grep ^proces
processor : 0
processor : 1

Note that to avoid utterly confusing userspace, lxcfs arranges the /proc/cpuinfo entries so that there are no gaps.

As with just about everything in LXD, those settings can also be applied in profiles:

stgraber@dakara:~$ lxc exec snappy -- cat /proc/cpuinfo | grep ^proces
processor : 0
processor : 1
processor : 2
processor : 3
stgraber@dakara:~$ lxc profile set default limits.cpu 3
stgraber@dakara:~$ lxc exec snappy -- cat /proc/cpuinfo | grep ^proces
processor : 0
processor : 1
processor : 2

To limit the CPU time of a container to 10% of the total, set the CPU allowance:

lxc config set my-container limits.cpu.allowance 10%

Or to give it a fixed slice of CPU time:

lxc config set my-container limits.cpu.allowance 25ms/200ms

And lastly, to reduce the priority of a container to a minimum:

lxc config set my-container limits.cpu.priority 0

Memory

To apply a straightforward memory limit run:

lxc config set my-container limits.memory 256MB

(The supported suffixes are kB, MB, GB, TB, PB and EB)

To turn swap off for the container (defaults to enabled):

lxc config set my-container limits.memory.swap false

To tell the kernel to swap this container’s memory first:

lxc config set my-container limits.memory.swap.priority 0

And finally if you don’t want hard memory limit enforcement:

lxc config set my-container limits.memory.enforce soft

Disk and block I/O

Unlike CPU and memory, disk and I/O limits are applied to the actual device entry, so you either need to edit the original device or mask it with a more specific one.

To set a disk limit (requires btrfs or ZFS):

lxc config device set my-container root size 20GB

For example:

stgraber@dakara:~$ lxc exec zerotier -- df -h /
Filesystem                        Size Used Avail Use% Mounted on
encrypted/lxd/containers/zerotier 179G 542M  178G   1% /
stgraber@dakara:~$ lxc config device set zerotier root size 20GB
stgraber@dakara:~$ lxc exec zerotier -- df -h /
Filesystem                       Size  Used Avail Use% Mounted on
encrypted/lxd/containers/zerotier 20G  542M   20G   3% /

To restrict speed you can do the following:

lxc config device set my-container root limits.read 30MB
lxc config device set my-container root.limits.write 10MB

Or to restrict IOps instead:

lxc config device set my-container root limits.read 20Iops
lxc config device set my-container root limits.write 10Iops

And lastly, if you’re on a busy system with over-commit, you may want to also do:

lxc config set my-container limits.disk.priority 10

To increase the I/O priority for that container to the maximum.

Network I/O

Network I/O is basically identical to block I/O as far the knobs available.

For example:

stgraber@dakara:~$ lxc exec zerotier -- wget http://speedtest.newark.linode.com/100MB-newark.bin -O /dev/null
--2016-03-26 22:17:34-- http://speedtest.newark.linode.com/100MB-newark.bin
Resolving speedtest.newark.linode.com (speedtest.newark.linode.com)... 50.116.57.237, 2600:3c03::4b
Connecting to speedtest.newark.linode.com (speedtest.newark.linode.com)|50.116.57.237|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104857600 (100M) [application/octet-stream]
Saving to: '/dev/null'

/dev/null 100%[===================>] 100.00M 58.7MB/s in 1.7s 

2016-03-26 22:17:36 (58.7 MB/s) - '/dev/null' saved [104857600/104857600]

stgraber@dakara:~$ lxc profile device set default eth0 limits.ingress 100Mbit
stgraber@dakara:~$ lxc profile device set default eth0 limits.egress 100Mbit
stgraber@dakara:~$ lxc exec zerotier -- wget http://speedtest.newark.linode.com/100MB-newark.bin -O /dev/null
--2016-03-26 22:17:47-- http://speedtest.newark.linode.com/100MB-newark.bin
Resolving speedtest.newark.linode.com (speedtest.newark.linode.com)... 50.116.57.237, 2600:3c03::4b
Connecting to speedtest.newark.linode.com (speedtest.newark.linode.com)|50.116.57.237|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104857600 (100M) [application/octet-stream]
Saving to: '/dev/null'

/dev/null 100%[===================>] 100.00M 11.4MB/s in 8.8s 

2016-03-26 22:17:56 (11.4 MB/s) - '/dev/null' saved [104857600/104857600]

And that’s how you throttle an otherwise nice gigabit connection to a mere 100Mbit/s one!

And as with block I/O, you can set an overall network priority with:

lxc config set my-container limits.network.priority 5

Getting the current resource usage

The LXD API exports quite a bit of information on current container resource usage, you can get:

  • Memory: current, peak, current swap and peak swap
  • Disk: current disk usage
  • Network: bytes and packets received and transferred for every interface

And now if you’re running a very recent LXD (only in git at the time of this writing), you can also get all of those in “lxc info”:

stgraber@dakara:~$ lxc info zerotier
Name: zerotier
Architecture: x86_64
Created: 2016/02/20 20:01 UTC
Status: Running
Type: persistent
Profiles: default
Pid: 29258
Ips:
 eth0: inet 172.17.0.101
 eth0: inet6 2607:f2c0:f00f:2700:216:3eff:feec:65a8
 eth0: inet6 fe80::216:3eff:feec:65a8
 lo: inet 127.0.0.1
 lo: inet6 ::1
 lxcbr0: inet 10.0.3.1
 lxcbr0: inet6 fe80::f0bd:55ff:feee:97a2
 zt0: inet 29.17.181.59
 zt0: inet6 fd80:56c2:e21c:0:199:9379:e711:b3e1
 zt0: inet6 fe80::79:e7ff:fe0d:5123
Resources:
 Processes: 33
 Disk usage:
  root: 808.07MB
 Memory usage:
  Memory (current): 106.79MB
  Memory (peak): 195.51MB
  Swap (current): 124.00kB
  Swap (peak): 124.00kB
 Network usage:
  lxcbr0:
   Bytes received: 0 bytes
   Bytes sent: 570 bytes
   Packets received: 0
   Packets sent: 0
  zt0:
   Bytes received: 1.10MB
   Bytes sent: 806 bytes
   Packets received: 10957
   Packets sent: 10957
  eth0:
   Bytes received: 99.35MB
   Bytes sent: 5.88MB
   Packets received: 64481
   Packets sent: 64481
  lo:
   Bytes received: 9.57kB
   Bytes sent: 9.57kB
   Packets received: 81
   Packets sent: 81
Snapshots:
 zerotier/blah (taken at 2016/03/08 23:55 UTC) (stateless)

Conclusion

The LXD team spent quite a few months iterating over the language we’re using for those limits. It’s meant to be as simple as it can get while remaining very powerful and specific when you want it to.

Live application of those limits and inheritance through profiles makes it a very powerful tool to live manage the load on your servers without impacting the running services.

Extra information

The main LXD website is at: https://linuxcontainers.org/lxd
Development happens on Github at: https://github.com/lxc/lxd
Mailing-list support happens on: https://lists.linuxcontainers.org
IRC support happens in: #lxcontainers on irc.freenode.net

And if you don’t want or can’t install LXD on your own machine, you can always try it online instead!

This entry was posted in Canonical voices, LXD, Planet Ubuntu and tagged . Bookmark the permalink.

29 Responses to LXD 2.0: Resource control [4/12]

  1. Ben Vincent says:

    Is there a way to subscribe to your site? Every single article I have read here has been fantastic and I would like to receive notifications when new posts are made.

  2. Fred Jackson says:

    Could you please address how to fully “undo” [lxd init] and remove everything that has been done with lxc, in order to start over from scratch with a new [lxd init]? Removing the lxd and lxc packages does not do it. I am getting worn out from completely reinstalling ubuntu on my test system.

    One thing that definitely doesn’t work is [zfs destroy]ing the ZFS pool that lxd is using. Very bad news as it wedges lxd really bad.

  3. shibata says:

    Thank you for great articles!

    Anyway following “CONTAINER” should be “PROFILE”?

    > lxc profile device set CONTAINER DEVICE KEY VALUE

  4. AceSlash says:

    Fantastic series!

    I have one question though: how do you limit the /dev/shm size on a container?

    I limited the memory for the (unprivileged) container but could not limit the size of /dev/shm, adding the following line to /etc/fstab inside the container crashed my /bin/bash several time until I removed it (and anyway it wasn’t working, /dev/shm was the same size as the host):
    tmpfs /dev/shm tmpfs defaults,noexec,nosuid,size=100M 0 0

    remounting doesn’t work either:
    mount -o remount,size=100M /dev/shm
    mount: cannot remount tmpfs read-write, is write-protected

    The issue is that when I try to fill /dev/shm, it crashes the container because it uses more than the memory limit :p

    Any idea?

    • Looks like apparmor is what’s blocking the remount. If the container is unprivileged, you could turn off apparmor by doing:
      lxc config set my-container raw.lxc lxc.aa_profile=unconfined

      Though I wouldn’t recommend this for a privileged container.

      By the looks of it, the problem is that apparmor cannot tell the filesystem type at the time of the remount so doesn’t know if it’s safe to proceed (as it would for a tmpfs) vs unsafe (as it would for proc).

      The problem shouldn’t happen on a straight mount though, so you could just over-mount it with:
      mount -o size=100M -t tmpfs tmpfs /run/shm/

      Or figure out how your init system mounts it and have it be mounted with the right size directly from the start.

      • AceSlash says:

        Thanks for you answer.

        It indeed works with a simple mount. Any idea how to set the tmpfs size at container start on Ubuntu 15.10? (both the host and the container use this OS)

        Shouldn’t the size of tmpfs be managed by LXD to avoid it from being bigger than the total amount of RAM given to the container? (at least at container start if the limit exists then). Currently this means that any user can crash a container if /dev/shm is bigger than the memory limit by just filling it.

        Should I open an issue on github for this?

        • Well, there is no kernel API allowing LXD to limit the default size of a tmpfs mount so there’s unfortunately not much LXD or LXC can do about this. At least the kernel doesn’t let you run over the memory quota as it once did but yeah, the resulting behavior isn’t amazing.

          I don’t know systemd so well, I know that back in the upstart days, you could just have booted, copy/pasted the /proc/mounts line you want to override and stuff it into /etc/fstab and you’d be good. There’s probably a systemd equivalent or at least some ways to override those default mount flags but you’d need to ask someone with more systemd knowledge…

          • AceSlash says:

            Okay, thank you for your answers, I will take a look at systemd.

            Maybe also a correct work around should be to simply drastically reduce the default tmpfs on the host since the containers take the same size. I usually don’t use tmpfs at all on a virtual/container host anyway.

  5. Mahesh says:

    First of all thanks for such a brilliant series. It would be great if you add command to set rootfs size for LVM backend.

    • We actually recently introduced a configuration option to set the default LVM LV size, it’s “storage.lvm_volume_size”.

      This only applies to newly created LVs and not to snapshots of existing LVs, so will only apply to new containers created from new images.

      Obviously we’d like for resizing of LVM backed containers to work in the same way as for ZFS and btrfs, but ext4 and xfs (which are used on LVM) simply don’t offer very reliable online filesystem resizing. At best they support growing it but never shrinking it.

  6. tom zhou says:

    Really cool idea.

    How can we implement QoS on CPU/MEM and DISK/NETWORK IO among lxd containers ?

  7. Pingback: LXD 2.0: LXD in LXD [8/12] - Linux Distros

  8. Igor says:

    I didn’t understand this phrase: “That is, when you run out of memory, the kernel out of memory killer will start having some fun with your processes.”

    • The kernel out of memory killer will pick random processes in your container and kill them if you exceed your memory allocation, same as happens on a real machine.

  9. Paul says:

    No matter what I do I can not increase the size of my container, I can not create a new container and I cannot move images to created

    Filesystem Size Used Avail Use% Mounted on
    /dev/mapper/ubuntu–vg-root 435G 58G 356G 14% /
    d1web1 21G 0 21G 0% /d1web1
    d1web1/containers 21G 0 21G 0% /d1web1/containers
    d1web1/deleted 21G 0 21G 0% /d1web1/deleted
    d1web1/deleted/images 21G 0 21G 0% /d1web1/deleted/images
    d1web1/images 21G 0 21G 0% /d1web1/images
    mail 97G 0 97G 0% /mail
    d1web1/containers/d1web1 21G 699M 21G 4% /var/lib/lxd/containers/d1web1.zfs
    d1web1/containers/mail 46G 26G 21G 56% /var/lib/lxd/containers/mail.zfs
    d1web1/containers/d1web3 21G 572M 21G 3% /var/lib/lxd/containers/d1web3.zfs
    d1web3 49G 0 49G 0% /d1web3

    It seems all subsequent images I create are all added to the d1web1 container which was the first container I created at only 50GB. The subsequent “mail” container of 100G was created but image was created in the d1web1 container and I can not move it (same goes for d1web3).

    At the moment all images are using the same 50GB disk which I need them to use their own mounted disks:

    sudo zpool list
    NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
    d1web1 49.8G 27.9G 21.9G – 41% 56% 1.00x ONLINE –
    d1web3 49.8G 62.5K 49.7G – 0% 0% 1.00x ONLINE –
    mail 99.5G 104K 99.5G – 0% 0% 1.00x ONLINE –

    Any help appreciated.

  10. Pingback: LXD w praktyce – część 4. Zarządzanie zasobami. - OSWorld.pl

  11. Martin says:

    In experimenting with building an image for developing and testing against I needed a container with another “disk” (or a few disks) attached to it. Previously I’ve done this inside of VMs, but containers are just so much faster and convenient, so I wanted to create the same/similar in a container.

    What I used to “fake” these block devices in a VM was to create a loop device using `losetup` and then create a few LVM volumes on top of that.

    When trying the same trick inside a LXD container, using `losetup` it “cannot find an unused loop device”. That makes sense to me, but my question is: Is there a way I can add a loop device to the container to use?

    Thanks in advance!

  12. Samuel says:

    Is there any way to set a network I/O limit on a per-container basis? In the example above you set the limit through the default profile.

    $ lxc profile device set default eth0 limits.ingress 100Mbit

    I’m looking for something like this:

    $ lxc config set eth0 limits.ingress 100Mbit

    Best regards,

  13. hurelhuyag says:

    Hi, raw.lxc is useful for unsupported cgroups. But how can I use multiple raw.lxc config?

    • It’s multi-line. You can either set it from stdin with something like:

      cat | lxc config set c1 raw.lxc – << EOF lxc.a=a lxc.b=b EOF Or use "lxc config edit c1" and make it a multi-line entry in there like: raw.lxc: |- lxc.a=a lxc.b=b

  14. eagleeye128 says:

    Hi, are there any statistics on disk io in lxc info coming from cgroups ?

    • Not right now. You can file an issue at https://github.com/lxc/lxd/issues with a suggestion of what kind of statistics would be useful to report from the blkio cgroup.

      I just had a very quick look and didn’t find anything that looks immediately useful to a normal user (no generic bytes counter or peak access speed or the like).

  15. Gavin Davies says:

    Great post 🙂 I’m struggling with one thing though

    “To just limit a container to any 2 CPUs, do:
    lxc config set my-container limits.cpu 2

    To pin to specific CPU cores, say the second and fourth:
    lxc config set my-container limits.cpu 1,3”

    As those commands have the exact same syntax, how would I pin container A to CPU 0 and container B to CPU 1?

    So, if I said:

    lxc config set containerA limits.cpu 0
    lxc config set containerB limits.cpu 1

    How would that behave? Would it allocate no CPU to containerA OR would it lock containerA into CPU 0?

    (I’m probably just being dense, I’m very new to all this!)

    Thanks,
    – Gavin

    • Good question and the answer is kinda odd. The syntax you’d use to pin to a single core is:

      lxc config set containerA limits.cpu 0-0
      lxc config set containerB limits.cpu 1-1

      So use the range syntax which has LXD do direct pinning, using a range of a single CPU.

  16. prafull gangawane says:

    LXC fails to achieve resource control, a container given 30% share of disk bandwidth consumes 70% share what could be the reason ?

    • The I/O limits are definitely the weirdest ones as far as kernel limits are concerns.

      LXD has logic to resolve what block device a particular container disk device is using and then sets the applicable blkio limit for that device in the kernel.

      There are quite a few shortcomings with this approach (and no way around them given how that cgroup works):
      – Filesystems that don’t use the normal kernel paths may not see the limit fully applied (ZFS is one of those).
      – Cached read/writes aren’t restricted (which is usually a reasonable behavior).
      – Filesystems spanning multiple block devices (btrfs, zfs) will have the limit applied on each of the backing devices, which typically means an higher (if raid1 with two devices, twice the amount) possible throughput, in the ideal case where the reads/writes are perfectly spread across devices.
      – If multiple “disk” devices are passed to the container from the same backing block device, then LXD will average the limits together as there is no way to do a per-path or per-mount limit.

      Should someone ever implement proper VFS bandwidth limits in the kernel, we’ll be sure to switch to that as that’s ultimately what we want for containers, but so far the only object we can interact with is a whole block device, and I mean whole block device, we can’t even apply per-partition I/O limits through cgroups.

Leave a Reply

Your email address will not be published. Required fields are marked *