LXC 1.0: Scripting with the API [8/10]

This is post 8 out of 10 in the LXC 1.0 blog post series.

The API

The first version of liblxc was introduced in LXC 0.9 but it was very much at an experimental state. LXC 1.0 however will ship with a much more complete API, covering all of LXC’s features. We’ve actually been rebasing all of our tools (lxc-*) to using that API rather than doing direct calls to the internal functions.

The API also comes with a whole set of tests which we run as part of our continuous integration setup and before distro uploads.

There are also quite a few bindings for those who don’t feel like writing C, we have lua and python3 bindings in-tree upstream and there are official out-of-tree bindings for Go and ruby.

The API documentation can be found at:
https://linuxcontainers.org/lxc/documentation/

It’s not necessarily the most readable API documentation ever and certainly could do with some examples, especially for the bindings, but it does cover all functions that are exported over the API. Any help improving our API documentation is very much welcome!

The basics

So let’s start with a very simple example of the LXC API using C, the following example will create a new container struct called “apicontainer”, create a root filesystem using the new download template, start the container, print its state and PID number, then attempt a clean shutdown before killing it.

#include <stdio.h>

#include <lxc/lxccontainer.h>

int main() {
    struct lxc_container *c;
    int ret = 1;

    /* Setup container struct */
    c = lxc_container_new("apicontainer", NULL);
    if (!c) {
        fprintf(stderr, "Failed to setup lxc_container struct
");
        goto out;
    }

    if (c->is_defined(c)) {
        fprintf(stderr, "Container already exists
");
        goto out;
    }

    /* Create the container */
    if (!c->createl(c, "download", NULL, NULL, LXC_CREATE_QUIET,
                    "-d", "ubuntu", "-r", "trusty", "-a", "i386", NULL)) {
        fprintf(stderr, "Failed to create container rootfs
");
        goto out;
    }

    /* Start the container */
    if (!c->start(c, 0, NULL)) {
        fprintf(stderr, "Failed to start the container
");
        goto out;
    }

    /* Query some information */
    printf("Container state: %s
", c->state(c));
    printf("Container PID: %d
", c->init_pid(c));

    /* Stop the container */
    if (!c->shutdown(c, 30)) {
        printf("Failed to cleanly shutdown the container, forcing.
");
        if (!c->stop(c)) {
            fprintf(stderr, "Failed to kill the container.
");
            goto out;
        }
    }

    /* Destroy the container */
    if (!c->destroy(c)) {
        fprintf(stderr, "Failed to destroy the container.
");
        goto out;
    }

    ret = 0;
out:
    lxc_container_put(c);
    return ret;
}

So as you can see, it’s not very difficult to use, most functions are fairly straightforward and error checking is pretty simple (most calls are boolean and errors are printed to stderr by LXC depending on the loglevel).

Python3 scripting

As much fun as C may be, I usually like to script my containers and C isn’t really the best language for that. That’s why I wrote and maintain the official python3 binding.

The equivalent to the example above in python3 would be:

import lxc
import sys

# Setup the container object
c = lxc.Container("apicontainer")
if c.defined:
    print("Container already exists", file=sys.stderr)
    sys.exit(1)

# Create the container rootfs
if not c.create("download", lxc.LXC_CREATE_QUIET, {"dist": "ubuntu",
                                                   "release": "trusty",
                                                   "arch": "i386"}):
    print("Failed to create the container rootfs", file=sys.stderr)
    sys.exit(1)

# Start the container
if not c.start():
    print("Failed to start the container", file=sys.stderr)
    sys.exit(1)

# Query some information
print("Container state: %s" % c.state)
print("Container PID: %s" % c.init_pid)

# Stop the container
if not c.shutdown(30):
    print("Failed to cleanly shutdown the container, forcing.")
    if not c.stop():
        print("Failed to kill the container", file=sys.stderr)
        sys.exit(1)

# Destroy the container
if not c.destroy():
    print("Failed to destroy the container.", file=sys.stderr)
    sys.exit(1)

Now for that specific example, python3 isn’t that much simpler than the C equivalent.

But what if we wanted to do something slightly more tricky, like iterating through all existing containers, start them (if they’re not already started), wait for them to have network connectivity, then run updates and shut them down?

import lxc
import sys

for container in lxc.list_containers(as_object=True):
    # Start the container (if not started)
    started=False
    if not container.running:
        if not container.start():
            continue
        started=True

    if not container.state == "RUNNING":
        continue

    # Wait for connectivity
    if not container.get_ips(timeout=30):
        continue

    # Run the updates
    container.attach_wait(lxc.attach_run_command,
                          ["apt-get", "update"])
    container.attach_wait(lxc.attach_run_command,
                          ["apt-get", "dist-upgrade", "-y"])

    # Shutdown the container
    if started:
        if not container.shutdown(30):
            container.stop()

The most interesting bit in the example above is the attach_wait command, which basically lets your run a standard python function in the container’s namespaces, here’s a more obvious example:

import lxc

c = lxc.Container("p1")
if not c.running:
    c.start()

def print_hostname():
    with open("/etc/hostname", "r") as fd:
        print("Hostname: %s" % fd.read().strip())

# First run on the host
print_hostname()

# Then on the container
c.attach_wait(print_hostname)

if not c.shutdown(30):
    c.stop()

And the output of running the above:

stgraber@castiana:~$ python3 lxc-api.py
/home/stgraber/<frozen>:313: Warning: The python-lxc API isn't yet stable and may change at any point in the future.
Hostname: castiana
Hostname: p1

It may take you a little while to wrap your head around the possibilities offered by that function, especially as it also takes quite a few flags (look for LXC_ATTACH_* in the C API) which lets you control which namespaces to attach to, whether to have the function contained by apparmor, whether to bypass cgroup restrictions, …

That kind of flexibility is something you’ll never get with a virtual machine and the way it’s supported through our bindings makes it easier than ever to use by anyone who wants to automate custom workloads.

You can also use the API to script cloning containers and using snapshots (though for that example to work, you need current upstream master due to a small bug I found while writing this…):

import lxc
import os
import sys

if not os.geteuid() == 0:
    print("The use of overlayfs requires privileged containers.")
    sys.exit(1)

# Create a base container (if missing) using an Ubuntu 14.04 image
base = lxc.Container("base")
if not base.defined:
    base.create("download", lxc.LXC_CREATE_QUIET, {"dist": "ubuntu",
                                                   "release": "precise",
                                                   "arch": "i386"})

    # Customize it a bit
    base.start()
    base.get_ips(timeout=30)
    base.attach_wait(lxc.attach_run_command, ["apt-get", "update"])
    base.attach_wait(lxc.attach_run_command, ["apt-get", "dist-upgrade", "-y"])

    if not base.shutdown(30):
        base.stop()

# Clone it as web (if not already existing)
web = lxc.Container("web")
if not web.defined:
    # Clone base using an overlayfs overlay
    web = base.clone("web", bdevtype="overlayfs",
                     flags=lxc.LXC_CLONE_SNAPSHOT)

    # Install apache
    web.start()
    web.get_ips(timeout=30)
    web.attach_wait(lxc.attach_run_command, ["apt-get", "update"])
    web.attach_wait(lxc.attach_run_command, ["apt-get", "install",
                                             "apache2", "-y"])

    if not web.shutdown(30):
        web.stop()

# Create a website container based on the web container
mysite = web.clone("mysite", bdevtype="overlayfs",
                   flags=lxc.LXC_CLONE_SNAPSHOT)
mysite.start()
ips = mysite.get_ips(family="inet", timeout=30)
if ips:
    print("Website running at: http://%s" % ips[0])
else:
    if not mysite.shutdown(30):
        mysite.stop()

The above will create a base container using a downloaded image, then clone it using an overlayfs based overlay, add apache2 to it, then clone that resulting container into yet another one called “mysite”. So “mysite” is effectively an overlay clone of “web” which is itself an overlay clone of “base”.

 

So there you go, I tried to cover most of the interesting bits of our API with the examples above, though there’s much more available, for example, I didn’t cover the snapshot API (currently restricted to system containers) outside of the specific overlayfs case above and only scratched the surface of what’s possible to do with the attach function.

LXC 1.0 will release with a stable version of the API, we’ll be doing additions in the next few 1.x versions (while doing bugfix only updates to 1.0.x) and hope not to have to break the whole API for quite a while (though we’ll certainly be adding more stuff to it).

Posted in Canonical voices, LXC, Planet Ubuntu | Tagged | 21 Comments

LXC 1.0: Unprivileged containers [7/10]

This is post 7 out of 10 in the LXC 1.0 blog post series.

Introduction to unprivileged containers

The support of unprivileged containers is in my opinion one of the most important new features of LXC 1.0.

You may remember from previous posts that I mentioned that LXC should be considered unsafe because while running in a separate namespace, uid 0 in your container is still equal to uid 0 outside of the container, meaning that if you somehow get access to any host resource through proc, sys or some random syscalls, you can potentially escape the container and then you’ll be root on the host.

That’s what user namespaces were designed for and implemented. It was a multi-year effort to think them through and slowly push the hundreds of patches required into the upstream kernel, but finally with 3.12 we got to a point where we can start a full system container entirely as a user.

So how do those user namespaces work? Well, simply put, each user that’s allowed to use them on the system gets assigned a range of unused uids and gids, ideally a whole 65536 of them. You can then use those uids and gids with two standard tools called newuidmap and newgidmap which will let you map any of those uids and gids to virtual uids and gids in a user namespace.

That means you can create a container with the following configuration:

lxc.id_map = u 0 100000 65536
lxc.id_map = g 0 100000 65536

The above means that I have one uid map and one gid map defined for my container which will map uids and gids 0 through 65536 in the container to uids and gids 100000 through 165536 on the host.

For this to be allowed, I need to have those ranges assigned to my user at the system level with:

stgraber@castiana:~$ grep stgraber /etc/sub* 2>/dev/null
/etc/subgid:stgraber:100000:65536
/etc/subuid:stgraber:100000:65536

LXC has now been updated so that all the tools are aware of those unprivileged containers. The standard paths also have their unprivileged equivalents:

  • /etc/lxc/lxc.conf => ~/.config/lxc/lxc.conf
  • /etc/lxc/default.conf => ~/.config/lxc/default.conf
  • /var/lib/lxc => ~/.local/share/lxc
  • /var/lib/lxcsnaps => ~/.local/share/lxcsnaps
  • /var/cache/lxc => ~/.cache/lxc

Your user, while it can create new user namespaces in which it’ll be uid 0 and will have some of root’s privileges against resources tied to that namespace will obviously not be granted any extra privilege on the host.

One such thing is creating new network devices on the host or changing bridge configuration. To workaround that, we wrote a tool called “lxc-user-nic” which is the only SETUID binary part of LXC 1.0 and which performs one simple task.
It parses a configuration file and based on its content will create network devices for the user and bridge them. To prevent abuse, you can restrict the number of devices a user can request and to what bridge they may be added.

An example is my own /etc/lxc/lxc-usernet file:

stgraber veth lxcbr0 10

This declares that the user “stgraber” is allowed up to 10 veth type devices to be created and added to the bridge called lxcbr0.

Between what’s offered by the user namespace in the kernel and that setuid tool, we’ve got all that’s needed to run most distributions unprivileged.

Pre-requirements

All examples and instructions I’ll be giving below are expecting that you are running a perfectly up to date version of Ubuntu 14.04 (codename trusty). That’s a pre-release of Ubuntu so you may want to run it in a VM or on a spare machine rather than upgrading your production computer.

The reason to want something that recent is because the rough requirements for well working unprivileged containers are:

  • Kernel: 3.13 + a couple of staging patches (which Ubuntu has in its kernel)
  • User namespaces enabled in the kernel
  • A very recent version of shadow that supports subuid/subgid
  • Per-user cgroups on all controllers (which I turned on a couple of weeks ago)
  • LXC 1.0 beta2 or higher (released two days ago)
  • A version of PAM with a loginuid patch that’s yet to be in any released version

Those requirements happen to all be true of the current development release of Ubuntu as of two days ago.

LXC pre-built containers

User namespaces come with quite a few obvious limitations. For example in a user namespace you won’t be allowed to use mknod to create a block or character device as being allowed to do so would let you access anything on the host. Same thing goes with some filesystems, you won’t for example be allowed to do loop mounts or mount an ext partition, even if you can access the block device.

Those limitations while not necessarily world ending in day to day use are a big problem during the initial bootstrap of a container as tools like debootstrap, yum, … usually try to do some of those restricted actions and will fail pretty badly.

Some templates may be tweaked to work and workaround such as a modified fakeroot could be used to bypass some of those limitations but the goal of the LXC project isn’t to require all of our users to be distro engineers, so we came up with a much simpler solution.

I wrote a new template called “download” which instead of assembling the rootfs and configuration locally will instead contact a server which contains daily pre-built rootfs and configuration for most common templates.

Those images are built from our Jenkins server using a few machines I have on my home network (a set of powerful x86 builders and a quadcore ARM board). The actual build process is pretty straightforward, a basic chroot is assembled, then the current git master is downloaded, built and the standard templates are run with the right release and architecture, the resulting rootfs is compressed, a basic config and metadata (expiry, files to template, …) is saved, the result is pulled by our main server, signed with a dedicated GPG key and published on the public web server.

The client side is a simple template which contacts the server over https (the domain is also DNSSEC enabled and available over IPv6), grabs signed indexes of all the available images, checks if the requested combination of distribution, release and architecture is supported and if it is, grabs the rootfs and metadata tarballs, validates their signature and stores them in a local cache. Any container creation after that point is done using that cache until the time the cache entries expires at which point it’ll grab a new copy from the server.

The current list of images is (as can be requested by passing –list):

---
DIST      RELEASE   ARCH    VARIANT    BUILD
---
debian    wheezy    amd64   default    20140116_22:43
debian    wheezy    armel   default    20140116_22:43
debian    wheezy    armhf   default    20140116_22:43
debian    wheezy    i386    default    20140116_22:43
debian    jessie    amd64   default    20140116_22:43
debian    jessie    armel   default    20140116_22:43
debian    jessie    armhf   default    20140116_22:43
debian    jessie    i386    default    20140116_22:43
debian    sid       amd64   default    20140116_22:43
debian    sid       armel   default    20140116_22:43
debian    sid       armhf   default    20140116_22:43
debian    sid       i386    default    20140116_22:43
oracle    6.5       amd64   default    20140117_11:41
oracle    6.5       i386    default    20140117_11:41
plamo     5.x       amd64   default    20140116_21:37
plamo     5.x       i386    default    20140116_21:37
ubuntu    lucid     amd64   default    20140117_03:50
ubuntu    lucid     i386    default    20140117_03:50
ubuntu    precise   amd64   default    20140117_03:50
ubuntu    precise   armel   default    20140117_03:50
ubuntu    precise   armhf   default    20140117_03:50
ubuntu    precise   i386    default    20140117_03:50
ubuntu    quantal   amd64   default    20140117_03:50
ubuntu    quantal   armel   default    20140117_03:50
ubuntu    quantal   armhf   default    20140117_03:50
ubuntu    quantal   i386    default    20140117_03:50
ubuntu    raring    amd64   default    20140117_03:50
ubuntu    raring    armhf   default    20140117_03:50
ubuntu    raring    i386    default    20140117_03:50
ubuntu    saucy     amd64   default    20140117_03:50
ubuntu    saucy     armhf   default    20140117_03:50
ubuntu    saucy     i386    default    20140117_03:50
ubuntu    trusty    amd64   default    20140117_03:50
ubuntu    trusty    armhf   default    20140117_03:50
ubuntu    trusty    i386    default    20140117_03:50

The template has been carefully written to work on any system that has a POSIX compliant shell with wget. gpg is recommended but can be disabled if your host doesn’t have it (at your own risks).

The same template can be used against your own server, which I hope will be very useful for enterprise deployments to build templates in a central location and have them pulled by all the hosts automatically using our expiry mechanism to keep them fresh.

While the template was designed to workaround limitations of unprivileged containers, it works just as well with system containers, so even on a system that doesn’t support unprivileged containers you can do:

lxc-create -t download -n p1 -- -d ubuntu -r trusty -a amd64

And you’ll get a new container running the latest build of Ubuntu 14.04 amd64.

Using unprivileged LXC

Right, so let’s get you started, as I already mentioned, all the instructions below have only been tested on a very recent Ubuntu 14.04 (trusty) installation.
You may want to grab a daily build and run it in a VM.

Install the required packages:

  • sudo apt-get update
  • sudo apt-get dist-upgrade
  • sudo apt-get install lxc systemd-services uidmap

Then, assign yourself a set of uids and gids with:

  • sudo usermod --add-subuids 100000-165536 $USER
  • sudo usermod --add-subgids 100000-165536 $USER
  • sudo chmod +x $HOME

That last one is required because LXC needs it to access ~/.local/share/lxc/ after it switched to the mapped UIDs. If you’re using ACLs, you may instead use “u:100000:x” as a more specific ACL.

Now create ~/.config/lxc/default.conf with the following content:

lxc.network.type = veth
lxc.network.link = lxcbr0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
lxc.id_map = u 0 100000 65536
lxc.id_map = g 0 100000 65536

And /etc/lxc/lxc-usernet with:

<your username> veth lxcbr0 10

And that’s all you need. Now let’s create our first unprivileged container with:

lxc-create -t download -n p1 -- -d ubuntu -r trusty -a amd64

You should see the following output from the download template:

Setting up the GPG keyring
Downloading the image index
Downloading the rootfs
Downloading the metadata
The image cache is now ready
Unpacking the rootfs

---
You just created an Ubuntu container (release=trusty, arch=amd64).
The default username/password is: ubuntu / ubuntu
To gain root privileges, please use sudo.

So looks like your first container was created successfully, now let’s see if it starts:

ubuntu@trusty-daily:~$ lxc-start -n p1 -d
ubuntu@trusty-daily:~$ lxc-ls --fancy
NAME  STATE    IPV4     IPV6     AUTOSTART  
------------------------------------------
p1    RUNNING  UNKNOWN  UNKNOWN  NO

It’s running! At this point, you can get a console using lxc-console or can SSH to it by looking for its IP in the ARP table (arp -n).

One thing you probably noticed above is that the IP addresses for the container aren’t listed, that’s because unfortunately LXC currently can’t attach to an unprivileged container’s namespaces. That also means that some fields of lxc-info will be empty and that you can’t use lxc-attach. However we’re looking into ways to get that sorted in the near future.

There are also a few problems with job control in the kernel and with PAM, so doing a non-detached lxc-start will probably result in a rather weird console where things like sudo will most likely fail. SSH may also fail on some distros. A patch has been sent upstream for this, but I just noticed that it doesn’t actually cover all cases and even if it did, it’s not in any released version yet.

Quite a few more improvements to unprivileged containers are to come until the final 1.0 release next month and while we certainly don’t expect all workloads to be possible with unprivileged containers, it’s still a huge improvement on what we had before and a very good building block for a lot more interesting use cases.

Posted in Canonical voices, LXC, Planet Ubuntu | Tagged | 111 Comments

LXC 1.0: Security features [6/10]

This is post 6 out of 10 in the LXC 1.0 blog post series.

When talking about container security most people either consider containers as inherently insecure or inherently secure. The reality isn’t so black and white and LXC supports a variety of technologies to mitigate most security concerns.

One thing to clarify right from the start is that you won’t hear any of the LXC maintainers tell you that LXC is secure so long as you use privileged containers. However, at least in Ubuntu, our default containers ship with what we think is a pretty good configuration of both the cgroup access and an extensive apparmor profile which prevents all attacks that we are aware of.

Below I’ll be covering the various technologies LXC supports to let you restrict what a container may do. Just keep in mind that unless you are using unprivileged containers, you shouldn’t give root access to a container to someone whom you’d mind having root access to your host.

Capabilities

The first security feature which was added to LXC was Linux capabilities support. With that feature you can set a list of capabilities that you want LXC to drop before starting the container or a full list of capabilities to retain (all others will be dropped).

The two relevant configurations options are:

  • lxc.cap.drop
  • lxc.cap.keep

Both are lists of capability names as listed in capabilities(7).

This may sound like a great way to make containers safe and for very specific cases it may be, however if running a system container, you’ll soon notice that dropping sys_admin and net_admin isn’t very practical and short of dropping those, you won’t make your container much safer (as root in the container will be able to re-grant itself any dropped capability).

In Ubuntu we use lxc.cap.drop to drop sys_module, mac_admin, mac_override, sys_time which prevent some known problems at container boot time.

Control groups

Control groups are interesting because they achieve multiple things which while interconnected are still pretty different:

  • Resource bean counting
  • Resource quotas
  • Access restrictions

The first two aren’t really security related, though resource quotas will let you avoid some obvious DoS of the host (by setting memory, cpu and I/O limits).

The last is mostly about the devices cgroup which lets you define which character and block devices a container may access and what it can do with them (you can restrict creation, read access and write access for each major/minor combination).

In LXC, configuring cgroups is done with the “lxc.cgroup.*” options which can roughly be defined as: lxc.cgroup.<controller>.<key> = <value>

For example to set a memory limit on p1 you’d add the following to its configuration:

lxc.cgroup.memory.limit_in_bytes = 134217728

This will set a memory limit of 128MB (the value is in bytes) and will be the equivalent to writing that same value to /sys/fs/cgroup/memory/lxc/p1/memory.limit_in_bytes

Most LXC templates only set a few devices controller entries by default:

# Default cgroup limits
lxc.cgroup.devices.deny = a
## Allow any mknod (but not using the node)
lxc.cgroup.devices.allow = c *:* m
lxc.cgroup.devices.allow = b *:* m
## /dev/null and zero
lxc.cgroup.devices.allow = c 1:3 rwm
lxc.cgroup.devices.allow = c 1:5 rwm
## consoles
lxc.cgroup.devices.allow = c 5:0 rwm
lxc.cgroup.devices.allow = c 5:1 rwm
## /dev/{,u}random
lxc.cgroup.devices.allow = c 1:8 rwm
lxc.cgroup.devices.allow = c 1:9 rwm
## /dev/pts/*
lxc.cgroup.devices.allow = c 5:2 rwm
lxc.cgroup.devices.allow = c 136:* rwm
## rtc
lxc.cgroup.devices.allow = c 254:0 rm
## fuse
lxc.cgroup.devices.allow = c 10:229 rwm
## tun
lxc.cgroup.devices.allow = c 10:200 rwm
## full
lxc.cgroup.devices.allow = c 1:7 rwm
## hpet
lxc.cgroup.devices.allow = c 10:228 rwm
## kvm
lxc.cgroup.devices.allow = c 10:232 rwm

This configuration allows the container (usually udev) to create any device it wishes (that’s the wildcard “m” above) but block everything else (the “a” deny entry) unless it’s listed in one of the allow entries below. This covers everything a container will typically need to function.

You will find reasonably up to date documentation about the available controllers, control files and supported values at:
https://www.kernel.org/doc/Documentation/cgroups/

Apparmor

A little while back we added Apparmor profiles support to LXC.
The Apparmor support is rather simple, there’s one configuration option “lxc.aa_profile” which sets what apparmor profile to use for the container.

LXC will then setup the container and ask apparmor to switch it to that profile right before starting the container. Ubuntu’s LXC profile is rather complex as it aims to prevent any of the known ways of escaping a container or cause harm to the host.

As things are today, Ubuntu ships with 3 apparmor profiles meaning that the supported values for lxc.aa_profile are:

  • lxc-container-default (default value if lxc.aa_profile isn’t set)
  • lxc-container-default-with-nesting (same as default but allows some needed bits for nested containers)
  • lxc-container-default-with-mounting (same as default but allows mounting ext*, xfs and btrfs file systems).
  • unconfined (a special value which will disable apparmor support for the container)

You can also define your own by copying one of the ones in /etc/apparmor.d/lxc/, adding the bits you want, giving it a unique name, then reloading apparmor with “sudo /etc/init.d/apparmor reload” and finally setting lxc.aa_profile to the new profile’s name.

SELinux

The SELinux support is very similar to Apparmor’s. An SELinux context can be set using “lxc.se_context”.

An example would be:

lxc.se_context = unconfined_u:unconfined_r:lxc_t:s0-s0:c0.c1023

Similarly to Apparmor, LXC will switch to the new SELinux context right before starting init in the container. As far as I know, no distributions are setting a default SELinux context at this time, however most distributions build LXC with SELinux support (including Ubuntu, should someone choose to boot their host with SELinux rather than Apparmor).

Seccomp

Seccomp is a fairly recent kernel mechanism which allows for filtering of system calls.
As a user you can write a seccomp policy file and set it using “lxc.seccomp” in the container’s configuration. As always, this policy will only be applied to the running container and will allow or reject syscalls with a pre-defined return value.

An example (though limited and useless) of a seccomp policy file would be:

1
whitelist
103

Which would only allow syscall #103 (syslog) in the container and reject everything else.

Note that seccomp is a rather low level feature and only useful for some very specific use cases. All syscalls have to be referred by their ID instead of their name and those may change between architectures. Also, as things are today, if your host is 64bit and you load a seccomp policy file, all 32bit syscalls will be rejected. We’d need per-personality seccomp profiles to solve that but it’s not been a high priority so far.

User namespace

And last but not least, what’s probably the only way of making a container actually safe. LXC now has support for user namespaces. I’ll go into more details on how to use that feature in a later blog post but simply put, LXC is no longer running as root so even if an attacker manages to escape the container, he’d find himself having the privileges of a regular user on the host.

All this is achieved by assigning ranges of uids and gids to existing users. Those users on the host will then be allowed to clone a new user namespace in which all uids/gids are mapped to uids/gids that are part of the user’s range.

This obviously means that you need to allocate a rather silly amount of uids and gids to each user who’ll be using LXC in that way. In a perfect world, you’d allocate 65536 uids and gids per container and per user. As this would likely exhaust the whole uid/gid range rather quickly on some systems, I tend to go with “just” 65536 uids and gids per user that’ll use LXC and then have the same range shared by all containers.

Anyway, that’s enough details about user namespaces for now. I’ll cover how to actually set that up and use those unprivileged containers in the next post.

Posted in Canonical voices, LXC, Planet Ubuntu | Tagged | 20 Comments

LXC 1.0: Container storage [5/10]

This is post 5 out of 10 in the LXC 1.0 blog post series.

Storage backingstores

LXC supports a variety of storage backends (also referred to as backingstore).
It defaults to “none” which simply stores the rootfs under
/var/lib/lxc/<container>/rootfs but you can specify something else to lxc-create or lxc-clone with the -B option.

Currently supported values are:

directory based storage (“none” and “dir)

This is the default backingstore, the container rootfs is stored under
/var/lib/lxc/<container>/rootfs

The --dir option (when using “dir”) can be used to override the path.

btrfs

With this backingstore LXC will setup a new subvolume for the container which makes snapshotting much easier.

lvm

This one will use a new logical volume for the container.
The LV can be set with --lvname (the default is the container name).
The VG can be set with --vgname (the default is “lxc”).
The filesystem can be set with --fstype (the default is “ext4”).
The size can be set with --fssize (the default is “1G”).
You can also use LVM thinpools with --thinpool

overlayfs

This one is mostly used when cloning containers to create a container based on another one and storing any changes in an overlay.

When used with lxc-create it’ll create a container where any change done after its initial creation will be stored in a “delta0” directory next to the container’s rootfs.

zfs

Very similar to btrfs, as I’ve not used either of those myself I can’t say much about them besides that it should also create some kind of subvolume for the container and make snapshots and clones faster and more space efficient.

Standard paths

One quick word with the way LXC usually works and where it’s storing its files:

  • /var/lib/lxc (default location for containers)
  • /var/lib/lxcsnap (default location for snapshots)
  • /var/cache/lxc (default location for the template cache)
  • $HOME/.local/share/lxc (default location for unprivileged containers)
  • $HOME/.local/share/lxcsnap (default location for unprivileged snapshots)
  • $HOME/.cache/lxc (default location for unprivileged template cache)

The default path, also called lxcpath can be overridden on the command line with the -P option or once and for all by setting “lxcpath = /new/path” in /etc/lxc/lxc.conf (or $HOME/.config/lxc/lxc.conf for unprivileged containers).

The snapshot directory is always “snap” appended to lxcpath so it’ll magically follow lxcpath. The template cache is unfortunately hardcoded and can’t easily be moved short of relying on bind-mounts or symlinks.

The default configuration used for all containers at creation time is taken from
/etc/lxc/default.conf (no unprivileged equivalent yet).
The templates themselves are stored in /usr/share/lxc/templates.

Cloning containers

All those backingstores only really shine once you start cloning containers.
For example, let’s take our good old “p1” Ubuntu container and let’s say you want to make a usable copy of it called “p4”, you can simply do:

sudo lxc-clone -o p1 -n p4

And there you go, you’ve got a working “p4” container that’ll be a simple copy of “p1” but with a new mac address and its hostname properly set.

Now let’s say you want to do a quick test against “p1” but don’t want to alter that container itself, yet you don’t want to wait the time needed for a full copy, you can simply do:

sudo lxc-clone -o p1 -n p1-test -B overlayfs -s

And there you go, you’ve got a new “p1-test” container which is entirely based on the “p1” rootfs and where any change will be stored in the “delta0” directory of “p1-test”.
The same “-s” option also works with lvm and btrfs (possibly zfs too) containers and tells lxc-clone to use a snapshot rather than copy the whole rootfs across.

Snapshotting

So cloning is nice and convenient, great for things like development environments where you want throw away containers. But in production, snapshots tend to be a whole lot more useful for things like backup or just before you do possibly risky changes.

In LXC we have a “lxc-snapshot” tool which will let you create, list, restore and destroy snapshots of your containers.
Before I show you how it works, please note that “lxc-snapshot” currently doesn’t appear to work with directory based containers. With those it produces an empty snapshot, this should be fixed by the time LXC 1.0 is actually released.

So, let’s say we want to backup our “p1-lvm” container before installing “apache2” into it, simply run:

echo "before installing apache2" > snap-comment
sudo lxc-snapshot -n p1-lvm -c snap-comment

At which point, you can confirm the snapshot was created with:

sudo lxc-snapshot -n p1-lvm -L -C

Now you can go ahead and install “apache2” in the container.

If you want to revert the container at a later point, simply use:

sudo lxc-snapshot -n p1-lvm -r snap0

Or if you want to restore a snapshot as its own container, you can use:

sudo lxc-snapshot -n p1-lvm -r snap0 p1-lvm-snap0

And you’ll get a new “p1-lvm-snap0” container which will contain a working copy of “p1-lvm” as it was at “snap0”.

Posted in Canonical voices, LXC, Planet Ubuntu | Tagged | 23 Comments

LXC 1.0: Some more advanced container usage [4/10]

This is post 4 out of 10 in the LXC 1.0 blog post series.

Running foreign architectures

By default LXC will only let you run containers of one of the architectures supported by the host. That makes sense since after all, your CPU doesn’t know what to do with anything else.

Except that we have this convenient package called “qemu-user-static” which contains a whole bunch of emulators for quite a few interesting architectures. The most common and useful of those is qemu-arm-static which will let you run most armv7 binaries directly on x86.

The “ubuntu” template knows how to make use of qemu-user-static, so you can simply check that you have the “qemu-user-static” package installed, then run:

sudo lxc-create -t ubuntu -n p3 -- -a armhf

After a rather long bootstrap, you’ll get a new p3 container which will be mostly running Ubuntu armhf. I’m saying mostly because the qemu emulation comes with a few limitations, the biggest of which is that any piece of software using the ptrace() syscall will fail and so will anything using netlink. As a result, LXC will install the host architecture version of upstart and a few of the networking tools so that the containers can boot properly.

stgraber@castiana:~$ file /bin/ls
/bin/ls: ELF 64-bit LSB  executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.24, """BuildID[sha1]""" =e50e0a5dadb8a7f4eaa2fd715cacb9842e157dc7, stripped
stgraber@castiana:~$ sudo lxc-start -n p3 -d
stgraber@castiana:~$ sudo lxc-attach -n p3
root@p3:/# file /bin/ls
/bin/ls: ELF 32-bit LSB  executable, ARM, EABI5 version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, """BuildID[sha1]""" =88ff013a8fd9389747fb1fea1c898547fb0f650a, stripped
root@p3:/# exit
stgraber@castiana:~$ sudo lxc-stop -n p3
stgraber@castiana:~$

Hooks

As we know people like to script their containers and that our configuration can’t always accommodate every single use case, we’ve introduced a set of hooks which you may use.

Those hooks are simple paths to an executable file which LXC will run at some specific time in the lifetime of the container. Those executables will also be passed a set of useful environment variables so they can easily know what container invoked them and what to do.

The currently available hooks are (details in lxc.conf(5)):

  • lxc.hook.pre-start (called before any initialization is done)
  • lxc.hook.pre-mount (called after creating the mount namespace but before mounting anything)
  • lxc.hook.mount (called after the mounts but before pivot_root)
  • lxc.hook.autodev (identical to mount but only called if using autodev)
  • lxc.hook.start (called in the container right before /sbin/init)
  • lxc.hook.post-stop (run after the container has been shutdown)
  • lxc.hook.clone (called when cloning a container into a new one)

Additionally each network section may also define two additional hooks:

  • lxc.network.script.up (called in the network namespace after the interface was created)
  • lxc.network.script.down (called in the network namespace before destroying the interface)

All of those hooks may be specified as many times as you want in the configuration so you can use each hooking point multiple times.

As a simple example, let’s add the following to our “p1” container:

lxc.hook.pre-start = /var/lib/lxc/p1/pre-start.sh

And create the hook itself at /var/lib/lxc/p1/pre-start.sh:

#!/bin/sh
echo "arguments: $*" > /tmp/test
echo "environment:" >> /tmp/test
env | grep LXC >> /tmp/test

Make it executable (chmod 755) and then start the container.
Checking /tmp/test you should see:

arguments: p1 lxc pre-start
environment:
LXC_ROOTFS_MOUNT=/usr/lib/x86_64-linux-gnu/lxc
LXC_CONFIG_FILE=/var/lib/lxc/p1/config
LXC_ROOTFS_PATH=/var/lib/lxc/p1/rootfs
LXC_NAME=p1

Android containers

I’ve often been asked whether it was possible to run Android in an LXC container. Well, the short answer is yes. However it’s not very simple and it really depends on what you want to do with it.

The first thing you’ll need if you want to do this is get your machine to run an Android kernel, you’ll need to have any modules needed by Android built and loaded before you can start the container.

Once you have that, you’ll need to create a new container by hand.
Let’s put it in “/var/lib/lxc/android/”, in there, you need a configuration file similar to this one:

lxc.rootfs = /var/lib/lxc/android/rootfs
lxc.utsname = armhf

lxc.network.type = none

lxc.devttydir = lxc
lxc.tty = 4
lxc.pts = 1024
lxc.arch = armhf
lxc.cap.drop = mac_admin mac_override
lxc.pivotdir = lxc_putold

lxc.hook.pre-start = /var/lib/lxc/android/pre-start.sh

lxc.aa_profile = unconfined

/var/lib/lxc/android/pre-start.sh is where the interesting bits happen. It needs to be an executable shell script, containing something along the lines of:

#!/bin/sh
mkdir -p $LXC_ROOTFS_PATH
mount -n -t tmpfs tmpfs $LXC_ROOTFS_PATH

cd $LXC_ROOTFS_PATH
cat /var/lib/lxc/android/initrd.gz | gzip -d | cpio -i

# Create /dev/pts if missing
mkdir -p $LXC_ROOTFS_PATH/dev/pts

Then get the initrd for your device and place it in /var/lib/lxc/android/initrd.gz.

At that point, when starting the LXC container, the Android initrd will be unpacked on a tmpfs (similar to Android’s ramfs) and Android’s init will be started which in turn should mount any partition that Android requires and then start all of the usual services.

Because there are no apparmor, cgroup or even network configuration applied to it, the container will have a lot of rights and will typically completely crash the machine. You unfortunately have to be familiar with the way Android works and not be afraid to modify its init scripts if not even its init process to only start the bits you actually want.

I can’t provide a generic recipe there as it completely depends on what you’re interested on, what version of Android and what device you’re using. But it’s clearly possible to do and you may want to look at Ubuntu Touch to see how we’re doing it by default there.

One last note, Android’s init script isn’t in /sbin/init, so you need to tell LXC where to load it with:

lxc-start -n android -- /init

LXC on Android devices

So now that we’ve seen how to run Android in LXC, let’s talk about running Ubuntu on Android in LXC.

LXC has been ported to bionic (Android’s C library) and while not feature-equivalent with its glibc build, it’s still good enough to be used.

Unfortunately due to the kind of low level access LXC requires and the fact that our primary focus isn’t Android, installation could be easier…You won’t be finding LXC on the Google PlayStore and we won’t provide you with a .apk that you can install.

Instead every time something changes in the upstream git branch, we produce a new tarball which can be downloaded here: https://jenkins.linuxcontainers.org/view/LXC/view/LXC%20builds/job/lxc-build-android/lastSuccessfulBuild/artifact/lxc-android.tar.gz

This build is known to work with Android >= 4.2 but will quite likely work on older versions too.

For this to work, you’ll need to grab your device’s kernel configuration and run lxc-checkconfig against it to see whether it’s compatible with LXC or not. Unfortunately it’s very likely that it won’t be… In that case, you’ll need to go hunt for the kernel source for your device, add the missing feature flags, rebuild it and update your device to boot your updated kernel.

As scary as this may sound, it’s usually not that difficult as long as your device is unlocked and you’re already using an alternate ROM like Cyanogen which usually make their kernel git tree easily available.

Once your device has a working kernel, all you need to do is unpack our tarball as root in your device’s / directory, copy an arm container to /data/lxc/containers/<container name>, get into /data/lxc and run “./run-lxc lxc-start -n <container name>”.
A few seconds later you’ll be greeted by a login prompt.

Posted in Canonical voices, LXC, Planet Ubuntu, Ubuntu Touch | Tagged | 32 Comments