Introduction
As you may know, LXD uses unprivileged containers by default.
The difference between an unprivileged container and a privileged one is whether the root user in the container is the “real” root user (uid 0 at the kernel level).
The way unprivileged containers are created is by taking a set of normal UIDs and GIDs from the host, usually at least 65536 of each (to be POSIX compliant) and mapping those into the container.
The most common example and what most LXD users will end up with by default is a map of 65536 UIDs and GIDs, with a host base id of 100000. This means that root in the container (uid 0) will be mapped to the host uid 100000 and uid 65535 in the container will be mapped to uid 165535 on the host. UID/GID 65536 and higher in the container aren’t mapped and will return an error if you attempt to use them.
From a security point of view, that means that anything which is not owned by the users and groups mapped into the container will be inaccessible. Any such resource will show up as being owned by uid/gid “-1” (rendered as 65534 or nobody/nogroup in userspace). It also means that should there be a way to escape the container, even root in the container would find itself with just as much privileges on the host as a nobody user.
LXD does offer a number of options related to unprivileged configuration:
- Increasing the size of the default uid/gid map
- Setting up per-container maps
- Punching holes into the map to expose host users and groups
Increasing the size of the default map
As mentioned above, in most cases, LXD will have a default map that’s made of 65536 uids/gids.
In most cases you won’t have to change that. There are however a few cases where you may have to:
- You need access to uid/gid higher than 65535.
This is most common when using network authentication inside of your containers. - You want to use per-container maps.
In which case you’ll need 65536 available uid/gid per container. - You want to punch some holes in your container’s map and need access to host uids/gids.
The default map is usually controlled by the “shadow” set of utilities and files. On systems where that’s the case, the “/etc/subuid” and “/etc/subgid” files are used to configure those maps.
On systems that do not have a recent enough version of the “shadow” package. LXD will assume that it doesn’t have to share uid/gid ranges with anything else and will therefore assume control of a billion uids and gids, starting at the host uid/gid 100000.
But the common case, is a system with a recent version of shadow.
An example of what the configuration may look like is:
stgraber@castiana:~$ cat /etc/subuid lxd:100000:65536 root:100000:65536 stgraber@castiana:~$ cat /etc/subgid lxd:100000:65536 root:100000:65536
The maps for “lxd” and “root” should always be kept in sync. LXD itself is restricted by the “root” allocation. The “lxd” entry is used to track what needs to be removed if LXD is uninstalled.
Now if you want to increase the size of the map available to LXD. Simply edit both of the files and bump the last value from 65536 to whatever size you need. I tend to bump it to a billion just so I don’t ever have to think about it again:
stgraber@castiana:~$ cat /etc/subuid lxd:100000:1000000000 root:100000:1000000000 stgraber@castiana:~$ cat /etc/subgid lxd:100000:1000000000 root:100000:100000000
After altering those files, you need to restart LXD to have it detect the new map:
root@vorash:~# systemctl restart lxd root@vorash:~# cat /var/log/lxd/lxd.log lvl=info msg="LXD 2.14 is starting in normal mode" path=/var/lib/lxd t=2017-06-14T21:21:13+0000 lvl=warn msg="CGroup memory swap accounting is disabled, swap limits will be ignored." t=2017-06-14T21:21:13+0000 lvl=info msg="Kernel uid/gid map:" t=2017-06-14T21:21:13+0000 lvl=info msg=" - u 0 0 4294967295" t=2017-06-14T21:21:13+0000 lvl=info msg=" - g 0 0 4294967295" t=2017-06-14T21:21:13+0000 lvl=info msg="Configured LXD uid/gid map:" t=2017-06-14T21:21:13+0000 lvl=info msg=" - u 0 1000000 1000000000" t=2017-06-14T21:21:13+0000 lvl=info msg=" - g 0 1000000 1000000000" t=2017-06-14T21:21:13+0000 lvl=info msg="Connecting to a remote simplestreams server" t=2017-06-14T21:21:13+0000 lvl=info msg="Expiring log files" t=2017-06-14T21:21:13+0000 lvl=info msg="Done expiring log files" t=2017-06-14T21:21:13+0000 lvl=info msg="Starting /dev/lxd handler" t=2017-06-14T21:21:13+0000 lvl=info msg="LXD is socket activated" t=2017-06-14T21:21:13+0000 lvl=info msg="REST API daemon:" t=2017-06-14T21:21:13+0000 lvl=info msg=" - binding Unix socket" socket=/var/lib/lxd/unix.socket t=2017-06-14T21:21:13+0000 lvl=info msg=" - binding TCP socket" socket=[::]:8443 t=2017-06-14T21:21:13+0000 lvl=info msg="Pruning expired images" t=2017-06-14T21:21:13+0000 lvl=info msg="Updating images" t=2017-06-14T21:21:13+0000 lvl=info msg="Done pruning expired images" t=2017-06-14T21:21:13+0000 lvl=info msg="Done updating images" t=2017-06-14T21:21:13+0000 root@vorash:~#
As you can see, the configured map is logged at LXD startup and can be used to confirm that the reconfiguration worked as expected.
You’ll then need to restart your containers to have them start using your newly expanded map.
Per container maps
Provided that you have a sufficient amount of uid/gid allocated to LXD, you can configure your containers to use their own, non-overlapping allocation of uids and gids.
This can be useful for two reasons:
- You are running software which alters kernel resource ulimits.
Those user-specific limits are tied to a kernel uid and will cross container boundaries leading to hard to debug issues where one container can perform an action but all others are then unable to do the same. - You want to know that should there be a way for someone in one of your containers to somehow get access to the host that they still won’t be able to access or interact with any of the other containers.
The main downsides to using this feature are:
- It’s somewhat wasteful with using 65536 uids and gids per container.
That being said, you’d still be able to run over 60000 isolated containers before running out of system uids and gids. - It’s effectively impossible to share storage between two isolated containers as everything written by one will be seen as -1 by the other. There is ongoing work around virtual filesystems in the kernel that will eventually let us get rid of that limitation.
To have a container use its own distinct map, simply run:
stgraber@castiana:~$ lxc config set test security.idmap.isolated true stgraber@castiana:~$ lxc restart test stgraber@castiana:~$ lxc config get test volatile.last_state.idmap [{"Isuid":true,"Isgid":false,"Hostid":165536,"Nsid":0,"Maprange":65536},{"Isuid":false,"Isgid":true,"Hostid":165536,"Nsid":0,"Maprange":65536}]
The restart step is needed to have LXD remap the entire filesystem of the container to its new map.
Note that this step will take a varying amount of time depending on the number of files in the container and the speed of your storage.
As can be seen above, after restart, the container is shown to have its own map of 65536 uids/gids.
If you want LXD to allocate more than the default 65536 uids/gids to an isolated container, you can bump the size of the allocation with:
stgraber@castiana:~$ lxc config set test security.idmap.size 200000 stgraber@castiana:~$ lxc restart test stgraber@castiana:~$ lxc config get test volatile.last_state.idmap [{"Isuid":true,"Isgid":false,"Hostid":165536,"Nsid":0,"Maprange":200000},{"Isuid":false,"Isgid":true,"Hostid":165536,"Nsid":0,"Maprange":200000}]
If you’re trying to allocate more uids/gids than are left in LXD’s allocation, LXD will let you know:
stgraber@castiana:~$ lxc config set test security.idmap.size 2000000000 error: Not enough uid/gid available for the container.
Direct user/group mapping
The fact that all uids/gids in an unprivileged container are mapped to a normally unused range on the host means that sharing of data between host and container is effectively impossible.
Now, what if you want to share your user’s home directory with a container?
The obvious answer to that is to define a new “disk” entry in LXD which passes your home directory to the container:
stgraber@castiana:~$ lxc config device add test home disk source=/home/stgraber path=/home/ubuntu Device home added to test
So that was pretty easy, but did it work?
stgraber@castiana:~$ lxc exec test -- bash root@test:~# ls -lh /home/ total 529K drwx--x--x 45 nobody nogroup 84 Jun 14 20:06 ubuntu
No. The mount is clearly there, but it’s completely inaccessible to the container.
To fix that, we need to take a few extra steps:
- Allow LXD’s use of our user uid and gid
- Restart LXD to have it load the new map
- Set a custom map for our container
- Restart the container to have the new map apply
stgraber@castiana:~$ printf "lxd:$(id -u):1\nroot:$(id -u):1\n" | sudo tee -a /etc/subuid lxd:201105:1 root:201105:1 stgraber@castiana:~$ printf "lxd:$(id -g):1\nroot:$(id -g):1\n" | sudo tee -a /etc/subgid lxd:200512:1 root:200512:1 stgraber@castiana:~$ sudo systemctl restart lxd stgraber@castiana:~$ printf "uid $(id -u) 1000\ngid $(id -g) 1000" | lxc config set test raw.idmap - stgraber@castiana:~$ lxc restart test
At which point, things should be working in the container:
stgraber@castiana:~$ lxc exec test -- su ubuntu -l ubuntu@test:~$ ls -lh total 119K drwxr-xr-x 5 ubuntu ubuntu 8 Feb 18 2016 data drwxr-x--- 4 ubuntu ubuntu 6 Jun 13 17:05 Desktop drwxr-xr-x 3 ubuntu ubuntu 28 Jun 13 20:09 Downloads drwx------ 84 ubuntu ubuntu 84 Sep 14 2016 Maildir drwxr-xr-x 4 ubuntu ubuntu 4 May 20 15:38 snap ubuntu@test:~$
Conclusion
User namespaces, the kernel feature that makes those uid/gid mappings possible is a very powerful tool which finally made containers on Linux safe by design. It is however not the easiest thing to wrap your head around and all of that uid/gid map math can quickly become a major issue.
In LXD we’ve tried to expose just enough of those underlying features to be useful to our users while doing the actual mapping math internally. This makes things like the direct user/group mapping above significantly easier than it otherwise would be.
Going forward, we’re very interested in some of the work around uid/gid remapping at the filesystem level, this would let us decouple the on-disk user/group map from that used for processes, making it possible to share data between differently mapped containers and alter the various maps without needing to also remap the entire filesystem.
Extra information
The main LXD website is at: https://linuxcontainers.org/lxd
Development happens on Github at: https://github.com/lxc/lxd
Discussion forun: https://discuss.linuxcontainers.org
Mailing-list support happens on: https://lists.linuxcontainers.org
IRC support happens in: #lxcontainers on irc.freenode.net
Try LXD online: https://linuxcontainers.org/lxd/try-it
Got it! Thank you for your explanations!
Hello,
fist of all: Thank you for your great post. The part “Direct user/group mapping” is exactly what I’m looking for. But it’s not working for me. I try to map the UID 1001 from host to the containers UID 1000.
I executed the following commands:
printf “lxd:1001:1\nroot:1001:1\n” | sudo tee -a /etc/subuid
printf “lxd:1001:1\nroot:1001:1\n” | sudo tee -a /etc/subgid
sudo systemctl restart lxd
printf “uid 1001 1000\ngid 1001 1000” | lxc config set raw.idmap –
lxc restart
But if I have a look to my disk in container I still see uid 1001 instead of user ubuntu (with uid 1000). What do I wrong?
Found it by myself: Container was security.privileged true from first tests. With security.privileged false it’s now working
Castiana? I’ll name mine Taoth Vaclarush 🙂
Many thanks
Thanks for the guide. In the section “Direct user/group mapping”, you mention the steps:
1) Allow LXD’s use of our user uid and gid
2) Set a custom map for our container
I am understanding that the users root and lxd on the host and be effectively mapped to the user with id 1000 in container. Why is this required?
For instance, for the use-case of getting the correct permissions on the mounted disk device which the section is about, I did not do step 1) and I still got the correct permissions. Why is this step required? Furthermore, in the documentation for setuid (man 5 setuid), it is mentioned that the 3rd parameter is the size of the range of UIDs that can be mapped to the given user (1st parameter), why did you restrict it to 1?
Thanks in advance.
I changed a container (103) to unprivileged.
Then I mounted a mergerFS.
The directories in it are set to uid/gid 100000:100000 (host side).
Now I see the directories in the LXC and can write to them (as root).
But I have to change the owner in the container to 2001:2001. At the moment it’s 0:0
But if I enter as root in the (container) terminal
chown -R 2001:2001 / home / data
comes the error message:
chown: changing ownership of ‘/home/daten’: Operation not permitted
I need to change the owner in the host.
chown -R 102001:102001 mnt/home/data
The same thing happens with chmod 755 /home/data
What can be the reason?
I changed a container (103) to unprivileged.
Then I mounted a mergerFS.
The directories in it are set to uid/gid 100000:100000 (host side).
Now I see the directories in the LXC and can write to them (as root).
But I have to change the owner in the container to 2001:2001. At the moment it’s 0:0
But if I enter as root in the (container) terminal
chown -R 2001:2001 / home / data
comes the error message:
chown: changing ownership of ‘/home/daten’: Operation not permitted
I need to change the owner in the host.
chown -R 102001:102001 mnt/home/data
The same thing happens with chmod 755 /home/data
What can be the reason?