Gathering LXC and Docker containers metrics

December 8, 2025 · 2664 words · 13 min

Linux Containers rely on control groups which not only track groups of processes, but also expose a

Linux Containers rely on control groups which not only track groups of processes, but also expose a lot of metrics about CPU, memory, and block I/O usage. We will see how to access those metrics, and how to obtain network usage metrics as well. This is relevant for “pure” , as well as for containers. Control groups are exposed through a pseudo-filesystem. In recent distros, you should find this filesystem under . Under that directory, you will see multiple sub-directories, called , , , etc.; each sub-directory actually corresponds to a different cgroup . On older systems, the control groups might be mounted on , without distinct hierarchies. In that case, instead of seeing the sub-directories, you will see a bunch of files in that directory, and possibly some directories corresponding to existing containers. To figure out where your control groups are mounted, you can run: The fact that different control groups can be in different hierarchies mean that you can use completely different groups (and policies) for e.g. CPU allocation and memory allocation. Let’s make up a completely imaginary example: you have a 2-CPU system running Python webapps with Gunicorn, a PostgreSQL database, and accepting SSH logins. You can put each webapp and each SSH session in their own memory control group (to make sure that a single app or user doesn’t use up the memory of the whole system), and at the same time, stick the webapps and database on a CPU, and the SSH logins on another CPU. Of course, if you run LXC containers, each hierarchy will have one group per container, and all hierarchies will look the same. Merging or splitting hierarchies is achieved by using special options when mounting the cgroup pseudo-filesystems. Note that if you want to change that, you will have to remove all existing cgroups in the hierarchies that you want to split or merge. You can look into to see the different control group subsystems known to the system, the hierarchy they belong to, and how many groups they contain. You can also look at to see which control groups a process belongs to. The control group will be shown as a path relative to the root of the hierarchy mountpoint; e.g. means “this process has not been assigned into a particular group”, while means that the process is likely to be a member of a container named . For each container, one cgroup will be created in each hierarchy. On older systems with older versions of the LXC userland tools, the name of the cgroup will be the name of the container. With more recent versions of the LXC tools, the cgroup will be . Additional note for Docker users: the container name will be the or of the container. If a container shows up as in , its long ID might be something like . You can look it up with or . Putting everything together: on my system, if I want to look at the memory metrics for a Docker container, I have to look at . For each subsystem, we will find one pseudo-file (in some cases, multiple) containing statistics about used memory, accumulated CPU cycles, or number of I/O completed. Those files are easy to parse, as we will see. Those will be found in the cgroup (duh!). Note that the memory control group adds a little overhead, because it does very fine-grained accounting of the memory usage on your system. Therefore, many distros chose to enable it by default. Generally, to enable it, all you have to do is to add some kernel command-line parameters: . The metrics are in the pseudo-file . Here is what it will look like: The first half (without the prefix) contains statistics relevant to the processes within the cgroup, excluding sub-cgroups. The second half (with the prefix) includes sub-cgroups as well. Some metrics are “gauges”, i.e. values that can increase or decrease (e.g. , the amount of swap space used by the members of the cgroup). Some others are “counters”, i.e. values that can only go up, because they represent occurrences of a specific event (e.g. , which indicates the number of page faults which happened since the creation of the cgroup; this number can never decrease). Let’s see what those metrics stand for. All memory amounts are in bytes (except for event counters). Accounting for memory in the page cache is very complex. If two processes in different control groups both read the same file (ultimately relying on the same blocks on disk), the corresponding memory charge will be split between the control groups. It’s nice, but it also means that when a cgroup is terminated, it could increase the memory usage of another cgroup, because they are not splitting the cost anymore for those memory pages. Now that we’ve covered memory metrics, everything else will look very simple in comparison. CPU metrics will be found in the controller. For each container, you will find a pseudo-file , containing the CPU usage accumulated by the processes of the container, broken down between and time. If you’re not familiar with the distinction, is the time during which the processes were in direct control of the CPU (i.e. executing process code), and is the time during which the CPU was executing system calls on behalf of those processes. Those times are expressed in ticks of 1/100th of second. (Actually, they are expressed in “user jiffies”. There are

per second, and on x86 systems, is 100. This used to map exactly to the number of scheduler “ticks” per second; but with the advent of higher frequency scheduling, as well as , the number of kernel ticks wasn’t relevant anymore. It stuck around anyway, mainly for legacy and compatibility reasons.) Block I/O is accounted in the controller. Different metrics are scattered across different files. While you can find in-depth details in the blkio-controller file in the kernel documentation, here is a short list of the most relevant ones: For each file, there is a variant, that aggregates the metrics of the control group and all its sub-cgroups. Also, it’s worth mentioning that in most cases, if the processes of a control group have not done any I/O on a given block device, the block device will not appear in the pseudo-files. In other words, you have to be careful each time you parse one of those files, because new entries might have appeared since the previous time. Interestingly, network metrics are not exposed directly by control groups. There is a good explanation for that: network interfaces exist within the context of . The kernel could probably accumulate metrics about packets and bytes sent and received by a group of processes, but those metrics wouldn’t be very useful. You want (at least!) per-interface metrics (because traffic happening on the local interface doesn’t really count). But since processes in a single cgroup can belong to multiple network namespaces, those metrics would be harder to interpret: multiple network namespaces means multiple interfaces, potentially multiple interfaces, etc.; so this is why there is no easy way to gather network metrics with control groups. So what shall we do? Well, we have multiple options. When people think about , they usually think about firewalling, and maybe NAT scenarios. But (or rather, the framework for which is just an interface) can also do some serious accounting. For instance, you can setup a rule to account for the outbound HTTP traffic on a web server: There is no or flag, so the rule will just count matched packets and go to the following rule. Later, you can check the values of the counters, with: (Technically, is not required, but it will prevent iptables from doing DNS reverse lookups, which are probably useless in this scenario.) Counters include packets and bytes. If you want to setup metrics for container traffic like this, you could execute a loop to add two rules per container IP address (one in each direction), in the chain. This will only meter traffic going through the NAT layer; you will also have to add traffic going through the userland proxy. Then, you will need to check those counters on a regular basis. If you happen to use , there is a nice plugin to automate iptables counters collection. Since each container has a virtual Ethernet interface, you might want to check directly the TX and RX counters of this interface. However, this is not as easy as it sounds. If you use Docker (as of current version 0.6) or , then you will notice that each container is associated to a virtual Ethernet interface in your host, with a name like . Figuring out which interface corresponds to which container is, unfortunately, difficult. (If you know an easy way, let me know.) In the long run, Docker will probably take over the setup of those virtual interfaces. It will keep track of their names, and make sure that it can easily associate containers with their respective interfaces. But for now, the best way is to check the metrics . I’m not talking about running a special agent in the container, or anything like that. We are going to run an executable from the host environment, but within the network namespace of a container. To do that, we will use the command. This command will let you execute any program (present in the host system) within any network namespace visible to the current process. This means that your host will be able to enter the network namespace of your containers, but your containers won’t be able to access the host, nor their sibling containers. Containers will be able to “see” and affect their sub-containers, though. The exact format of the command is: For instance: How does the naming system work? How does find ? Answer: by using the namespaces pseudo-files. Each process belongs to one network namespace, one PID namespace, one namespace, etc.; and those namespaces are materialized under . For instance, the network namespace of PID 42 is materialized by the pseudo-file . When you run , it expects to be one of those pseudo-files. (Symlinks are accepted.) In other words, to execute a command within the network namespace of a container, we need to: Now, we need to figure out a way to find the PID of a process (any process!) running in the container that we want to investigate. This is actually very easy. You have to locate one of the control groups corresponding to the container. We explained how to locate those cgroups in the beginning of this post, so we won’t cover that again. On my machine, a control group will typically be located in . Within that directory, you will find a pseudo-file called . It contains the list of the PIDs that are in the control group, i.e., in the container. We can take any of them; so the first one will do. Putting everything together, if the “short ID” of a container is held in the environment variable , here is a small shell snippet to put everything together: The same mechanism is used in to setup network interfaces within containers the containers. Note that running a new process each time you want to update metrics is (relatively) expensive. If you want to collect metrics at high resolutions, and/or over a large number of containers (think 1000 containers on a single host), you do not want to fork a new process each time. Here is how to collect metrics from a single process. You will have to write your metric collector in C (or any language that lets you do low-level system calls). You need to use a special system call, , which lets the current process enter any arbitrary namespace. It requires, however, an open file descriptor to the namespace pseudo-file (remember: that’s the pseudo-file in ). However, there is a catch: you must not keep this file descriptor open. If you do, when the last process of the control group exits, the namespace will not be destroyed, and its network resources (like the virtual interface of the container) will stay around for ever (or until you close that file descriptor). The right approach would be to keep track of the first PID of each container, and re-open the namespace pseudo-file each time. Sometimes, you do not care about real time metric collection, but when a container exits, you want to know how much CPU, memory, etc. it has used. The current implementation of Docker (as of 0.6) makes this particularly challenging, because it relies on , and when a container stops, carefully cleans up behind it. If you really want to collect the metrics anyway, here is how. For each container, start a collection process, and move it to the control groups that you want to monitor by writing its PID to the file of the cgroup. The collection process should periodically re-read the file to check if it’s the last process of the control group. (If you also want to collect network statistics as explained in the previous section, you should also move the process to the appropriate network namespace.) When the container exits, will try to delete the control groups. It will fail, since the control group is still in use; but that’s fine. You process should now detect that it is the only one remaining in the group. Now is the right time to collect all the metrics you need! Finally, your process should move itself back to the root control group, and remove the container control group. To remove a control group, just its directory. It’s counter-intuitive to a directory as it still contains files; but remember that this is a pseudo-filesystem, so usual rules don’t apply. After the cleanup is done, the collection process can exit safely. As you can see, collecting metrics when a container exits can be tricky; for this reason, it is usually easier to collect metrics at regular intervals (e.g. every minute, with the collectd LXC plugin) and rely on that instead. To recap, we covered: As we have seen, metrics collection is not insanely difficult, but still involves many complicated steps, with special cases like those for the network subsystem. Docker will take care of this, or at least expose hooks to make it more straightforward. It is one of the reasons why we repeat over and over “Docker is not production ready yet”: it’s fine to skip metrics for development, continuous testing, or staging environments, but it’s definitely to run production services without metrics! Last but not least, note that even with all that information, you will still need a storage and graphing system for those metrics. There are many such systems out there. If you want something that you can deploy on your own, you can check e.g. or . There are also “-as-a-Service” offerings. Those services will store your metrics and let you query them in various ways, for a given price. Some examples include , , , and many more.   Jérôme is a senior engineer at dotCloud, where he rotates between Ops, Support and Evangelist duties and has earned the nickname of “master Yoda”. In a previous life he built and operated large scale Xen hosting back when EC2 was , supervized the deployment of fiber interconnects through the French subway, built a specialized GIS to visualize fiber infrastructure, specialized in commando deployments of large-scale computer systems in bandwidth-constrained environments such as conference centers, and various other feats of technical wizardry. He cares for the servers powering dotCloud, helps our users feel at home on the platform, and documents the many ways to use dotCloud in articles, tutorials and sample applications. He’s also an avid dotCloud power user who has deployed just about anything on dotCloud – look for one of his many custom services on our Github repository.