A Tragedy Caused by a Single Kubernetes Command

7 min readFeb 3, 2024

Description

Due to the Centos EOL, last year we were busy migrating to a new OS internally. We decided to take this opportunity to transition from cgroup v1 to cgroup v2. However, during the process of adapting to cgroupv2 in older versions of Kubernetes, we encountered some issues. Initially, kubelet exposed container CPU load and other related monitoring data using -enable_load_reader in a cgroup v1 environment. However, in a cgroup v2 environment, using this configuration caused kubelet to panic.

Below are the key details:

container.go:422] Could not initialize cpu load reader for "/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podXXX.slice": failed to create a netlink based cpuload reader: failed to get netlink family id for task stats: binary.Read: invalid type int32

Technical Background

This section covers the following topics:

How container metrics are generated
How Kubernetes integrates container monitoring
How CPU load is calculated

cadvisor

cAdvisor is a powerful Docker container monitoring tool designed specifically for containerized environments, facilitating the monitoring of resource usage and performance analysis. It is used to collect, aggregate, process, and output relevant information about containers. cAdvisor supports Docker containers and other types of container runtimes.

Kubelet has built-in support for cAdvisor, allowing users to directly obtain monitoring metrics about containers on nodes through the Kubelet component.

> K8s 1.19 uses cAdvisor version 0.39.3, while the brief introduction here uses version 0.48.1.

Below is the main functionality code, including some comments to improve readability. The code path is: /cadvisor/cmd/cadvisor.go.

cAdvisor primarily accomplishes the following tasks:

Provides external APIs for external use, including general API interfaces and a Prometheus interface.
Supports third-party data storage, including BigQuery, Elasticsearch, InfluxDB, Kafka, Prometheus, Redis, StatsD, and standard output.
Collects monitoring data related to containers, processes, machines, the Go runtime, and custom business logic.

func init() {
    optstr := container.AllMetrics.String()
    flag.Var(&ignoreMetrics, "disable_metrics", fmt.Sprintf("comma-separated list of `metrics` to be disabled. Options are %s.", optstr))
    flag.Var(&enableMetrics, "enable_metrics", fmt.Sprintf("comma-separated list of `metrics` to be enabled. If set, overrides 'disable_metrics'. Options are %s.", optstr))
}

From the above code, it can be seen that cAdvisor supports the ability to enable or disable specific metrics. Among them, AllMetrics mainly includes the following metrics:

https://github.com/google/cadvisor/blob/master/container/factory.go#L72

var AllMetrics = MetricSet{
    CpuUsageMetrics:                struct{}{},
    ProcessSchedulerMetrics:        struct{}{},
    PerCpuUsageMetrics:             struct{}{},
    MemoryUsageMetrics:             struct{}{},
    MemoryNumaMetrics:              struct{}{},
    CpuLoadMetrics:                 struct{}{},
    DiskIOMetrics:                  struct{}{},
    DiskUsageMetrics:               struct{}{},
    NetworkUsageMetrics:            struct{}{},
    NetworkTcpUsageMetrics:         struct{}{},
    NetworkAdvancedTcpUsageMetrics: struct{}{},
    NetworkUdpUsageMetrics:         struct{}{},
    ProcessMetrics:                 struct{}{},
    AppMetrics:                     struct{}{},
    HugetlbUsageMetrics:            struct{}{},
    PerfMetrics:                    struct{}{},
    ReferencedMemoryMetrics:        struct{}{},
    CPUTopologyMetrics:             struct{}{},
    ResctrlMetrics:                 struct{}{},
    CPUSetMetrics:                  struct{}{},
    OOMMetrics:                     struct{}{},
}

func main() {
    ...

    var includedMetrics container.MetricSet
    if len(enableMetrics) > 0 {
        includedMetrics = enableMetrics
    } else {
        includedMetrics = container.AllMetrics.Difference(ignoreMetrics)
    }
    
    klog.V(1).Infof("enabled metrics: %s", includedMetrics.String())
    setMaxProcs()
    
    memoryStorage, err := NewMemoryStorage()
    if err != nil {
        klog.Fatalf("Failed to initialize storage driver: %s", err)
    }

    sysFs := sysfs.NewRealSysFs()

    
    // the core of cadvisor
    resourceManager, err := manager.New(memoryStorage, sysFs, manager.HousekeepingConfigFlags, includedMetrics, &collectorHTTPClient, strings.Split(*rawCgroupPrefixWhiteList, ","), strings.Split(*envMetadataWhiteList, ","), *perfEvents, *resctrlInterval)
    if err != nil {
        klog.Fatalf("Failed to create a manager: %s", err)
    }


    // registry http handler
    err = cadvisorhttp.RegisterHandlers(mux, resourceManager, *httpAuthFile, *httpAuthRealm, *httpDigestFile, *httpDigestRealm, *urlBasePrefix)
    if err != nil {
        klog.Fatalf("Failed to register HTTP handlers: %v", err)
    }
    // container label custome kubelet 1.28 changeto CRI need rewrite kubelet
    containerLabelFunc := metrics.DefaultContainerLabels
    if !*storeContainerLabels {
        whitelistedLabels := strings.Split(*whitelistedContainerLabels, ",")
        // Trim spacing in labels
        for i := range whitelistedLabels {
            whitelistedLabels[i] = strings.TrimSpace(whitelistedLabels[i])
        }
        containerLabelFunc = metrics.BaseContainerLabels(whitelistedLabels)
    }

    ...
}

The generation of CPU load metrics is controlled by the command line flag enable_load_reader.

https://github.com/google/cadvisor/blob/42bb3d13a0cf9ab80c880a16c4ebb4f36e51b0c9/manager/container.go#L455

if *enableLoadReader {
        // Create cpu load reader.
        loadReader, err := cpuload.New()
        if err != nil {
            klog.Warningf("Could not initialize cpu load reader for %q: %s", ref.Name, err)
        } else {
            cont.loadReader = loadReader
        }
    }

Kubelet

In Kubernetes, Google’s cAdvisor project is used for collecting container resource and performance metrics on nodes. Within the kubelet server, cAdvisor is integrated to monitor all containers under the kubepods (default cgroup name, with “.slice” suffix under systemd mode) cgroup on that node. As of version 1.29.0-alpha.2, kubelet currently provides the following two configuration options (but now useLegacyCadvisorStats is set to false):

if kubeDeps.useLegacyCadvisorStats {
    klet.StatsProvider = stats.NewCadvisorStatsProvider(
      klet.cadvisor,
      klet.resourceAnalyzer,
      klet.podManager,
      klet.runtimeCache,
      klet.containerRuntime,
      klet.statusManager,
      hostStatsProvider)
  } else {
    klet.StatsProvider = stats.NewCRIStatsProvider(
      klet.cadvisor,
      klet.resourceAnalyzer,
      klet.podManager,
      klet.runtimeCache,
      kubeDeps.RemoteRuntimeService,
      kubeDeps.RemoteImageService,
      hostStatsProvider,
      utilfeature.DefaultFeatureGate.Enabled(features.PodAndContainerStatsFromCRI))
  }

The kubelet exposes all relevant runtime metrics in Prometheus metric format at /stats/, as shown in the following figure. The kubelet embeds the cAdvisor service.

Finally, we can see how the cAdvisor component is initialized within the kubelet.

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cadvisor/cadvisor_linux.go#L80

func New(imageFsInfoProvider ImageFsInfoProvider, rootPath string, cgroupRoots []string, usingLegacyStats, localStorageCapacityIsolation bool) (Interface, error) {
    sysFs := sysfs.NewRealSysFs()
    // Here are the types of monitoring metrics exposed by kubelet by default
    includedMetrics := cadvisormetrics.MetricSet{
        ...
        cadvisormetrics.CpuLoadMetrics:      struct{}{},
        ...
    }
    // init cAdvisor container manager.
    m, err := manager.New(memory.New(statsCacheDuration, nil), sysFs, housekeepingConfig, includedMetrics, http.DefaultClient, cgroupRoots, nil /* containerEnvMetadataWhiteList */, "" /* perfEventsFile */, time.Duration(0) /*resctrlInterval*/)
    ...

Here is the direct invocation of the manager.New function interface for cAdvisor. For more detailed information, please refer to:https://zoues.com/posts/3f237e52/

CPU Load metric

The CPU usage reflects the current level of CPU activity, while CPU load average refers to the number of processes actively using the CPU and those waiting for CPU time within a certain period. Here, the processes waiting for CPU time refer to those waiting to be awakened, excluding processes in a wait state.

When diagnosing devices, it’s necessary to combine CPU usage, load average, and task status to make judgments. For example, if the CPU usage is low but the load average is high, it may indicate an I/O bottleneck, but we won’t delve into this here.

The metric names exposed by cAdvisor is:

container_cpu_load_average_10s

So let’s take a look at how they are calculated.

https://github.com/google/cadvisor/blob/master/manager/container.go#L632

// Calculate new smoothed load average using the new sample of runnable threads.
// The decay used ensures that the load will stabilize on a new constant value within
// 10 seconds.
func (cd *containerData) updateLoad(newLoad uint64) {
    if cd.loadAvg < 0 {
        cd.loadAvg = float64(newLoad) // initialize to the first seen sample for faster stabilization.
    } else {
        cd.loadAvg = cd.loadAvg*cd.loadDecay + float64(newLoad)*(1.0-cd.loadDecay)
    }
}

The formula for calculation is as follows:cd.loadAvg = cd.loadAvg*cd.loadDecay + float64(newLoad)*(1.0-cd.loadDecay)

Essentially, the previous value calculated from the last collection, cd.loadAvg, is multiplied by the calculation factor cd.loadDecay, and then added to the current collected value newLoad multiplied by (1.0-cd.loadDecay). This yields the current value of cd.loadAvg.

Here is the logic for calculating cont.loadDecay:

https://github.com/google/cadvisor/blob/master/manager/container.go#L453

cont.loadDecay = math.Exp(float64(-cont.housekeepingInterval.Seconds() / 10))

Here is a fixed value related to housekeepingInterval, known as the decay window.

For a detailed explanation of container CPU load, please refer to the referenced link.

Tracing Back to the Source

The previous value of cd.loadAvg for CPU load is obtained as follows:

https://github.com/google/cadvisor/blob/master/manager/container.go#L650

if cd.loadReader != nil {
        // TODO(vmarmol): Cache this path.
        path, err := cd.handler.GetCgroupPath("cpu")
        if err == nil {
            loadStats, err := cd.loadReader.GetCpuLoad(cd.info.Name, path)
            if err != nil {
                return fmt.Errorf("failed to get load stat for %q - path %q, error %s", cd.info.Name, path, err)
            }
            stats.TaskStats = loadStats
            cd.updateLoad(loadStats.NrRunning)
            // convert to 'milliLoad' to avoid floats and preserve precision.
            stats.Cpu.LoadAverage = int32(cd.loadAvg * 1000)
        }
    }

With a deeper exploration, it can be discovered that netlink is used to retrieve system metrics. The critical calling path is as follows:

updateStats->GetCpuLoad->getLoadStats->prepareCmdMessage->prepareMessage

Through the above analysis, it is evident that cAdvisor retrieves CPU load information by sending a CGROUPSTATS_CMD_GET request and communicates via netlink messages:

cadvisor/utils/cpuload/netlink/netlink.go

At lines 128 to 132 in the v0.48.1 branch:

func prepareCmdMessage(id uint16, cfd uintptr) (msg netlinkMessage) { 
    buf := bytes.NewBuffer([]byte{}) 
    addAttribute(buf, unix.CGROUPSTATS_CMD_ATTR_FD, uint32(cfd), 4) 
    return prepareMessage(id, unix.CGROUPSTATS_CMD_GET, buf.Bytes()) 
}

Finally, the kernel handles the retrieval request in cgroupstats_user_cmd:

/* user->kernel request/get-response */

kernel/taskstats.c#L407

static int cgroupstats_user_cmd(struct sk_buff *skb, struct genl_info *info)
{
    int rc = 0;
    struct sk_buff *rep_skb;
    struct cgroupstats *stats;
    struct nlattr *na;
    size_t size;
    u32 fd;
    struct fd f;

    na = info->attrs[CGROUPSTATS_CMD_ATTR_FD];
    if (!na)
        return -EINVAL;

    fd = nla_get_u32(info->attrs[CGROUPSTATS_CMD_ATTR_FD]);
    f = fdget(fd);
    if (!f.file)
        return 0;

    size = nla_total_size(sizeof(struct cgroupstats));

    rc = prepare_reply(info, CGROUPSTATS_CMD_NEW, &rep_skb,
                size);
    if (rc < 0)
        goto err;

    na = nla_reserve(rep_skb, CGROUPSTATS_TYPE_CGROUP_STATS,
                sizeof(struct cgroupstats));
    if (na == NULL) {
        nlmsg_free(rep_skb);
        rc = -EMSGSIZE;
        goto err;
    }

    stats = nla_data(na);
    memset(stats, 0, sizeof(*stats));

    rc = cgroupstats_build(stats, f.file->f_path.dentry);
    if (rc < 0) {
        nlmsg_free(rep_skb);
        goto err;
    }

    rc = send_reply(rep_skb, info);

err:
    fdput(f);
    return rc;
}

And constructs the cgroup stats result in the cgroupstats_build function:

kernel/cgroup/cgroup-v1.c#L699

/**
 * cgroupstats_build - build and fill cgroupstats
 * @stats: cgroupstats to fill information into
 * @dentry: A dentry entry belonging to the cgroup for which stats have
 * been requested.
 *
 * Build and fill cgroupstats so that taskstats can export it to user
 * space.
 *
 * Return: %0 on success or a negative errno code on failure
 */
int cgroupstats_build(struct cgroupstats *stats, struct dentry *dentry)
{
……
    /* it should be kernfs_node belonging to cgroupfs and is a directory */
    if (dentry->d_sb->s_type != &cgroup_fs_type || !kn ||
        kernfs_type(kn) != KERNFS_DIR)
        return -EINVAL;

Here, it can be observed that cgroup_fs_type is the type of cgroup v1, and there is no handling for cgroup v2. Therefore, the cgroupstats_build function returns EINVAL on the path type judgment statement.

There is also an explanation of this issue in the kernel community: kernel community issue

Let’s see how Tejun (meta, cgroupv2 owner) explains this:

The exclusion of cgroupstats from v2 interface was intentional due to the duplication and inconsistencies with other statistics. If you need these numbers, please justify and add them to the appropriate cgroupfs stat file.

The deliberate exclusion of cgroupstats operations from the v2 interface is because they duplicate and are inconsistent with other statistical data.

Conclusion

So what is his suggestion?

He suggests that we use PSI instead of obtaining CPU statistics information through the CGROUPSTATS_CMD_GET netlink API. Instead, we should directly obtain it from the cpu.pressure, memory.pressure, and io.pressure files. We will discuss the relevant progress of PSI in the container field later. Currently, Containerd already supports PSI-related monitoring.