collectd

There are inherent challenges in collecting resource data from a measurable system. As the collection process requires executing code and/or storing data on the system under test, the process always consumes some energy and may impose a measurable load. The collected data also needs to be either stored in memory, on disk, or transmitted over the network for archiving.

We have generally opted for a solution where data is collected in memory. Storing in memory has a relatively low overhead compared to storing on disk or transmitting over the network. We use a tmpfs-based virtual disk on our Linux system. The largest downside of this approach is that the system can run out of memory. In such cases, installing more RAM or storing the logs to disk are two proposed solutions. Switching to a disk based logging is easy since we only need to specify a different target directory that is not a tmpfs mount point.

Another challenge with data logging with two or more systems (SUT & PowerGoblin instance) is that the system clocks might not be synchronized or even run at different speed.

In addition, storage presents the challenge of synchronizing the timestamps of the systems. We use simple single-point linear matching for the measurements.

Currently, PowerGoblin (only) supports importing data from the collectd tool.

collectd

The collectd tool can be used for monitoring resource usage using software-based meters. Collectd is extensible with a large set of plugins, so it is quite comprehensive tool for data collection. Each plugin requires a unique configuration section in the file if the default configuration is not sufficient. Also keep in mind that collecting more data places more load to the system and may affect the results.

Collectd runs on the SUT. After the measurement is done, the data is compiled into a zip package that is sent over the network to the PowerGoblin instance with an HTTP POST request.

Filtering

While the zip package may contain collectd data from a variety of plugins, PowerGoblin defines an internal filter for selecting only the relevant data for further processing. When adding new plugins to the configuration, remember to also adjust the filter.

The global default filter is defined by the collectdFilters field in ~/.config/powergoblin/default.json and it is a comma-separated list of prefixes. The list is inclusive and consists of the prefixes of files to include. E.g. if_octets includes

mypc/interface-lo/if_octets-2024-11-11 and
mypc/interface-lan/if_octets-2024-11-11 and even
mypc/interface-lan/if_octets_foobar-2024-11-11, but most likely no such file is created by the plugins.

The HTTP API provides a way to include and exclude certain filters:

api/v2/session/latest/resource/counter/include: add filter counter
api/v2/session/latest/resource/counter/exclude: remove filter counter

Basic resources

Currently, PowerGoblin collects and filters the collectd data by default for the following variables (can be configured in ~/.config/powergoblin/default.json):

"memory-used", "memory-free", "memory-cached", "memory-buffered",
"if_octets", "cpu-user", "cpu-system", "cpu-idle"

The following configuration file (e.g. collectd.conf, does not require any specific path) sets up collectd to collect data on the variables that PowerGoblin analyzes using the default configuration shown above:

Interval 0.1

LoadPlugin cpu
LoadPlugin csv
LoadPlugin interface
LoadPlugin load
LoadPlugin memory

<Plugin csv>
  DataDir "/tmp/collectd"
  StoreRates false
</Plugin>

The configuration is now ready, and we can continue using collectd.

Plugins

RAPL

Recent Intel and AMD CPUs provide so-called RAPL (Running Average Power Limit) data via special machine registers and kernel APIs. Depending on the hardware support, RAPL provides energy estimations for:

The whole package (package-0)
The cores (core)
An unspecified uncore device (client processors, uncore)
The DRAM (server processors, dram)
System (psys)

We currently only support reading the RAPL data with collectd, which utilizes the Linux kernel APIs. Because the RAPL data can be used for malicious purposes, not all hardware counters and kernel versions (example) allow the reading of the data as a non-root user. Read the configuration section for a way to set up access to the RAPL data.

First, make sure you have Python installed on the SUT (e.g. by launching python). Download the following intel_rapl.py script.

When measuring, the following configuration file (collectd.conf) contains the relevant parts for doing the measurements:

Interval 0.1

LoadPlugin csv
LoadPlugin python

<Plugin csv>
  DataDir "/tmp/collectd"
  StoreRates false
</Plugin>

<Plugin python>
ModulePath "."
LogTraces true
Import "intel_rapl"
</Plugin>

When running collectd, both the Python script and the configuration file should be located in the same directory.

On our test system the exposed counters are called counter-package-0 and counter-core. The counters use micro joules as units. To include this data in the session files, the counter logs should be added to the list of logged resource types using the following API call:

msg_get session/latest/resource/counter/include

Another option is to configure the global application settings by adding the counter prefix in the configuration file (~/.config/powergoblin/default.json):

"memory-used", "memory-free", "memory-cached", "memory-buffered",
"if_octets", "cpu-user", "cpu-system", "cpu-idle", "counter"

The configuration is now ready, and we can continue using collectd.

NVML

Nvidia graphics cards support the reading of resource information and power consumption estimates via the NVML interface. Depending on the hardware support, NVML provides resource data and energy estimations for:

GPU memory (memory-used, memory-free), unit: bytes
GPU load (percent-gpu_used), unit: percent
fan speed (fanspeed), unit: rpm or percent (?)
temperature (temperature-core), unit: °C
frequency (frequency-multiprocessor, frequency-memory), unit: Hz
power (power), unit: Watts

The information above was extracted from our test device (Geforce GTX 1080).

We currently only support reading the NVML data with collectd, which has built-in support for reading the data using APIs exposed by the official Nvidia drivers. Before the built-in support was added, there were third party collectd plugins for this purpose (e.g. #1, #2 & #3), but the built-in functionality is now recommended.

On Arch Linux, the collectd package from AUR did not have the necessary NVML support enabled. A custom package needs to be built to enable the support. First, make sure you have Nvidia drivers and CUDA installed on the SUT. Note that the driver and CUDA packages are rather large compared to most system packages. Use the following commands to build the utility:

$ curl https://tech.utugit.fi/soft/visiiri/doc/collectd-arch.tar.gz|tar xvfz -
$ cd collectd-archlinux/
$ makepkg -s
$ makepkg
$ makepkg -i

When measuring, the following configuration file (collectd.conf) contains the relevant parts for doing the measurements:

Interval 0.1

LoadPlugin csv
LoadPlugin gpu_nvidia

<Plugin csv>
  DataDir "/tmp/collectd"
  StoreRates false
</Plugin>

On our test system the exposed counters are called fanspeed, frequency-memory, frequency-multiprocessor, memory-free, memory-used, percent-gpu_used, and power. The memory counters are already included by the default filters. To include this data in the session files, the entry prefixes should be added to the list of logged resource types using the following API call:

msg_get session/latest/resource/fanspeed,frequency,percent,power/include

Another option is to configure the global application settings by adding the counter prefix in the configuration file (~/.config/powergoblin/default.json):

"memory-used", "memory-free", "memory-cached", "memory-buffered",
"if_octets", "cpu-user", "cpu-system", "cpu-idle",
"fanspeed", "frequency-memory", "frequency-multiprocessor",
"percent-gpu_used", "power"

The configuration is now ready, and we can continue using collectd.

Additional plugins

The following table shows the different variables that collectd supports measuring with different plugins:

Plugin	Power	CPU	Memory	Network	Disk	Process
apc ups	X (ups)
nut	X (ups)
ted	X (power)
mic	X (mic)
battery	X (laptop)
ipmi	X (ipmi)
snmp	X (ipmi)
mbmon	X (sensors)
redfish	X (sensors)
sensors	X (sensors)
thermal	X (sensors)
multimeter	X (multimeter)
sigrok	X (multimeter)
smart	X (hdd)
hddtemp	X (hdd)
wireless	X (wireless)
gpu_nvidia	X (gpu)
intel_rapl	X (rapl)
cpu		X
cpufreq		X
xen		X
turbostat		X
contextswitch		X
memory			X
swap			X
vmem			X
buddyinfo			X
dns				X
interface				X
iptables				X
netlink				X
ethstat				X
madwifi				X
disk					X
df					X
cgroups						X
load						X
processes						X
dbi						X (db)
virt		X	X	X	X

Using collectd

First, make sure that all the collectd plugins required by the measurement's collectd configuration have been installed and are functional. This might require doing test runs before the actual measurement. Testing the script on another system does not imply it will automatically also work on the SUT. The syntax of the configuration file can be verified by running:

$ collectd -tT -C "$CONFFILE"

How to use collectd in conjunction with PowerGoblin? You will need to start collectd before doing the measurement and shut down the utility after the measurement:

$ collectd -f -C "$CONFFILE"

As mentioned in the collectd configuration above, the data will be collected to the directory tree under /tmp/collectd:

└-- mypc
    ├-- cpu-0
    |   ├-- cpu-idle-2024-11-11
    |   ├-- cpu-interrupt-2024-11-11
    |   ├-- cpu-nice-2024-11-11
    |   ├-- cpu-softirq-2024-11-11
    |   ├-- cpu-steal-2024-11-11
    |   ├-- cpu-system-2024-11-11
    |   ├-- cpu-user-2024-11-11
    |   └-- cpu-wait-2024-11-11
    ├-- intel-rapl
    |   └-- counter-core-2024-11-11
    |   └-- counter-package-0-2024-11-11
    ├-- interface-lo
    |   ├-- if_dropped-2024-11-11
    |   ├-- if_errors-2024-11-11
    |   ├-- if_octets-2024-11-11
    |   └-- if_packets-2024-11-11
    ├-- interface-lan
    |   ├-- if_dropped-2024-11-11
    |   ├-- if_errors-2024-11-11
    |   ├-- if_octets-2024-11-11
    |   └-- if_packets-2024-11-11
    ├-- load
    |   └-- load-2024-11-11
    └-- memory
    |   ├-- memory-buffered-2024-11-11
    |   ├-- memory-cached-2024-11-11
    |   ├-- memory-free-2024-11-11
    |   ├-- memory-slab_recl-2024-11-11
    |   ├-- memory-slab_unrecl-2024-11-11
    |   └-- memory-used-2024-11-11
    └── gpu_nvidia-0-NVIDIA GeForce GTX 1080
        ├── fanspeed-2024-11-11
        ├── frequency-memory-2024-11-11
        ├── frequency-multiprocessor-2024-11-11
        ├── memory-free-2024-11-11
        ├── memory-used-2024-11-11
        ├── percent-gpu_used-2024-11-11
        ├── power-2024-11-11
        └── temperature-core-2024-11-11

Before running the data collection, make sure the directory is empty to avoid contamination of the results. After the collection, the data should be compressed to a zip archive and uploaded to the PowerGoblin instance with the HTTP API. We should also synchronize the timestamps between the SUT and the PowerGoblin instance, because the clocks on both systems might not be in sync.

The PowerGoblin import functionality expects the exact directory structure (with relative paths) that is located under this directory path. That is, the root level should only contain a single folder with the SUT's hostname.

Example scripts

We provide example projects (#1, #2 & #3) for performing measurements with collectd. These projects utilize the PowerGoblin library, which simplifies the code quite a bit.

The following scripts are also provided for doing the measurement process without any external libraries. They start collectd as a background process. The scripts collect the log data in /tmp/collectd, compresses the directory tree into a zip archive, synchronizes the clocks, and uploads the file to the PowerGoblin instance (running at localhost:8080).

ShellPythonJavaJavascriptKotlin

# requires coreutils, sh, curl, zip, collectd

msg_post() { 
  curl -d@- http://$HOST/api/v2/$*
}
msg_post_file() { 
  curl --data-binary "@$1" http://$HOST/api/v2/$2 
}

# hostname and port of PowerGoblin
export HOST=localhost:8080

# start collecting resource data
mkdir -p /tmp/collectd
collectd -f -C collectd.conf &
PID=$!

# synchronize the clocks
date +%s%N | msg_post session/latest/sync

# stop collecting resource data and prepare the zip
kill $PID
pushd /tmp/collectd &> /dev/null
zip -qr collectd.zip *
popd &> /dev/null

# send the collectd data
msg_post_file /tmp/collectd/collectd.zip session/latest/import/collectd

# Requires Python 3+, python-requests,

import requests
import time
import shutil
import subprocess

class GoblinClient:
  def __init__(self, host):
    self.prefix = "http://" + host + "/api/v2/"

  def post_text(self, url, text):
    h = {'Content-Type': 'text/plain'}
    return requests.post(self.prefix + url, data=text, headers=h)

  def post_file(self, url, file):
    with open(file, 'rb') as p:
      h = {'content-type': 'application/x-zip'}
      return requests.post(self.prefix + url, data=p, verify=False, headers=h)

def milli_time():
  return str(round(time.time() * 1000))

c = GoblinClient("localhost:8080")

# start collecting resource data
p = subprocess.Popen(["/usr/sbin/collectd", "-fC", "collectd.conf"])

# synchronize the clocks
c.post_text("session/latest/sync", text = milli_time())

# stop collecting resource data and prepare the zip
p.terminate()
time.sleep(2)
shutil.make_archive("collectd", "zip", "/tmp/collectd/")

# send the collectd data
c.post_file("session/latest/import/collectd", "collectd.zip")

// requires Java 21+

import java.io.IOException;
import java.nio.file.*;
import java.net.http.*;
import java.util.zip.*;

record GoblinClient(String host) {
  private HttpResponse<String> run(HttpRequest.Builder b) throws Exception {
    try (var client = HttpClient.newHttpClient()) {
      return client.send(
          b.build(),
          HttpResponse.BodyHandlers.ofString()
      );
    }
  }
  
  private HttpRequest.Builder api(String api) {
      return HttpRequest.newBuilder(URI.create("http://"+host+"/api/v2/"+api));
  }

  HttpResponse<String> post(String api, String msg) throws Exception {
    return run(api(api).POST(HttpRequest.BodyPublishers.ofString(msg)));
  }

  HttpResponse<String> post(String api, Path path) throws Exception {
    return run(api(api).POST(HttpRequest.BodyPublishers.ofFile(path)));
  }

  void zipFile(File file, String fn, ZipOutputStream out) throws IOException  {
    if (file.isDirectory()) {
      var name = fn.endsWith("/") ? fn : fn + "/";
      out.putNextEntry(new ZipEntry(name));
      out.closeEntry();
      File[] children = file.listFiles();
      if (children != null)
        for (File child: children)
          zipFile(child, name + child.getName(), out);
    } else
      try(var fis = new FileInputStream(file)) {
        out.putNextEntry(new ZipEntry(fn));
        fis.transferTo(out);
      }
  }

  void zip(Path target, List<Path> source) throws IOException {
    try(var fos = new FileOutputStream(target.toFile());
      var out = new ZipOutputStream(fos)) {
      for (Path it : source) {
        var fileToZip = it.toAbsolutePath().toFile();
        zipFile(fileToZip, fileToZip.getName(), out);
      }
    }
  }
  void zip(Path target, Path source) throws IOException {
    zip(target, Files.list(source).toList());
  }
}

void main() {
  var c = new GoblinClient("localhost:8080");

  // synchronize the clocks
  c.post("session/latest/sync", "" + System.currentTimeMillis());

  // send the collectd data
  c.zip(Path.of("collectd.zip"), Path.of("/tmp/collectd/"));
  c.post("session/latest/import/collectd", Path.of("collectd.zip"));
}

const request = require('request');
const fs = require('fs');

const host = "localhost:8080";

function get(api) {
  request(
    "http://" + host + "/api/v2/" + api, 
    (e, r, body) => {
      if (!e && r.statusCode == 200)
        console.log(body);
    }
  );
}

function post_json(api, data) {
  request.post(
    {
      url: "http://" + host + "/api/v2/" + api,
      json: data,
    },
    (e, r, body) => {        
      if (!e && r.statusCode == 200)
        console.log(body)
    }
  );
}

function post_file(api, file) {
  fs.createReadStream(file).pipe(
    request.post(
    {
      url: "http://" + host + "/api/v2/" + api
    },
    (e, r, body) => {
      if (!e && r.statusCode == 200)
        console.log(body)
    }
    )
  );
}

// synchronize the clocks
post_json("session/latest/sync", +new Date());

// send the collectd data
post_file("session/latest/import/collectd", "collectd.zip")

// Requires Java 21+

import java.nio.file.*
import java.net.*
import java.net.http.*
import java.util.zip.*
import java.io.*

class GoblinClient(val host: String) {
  private fun run(b: HttpRequest.Builder) =
    HttpClient.newHttpClient().use {
      it.send(
        b.build(),
        HttpResponse.BodyHandlers.ofString()
      )
    }
  
  private fun api(api: String) =
    HttpRequest.newBuilder(URI.create(("http://$host/api/v2/$api")))
  
  fun get(api: String) = run(api(api))

  fun post(api: String, msg: String) =
    run(api(api).POST(HttpRequest.BodyPublishers.ofString(msg)))

  fun post(api: String, path: Path) =
    run(api(api).POST(HttpRequest.BodyPublishers.ofFile(path)))
}

fun zipFile(fileToZip: File, fileName: String, out: ZipOutputStream) {
    if (fileToZip.isDirectory) {
        val name = if (fileName.endsWith("/")) fileName else "$fileName/"
        out.putNextEntry(ZipEntry(name))
        out.closeEntry()
        val children = fileToZip.listFiles() ?: 
          throw IOException("Error opening [$fileToZip]")
        for (child in children) 
            zipFile(child, name + child.name, out)
    } else
        FileInputStream(fileToZip).use { fis ->
            out.putNextEntry(ZipEntry(fileName))
            fis.transferTo(out)
        }
}

fun zip(destinationFile: Path, sourceDirs: List<Path>) {
    FileOutputStream(destinationFile.toFile()).use { fos ->
        ZipOutputStream(fos).use { zipOut ->
            sourceDirs.forEach {
                val path = it.toAbsolutePath()
                val fileToZip = path.toFile()
                zipFile(fileToZip, fileToZip.name, zipOut)
            }
        }
    }
}

// --- Examples ---

var c = GoblinClient("localhost:8080")

// synchronize the clocks
c.post("session/latest/sync", "" + System.currentTimeMillis())

// send the collectd data
zip(Path.of("collectd.zip"), Path.of("/tmp/collectd/"))
c.post("session/latest/import/collectd", Path.of("collectd.zip"))