Scalable Fleet Management and Automation for Next-Gen Substations
Content Type: Event Recap
Event: LF Energy Summit Europe 2025
Session: Scalable Fleet Management and Automation for Next-Gen Substations
Speakers: Ingo Boernig (Red Hat), Christian Koep (Red Hat)
TL;DR — Declarative lifecycle management for large substation fleets
Ingo Boernig and Christian Koep presented an approach for managing large fleets of substation systems using Flight Control and a declarative model. They described automating the device lifecycle from onboarding and approval to updates, health checks, rollbacks, and monitoring, with a focus on remote, bandwidth-constrained edge environments and strong security practices such as signed images, per-device certificates, and TPM support.
Why Next-Gen Substations Need Fleet Management
Koep framed the problem around the grid’s shift from centralized power plants to a decentralized system with many smaller renewable sources and prosumers. He noted a common expectation that grid capacity must double within the next 20 years. More secondary and primary equipment produces more data, and data only becomes useful when algorithms run on it continuously. That drives the need for compute close to where data is produced, at the edge.
They also highlighted pressures that come with moving IT platforms into edge environments, including long-term security, monitoring, avoiding vendor lock-in, and changing systems frequently while keeping them stable. Koep described this as a mindset shift from devices that provide fixed functions to functions that need to run somewhere, typically on an IT platform.
Edge Constraints and Operational Risk
Boernig emphasized that edge environments differ from data centers in ways that shape automation requirements:
-
Limited compute resources, with hardware sized for the workload, not management overhead
-
Networks that are less redundant, with limited bandwidth and restricted connectivity
-
No onsite IT staff, so devices must be fully remote managed from the start
-
Installation often done by technicians who are not deeply knowledgeable about the technology
On the risk side, they stressed that software is not static. It evolves and needs updates, and updates must be reliable. When an update fails, services need to return “very fast back on.” They also described the need for an inventory of software components so teams can identify vulnerabilities and patch quickly.
Why This Is Not Just “Run Ansible Everywhere”
Koep addressed Ansible directly. He said Ansible can hit limits at scale, especially with tens of thousands of devices or VMs, because controllers or jump hosts need SSH or similar connectivity to targets, and jobs require targets to be available when executed. In edge environments that can lead to repeatedly “hitting the deploy button” until systems become eventually consistent.
He positioned the solution as a complement to Ansible rather than a replacement, with Ansible still fitting well for network equipment and similar domains.
Flight Control and the Agent Model
Koep introduced Flight Control as an open sourced project that came out of Red Hat’s emerging technologies work. The key architectural point they emphasized was the agent:
-
The central system does not need to connect to devices over SSH
-
The agent “calls home,” asks what it is supposed to do, and then it eventually happens
This model avoids central-to-edge connectivity requirements and key management, which they argued fits edge networking constraints and remote operations.
The presentation description characterizes this as centralized control using a declarative approach to automate the complete device lifecycle, including deployment, health checks, monitoring, rollbacks, and security, and as a way to simplify IT/OT convergence and integrate with existing infrastructure.
bootc as an Enabling Technology
Boernig explained why bootc matters for fleet-scale operations. He described the lack of standardization for Linux OS image formats across platforms, then contrasted that with the container world, where OCI images are a shared standard that can be built, pushed to registries, and run broadly.
He described bootc as adopting OCI images for operating systems by packaging the kernel, bootloader, and init system into the container image, then storing and distributing it through the same container tooling and registries. He said bootc images exist for multiple distributions, including Fedora, RHEL, CentOS, Debian, and others.
He also noted that, because this is a container image, vendors could package applications similarly, distribute them through registries, and start them from there.
Onboarding: Zero Touch and Zero Trust Provisioning
Koep described onboarding as “zero touch provisioning” and “zero trust provisioning.” In the flow they presented:
-
A device is plugged in and can provision itself
-
TPMs are supported as a hardware root of trust
-
Every device has a client certificate
-
Operating system images and application images can be signed
He connected this to software supply chain security, especially where physical access controls are limited. Rogue devices are rejected when they attempt to connect and request secured content. In practice, he said this reduces the need for IT knowledge onsite. A technician, or “maybe a drone,” can deliver and plug in a device, then it can be approved and added to the fleet.
Fleet Concept, Labels, and Desired State
Koep described Flight Control’s “fleet” concept as a collection of systems that are supposed to look exactly the same. Devices become part of a fleet based on labels, using examples like:
-
substation=true -
region=NRW -
city=Dortmund
He showed an approval workflow where a device boots into a specified image, requests approval, and an operator reviews identifiers like boot ID, approves it, and assigns a name such as “substation Dortmund.”
He also said they collect metadata from the agent, including the running OS, architecture, and kernel version. Without managing SSH keys, they have access to a terminal session.
Once labeled into a fleet, the device compares its current state to the fleet’s desired state, which includes the OS image, applications, and configs. Devices then converge over time to that desired state.
Updates, Health Checks, and Rollbacks
Koep emphasized that updates in this model are image-based:
-
The new image is downloaded to the system
-
The update is performed by booting into the new version
-
There is “no residue” of old packages or configuration files, it is a clean cutover
He then tied this to “smart health checks.” Operators can define baseline checks (he gave examples like DNS and network connectivity). If checks fail, the system rolls back automatically to the latest healthy version. He described this as part of “zero touch,” meaning no manual intervention.
Rollout Policies, Disruption Budgets, and Bandwidth-Aware Scheduling
Koep acknowledged the risk of updating “thousands of devices” at once if a fleet configuration changes. He said rollout can be controlled through policies that update subsets of machines, only proceed when prior updates succeed, and stop when disruption budgets are exceeded.
He also described scheduling controls designed for bandwidth constraints, where devices can download updates at one time (for example at night) and perform the actual update on a different day.
Git-Backed Configuration and Location-Aware Application
Koep described keeping configurations for devices in a Git repository and applying them based on labels and location. His example was that a device in a specific fleet automatically receives the network configuration that makes sense for that location. If the device moves, changing labels in software causes the system to apply the configuration appropriate to the new location.
He also showed that fleet-level configuration can include dropping files onto systems, running applications on top of the base system (he used a “hello world” application in Podman as an example), and adding monitoring specifics.
The presentation description further characterizes this as enabling predictive maintenance through comprehensive data collection. In the talk itself, they explicitly mentioned collecting device metadata from the agent and emphasized the need for component inventories to identify vulnerabilities.
Closing Message
Boernig summarized the vision as follows: “a device outside in the OT world feels like a container in a Kubernetes environment.” You specify a desired state, keep everything in Git, and the fleet eventually and consistently evolves into that state. He said they created Flight Control because customers want to control large numbers of similar remote deployments while keeping risk low and security standards high, using a method designed for simplicity and extensibility. They also stated Flight Control is free software, developed in the open, and they are looking for contributors.
Watch the presentation here: https://youtu.be/cs79xANQ8cg?si=zJUM8YOtFz6ohbct
Learn more about Flight Control: https://github.com/flightctl/flightctl