If you’re interested in this project and would like to help in engineering efforts, or have general usage questions, we are happy to have you!
We hold a weekly meeting that all audiences are welcome to attend.
We would appreciate your feedback so that we can make Talos even better!
To do so, you can take our survey.
You can subscribe to this meeting by joining the community forum above.
Note: You can convert the meeting hours to your local time.
Enterprise
If you are using Talos in a production setting, and need consulting services to get started or to integrate Talos into your existing environment, we can help.
Sidero Labs, Inc. offers support contracts with SLA (Service Level Agreement)-bound terms for mission-critical environments.
A quick introduction in to what Talos is and why it should be used.
Talos is a container optimized Linux distro; a reimagining of Linux for distributed systems such as Kubernetes.
Designed to be as minimal as possible while still maintaining practicality.
For these reasons, Talos has a number of features unique to it:
it is immutable
it is atomic
it is ephemeral
it is minimal
it is secure by default
it is managed via a single declarative configuration file and gRPC API
Talos can be deployed on container, cloud, virtualized, and bare metal platforms.
Why Talos
In having less, Talos offers more.
Security.
Efficiency.
Resiliency.
Consistency.
All of these areas are improved simply by having less.
1.2 - Quickstart
A short guide on setting up a simple Talos Linux cluster locally with Docker.
Local Docker Cluster
The easiest way to try Talos is by using the CLI (talosctl) to create a cluster on a machine with docker installed.
Download kubectl via one of methods outlined in the documentation.
Create the Cluster
Now run the following:
talosctl cluster create
Verify that you can reach Kubernetes:
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
talos-default-controlplane-1 Ready master 115s v1.25.5 10.5.0.2 <none> Talos (v1.2.8) <host kernel> containerd://1.5.5
talos-default-worker-1 Ready <none> 115s v1.25.5 10.5.0.3 <none> Talos (v1.2.8) <host kernel> containerd://1.5.5
Destroy the Cluster
When you are all done, remove the cluster:
talosctl cluster destroy
1.3 - Getting Started
A guide to setting up a Talos Linux cluster on multiple machines.
This document will walk you through installing a full Talos Cluster.
If this is your first use of Talos Linux, we recommend the Quickstart first, to quickly create a local virtual cluster on your workstation.
Regardless of where you run Talos, in general you need to:
acquire the installation image
decide on the endpoint for Kubernetes
optionally create a load balancer
configure Talos
configure talosctl
bootstrap Kubernetes
Prerequisites
talosctl
talosctl is a CLI tool which interfaces with the Talos API in
an easy manner.
The most general way to install Talos is to use the ISO image (note there are easier methods for some platforms, such as pre-built AMIs for AWS - check the specific Installation Guides.)
The latest ISO image can be found on the Github Releases page:
When booted from the ISO, Talos will run in RAM, and will not install itself
until it is provided a configuration.
Thus, it is safe to boot the ISO onto any machine.
Alternative Booting
For network booting and self-built media, you can use the published kernel and initramfs images:
Note that to use alternate booting, there are a number of required kernel parameters.
Please see the kernel docs for more information.
Decide the Kubernetes Endpoint
In order to configure Kubernetes, Talos needs to know
what the endpoint (DNS name or IP address) of the Kubernetes API Server will be.
The endpoint should be the fully-qualified HTTP(S) URL for the Kubernetes API
Server, which (by default) runs on port 6443 using HTTPS.
Thus, the format of the endpoint may be something like:
https://192.168.0.10:6443
https://kube.mycluster.mydomain.com:6443
https://[2001:db8:1234::80]:6443
The Kubernetes API Server endpoint, in order to be highly available, should be configured in a way that functions off all available control plane nodes.
There are three common ways to do this:
Dedicated Load-balancer
If you are using a cloud provider or have your own load-balancer (such
as HAProxy, nginx reverse proxy, or an F5 load-balancer), using
a dedicated load balancer is a natural choice.
Create an appropriate frontend matching the endpoint, and point the backends at the addresses of each of the Talos control plane nodes.
(Note that given we have not yet created the control plane nodes, the IP addresses of the backends may not be known yet.
We can bind the backends to the frontend at a later point.)
Layer 2 Shared IP
Talos has integrated support for serving Kubernetes from a shared/virtual IP address.
This method relies on Layer 2 connectivity between control plane Talos nodes.
In this case, we choose an unused IP address on the same subnet as the Talos
control plane nodes.
For instance, if your control plane node IPs are:
192.168.0.10
192.168.0.11
192.168.0.12
you could choose the ip 192.168.0.15 as your shared IP address.
(Make sure that 192.168.0.15 is not used by any other machine and that your DHCP server
will not serve it to any other machine.)
Once chosen, form the full HTTPS URL from this IP:
https://192.168.0.15:6443
If you create a DNS record for this IP, note you will need to use the IP address itself, not the DNS name, to configure the shared IP (machine.network.interfaces[].vip.ip) in the Talos configuration.
For more information about using a shared IP, see the related
Guide
DNS records
If neither of the other methods work for you, you can use DNS records to
provide a measure of redundancy.
In this case, you would add multiple A or AAAA records (one for each control plane node) to a DNS name.
For instance, you could add:
kube.cluster1.mydomain.com IN A 192.168.0.10
kube.cluster1.mydomain.com IN A 192.168.0.11
kube.cluster1.mydomain.com IN A 192.168.0.12
Then, your endpoint would be:
https://kube.cluster1.mydomain.com:6443
Decide how to access the Talos API
Many administrative tasks are performed by calling the Talos API on Talos Linux control plane nodes.
We recommend accessing the control plane nodes directly from the talosctl client, if possible (i.e. set your endpoints to the IP addresses of the control plane nodes).
This requires your control plane nodes to be reachable from the client IP.
If the control plane nodes are not directly reachable from the workstation where you run talosctl, then configure a load balancer for TCP port 50000 to be forwarded to the control plane nodes.
Do not use Talos Linux’s built in VIP support for accessing the Talos API, as it will not function in the event of an etcd failure, and you will not be able to access the Talos API to fix things.
If you create a load balancer to forward the Talos API calls, make a note of the IP or
hostname so that you can configure your talosctl tool’s endpoints below.
Configure Talos
When Talos boots without a configuration, such as when using the Talos ISO, it
enters a limited maintenance mode and waits for a configuration to be provided.
In other installation methods, a configuration can be passed in on boot.
For example, Talos can be booted with the talos.config kernel
commandline argument set to an HTTP(s) URL from which it should receive its
configuration.
Where a PXE server is available, this is much more efficient than
manually configuring each node.
If you do use this method, note that Talos requires a number of other
kernel commandline parameters.
See required kernel parameters.
If creating EC2 kubernetes clusters, the configuration file can be passed in as --user-data to the aws ec2 run-instances command.
In any case, we need to generate the configuration which is to be provided:
talosctl gen config cluster-name cluster-endpoint
Here, cluster-name is an arbitrary name for the cluster, used
in your local client configuration as a label.
It should be unique in the configuration on your local workstation.
The cluster-endpoint is the Kubernetes Endpoint you
selected from above.
This is the Kubernetes API URL, and it should be a complete URL, with https://
and port.
(The default port is 6443, but you may have configured your load balancer to forward a different port.)
For example:
talosctl gen config my-cluster https://192.168.64.15:6443
generating PKI and tokens
created /Users/taloswork/controlplane.yaml
created /Users/taloswork/worker.yaml
created /Users/taloswork/talosconfig
When you run this command, a number of files are created in your current
directory:
controlplane.yaml
worker.yaml
talosconfig
The .yaml files are Machine Configs.
They provide Talos Linux servers their complete configuration,
describing everything from what disk Talos should be installed on, to network settings.
The controlplane.yaml file describes how Talos should form a Kubernetes cluster.
The talosconfig file (which is also YAML) is your local client configuration file.
Controlplane and Worker
The two types of Machine Configs correspond to the two roles of Talos nodes, control plane (which run both the Talos and Kubernetes control planes) and worker nodes (which run the workloads).
The main difference between Controlplane Machine Config files and Worker Machine
Config files is that the former contains information about how to form the
Kubernetes cluster.
Modifying the Machine configs
The generated Machine Configs have defaults that work for many cases.
They use DHCP for interface configuration, and install to /dev/sda.
If the defaults work for your installation, you may use them as is.
Sometimes, you will need to modify the generated files so they work with your systems.
A common example is needing to change the default installation disk.
If you try to to apply the machine config to a node, and get an error like the below, you need to specify a different installation disk:
talosctl apply-config --insecure -n 192.168.64.8 --file controlplane.yaml
error applying new configuration: rpc error: code= InvalidArgument desc= configuration validation failed: 1 error occurred:
* specified install disk does not exist: "/dev/sda"
You can verify which disks your nodes have by using the talosctl disks --insecure command.
Insecure mode is needed at this point as the PKI infrastructure has not yet been set up.
For example:
talosctl -n 192.168.64.8 disks --insecure
DEV MODEL SERIAL TYPE UUID WWID MODALIAS NAME SIZE BUS_PATH
/dev/vda - - HDD - - virtio:d00000002v00001AF4 - 69 GB /pci0000:00/0000:00:06.0/virtio2/
In this case, you would modiy the controlplane.yaml and worker.yaml and edit the line:
install:
disk: /dev/sda # The disk used for installations.
to reflect vda instead of sda.
Machine Configs as Templates
Individual machines may need different settings: for instance, each may have a
different static IP address.
When different files are needed for machines of the same type, simply
copy the source template (controlplane.yaml or worker.yaml) and make whatever
modifications are needed.
For instance, if you had three control plane nodes and three worker nodes, you
may do something like this:
for i in $(seq 0 2); do cp controlplane.yaml cp$i.yaml
end
for i in $(seq 0 2); do cp worker.yaml w$i.yaml
end
Then modify each file as needed.
Apply Configuration
To apply the Machine Configs, you need to know the machines’ IP addresses.
Talos will print out the IP addresses of the machines on the console during the boot process:
The insecure flag is necessary because the PKI infrastructure has not yet been made available to the node.
Note: the connection will be encrypted, it is just unauthenticated.
If you have console access you can extract the server certificate fingerprint and use it for an additional layer of validation:
Using the fingerprint allows you to be sure you are sending the configuration to the correct machine, but it is completely optional.
After the configuration is applied to a node, it will reboot.
Repeat this process for each of the nodes in your cluster.
Understand talosctl, endpoints and nodes
It is important to understand the concept of endpoints and nodes.
In short: endpoints are the nodes that talosctl sends commands to, but nodes are the nodes that the command operates on.
The endpoint will forward the command to the nodes, if needed.
Endpoints
Endpoints are the IP addresses to which the talosctl client directly talks.
These should be the set of control plane nodes, either directly or through a load balancer.
Each endpoint will automatically proxy requests destined to another node in the cluster.
This means that you only need access to the control plane nodes in order to access the rest of the network.
talosctl will automatically load balance requests and fail over between all of your endpoints.
You can pass in --endpoints <IP Address1>,<IP Address2> as a comma separated list of IP/DNS addresses to the current talosctl command.
You can also set the endpoints in your talosconfig, by calling talosctl config endpoint <IP Address1> <IP Address2>.
Note: these are space separated, not comma separated.
As an example, if the IP addresses of our control plane nodes are:
The node is the target you wish to perform the API call on.
When specifying nodes, their IPs and/or hostnames are as seen by the endpoint servers, not as from the client.
This is because all connections are proxied through the endpoints.
You may provide -n or --nodes to any talosctl command to supply the node or (comma-separated) nodes on which you wish to perform the operation.
For example, to see the containers running on node 192.168.0.200:
talosctl -n 192.168.0.200 containers
To see the etcd logs on both nodes 192.168.0.10 and 192.168.0.11:
talosctl -n 192.168.0.10,192.168.0.11 logs etcd
It is possible to set a default set of nodes in the talosconfig file, but our recommendation is to explicitly pass in the node or nodes to be operated on with each talosctl command.
For a more in-depth discussion of Endpoints and Nodes, please see talosctl.
Default configuration file
You can reference which configuration file to use directly with the --talosconfig parameter:
talosctl --talosconfig=./talosconfig \
--nodes 192.168.0.2 version
However, talosctl comes with tooling to help you integrate and merge this configuration into the default talosctl configuration file.
This is done with the merge option.
talosctl config merge ./talosconfig
This will merge your new talosconfig into the default configuration file ($XDG_CONFIG_HOME/talos/config.yaml), creating it if necessary.
Like Kubernetes, the talosconfig configuration files has multiple “contexts” which correspond to multiple clusters.
The <cluster-name> you chose above will be used as the context name.
Kubernetes Bootstrap
Bootstrapping your Kubernetes cluster with Talos is as simple as:
talosctl bootstrap --nodes 192.168.0.2
The bootstrap operation should only be called ONCE and only on a SINGLE control plane node!
The IP can be any of your control planes (or the loadbalancer, if used for the Talos API endpoint).
At this point, Talos will form an etcd cluster, generate all of the core Kubernetes assets, and start the Kubernetes control plane components.
After a few moments, you will be able to download your Kubernetes client configuration and get started:
talosctl kubeconfig
Running this command will add (merge) you new cluster into your local Kubernetes configuration.
If you would prefer the configuration to not be merged into your default Kubernetes configuration file, pass in a filename:
talosctl kubeconfig alternative-kubeconfig
You should now be able to connect to Kubernetes and see your nodes:
kubectl get nodes
And use talosctl to explore your cluster:
talosctl -n <NODEIP> dashboard
For a list of all the commands and operations that talosctl provides, see the CLI reference.
1.4 - Theila UI for Talos
An intro to Theila - a UI for Talos clusters.
Once you have a Talos cluster running, you may find it easier to get insights on your cluster(s) using a visual user interface rather than the talosctl CLI.
For this, Sidero Labs provides Theila, a simple, single-binary web-based visual user interface for Talos clusters.
Prerequisites
You should have a Talos cluster up & running, and the talosconfig file for Theila to access it.
Installation
Theila is published as a single static binary compiled for various platforms and architectures, as well as a container image.
Binary
You can download the correct binary for your system from the releases page, or use the following commands in your terminal.
Once it is running you should be able to point a browser at http://localhost:8080 to open the Theila UI.
Clusters
You can navigate around various Talos clusters using the menu at the upper-left corner (see 1.1), then selecting the specific cluster from the list (see 1.2).
Cluster Overview
Clicking on the “Overview” option in the menu (see 2.1) will display an overview of resource use & health of the cluster.
Nodes
Entering the “Nodes” section on the menu (see 3.1) will give a list of nodes in the cluster (see 3.2), along with information such as IP address, status, and any roles assigned to the node.
Opening the node menu (see 3.3) show the actions that can be taken on a specific node.
Clicking on a specific node name in the node list will open the node detail page for more information on each specific node (see 4.1), including running services and their logs (see 4.2).
Clicking on the “Monitor” tab (see 5.1) allows you to watch resource use over time, with CPU and memory consumption graphs updated in real time, and a detailed list of running process each with their individual resource use (see 5.2).
Lastly, the “Dmesg” tab shows all kernel messages of the node since boot.
Pods
Using the “Pods” section on the menu (see 6.1) will list all pods in the cluster, across all namespaces.
Clicking on the drop-down arrow (see 6.2) will open up more detailed information of the specified pod.
1.5 - System Requirements
Hardware requirements for running Talos Linux.
Minimum Requirements
Role
Memory
Cores
Control Plane
2GB
2
Worker
1GB
1
Recommended
Role
Memory
Cores
Control Plane
4GB
4
Worker
2GB
2
These requirements are similar to that of kubernetes.
Talos now defaults to node-role.kubernetes.io/control-plane label/taint.
On upgrades Talos now removes the node-role.kubernetes.io/master label/taint on control-plane nodes and replaces it with the node-role.kubernetes.io/control-plane label/taint.
Workloads that tolerate the old taints or have node selectors with the old labels will need to be updated.
Previously Talos labeled control plane nodes with both control-plane and master labels and tainted the node with master taint.
Scheduling on Control Plane Nodes
Machine configuration .cluster.allowSchedulingOnMasters is deprecated and replaced by .cluster.allowSchedulingOnControlPlanes.
The .cluster.allowSchedulingOnMasters will be removed in a future release of Talos.
If both .cluster.allowSchedulingOnMasters and .cluster.allowSchedulingOnControlPlanes are set to true, the .cluster.allowSchedulingOnControlPlanes will be used.
Control Plane Components
Talos now run all Kubernetes Control Plane Components with the CRI default Seccomp Profile and other recommendations as described in
KEP-2568.
k8s.gcr.io Registry
Talos now defaults to adding a registry mirror configuration in the machine configuration for k8s.gcr.io pointing to both registry.k8s.io and k8s.gcr.io unless overridden.
This is in line with the Kubernetes 1.25 release having the new registry.k8s.io registry endpoint.
This is only enabled by default on newly generated configurations and not on upgrades.
This can be enabled with a machine configuration as follows:
Talos now supports creating custom seccomp profiles on the host machine which in turn can be used by Kubernetes workloads.
It can be configured in the machine config as below:
This profile data can be either configured as a YAML definition or as a JSON string.
The profiles are created on the host under /var/lib/kubelet/seccomp/profiles.
Default seccomp Profile
Talos now runs Kubelet with the CRI default Seccomp Profile enabled.
This can be disabled by setting .machine.kubelet.defaultRuntimeSeccompProfileEnabled to false.
This feature is not enabled automatically on upgrades, so upgrading to Talos v1.2 needs this to be explicitly enabled.
Talos now generates the default hostname (when there is no explicitly specified hostname) for the nodes based on the
node id (e.g. talos-2gd-76y) instead of using the DHCP assigned IP address (e.g. talos-172-20-0-2).
This ensures that the node hostname is not changed when DHCP assigns a new IP to a node.
Note: the stable hostname generation algorithm changed between v1.2.0-beta.0 and v1.2.0-beta.1, please take care when upgrading
from versions >= 1.2.0-alpha.1 to versions >= 1.2.0-beta.1 when using stable default hostname feature.
Packet Capture
Talos now supports capturing packets on a network interface with talosctl pcap command:
talosctl pcap --interface eth0
Cluster Discovery and KubeSpan
KubeSpan Kubernetes Network Advertisement
KubeSpan no longer by default advertises Kubernetes pod networks of the node over KubeSpan.
This means that CNI should handle encapsulation of pod-to-pod traffic into the node-to-node tunnel,
and node-to-node traffic will be handled by KubeSpan.
This provides better compatibility with popular CNIs like Calico and Cilium.
Old behavior can be restored by setting .machine.kubespan.advertiseKubernetesNetworks = true in the machine config.
Kubernetes Discovery Backend
Kubernetes cluster discovery backend is now disabled by default for new clusters.
This backend doesn’t provide any benefits over the Discovery Service based backend, while it
causes issues for KubeSpan enabled clusters when control plane endpoint is KubeSpan-routed.
For air-gapped installations when the Discovery Service is not enabled, Kubernetes Discovery Backend can be enabled by applying
the following machine configuration patch:
The advertisedSubnets setting is used to control which subnet is used for etcd peer communication, it will be advertised
by each peer for other peers to connect to.
If advertiseSubnets is set, listenSubnets defaults to the same value, so that
etcd only listens on the same subnet as it advertises.
Additional subnets can be configured in listenSubnets if needed.
Default behavior hasn’t changed - if the advertisedSubnets is not set, Talos picks up the first available network address as
an advertised address and etcd is configured to listen on all interfaces.
Note: most of the etcd configuration changes are accepted on the fly, but they are fully applied only after a reboot.
CLI
Tracking Progress of API Calls
talosctl subcommands shutdown, reboot, reset and upgrade now have a new flag --wait to
wait until the operation is completed, displaying information on the current status of each node.
A new --debug flag is added to these commands to get the kernel logs output from these nodes if the operation fails.
Generating Machine Config from Secrets
It is now possible to pre-generate secret material for the cluster with talosctl gen secrets:
talosctl gen secrets -o cluster1-secrets.yaml
Secrets file should be stored in a safe place, and machine configuration for the node in the cluster can be generated on demand
with talosctl gen config:
talosctl gen config --with-secrets cluster1-secrets.yaml cluster1 https://cluster1.example.com:6443/
This way configuration can be generated on demand, for example with configuration patches.
Nodes with machine configuration generated from the same secrets file can join each other to form a cluster.
Integrations
Talos API access from Kubernetes
Talos now supports access to its API from within Kubernetes.
It can be configured in the machine config as below:
This feature introduces a new custom resource definition, serviceaccounts.talos.dev.
Creating custom resources of this type will provide credentials to access Talos API from within Kubernetes.
The new CLI subcommand talosctl inject serviceaccount can be used to configure Kubernetes manifests with Talos service accounts as below:
NVIDIA GPU support on Talos has been promoted to beta and SideroLabs now publishes the NVIDIA Open GPU Kernel Modules as a system extension making it easier to run GPU workloads on Talos.
Refer to enabling NVIDIA GPU support docs here:
In this guide we will create an Kubernetes cluster with 1 worker node, and 2 controlplane nodes.
We assume an existing digital rebar deployment, and some familiarity with iPXE.
We leave it up to the user to decide if they would like to use static networking, or DHCP.
The setup and configuration of DHCP will not be covered.
Create the Machine Configuration Files
Generating Base Configurations
Using the DNS name of the load balancer, generate the base configuration files for the Talos machines:
$ talosctl gen config talos-k8s-metal-tutorial https://<load balancer IP or DNS>:<port>
created controlplane.yaml
created worker.yaml
created talosconfig
The loadbalancer is used to distribute the load across multiple controlplane nodes.
This isn’t covered in detail, because we assume some loadbalancing knowledge before hand.
If you think this should be added to the docs, please create a issue.
At this point, you can modify the generated configs to your liking.
Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.
Validate the Configuration Files
$ talosctl validate --config controlplane.yaml --mode metal
controlplane.yaml is valid for metal mode
$ talosctl validate --config worker.yaml --mode metal
worker.yaml is valid for metal mode
Publishing the Machine Configuration Files
Digital Rebar has a built-in fileserver, which means we can use this feature to expose the talos configuration files.
We will place controlplane.yaml, and worker.yaml into Digital Rebar file server by using the drpcli tools.
Copy the generated files from the step above into your Digital Rebar installation.
drpcli file upload <file>.yaml as <file>.yaml
Replacing <file> with controlplane or worker.
Download the boot files
Download a recent version of boot.tar.gz from github.
At this point we can retrieve the admin kubeconfig by running:
talosctl --talosconfig talosconfig kubeconfig .
2.1.1.2 - Equinix Metal
Creating Talos clusters with Equinix Metal.
You can create a Talos Linux cluster on Equinix Metal in a variety of ways, such as through the EM web UI, the metal command line too, or through PXE booting.
Talos Linux is a supported OS install option on Equinix Metal, so it’s an easy process.
Regardless of the method, the process is:
Create a DNS entry for your Kubernetes endpoint.
Generate the configurations using talosctl.
Provision your machines on Equinix Metal.
Push the configurations to your servers (if not done as part of the machine provisioning).
configure your Kubernetes endpoint to point to the newly created control plane nodes
bootstrap the cluster
Define the Kubernetes Endpoint
There are a variety of ways to create an HA endpoint for the Kubernetes cluster.
Some of the ways are:
DNS
Load Balancer
BGP
Whatever way is chosen, it should result in an IP address/DNS name that routes traffic to all the control plane nodes.
We do not know the control plane node IP addresses at this stage, but we should define the endpoint DNS entry so that we can use it in creating the cluster configuration.
After the nodes are provisioned, we can use their addresses to create the endpoint A records, or bind them to the load balancer, etc.
Create the Machine Configuration Files
Generating Configurations
Using the DNS name of the loadbalancer defined above, generate the base configuration files for the Talos machines:
$ talosctl gen config talos-k8s-em-tutorial https://<load balancer IP or DNS>:<port>
created controlplane.yaml
created worker.yaml
created talosconfig
The port used above should be 6443, unless your load balancer maps a different port to port 6443 on the control plane nodes.
Validate the Configuration Files
talosctl validate --config controlplane.yaml --mode metal
talosctl validate --config worker.yaml --mode metal
Note: Validation of the install disk could potentially fail as validation
is performed on your local machine and the specified disk may not exist.
Passing in the configuration as User Data
You can use the metadata service provide by Equinix Metal to pass in the machines configuration.
It is required to add a shebang to the top of the configuration file.
The convention we use is #!talos.
Provision the machines in Equinix Metal
Using the Equinix Metal UI
Simply select the location and type of machines in the Equinix Metal web interface.
Select Talos as the Operating System, then select the number of servers to create, and name them (in lowercase only.)
Under optional settings, you can optionally paste in the contents of controlplane.yaml that was generated, above (ensuring you add a first line of #!talos).
You can repeat this process to create machines of different types for control plane and worker nodes (although you would pass in worker.yaml for the worker nodes, as user data).
If you did not pass in the machine configuration as User Data, you need to provide it to each machine, with the following command:
This guide assumes the user has a working API token,and the Equinix Metal CLI installed.
Because Talos Linux is a supported operating system, Talos Linux machines can be provisioned directly via the CLI, using the -O talos_v1 parameter (for Operating System).
Note: Ensure you have prepended #!talos to the controlplane.yaml file.
Now our control plane nodes have been created, and we know their IP addresses, we can associate them with the Kubernetes endpoint.
Configure your load balancer to route traffic to these nodes, or add A records to your DNS entry for the endpoint, for each control plane node.
e.g.
host endpoint.mydomain.com
endpoint.mydomain.com has address 145.40.90.201
endpoint.mydomain.com has address 147.75.109.71
endpoint.mydomain.com has address 145.40.90.177
This only needs to be issued to one control plane node.
Retrieve the kubeconfig
At this point we can retrieve the admin kubeconfig by running:
talosctl --talosconfig talosconfig kubeconfig .
2.1.1.3 - Matchbox
In this guide we will create an HA Kubernetes cluster with 3 worker nodes using an existing load balancer and matchbox deployment.
Creating a Cluster
In this guide we will create an HA Kubernetes cluster with 3 worker nodes.
We assume an existing load balancer, matchbox deployment, and some familiarity with iPXE.
We leave it up to the user to decide if they would like to use static networking, or DHCP.
The setup and configuration of DHCP will not be covered.
Create the Machine Configuration Files
Generating Base Configurations
Using the DNS name of the load balancer, generate the base configuration files for the Talos machines:
$ talosctl gen config talos-k8s-metal-tutorial https://<load balancer IP or DNS>:<port>
created controlplane.yaml
created worker.yaml
created talosconfig
At this point, you can modify the generated configs to your liking.
Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.
Validate the Configuration Files
$ talosctl validate --config controlplane.yaml --mode metal
controlplane.yaml is valid for metal mode
$ talosctl validate --config worker.yaml --mode metal
worker.yaml is valid for metal mode
Publishing the Machine Configuration Files
In bare-metal setups it is up to the user to provide the configuration files over HTTP(S).
A special kernel parameter (talos.config) must be used to inform Talos about where it should retrieve its configuration file.
To keep things simple we will place controlplane.yaml, and worker.yaml into Matchbox’s assets directory.
This directory is automatically served by Matchbox.
Create the Matchbox Configuration Files
The profiles we will create will reference vmlinuz, and initramfs.xz.
Download these files from the release of your choice, and place them in /var/lib/matchbox/assets.
Now that we have our configuration files in place, boot all the machines.
Talos will come up on each machine, grab its configuration file, and bootstrap itself.
At this point we can retrieve the admin kubeconfig by running:
talosctl --talosconfig talosconfig kubeconfig .
2.1.1.4 - Sidero
Sidero is a project created by the Talos team that has native support for Talos.
Sidero Metal is a project created by the Talos team that provides a bare metal installer for Cluster API, and that has native support for Talos Linux.
It can be easily installed using clusterctl.
The best way to get started with Sidero Metal is to visit the website.
2.1.2 - Virtualized Platforms
Installation of Talos Linux for virtualization platforms.
2.1.2.1 - Hyper-V
Creating a Talos Kubernetes cluster using Hyper-V.
Pre-requisities
Download the latest talos-amd64.iso ISO from github releases page
Create a New-TalosVM folder in any of your PS Module Path folders $env:PSModulePath -split ';' and save the New-TalosVM.psm1 there
Plan Overview
Here we will create a basic 3 node cluster with a single control-plane node and two worker nodes.
The only difference between control plane and worker node is the amount of RAM and an additional storage VHD.
This is personal preference and can be configured to your liking.
We are using a VMNamePrefix argument for a VM Name prefix and not the full hostname.
This command will find any existing VM with that prefix and “+1” the highest suffix it finds.
For example, if VMs talos-cp01 and talos-cp02 exist, this will create VMs starting from talos-cp03, depending on NumberOfVMs argument.
Setup a Control Plane Node
Use the following command to create a single control plane node:
This will create two VMs: talos-worker01 and talos-wworker02 and attach an additional VHD of 50GB for storage (which in my case will be passed to Mayastor).
Pushing Config to the Nodes
Now that our VMs are ready, find their IP addresses from console of VM.
With that information, push config to the control plane node with:
# set control plane IP variable$CONTROL_PLANE_IP='10.10.10.x'# Generate talos configtalosctl gen config talos-cluster https://$($CONTROL_PLANE_IP):6443 --output-dir .
# Apply config to control plane nodetalosctl apply-config --insecure --nodes $CONTROL_PLANE_IP --file .\controlplane.yaml
Now that our nodes are ready, we are ready to bootstrap the Kubernetes cluster.
# Use following command to set node and endpoint permanantly in config so you dont have to type it everytimetalosctl config endpoint $CONTROL_PLANE_IPtalosctl config node $CONTROL_PLANE_IP# Bootstrap clustertalosctl bootstrap
# Generate kubeconfigtalosctl kubeconfig .
This will generate the kubeconfig file, you can use to connect to the cluster.
2.1.2.2 - KVM
Talos is known to work on KVM.
We don’t yet have a documented guide specific to KVM; however, you can have a look at our
Vagrant & Libvirt guide which uses KVM for virtualization.
If you run into any issues, our community can probably help!
2.1.2.3 - Proxmox
Creating Talos Kubernetes cluster using Proxmox.
In this guide we will create a Kubernetes cluster using Proxmox.
Video Walkthrough
To see a live demo of this writeup, visit Youtube here:
Installation
How to Get Proxmox
It is assumed that you have already installed Proxmox onto the server you wish to create Talos VMs on.
Visit the Proxmox downloads page if necessary.
In order to install Talos in Proxmox, you will need the ISO image from the Talos release page.
You can download talos-amd64.iso via
github.com/siderolabs/talos/releases
From the Proxmox UI, select the “local” storage and enter the “Content” section.
Click the “Upload” button:
Select the ISO you downloaded previously, then hit “Upload”
Create VMs
Start by creating a new VM by clicking the “Create VM” button in the Proxmox UI:
Fill out a name for the new VM:
In the OS tab, select the ISO we uploaded earlier:
Keep the defaults set in the “System” tab.
Keep the defaults in the “Hard Disk” tab as well, only changing the size if desired.
In the “CPU” section, give at least 2 cores to the VM:
Note: As of Talos v1.0 (which requires the x86-64-v2 microarchitecture), booting
with the default Processor Type kvm64 will not work.
You can enable the required
CPU features after creating the VM by adding the following line in the corresponding
/etc/pve/qemu-server/<vmid>.conf file:
Alternatively, you can set the Processor Type to host if your Proxmox host supports
these CPU features, this however prevents using live VM migration.
Verify that the RAM is set to at least 2GB:
Keep the default values for networking, verifying that the VM is set to come up on the bridge interface:
Finish creating the VM by clicking through the “Confirm” tab and then “Finish”.
Repeat this process for a second VM to use as a worker node.
You can also repeat this for additional nodes desired.
Start Control Plane Node
Once the VMs have been created and updated, start the VM that will be the first control plane node.
This VM will boot the ISO image specified earlier and enter “maintenance mode”.
With DHCP server
Once the machine has entered maintenance mode, there will be a console log that details the IP address that the node received.
Take note of this IP address, which will be referred to as $CONTROL_PLANE_IP for the rest of this guide.
If you wish to export this IP as a bash variable, simply issue a command like export CONTROL_PLANE_IP=1.2.3.4.
Without DHCP server
To apply the machine configurations in maintenance mode, VM has to have IP on the network.
So you can set it on boot time manually.
Press e on the boot time.
And set the IP parameters for the VM.
Format is:
With the IP address above, you can now generate the machine configurations to use for installing Talos and Kubernetes.
Issue the following command, updating the output directory, cluster name, and control plane IP as you see fit:
talosctl gen config talos-vbox-cluster https://$CONTROL_PLANE_IP:6443 --output-dir _out
This will create several files in the _out directory: controlplane.yaml, worker.yaml, and talosconfig.
Create Control Plane Node
Using the controlplane.yaml generated above, you can now apply this config using talosctl.
Issue:
You should now see some action in the Proxmox console for this VM.
Talos will be installed to disk, the VM will reboot, and then Talos will configure the Kubernetes control plane on this VM.
Note: This process can be repeated multiple times to create an HA control plane.
Create Worker Node
Create at least a single worker node using a process similar to the control plane creation above.
Start the worker node VM and wait for it to enter “maintenance mode”.
Take note of the worker node’s IP address, which will be referred to as $WORKER_IP
Note: This process can be repeated multiple times to add additional workers.
Using the Cluster
Once the cluster is available, you can make use of talosctl and kubectl to interact with the cluster.
For example, to view current running containers, run talosctl containers for a list of containers in the system namespace, or talosctl containers -k for the k8s.io namespace.
To view the logs of a container, use talosctl logs <container> or talosctl logs -k <container>.
First, configure talosctl to talk to your control plane node by issuing the following, updating paths and IPs as necessary:
We will use Vagrant and its libvirt plugin to create a KVM-based cluster with 3 control plane nodes and 1 worker node.
For this, we will mount Talos ISO into the VMs using a virtual CD-ROM,
and configure the VMs to attempt to boot from the disk first with the fallback to the CD-ROM.
We will also configure a virtual IP address on Talos to achieve high-availability on kube-apiserver.
Preparing the environment
First, we download the latest talos-amd64.iso ISO from GitHub releases into the /tmp directory.
Current machine states:
control-plane-node-1 not created (libvirt)
control-plane-node-2 not created (libvirt)
control-plane-node-3 not created (libvirt)
worker-node-1 not created (libvirt)
Congratulations, you have a highly-available Talos cluster running!
Cleanup
You can destroy the vagrant environment by running:
vagrant destroy -f
And remove the ISO image you downloaded:
sudo rm -f /tmp/talos-amd64.iso
2.1.2.5 - VMware
Creating Talos Kubernetes cluster using VMware.
Creating a Cluster via the govc CLI
In this guide we will create an HA Kubernetes cluster with 2 worker nodes.
We will use the govc cli which can be downloaded here.
Prereqs/Assumptions
This guide will use the virtual IP (“VIP”) functionality that is built into Talos in order to provide a stable, known IP for the Kubernetes control plane.
This simply means the user should pick an IP on their “VM Network” to designate for this purpose and keep it handy for future steps.
Create the Machine Configuration Files
Generating Base Configurations
Using the VIP chosen in the prereq steps, we will now generate the base configuration files for the Talos machines.
This can be done with the talosctl gen config ... command.
Take note that we will also use a JSON6902 patch when creating the configs so that the control plane nodes get some special information about the VIP we chose earlier, as well as a daemonset to install vmware tools on talos nodes.
First, download cp.patch.yaml to your local machine and edit the VIP to match your chosen IP.
You can do this by issuing: curl -fsSLO https://raw.githubusercontent.com/siderolabs/talos/master/website/content/v1.2/talos-guides/install/virtualized-platforms/vmware/cp.patch.yaml.
It’s contents should look like the following:
With the patch in hand, generate machine configs with:
$ talosctl gen config vmware-test https://<VIP>:<port> --config-patch-control-plane @cp.patch.yaml
created controlplane.yaml
created worker.yaml
created talosconfig
At this point, you can modify the generated configs to your liking if needed.
Optionally, you can specify additional patches by adding to the cp.patch.yaml file downloaded earlier, or create your own patch files.
Validate the Configuration Files
$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode
Set Environment Variables
govc makes use of the following environment variables
As part of this guide, we have a more automated install script that handles some of the complexity of importing OVAs and creating VMs.
If you wish to use this script, we will detail that next.
If you wish to carry out the manual approach, simply skip ahead to the “Manual Approach” section.
Scripted Install
Download the vmware.sh script to your local machine.
You can do this by issuing curl -fsSLO "https://raw.githubusercontent.com/siderolabs/talos/master/website/content/v1.2/talos-guides/install/virtualized-platforms/vmware/vmware.sh".
This script has default variables for things like Talos version and cluster name that may be interesting to tweak before deploying.
Import OVA
To create a content library and import the Talos OVA corresponding to the mentioned Talos version, simply issue:
./vmware.sh upload_ova
Create Cluster
With the OVA uploaded to the content library, you can create a 5 node (by default) cluster with 3 control plane and 2 worker nodes:
./vmware.sh create
This step will create a VM from the OVA, edit the settings based on the env variables used for VM size/specs, then power on the VMs.
You may now skip past the “Manual Approach” section down to “Bootstrap Cluster”.
Manual Approach
Import the OVA into vCenter
A talos.ova asset is published with each release.
We will refer to the version of the release as $TALOS_VERSION below.
It can be easily exported with export TALOS_VERSION="v0.3.0-alpha.10" or similar.
Talos makes use of the guestinfo facility of VMware to provide the machine/cluster configuration.
This can be set using the govc vm.change command.
To facilitate persistent storage using the vSphere cloud provider integration with Kubernetes, disk.enableUUID=1 is used.
In the vSphere UI, open a console to one of the control plane nodes.
You should see some output stating that etcd should be bootstrapped.
This text should look like:
"etcd is waiting to join the cluster, if this node is the first node in the cluster, please run `talosctl bootstrap` against one of the following IPs:
The talos-vmtoolsd application was deployed as a daemonset as part of the cluster creation; however, we must now provide a talos credentials file for it to use.
Once configured, you should now see these daemonset pods go into “Running” state and in vCenter, you will now see IPs and info from the Talos nodes present in the UI.
2.1.2.6 - Xen
Talos is known to work on Xen.
We don’t yet have a documented guide specific to Xen; however, you can follow the General Getting Started Guide.
If you run into any issues, our community can probably help!
2.1.3 - Cloud Platforms
Installation of Talos Linux on many cloud platforms.
2.1.3.1 - AWS
Creating a cluster via the AWS CLI.
Creating a Cluster via the AWS CLI
In this guide we will create an HA Kubernetes cluster with 3 worker nodes.
We assume an existing VPC, and some familiarity with AWS.
If you need more information on AWS specifics, please see the official AWS documentation.
Set the needed info
Change to your desired region:
REGION="us-west-2"aws ec2 describe-vpcs --region $REGIONVPC="(the VpcId from the above command)"
Create the Subnet
Use a CIDR block that is present on the VPC specified above.
Replace amd64 in the line above with the desired architecture.
Note the AMI id that is returned is assigned to an environment variable: it will be used later when booting instances.
We now have an AMI we can use to create our cluster.
Save the AMI ID, as we will need it when we create EC2 instances.
AMI="(AMI ID of the register image command)"
Create a Security Group
aws ec2 create-security-group \
--region $REGION\
--group-name talos-aws-tutorial-sg \
--description "Security Group for EC2 instances to allow ports required by Talos"SECURITY_GROUP="(security group id that is returned)"
Using the security group from above, allow all internal traffic within the same security group:
Using the DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines.
Note that the port used here is the externally accessible port configured on the load balancer - 443 - not the internal port of 6443:
$ talosctl gen config talos-k8s-aws-tutorial https://<load balancer DNS>:<port> --with-examples=false --with-docs=falsecreated controlplane.yaml
created worker.yaml
created talosconfig
Note that the generated configs are too long for AWS userdata field if the --with-examples and --with-docs flags are not passed.
At this point, you can modify the generated configs to your liking.
Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.
Validate the Configuration Files
$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode
Create the EC2 Instances
change the instance type if desired.
Note: There is a known issue that prevents Talos from running on T2 instance types.
Please use T3 if you need burstable instance types.
Set the endpoints (the control plane node to which talosctl commands are sent) and nodes (the nodes that the command operates on):
talosctl --talosconfig talosconfig config endpoint <control plane 1 PUBLIC IP>
talosctl --talosconfig talosconfig config node <control plane 1 PUBLIC IP>
Bootstrap etcd:
talosctl --talosconfig talosconfig bootstrap
Retrieve the kubeconfig
At this point we can retrieve the admin kubeconfig by running:
talosctl --talosconfig talosconfig kubeconfig .
The different control plane nodes should sendi/receive traffic via the load balancer, notice that one of the control plane has intiated the etcd cluster, and the others should join.
You can now watch as your cluster bootstraps, by using
talosctl --talosconfig talosconfig health
You can also watch the performance of a node, via:
talosctl --talosconfig talosconfig dashboard
And use standard kubectl commands.
2.1.3.2 - Azure
Creating a cluster via the CLI on Azure.
Creating a Cluster via the CLI
In this guide we will create an HA Kubernetes cluster with 1 worker node.
We assume existing Blob Storage, and some familiarity with Azure.
If you need more information on Azure specifics, please see the official Azure documentation.
Environment Setup
We’ll make use of the following environment variables throughout the setup.
Edit the variables below with your correct information.
# Storage account to useexportSTORAGE_ACCOUNT="StorageAccountName"# Storage container to upload toexportSTORAGE_CONTAINER="StorageContainerName"# Resource group nameexportGROUP="ResourceGroupName"# LocationexportLOCATION="centralus"# Get storage account connection string based on info aboveexportCONNECTION=$(az storage account show-connection-string \
-n $STORAGE_ACCOUNT\
-g $GROUP\
-o tsv)
Create the Image
First, download the Azure image from a Talos release.
Once downloaded, untar with tar -xvf /path/to/azure-amd64.tar.gz
Upload the VHD
Once you have pulled down the image, you can upload it to blob storage with:
Now that the image is present in our blob storage, we’ll register it.
az image create \
--name talos \
--source https://$STORAGE_ACCOUNT.blob.core.windows.net/$STORAGE_CONTAINER/talos-azure.vhd \
--os-type linux \
-g $GROUP
Network Infrastructure
Virtual Networks and Security Groups
Once the image is prepared, we’ll want to work through setting up the network.
Issue the following to create a network security group and add rules to it.
In Azure, we have to pre-create the NICs for our control plane so that they can be associated with our load balancer.
for i in $( seq 012); do# Create public IP for each nic az network public-ip create \
--resource-group $GROUP\
--name talos-controlplane-public-ip-$i\
--allocation-method static
# Create nic az network nic create \
--resource-group $GROUP\
--name talos-controlplane-nic-$i\
--vnet-name talos-vnet \
--subnet talos-subnet \
--network-security-group talos-sg \
--public-ip-address talos-controlplane-public-ip-$i\
--lb-name talos-lb \
--lb-address-pools talos-be-pool
done# NOTES:# Talos can detect PublicIPs automatically if PublicIP SKU is Basic.# Use `--sku Basic` to set SKU to Basic.
Cluster Configuration
With our networking bits setup, we’ll fetch the IP for our load balancer and create our configuration files.
LB_PUBLIC_IP=$(az network public-ip show \
--resource-group $GROUP\
--name talos-public-ip \
--query [ipAddress]\
--output tsv)talosctl gen config talos-k8s-azure-tutorial https://${LB_PUBLIC_IP}:6443
Compute Creation
We are now ready to create our azure nodes.
Azure allows you to pass Talos machine configuration to the virtual machine at bootstrap time via
user-data or custom-data methods.
Talos supports only custom-data method, machine configuration is available to the VM only on the first boot.
# Create availability setaz vm availability-set create \
--name talos-controlplane-av-set \
-g $GROUP# Create the controlplane nodesfor i in $( seq 012); do az vm create \
--name talos-controlplane-$i\
--image talos \
--custom-data ./controlplane.yaml \
-g $GROUP\
--admin-username talos \
--generate-ssh-keys \
--verbose \
--boot-diagnostics-storage $STORAGE_ACCOUNT\
--os-disk-size-gb 20\
--nics talos-controlplane-nic-$i\
--availability-set talos-controlplane-av-set \
--no-wait
done# Create worker node az vm create \
--name talos-worker-0 \
--image talos \
--vnet-name talos-vnet \
--subnet talos-subnet \
--custom-data ./worker.yaml \
-g $GROUP\
--admin-username talos \
--generate-ssh-keys \
--verbose \
--boot-diagnostics-storage $STORAGE_ACCOUNT\
--nsg talos-sg \
--os-disk-size-gb 20\
--no-wait
# NOTES:# `--admin-username` and `--generate-ssh-keys` are required by the az cli,# but are not actually used by talos# `--os-disk-size-gb` is the backing disk for Kubernetes and any workload containers# `--boot-diagnostics-storage` is to enable console output which may be necessary# for troubleshooting
Bootstrap Etcd
You should now be able to interact with your cluster with talosctl.
We will need to discover the public IP for our first control plane node first.
At this point we can retrieve the admin kubeconfig by running:
talosctl --talosconfig talosconfig kubeconfig .
2.1.3.3 - DigitalOcean
Creating a cluster via the CLI on DigitalOcean.
Creating a Cluster via the CLI
In this guide we will create an HA Kubernetes cluster with 1 worker node.
We assume an existing Space, and some familiarity with DigitalOcean.
If you need more information on DigitalOcean specifics, please see the official DigitalOcean documentation.
Create the Image
First, download the DigitalOcean image from a Talos release.
Extract the archive to get the disk.raw file, compress it using gzip to disk.raw.gz.
Using an upload method of your choice (doctl does not have Spaces support), upload the image to a space.
Now, create an image using the URL of the uploaded image:
We will need the IP of the load balancer.
Using the ID of the load balancer, run:
doctl compute load-balancer get --format IP <load balancer ID>
Save it, as we will need it in the next step.
Create the Machine Configuration Files
Generating Base Configurations
Using the DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines:
$ talosctl gen config talos-k8s-digital-ocean-tutorial https://<load balancer IP or DNS>:<port>
created controlplane.yaml
created worker.yaml
created talosconfig
At this point, you can modify the generated configs to your liking.
Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.
Validate the Configuration Files
$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode
Create the Droplets
Create the Control Plane Nodes
Run the following commands, to give ourselves three total control plane nodes:
Note: Although SSH is not used by Talos, DigitalOcean still requires that an SSH key be associated with the droplet.
Create a dummy key that can be used to satisfy this requirement.
At this point we can retrieve the admin kubeconfig by running:
talosctl --talosconfig talosconfig kubeconfig .
2.1.3.4 - GCP
Creating a cluster via the CLI on Google Cloud Platform.
Creating a Cluster via the CLI
In this guide, we will create an HA Kubernetes cluster in GCP with 1 worker node.
We will assume an existing Cloud Storage bucket, and some familiarity with Google Cloud.
If you need more information on Google Cloud specifics, please see the official Google documentation.
Once the image is prepared, we’ll want to work through setting up the network.
Issue the following to create a firewall, load balancer, and their required components.
Using GCP deployment manager automatically creates a Google Storage bucket and uploads the Talos image to it.
Once the deployment is complete the generated talosconfig and kubeconfig files are uploaded to the bucket.
By default this setup creates a three node control plane and a single worker in us-west1-b
First we need to create a folder to store our deployment manifests and perform all subsequent operations from that folder.
mkdir -p talos-gcp-deployment
cd talos-gcp-deployment
Getting the deployment manifests
We need to download two deployment manifests for the deployment from the Talos github repository.
curl -fsSLO "https://raw.githubusercontent.com/siderolabs/talos/master/website/content/v1.2/talos-guides/install/cloud-platforms/gcp/config.yaml"curl -fsSLO "https://raw.githubusercontent.com/siderolabs/talos/master/website/content/v1.2/talos-guides/install/cloud-platforms/gcp/talos-ha.jinja"# if using ccmcurl -fsSLO "https://raw.githubusercontent.com/siderolabs/talos/master/website/content/v1.2/talos-guides/install/cloud-platforms/gcp/gcp-ccm.yaml"
Updating the config
Now we need to update the local config.yaml file with any required changes such as changing the default zone, Talos version, machine sizes, nodes count etc.
Note: The externalCloudProvider property is set to false by default.
The manifest used for deploying the ccm (cloud controller manager) is currently using the GCP ccm provided by openshift since there are no public images for the ccm yet.
Since the routes controller is disabled while deploying the CCM, the CNI pods needs to be restarted after the CCM deployment is complete to remove the node.kubernetes.io/network-unavailable taint.
See Nodes network-unavailable taint not removed after installing ccm for more information
Use a custom built image for the ccm deployment if required.
Creating the deployment
Now we are ready to create the deployment.
Confirm with y for any prompts.
Run the following command to create the deployment:
# use a unique name for the deployment, resources are prefixed with the deployment nameexportDEPLOYMENT_NAME="<deployment name>"gcloud deployment-manager deployments create "${DEPLOYMENT_NAME}" --config config.yaml
Retrieving the outputs
First we need to get the deployment outputs.
# first get the outputsOUTPUTS=$(gcloud deployment-manager deployments describe "${DEPLOYMENT_NAME}" --format json | jq '.outputs[]')BUCKET_NAME=$(jq -r '. | select(.name == "bucketName").finalValue'<<<"${OUTPUTS}")# used when cloud controller is enabledSERVICE_ACCOUNT=$(jq -r '. | select(.name == "serviceAccount").finalValue'<<<"${OUTPUTS}")PROJECT=$(jq -r '. | select(.name == "project").finalValue'<<<"${OUTPUTS}")
Note: If cloud controller manager is enabled, the below command needs to be run to allow the controller custom role to access cloud resources
In addition to the talosconfig and kubeconfig files, the storage bucket contains the controlplane.yaml and worker.yaml files used to join additional nodes to the cluster.
kubectl \
--kubeconfig kubeconfig \
--namespace kube-system \
apply \
--filename gcp-ccm.yaml
# wait for the ccm to be upkubectl \
--kubeconfig kubeconfig \
--namespace kube-system \
rollout status \
daemonset cloud-controller-manager
If the cloud controller manager is enabled, we need to restart the CNI pods to remove the node.kubernetes.io/network-unavailable taint.
# restart the CNI pods, in this case flannelkubectl \
--kubeconfig kubeconfig \
--namespace kube-system \
rollout restart \
daemonset kube-flannel
# wait for the pods to be restartedkubectl \
--kubeconfig kubeconfig \
--namespace kube-system \
rollout status \
daemonset kube-flannel
Check cluster status
kubectl \
--kubeconfig kubeconfig \
get nodes
Cleanup deployment
Warning: This will delete the deployment and all resources associated with it.
# delete the objects in the bucket firstgsutil -m rm -r "gs://${BUCKET_NAME}"gcloud deployment-manager deployments delete "${DEPLOYMENT_NAME}" --quiet
2.1.3.5 - Hetzner
Creating a cluster via the CLI (hcloud) on Hetzner.
Upload image
Hetzner Cloud does not support uploading custom images.
You can email their support to get a Talos ISO uploaded by following issues:3599 or you can prepare image snapshot by yourself.
There are two options to upload your own.
Run an instance in rescue mode and replace the system OS with the Talos image
Create a new Server in the Hetzner console.
Enable the Hetzner Rescue System for this server and reboot.
Upon a reboot, the server will boot a special minimal Linux distribution designed for repair and reinstall.
Once running, login to the server using ssh to prepare the system disk by doing the following:
# Check that you in Rescue modedf
### Result is like:# udev 987432 0 987432 0% /dev# 213.133.99.101:/nfs 308577696 247015616 45817536 85% /root/.oldroot/nfs# overlay 995672 8340 987332 1% /# tmpfs 995672 0 995672 0% /dev/shm# tmpfs 398272 572 397700 1% /run# tmpfs 5120 0 5120 0% /run/lock# tmpfs 199132 0 199132 0% /run/user/0# Download the Talos imagecd /tmp
wget -O /tmp/talos.raw.xz https://github.com/siderolabs/talos/releases/download/v1.2.8/hcloud-amd64.raw.xz
# Replace systemxz -d -c /tmp/talos.raw.xz | dd of=/dev/sda && sync
# shutdown the instanceshutdown -h now
To make sure disk content is consistent, it is recommended to shut the server down before taking an image (snapshot).
Once shutdown, simply create an image (snapshot) from the console.
You can now use this snapshot to run Talos on the cloud.
Create a new image by issuing the commands shown below.
Note that to create a new API token for your Project, switch into the Hetzner Cloud Console choose a Project, go to Access → Security, and create a new token.
# First you need set API TokenexportHCLOUD_TOKEN=${TOKEN}# Upload imagepacker init .
packer build .
# Save the image IDexportIMAGE_ID=<image-id-in-packer-output>
After doing this, you can find the snapshot in the console interface.
Using the IP/DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines by issuing:
$ talosctl gen config talos-k8s-hcloud-tutorial https://<load balancer IP or DNS>:6443
created controlplane.yaml
created worker.yaml
created talosconfig
At this point, you can modify the generated configs to your liking.
Optionally, you can specify --config-patch with RFC6902 jsonpatches which will be applied during the config generation.
Validate the Configuration Files
Validate any edited machine configs with:
$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode
Create the Servers
We can now create our servers.
Note that you can find IMAGE_ID in the snapshot section of the console: https://console.hetzner.cloud/projects/$PROJECT_ID/servers/snapshots.
Proxmox can create cloud-init disk for you.
Edit the cloud-init config information in Proxmox as follows, substitute your own information as necessary:
and then update cicustom param at /etc/pve/qemu-server/$ID.conf.
cicustom: user=local:snippets/controlplane-1.yml
ipconfig0: ip=192.168.1.10/24,gw=192.168.10.254
nameserver: 1.1.1.1
searchdomain: local
Note: snippets/controlplane-1.yml is Talos machine config.
It is usually located at /var/lib/vz/snippets/controlplane-1.yml.
This file must be placed to this path manually, as Proxmox does not support snippet uploading via API/GUI.
Click on Regenerate Image button after the above changes are made.
2.1.3.7 - OpenStack
Creating a cluster via the CLI on OpenStack.
Creating a Cluster via the CLI
In this guide, we will create an HA Kubernetes cluster in OpenStack with 1 worker node.
We will assume an existing some familiarity with OpenStack.
If you need more information on OpenStack specifics, please see the official OpenStack documentation.
Environment Setup
You should have an existing openrc file.
This file will provide environment variables necessary to talk to your OpenStack cloud.
See here for instructions on fetching this file.
Create the Image
First, download the OpenStack image from a Talos release.
These images are called openstack-$ARCH.tar.gz.
Untar this file with tar -xvf openstack-$ARCH.tar.gz.
The resulting file will be called disk.raw.
Upload the Image
Once you have the image, you can upload to OpenStack with:
openstack image create --public --disk-format raw --file disk.raw talos
Network Infrastructure
Load Balancer and Network Ports
Once the image is prepared, you will need to work through setting up the network.
Issue the following to create a load balancer, the necessary network ports for each control plane node, and associations between the two.
Creating loadbalancer:
# Create load balancer, updating vip-subnet-id if necessaryopenstack loadbalancer create --name talos-control-plane --vip-subnet-id public
# Create listeneropenstack loadbalancer listener create --name talos-control-plane-listener --protocol TCP --protocol-port 6443 talos-control-plane
# Pool and health monitoringopenstack loadbalancer pool create --name talos-control-plane-pool --lb-algorithm ROUND_ROBIN --listener talos-control-plane-listener --protocol TCP
openstack loadbalancer healthmonitor create --delay 5 --max-retries 4 --timeout 10 --type TCP talos-control-plane-pool
Creating ports:
# Create ports for control plane nodes, updating network name if necessaryopenstack port create --network shared talos-control-plane-1
openstack port create --network shared talos-control-plane-2
openstack port create --network shared talos-control-plane-3
# Create floating IPs for the ports, so that you will have talosctl connectivity to each control planeopenstack floating ip create --port talos-control-plane-1 public
openstack floating ip create --port talos-control-plane-2 public
openstack floating ip create --port talos-control-plane-3 public
Note: Take notice of the private and public IPs associated with each of these ports, as they will be used in the next step.
Additionally, take node of the port ID, as it will be used in server creation.
Associate port’s private IPs to loadbalancer:
# Create members for each port IP, updating subnet-id and address as necessary.openstack loadbalancer member create --subnet-id shared-subnet --address <PRIVATE IP OF talos-control-plane-1 PORT> --protocol-port 6443 talos-control-plane-pool
openstack loadbalancer member create --subnet-id shared-subnet --address <PRIVATE IP OF talos-control-plane-2 PORT> --protocol-port 6443 talos-control-plane-pool
openstack loadbalancer member create --subnet-id shared-subnet --address <PRIVATE IP OF talos-control-plane-3 PORT> --protocol-port 6443 talos-control-plane-pool
Security Groups
This example uses the default security group in OpenStack.
Ports have been opened to ensure that connectivity from both inside and outside the group is possible.
You will want to allow, at a minimum, ports 6443 (Kubernetes API server) and 50000 (Talos API) from external sources.
It is also recommended to allow communication over all ports from within the subnet.
Cluster Configuration
With our networking bits setup, we’ll fetch the IP for our load balancer and create our configuration files.
LB_PUBLIC_IP=$(openstack loadbalancer show talos-control-plane -f json | jq -r .vip_address)talosctl gen config talos-k8s-openstack-tutorial https://${LB_PUBLIC_IP}:6443
Additionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.
Compute Creation
We are now ready to create our OpenStack nodes.
Create control plane:
# Create control planes 2 and 3, substituting the same info.for i in $( seq 13); do openstack server create talos-control-plane-$i --flavor m1.small --nic port-id=talos-control-plane-$i --image talos --user-data /path/to/controlplane.yaml
done
Create worker:
# Update network name as necessary.openstack server create talos-worker-1 --flavor m1.small --network shared --image talos --user-data /path/to/worker.yaml
Note: This step can be repeated to add more workers.
Bootstrap Etcd
You should now be able to interact with your cluster with talosctl.
We will use one of the floating IPs we allocated earlier.
It does not matter which one.
Using the IP/DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines by issuing:
$ talosctl gen config talos-k8s-oracle-tutorial https://<load balancer IP or DNS>:6443 --additional-sans <load balancer IP or DNS>
created controlplane.yaml
created worker.yaml
created talosconfig
At this point, you can modify the generated configs to your liking.
Optionally, you can specify --config-patch with RFC6902 jsonpatches which will be applied during the config generation.
Validate the Configuration Files
Validate any edited machine configs with:
$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode
Set the endpoints and nodes for your talosconfig with:
talosctl --talosconfig talosconfig config endpoint <load balancer IP or DNS>
talosctl --talosconfig talosconfig config node <control-plane-1-IP>
Bootstrap etcd on the first control plane node with:
talosctl --talosconfig talosconfig bootstrap
Retrieve the kubeconfig
At this point we can retrieve the admin kubeconfig by running:
talosctl --talosconfig talosconfig kubeconfig .
2.1.3.9 - Scaleway
Creating a cluster via the CLI (scw) on scaleway.com.
Talos is known to work on scaleway.com; however, it is currently undocumented.
2.1.3.10 - UpCloud
Creating a cluster via the CLI (upctl) on UpCloud.com.
In this guide we will create an HA Kubernetes cluster 3 control plane nodes and 1 worker node.
We assume some familiarity with UpCloud.
If you need more information on UpCloud specifics, please see the official UpCloud documentation.
Create the Image
The best way to create an image for UpCloud, is to build one using
Hashicorp packer, with the
upcloud-amd64.raw.xz image found on the Talos Releases.
Using the general ISO is also possible, but the UpCloud image has some UpCloud
specific features implemented, such as the fetching of metadata and user data to configure the nodes.
To create the cluster, you need a few things locally installed:
NOTE: Make sure your account allows API connections.
To do so, log into
UpCloud control panel and go to People
-> Account -> Permissions -> Allow API connections checkbox.
It is recommended
to create a separate subaccount for your API access and only set the API permission.
To use the UpCloud CLI, you need to create a config in $HOME/.config/upctl.yaml
To use the UpCloud packer plugin, you need to also export these credentials to your
environment variables, by e.g. putting the following in your .bashrc or .zshrc
Now create a new image by issuing the commands shown below.
packer init .
packer build .
After doing this, you can find the custom image in the console interface under storage.
Creating a Cluster via the CLI
Create an Endpoint
To communicate with the Talos cluster you will need a single endpoint that is used
to access the cluster.
This can either be a loadbalancer that will sit in front of
all your control plane nodes, a DNS name with one or more A or AAAA records pointing
to the control plane nodes, or directly the IP of a control plane node.
Which option is best for you will depend on your needs.
Endpoint selection has been further documented here.
After you decide on which endpoint to use, note down the domain name or IP, as
we will need it in the next step.
Create the Machine Configuration Files
Generating Base Configurations
Using the DNS name of the endpoint created earlier, generate the base
configuration files for the Talos machines:
$ talosctl gen config talos-upcloud-tutorial https://<load balancer IP or DNS>:<port> --install-disk /dev/vda
created controlplane.yaml
created worker.yaml
created talosconfig
At this point, you can modify the generated configs to your liking.
Depending on the Kubernetes version you want to run, you might need to select a different Talos version, as not all versions are compatible.
You can find the support matrix here.
Optionally, you can specify --config-patch with RFC6902 jsonpatch or yamlpatch
which will be applied during the config generation.
Validate the Configuration Files
$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode
Create the Servers
Create the Control Plane Nodes
Run the following to create three total control plane nodes:
for ID in $(seq 3); do upctl server create \
--zone us-nyc1 \
--title talos-us-nyc1-master-$ID\
--hostname talos-us-nyc1-master-$ID\
--plan 2xCPU-4GB \
--os "Talos (v1.2.8)"\
--user-data "$(cat controlplane.yaml)"\
--enable-metada
done
Note: modify the zone and OS depending on your preferences.
The OS should match the template name generated with packer in the previous step.
Note the IP address of the first control plane node, as we will need it later.
To configure talosctl we will need the first control plane node’s IP, as noted earlier.
We only add one node IP, as that is the entry into our cluster against which our commands will be run.
All requests to other nodes are proxied through the endpoint, and therefore not
all nodes need to be manually added to the config.
You don’t want to run your commands against all nodes, as this can destroy your
cluster if you are not careful (further documentation).
At this point we can retrieve the admin kubeconfig by running:
talosctl --talosconfig talosconfig kubeconfig
It will take a few minutes before Kubernetes has been fully bootstrapped, and is accessible.
You can check if the nodes are registered in Talos by running
talosctl --talosconfig talosconfig get members
To check if your nodes are ready, run
kubectl get nodes
2.1.3.11 - Vultr
Creating a cluster via the CLI (vultr-cli) on Vultr.com.
Creating a Cluster using the Vultr CLI
This guide will demonstrate how to create a highly-available Kubernetes cluster with one worker using the Vultr cloud provider.
Vultr have a very well documented REST API, and an open-source CLI tool to interact with the API which will be used in this guide.
Make sure to follow installation and authentication instructions for the vultr-cli tool.
Upload image
First step is to make the Talos ISO available to Vultr by uploading the latest release of the ISO to the Vultr ISO server.
vultr-cli iso create --url https://github.com/siderolabs/talos/releases/download/v1.2.8/talos-amd64.iso
Make a note of the ID in the output, it will be needed later when creating the instances.
Create a Load Balancer
A load balancer is needed to serve as the Kubernetes endpoint for the cluster.
Make a note of the ID of the load balancer from the output of the above command, it will be needed after the control plane instances are created.
vultr-cli load-balancer get $LOAD_BALANCER_ID | grep ^IP
Make a note of the IP address, it will be needed later when generating the configuration.
Create the Machine Configuration
Generate Base Configuration
Using the IP address (or DNS name if one was created) of the load balancer created above, generate the machine configuration files for the new cluster.
talosctl gen config talos-kubernetes-vultr https://$LOAD_BALANCER_ADDRESS
Once generated, the machine configuration can be modified as necessary for the new cluster, for instance updating disk installation, or adding SANs for the certificates.
First a control plane needs to be created, with the example below creating 3 instances in a loop.
The instance type (noted by the --plan vc2-2c-4gb argument) in the example is for a minimum-spec control plane node, and should be updated to suit the cluster being created.
for id in $(seq 3); do vultr-cli instance create \
--plan vc2-2c-4gb \
--region $REGION\
--iso $TALOS_ISO_ID\
--host talos-k8s-cp${id}\
--label "Talos Kubernetes Control Plane"\
--tags talos,kubernetes,control-plane
done
Make a note of the instance IDs, as they are needed to attach to the load balancer created earlier.
Now worker nodes can be created and configured in a similar way to the control plane nodes, the difference being mainly in the machine configuration file.
Note that like with the control plane nodes, the instance type (here set by --plan vc2-1-1gb) should be changed for the actual cluster requirements.
for id in $(seq 1); do vultr-cli instance create \
--plan vc2-1c-1gb \
--region $REGION\
--iso $TALOS_ISO_ID\
--host talos-k8s-worker${id}\
--label "Talos Kubernetes Worker"\
--tags talos,kubernetes,worker
done
Once the worker is booted and in maintenance mode, the machine configuration can be applied in the following manner.
Once all the cluster nodes are correctly configured, the cluster can be bootstrapped to become functional.
It is important that the talosctl bootstrap command be executed only once and against only a single control plane node.
Finally, with the cluster fully running, the administrative kubeconfig can be retrieved from the Talos API to be saved locally.
talosctl --talosconfig talosconfig kubeconfig .
Now the kubeconfig can be used by any of the usual Kubernetes tools to interact with the Talos-based Kubernetes cluster as normal.
2.1.4 - Local Platforms
Installation of Talos Linux on local platforms, helpful for testing and developing.
2.1.4.1 - Docker
Creating Talos Kubernetes cluster using Docker.
In this guide we will create a Kubernetes cluster in Docker, using a containerized version of Talos.
Running Talos in Docker is intended to be used in CI pipelines, and local testing when you need a quick and easy cluster.
Furthermore, if you are running Talos in production, it provides an excellent way for developers to develop against the same version of Talos.
Requirements
The follow are requirements for running Talos in Docker:
Due to the fact that Talos will be running in a container, certain APIs are not available.
For example upgrade, reset, and similar APIs don’t apply in container mode.
Further, when running on a Mac in docker, due to networking limitations, VIPs are not supported.
Create the Cluster
Creating a local cluster is as simple as:
talosctl cluster create --wait
Once the above finishes successfully, your talosconfig(~/.talos/config) will be configured to point to the new cluster.
Note: Startup times can take up to a minute or more before the cluster is available.
Finally, we just need to specify which nodes you want to communicate with using talosctl.
Talosctl can operate on one or all the nodes in the cluster – this makes cluster wide commands much easier.
talosctl config nodes 10.5.0.2 10.5.0.3
Using the Cluster
Once the cluster is available, you can make use of talosctl and kubectl to interact with the cluster.
For example, to view current running containers, run talosctl containers for a list of containers in the system namespace, or talosctl containers -k for the k8s.io namespace.
To view the logs of a container, use talosctl logs <container> or talosctl logs -k <container>.
Cleaning Up
To cleanup, run:
talosctl cluster destroy
2.1.4.2 - QEMU
Creating Talos Kubernetes cluster using QEMU VMs.
In this guide we will create a Kubernetes cluster using QEMU.
Video Walkthrough
To see a live demo of this writeup, see the video below:
Requirements
Linux
a kernel with
KVM enabled (/dev/kvm must exist)
CONFIG_NET_SCH_NETEM enabled
CONFIG_NET_SCH_INGRESS enabled
at least CAP_SYS_ADMIN and CAP_NET_ADMIN capabilities
QEMU
bridge, static and firewall CNI plugins from the standard CNI plugins, and tc-redirect-tap CNI plugin from the awslabs tc-redirect-tap installed to /opt/cni/bin (installed automatically by talosctl)
iptables
/var/run/netns directory should exist
Installation
How to get QEMU
Install QEMU with your operating system package manager.
For example, on Ubuntu for x86:
Before the first cluster is created, talosctl will download the CNI bundle for the VM provisioning and install it to ~/.talos/cni directory.
Once the above finishes successfully, your talosconfig (~/.talos/config) will be configured to point to the new cluster, and kubeconfig will be
downloaded and merged into default kubectl config location (~/.kube/config).
Once the cluster is available, you can make use of talosctl and kubectl to interact with the cluster.
For example, to view current running containers, run talosctl -n 10.5.0.2 containers for a list of containers in the system namespace, or talosctl -n 10.5.0.2 containers -k for the k8s.io namespace.
To view the logs of a container, use talosctl -n 10.5.0.2 logs <container> or talosctl -n 10.5.0.2 logs -k <container>.
A bridge interface will be created, and assigned the default IP 10.5.0.1.
Each node will be directly accessible on the subnet specified at cluster creation time.
A loadbalancer runs on 10.5.0.1 by default, which handles loadbalancing for the Kubernetes APIs.
You can see a summary of the cluster state by running:
$ talosctl cluster show --provisioner qemu
PROVISIONER qemu
NAME talos-default
NETWORK NAME talos-default
NETWORK CIDR 10.5.0.0/24
NETWORK GATEWAY 10.5.0.1
NETWORK MTU 1500NODES:
NAME TYPE IP CPU RAM DISK
talos-default-controlplane-1 ControlPlane 10.5.0.2 1.00 1.6 GB 4.3 GB
talos-default-controlplane-2 ControlPlane 10.5.0.3 1.00 1.6 GB 4.3 GB
talos-default-controlplane-3 ControlPlane 10.5.0.4 1.00 1.6 GB 4.3 GB
talos-default-worker-1 Worker 10.5.0.5 1.00 1.6 GB 4.3 GB
Note: In that case that the host machine is rebooted before destroying the cluster, you may need to manually remove ~/.talos/clusters/talos-default.
Manual Clean Up
The talosctl cluster destroy command depends heavily on the clusters state directory.
It contains all related information of the cluster.
The PIDs and network associated with the cluster nodes.
If you happened to have deleted the state folder by mistake or you would like to cleanup
the environment, here are the steps how to do it manually:
Remove VM Launchers
Find the process of talosctl qemu-launch:
ps -elf | grep 'talosctl qemu-launch'
To remove the VMs manually, execute:
sudo kill -s SIGTERM <PID>
Example output, where VMs are running with PIDs 157615 and 157617
This is more tricky part as if you have already deleted the state folder.
If you didn’t then it is written in the state.yaml in the
~/.talos/clusters/<cluster-name> directory.
In order to install Talos in VirtualBox, you will need the ISO image from the Talos release page.
You can download talos-amd64.iso via
github.com/siderolabs/talos/releases
Start by creating a new VM by clicking the “New” button in the VirtualBox UI:
Supply a name for this VM, and specify the Type and Version:
Edit the memory to supply at least 2GB of RAM for the VM:
Proceed through the disk settings, keeping the defaults.
You can increase the disk space if desired.
Once created, select the VM and hit “Settings”:
In the “System” section, supply at least 2 CPUs:
In the “Network” section, switch the network “Attached To” section to “Bridged Adapter”:
Finally, in the “Storage” section, select the optical drive and, on the right, select the ISO by browsing your filesystem:
Repeat this process for a second VM to use as a worker node.
You can also repeat this for additional nodes desired.
Start Control Plane Node
Once the VMs have been created and updated, start the VM that will be the first control plane node.
This VM will boot the ISO image specified earlier and enter “maintenance mode”.
Once the machine has entered maintenance mode, there will be a console log that details the IP address that the node received.
Take note of this IP address, which will be referred to as $CONTROL_PLANE_IP for the rest of this guide.
If you wish to export this IP as a bash variable, simply issue a command like export CONTROL_PLANE_IP=1.2.3.4.
Generate Machine Configurations
With the IP address above, you can now generate the machine configurations to use for installing Talos and Kubernetes.
Issue the following command, updating the output directory, cluster name, and control plane IP as you see fit:
talosctl gen config talos-vbox-cluster https://$CONTROL_PLANE_IP:6443 --output-dir _out
This will create several files in the _out directory: controlplane.yaml, worker.yaml, and talosconfig.
Create Control Plane Node
Using the controlplane.yaml generated above, you can now apply this config using talosctl.
Issue:
You should now see some action in the VirtualBox console for this VM.
Talos will be installed to disk, the VM will reboot, and then Talos will configure the Kubernetes control plane on this VM.
Note: This process can be repeated multiple times to create an HA control plane.
Create Worker Node
Create at least a single worker node using a process similar to the control plane creation above.
Start the worker node VM and wait for it to enter “maintenance mode”.
Take note of the worker node’s IP address, which will be referred to as $WORKER_IP
Note: This process can be repeated multiple times to add additional workers.
Using the Cluster
Once the cluster is available, you can make use of talosctl and kubectl to interact with the cluster.
For example, to view current running containers, run talosctl containers for a list of containers in the system namespace, or talosctl containers -k for the k8s.io namespace.
To view the logs of a container, use talosctl logs <container> or talosctl logs -k <container>.
First, configure talosctl to talk to your control plane node by issuing the following, updating paths and IPs as necessary:
Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node.
Following the instructions in the console output to connect to the interactive installer:
talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>
Once the interactive installation is applied, the cluster will form and you can then use kubectl.
Retrieve the kubeconfig
Retrieve the admin kubeconfig by running:
talosctl kubeconfig
2.1.5.2 - Jetson Nano
Installing Talos on Jetson Nano SBC using raw disk image.
We will use the R32.7.2 release for the Jetson Nano.
Most of the instructions is similar to this doc except that we’d be using a upstream version of u-boot with patches from NVIDIA u-boot so that USB boot also works.
Before flashing we need the following:
A USB-A to micro USB cable
A jumper wire to enable recovery mode
A HDMI monitor to view the logs if the USB serial adapter is not available
A USB to Serial adapter with 3.3V TTL (optional)
A 5V DC barrel jack
If you’re planning to use the serial console follow the documentation here
First start by downloading the Jetson Nano L4T release.
Next we will extract the L4T release and replace the u-boot binary with the patched version.
tar xf jetson-210_linux_r32.6.1_aarch64.tbz2
cd Linux_for_Tegra
crane --platform=linux/arm64 export ghcr.io/siderolabs/u-boot:v1.1.0-alpha.0-42-gcd05ae8 - | tar xf - --strip-components=1 -C bootloader/t210ref/p3450-0000/ jetson_nano/u-boot.bin
Next we will flash the firmware to the Jetson Nano SPI flash.
In order to do that we need to put the Jetson Nano into Force Recovery Mode (FRC).
We will use the instructions from here
Ensure that the Jetson Nano is powered off.
There is no need for the SD card/USB storage/network cable to be connected
Connect the micro USB cable to the micro USB port on the Jetson Nano, don’t plug the other end to the PC yet
Enable Force Recovery Mode (FRC) by placing a jumper across the FRC pins on the Jetson Nano
For board revision A02, these are pins 3 and 4 of header J40
For board revision B01, these are pins 9 and 10 of header J50
Place another jumper across J48 to enable power from the DC jack and connect the Jetson Nano to the DC jack J25
Now connect the other end of the micro USB cable to the PC and remove the jumper wire from the FRC pins
Now the Jetson Nano is in Force Recovery Mode (FRC) and can be confirmed by running the following command
lsusb | grep -i "nvidia"
Now we can move on the flashing the firmware.
sudo ./flash p3448-0000-max-spi external
This will flash the firmware to the Jetson Nano SPI flash and you’ll see a lot of output.
If you’ve connected the serial console you’ll also see the progress there.
Once the flashing is done you can disconnect the USB cable and power off the Jetson Nano.
| Replace /dev/mmcblk0 with the name of your SD card/USB storage.
Bootstrapping the Node
Insert the SD card/USB storage to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node.
Following the instructions in the console output to connect to the interactive installer:
talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>
Once the interactive installation is applied, the cluster will form and you can then use kubectl.
Retrieve the kubeconfig
Retrieve the admin kubeconfig by running:
talosctl kubeconfig
2.1.5.3 - Libre Computer Board ALL-H3-CC
Installing Talos on Libre Computer Board ALL-H3-CC SBC using raw disk image.
Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node.
Following the instructions in the console output to connect to the interactive installer:
talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>
Once the interactive installation is applied, the cluster will form and you can then use kubectl.
Retrieve the kubeconfig
Retrieve the admin kubeconfig by running:
talosctl kubeconfig
2.1.5.4 - Pine64
Installing Talos on a Pine64 SBC using raw disk image.
Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node.
Following the instructions in the console output to connect to the interactive installer:
talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>
Once the interactive installation is applied, the cluster will form and you can then use kubectl.
Retrieve the kubeconfig
Retrieve the admin kubeconfig by running:
talosctl kubeconfig
2.1.5.5 - Pine64 Rock64
Installing Talos on Pine64 Rock64 SBC using raw disk image.
Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node.
Following the instructions in the console output to connect to the interactive installer:
talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>
Once the interactive installation is applied, the cluster will form and you can then use kubectl.
Retrieve the kubeconfig
Retrieve the admin kubeconfig by running:
talosctl kubeconfig
2.1.5.6 - Radxa ROCK PI 4
Installing Talos on Radxa ROCK PI 4a/4b SBC using raw disk image.
Prerequisites
You will need
talosctl
an SD card or an eMMC or USB drive or an nVME drive
After these steps, Talos will boot from the nVME/USB and enter maintenance mode.
Proceed to bootstrapping the node.
Bootstrapping the Node
Wait for the console to show you the instructions for bootstrapping the node.
Following the instructions in the console output to connect to the interactive installer:
talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>
Once the interactive installation is applied, the cluster will form and you can then use kubectl.
Retrieve the kubeconfig
Retrieve the admin kubeconfig by running:
talosctl kubeconfig
2.1.5.7 - Radxa ROCK PI 4C
Installing Talos on Radxa ROCK PI 4c SBC using raw disk image.
Prerequisites
You will need
talosctl
an SD card or an eMMC or USB drive or an nVME drive
After these steps, Talos will boot from the nVME/USB and enter maintenance mode.
Proceed to bootstrapping the node.
Bootstrapping the Node
Wait for the console to show you the instructions for bootstrapping the node.
Following the instructions in the console output to connect to the interactive installer:
talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>
Once the interactive installation is applied, the cluster will form and you can then use kubectl.
Retrieve the kubeconfig
Retrieve the admin kubeconfig by running:
talosctl kubeconfig
2.1.5.8 - Raspberry Pi 4 Model B
Installing Talos on Rpi4 SBC using raw disk image.
Video Walkthrough
To see a live demo of this writeup, see the video below:
At least version v2020.09.03-138a1 of the bootloader (rpi-eeprom) is required.
To update the bootloader we will need an SD card.
Insert the SD card into your computer and use Raspberry Pi Imager
to install the bootloader on it (select Operating System > Misc utility images > Bootloader > SD Card Boot).
Alternatively, you can use the console on Linux or macOS.
The path to your SD card can be found using fdisk on Linux or diskutil on macOS.
In this example, we will assume /dev/mmcblk0.
Remove the SD card from your local machine and insert it into the Raspberry Pi.
Power the Raspberry Pi on, and wait at least 10 seconds.
If successful, the green LED light will blink rapidly (forever), otherwise an error pattern will be displayed.
If an HDMI display is attached to the port closest to the power/USB-C port,
the screen will display green for success or red if a failure occurs.
Power off the Raspberry Pi and remove the SD card from it.
Note: Updating the bootloader only needs to be done once.
Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node.
Following the instructions in the console output to connect to the interactive installer:
talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>
Once the interactive installation is applied, the cluster will form and you can then use kubectl.
Note: if you have an HDMI display attached and it shows only a rainbow splash,
please use the other HDMI port, the one closest to the power/USB-C port.
Retrieve the kubeconfig
Retrieve the admin kubeconfig by running:
talosctl kubeconfig
Troubleshooting
The following table can be used to troubleshoot booting issues:
Long Flashes
Short Flashes
Status
0
3
Generic failure to boot
0
4
start*.elf not found
0
7
Kernel image not found
0
8
SDRAM failure
0
9
Insufficient SDRAM
0
10
In HALT state
2
1
Partition not FAT
2
2
Failed to read from partition
2
3
Extended partition not FAT
2
4
File signature/hash mismatch - Pi 4
4
4
Unsupported board type
4
5
Fatal firmware error
4
6
Power failure type A
4
7
Power failure type B
2.2 - Configuration
Guides on how to configure Talos Linux machines
2.2.1 - Configuration Patches
In this guide, we’ll patch the generated machine configuration.
Talos generates machine configuration for two types of machines: controlplane and worker machines.
Many configuration options can be adjusted using talosctl gen config but not all of them.
Configuration patching allows modifying machine configuration to fit it for the cluster or a specific machine.
Configuration Patch Formats
Talos supports two configuration patch formats:
strategic merge patches
RFC6902 (JSON patches)
Strategic merge patches are the easiest to use, but JSON patches allow more precise configuration adjustments.
Strategic Merge patches
Strategic merge patches look like incomplete machine configuration files:
machine:
network:
hostname: worker1
When applied to the machine configuration, the patch gets merged with the respective section of the machine configuration:
In general, machine configuration contents are merged with the contents of the strategic merge patch, with strategic merge patch
values overriding machine configuration values.
There are some special rules:
If the field value is a list, the patch value is appended to the list, with the following exceptions:
values of the fields cluster.network.podSubnets and cluster.network.serviceSubnets are overwritten on merge
network.interfaces section is merged with the value in the machine config if there is a match on interface: or deviceSelector: keys
RFC6902 (JSON Patches)
JSON patches can be written either in JSON or YAML format.
A proper JSON patch requires an op field that depends on the machine configuration contents: whether the path already exists or not.
For example, the strategic merge patch from the previous section can be written either as:
Several talosctl commands accept config patches as command-line flags.
Config patches might be passed either as an inline value or as a reference to a file with @file.patch syntax:
If multiple config patches are specified, they are applied in the order of appearance.
The format of the patch (JSON patch or strategic merge patch) is detected automatically.
Generated machine configuration can be patched with talosctl gen config:
Create cluster like normal and see that metrics are now present on this port:
$ curl 127.0.0.1:11234/v1/metrics
# HELP container_blkio_io_service_bytes_recursive_bytes The blkio io service bytes recursive# TYPE container_blkio_io_service_bytes_recursive_bytes gaugecontainer_blkio_io_service_bytes_recursive_bytes{container_id="0677d73196f5f4be1d408aab1c4125cf9e6c458a4bea39e590ac779709ffbe14",device="/dev/dm-0",major="253",minor="0",namespace="k8s.io",op="Async"}0container_blkio_io_service_bytes_recursive_bytes{container_id="0677d73196f5f4be1d408aab1c4125cf9e6c458a4bea39e590ac779709ffbe14",device="/dev/dm-0",major="253",minor="0",namespace="k8s.io",op="Discard"}0...
...
2.2.3 - Custom Certificate Authorities
How to supply custom certificate authorities
Appending the Certificate Authority
Put into each machine the PEM encoded certificate:
It is possible to enable encryption for system disks at the OS level.
As of this writing, only STATE and EPHEMERAL partitions can be encrypted.
STATE contains the most sensitive node data: secrets and certs.
EPHEMERAL partition may contain some sensitive workload data.
Data is encrypted using LUKS2, which is provided by the Linux kernel modules and cryptsetup utility.
The operating system will run additional setup steps when encryption is enabled.
If the disk encryption is enabled for the STATE partition, the system will:
Save STATE encryption config as JSON in the META partition.
Before mounting the STATE partition, load encryption configs either from the machine config or from the META partition.
Note that the machine config is always preferred over the META one.
Before mounting the STATE partition, format and encrypt it.
This occurs only if the STATE partition is empty and has no filesystem.
If the disk encryption is enabled for the EPHEMERAL partition, the system will:
Get the encryption config from the machine config.
Before mounting the EPHEMERAL partition, encrypt and format it.
This occurs only if the EPHEMERAL partition is empty and has no filesystem.
Configuration
Right now this encryption is disabled by default.
To enable disk encryption you should modify the machine configuration with the following options:
Note: What the LUKS2 docs call “keys” are, in reality, a passphrase.
When this passphrase is added, LUKS2 runs argon2 to create an actual key from that passphrase.
LUKS2 supports up to 32 encryption keys and it is possible to specify all of them in the machine configuration.
Talos always tries to sync the keys list defined in the machine config with the actual keys defined for the LUKS2 partition.
So if you update the keys list you should have at least one key that is not changed to be used for keys management.
When you define a key you should specify the key kind and the slot:
Take a note that key order does not play any role on which key slot is used.
Every key must always have a slot defined.
Encryption Key Kinds
Talos supports two kinds of keys:
nodeID which is generated using the node UUID and the partition label (note that if the node UUID is not really random it will fail the entropy check).
static which you define right in the configuration.
Note: Use static keys only if your STATE partition is encrypted and only for the EPHEMERAL partition.
For the STATE partition it will be stored in the META partition, which is not encrypted.
Key Rotation
It is necessary to do talosctl apply-config a couple of times to rotate keys, since there is a need to always maintain a single working key while changing the other keys around it.
That’s it!
After you run the last command, the partition will be wiped and the node will reboot.
During the next boot the system will encrypt the partition.
State Partition
Calling wipe against the STATE partition will make the node lose the config, so the previous flow is not going to work.
The flow should be to first wipe the STATE partition:
talosctl reset --system-labels-to-wipe STATE -n <node ip> --reboot=true
Node will enter into maintenance mode, then run apply-config with --insecure flag:
After installation is complete the node should encrypt the STATE partition.
2.2.5 - Editing Machine Configuration
How to edit and patch Talos machine configuration, with reboot, immediately, or stage update on reboot.
Talos node state is fully defined by machine configuration.
Initial configuration is delivered to the node at bootstrap time, but configuration can be updated while the node is running.
Note: Be sure that config is persisted so that configuration updates are not overwritten on reboots.
Configuration persistence was enabled by default since Talos 0.5 (persist: true in machine configuration).
There are three talosctl commands which facilitate machine configuration updates:
talosctl apply-config to apply configuration from the file
talosctl edit machineconfig to launch an editor with existing node configuration, make changes and apply configuration back
talosctl patch machineconfig to apply automated machine configuration via JSON patch
Each of these commands can operate in one of four modes:
apply change in automatic mode(default): reboot if the change can’t be applied without a reboot, otherwise apply the change immediately
apply change with a reboot (--mode=reboot): update configuration, reboot Talos node to apply configuration change
apply change immediately (--mode=no-reboot flag): change is applied immediately without a reboot, fails if the change contains any fields that can not be updated without a reboot
apply change on next reboot (--mode=staged): change is staged to be applied after a reboot, but node is not rebooted
apply change in the interactive mode (--mode=interactive; only for talosctl apply-config): launches TUI based interactive installer
Note: applying change on next reboot (--mode=staged) doesn’t modify current node configuration, so next call to
talosctl edit machineconfig --mode=staged will not see changes
Additionally, there is also talosctl get machineconfig, which retrieves the current node configuration API resource and contains the machine configuration in the .spec field.
It can be used to modify the configuration locally before being applied to the node.
The list of config changes allowed to be applied immediately in Talos v1.2.8:
.debug
.cluster
.machine.time
.machine.certCANs
.machine.install (configuration is only applied during install/upgrade)
.machine.network
.machine.sysfs
.machine.sysctls
.machine.logging
.machine.controlplane
.machine.kubelet
.machine.pods
.machine.kernel
.machine.registries (CRI containerd plugin will not pick up the registry authentication settings without a reboot)
.machine.features.kubernetesTalosAPIAccess
talosctl apply-config
This command is mostly used to submit initial machine configuration to the node (generated by talosctl gen config).
It can be used to apply new configuration from the file to the running node as well, but most of the time it’s not convenient, as it doesn’t operate on the current node machine configuration.
Example:
talosctl -n <IP> apply-config -f config.yaml
Command apply-config can also be invoked as apply machineconfig:
Applying machine configuration immediately (without a reboot):
talosctl -n IP apply machineconfig -f config.yaml --mode=no-reboot
Starting the interactive installer:
talosctl -n IP apply machineconfig --mode=interactive
Note: when a Talos node is running in the maintenance mode it’s necessary to provide --insecure (-i) flag to connect to the API and apply the config.
taloctl edit machineconfig
Command talosctl edit loads current machine configuration from the node and launches configured editor to modify the config.
If config hasn’t been changed in the editor (or if updated config is empty), update is not applied.
Note: Talos uses environment variables TALOS_EDITOR, EDITOR to pick up the editor preference.
If environment variables are missing, vi editor is used by default.
Example:
talosctl -n <IP> edit machineconfig
Configuration can be edited for multiple nodes if multiple IP addresses are specified:
talosctl -n <IP1>,<IP2>,... edit machineconfig
Applying machine configuration change immediately (without a reboot):
Command talosctl patch works similar to talosctl edit command - it loads current machine configuration, but instead of launching configured editor it applies a set of JSON patches to the configuration and writes the result back to the node.
Example, updating kubelet version (in auto mode):
$ talosctl -n <IP> patch machineconfig -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.25.5"}]'patched mc at the node <IP>
Updating kube-apiserver version in immediate mode (without a reboot):
$ talosctl -n <IP> patch machineconfig --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "k8s.gcr.io/kube-apiserver:v1.25.5"}]'patched mc at the node <IP>
A patch might be applied to multiple nodes when multiple IPs are specified:
If a Talos node fails to boot because of wrong configuration (for example, control plane endpoint is incorrect), configuration can be updated to fix the issue.
2.2.6 - Logging
Dealing with Talos Linux logs.
Viewing logs
Kernel messages can be retrieved with talosctl dmesg command:
Service logs can be retrieved with talosctl logs command:
$ talosctl -n 172.20.1.2 services
NODE SERVICE STATE HEALTH LAST CHANGE LAST EVENT
172.20.1.2 apid Running OK 19m27s ago Health check successful
172.20.1.2 containerd Running OK 19m29s ago Health check successful
172.20.1.2 cri Running OK 19m27s ago Health check successful
172.20.1.2 etcd Running OK 19m22s ago Health check successful
172.20.1.2 kubelet Running OK 19m20s ago Health check successful
172.20.1.2 machined Running ? 19m30s ago Service started as goroutine
172.20.1.2 trustd Running OK 19m27s ago Health check successful
172.20.1.2 udevd Running OK 19m28s ago Health check successful
$ talosctl -n 172.20.1.2 logs machined
172.20.1.2: [talos] task setupLogger (1/1): done, 106.109µs
172.20.1.2: [talos] phase logger (1/7): done, 564.476µs
[...]
Container logs for Kubernetes pods can be retrieved with talosctl logs -k command:
Messages are newline-separated when sent over TCP.
Over UDP messages are sent with one message per packet.
msg, talos-level, talos-service, and talos-time fields are always present; there may be additional fields.
Kernel logs
Kernel log delivery can be enabled with the talos.logging.kernel kernel command line argument, which can be specified
in the .machine.installer.extraKernelArgs:
Kernel log destination is specified in the same way as service log endpoint.
The only supported format is json_lines.
Sample message:
{
"clock":6252819, // time relative to the kernel boot time
"facility":"user",
"msg":"[talos] task startAllServices (1/1): waiting for 6 services\n",
"priority":"warning",
"seq":711,
"talos-level":"warn", // Talos-translated `priority` into common logging level
"talos-time":"2021-11-26T16:53:21.3258698Z"// Talos-translated `clock` using current time
}
extraKernelArgs in the machine configuration are only applied on Talos upgrades, not just by applying the config.
(Upgrading to the same version is fine).
Filebeat example
To forward logs to other Log collection services, one way to do this is sending
them to a Filebeat running in the
cluster itself (in the host network), which takes care of forwarding it to
other endpoints (and the necessary transformations).
If Elastic Cloud on Kubernetes
is being used, the following Beat (custom resource) configuration might be
helpful:
The input configuration ensures that messages and timestamps are extracted properly.
Refer to the Filebeat documentation on how to forward logs to other outputs.
Also note the hostNetwork: true in the daemonSet configuration.
This ensures filebeat uses the host network, and listens on 127.0.0.1:12345
(UDP) on every machine, which can then be specified as a logging endpoint in
the machine configuration.
Fluent-bit example
First, we’ll create a value file for the fluentd-bit Helm chart.
# fluentd-bit.yamlpodAnnotations:
fluentbit.io/exclude: 'true'extraPorts:
- port: 12345containerPort: 12345protocol: TCP
name: talos
config:
service: | [SERVICE]
Flush 5
Daemon Off
Log_Level warn
Parsers_File custom_parsers.confinputs: | [INPUT]
Name tcp
Listen 0.0.0.0
Port 12345
Format json
Tag talos.*
[INPUT]
Name tail
Alias kubernetes
Path /var/log/containers/*.log
Parser containerd
Tag kubernetes.*
[INPUT]
Name tail
Alias audit
Path /var/log/audit/kube/*.log
Parser audit
Tag audit.*filters: | [FILTER]
Name kubernetes
Alias kubernetes
Match kubernetes.*
Kube_Tag_Prefix kubernetes.var.log.containers.
Use_Kubelet Off
Merge_Log On
Merge_Log_Trim On
Keep_Log Off
K8S-Logging.Parser Off
K8S-Logging.Exclude On
Annotations Off
Labels On
[FILTER]
Name modify
Match kubernetes.*
Add source kubernetes
Remove logtagcustomParsers: | [PARSER]
Name audit
Format json
Time_Key requestReceivedTimestamp
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
[PARSER]
Name containerd
Format regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%zoutputs: | [OUTPUT]
Name stdout
Alias stdout
Match *
Format json_lines# If you wish to ship directly to Loki from Fluentbit,# Uncomment the following output, updating the Host with your Loki DNS/IP info as necessary.# [OUTPUT]# Name loki# Match *# Host loki.loki.svc# Port 3100# Labels job=fluentbit# Auto_Kubernetes_Labels ondaemonSetVolumes:
- name: varlog
hostPath:
path: /var/log
daemonSetVolumeMounts:
- name: varlog
mountPath: /var/log
tolerations:
- operator: Exists
effect: NoSchedule
Next, we will add the helm repo for FluentBit, and deploy it to the cluster.
$ kubectl -n kube-system get svc -l app.kubernetes.io/name=fluent-bit
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
fluent-bit ClusterIP 10.200.0.138 <none> 2020/TCP,5170/TCP 108m
Finally, we will change talos log destination with the command talosctl edit mc.
This example configuration was well tested with Cilium CNI, and it should work with iptables/ipvs based CNI plugins too.
Vector example
Vector is a lightweight observability pipeline ideal for a Kubernetes environment.
It can ingest (source) logs from multiple sources, perform remapping on the logs (transform), and forward the resulting pipeline to multiple destinations (sinks).
As it is an end to end platform, it can be run as a single-deployment ‘aggregator’ as well as a replicaSet of ‘Agents’ that run on each node.
As Talos can be set as above to send logs to a destination, we can run Vector as an Aggregator, and forward both kernel and service to a UDP socket in-cluster.
Below is an excerpt of a source/sink setup for Talos, with a ‘sink’ destination of an in-cluster Grafana Loki log aggregation service.
As Loki can create labels from the log input, we have set up the Loki sink to create labels based on the host IP, service and facility of the inbound logs.
Note that a method of exposing the Vector service will be required which may vary depending on your setup - a LoadBalancer is a good option.
In order to create a key pair, you will need the root CA.
Save the CA public key, and CA private key as ca.crt, and ca.key respectively.
Now, run the following commands to generate a certificate:
talosctl gen key --name admin
talosctl gen csr --key admin.key --ip 127.0.0.1
talosctl gen crt --ca ca --csr admin.csr --name admin
Now, base64 encode admin.crt, and admin.key:
cat admin.crt | base64
cat admin.key | base64
You can now set the crt and key fields in the talosconfig to the base64 encoded strings.
Renewing an Expired Administrator Certificate
In order to renew the certificate, you will need the root CA, and the admin private key.
The base64 encoded key can be found in any one of the control plane node’s configuration file.
Where it is exactly will depend on the specific version of the configuration file you are using.
Save the CA public key, CA private key, and admin private key as ca.crt, ca.key, and admin.key respectively.
Now, run the following commands to generate a certificate:
talosctl gen csr --key admin.key --ip 127.0.0.1
talosctl gen crt --ca ca --csr admin.csr --name admin
You should see admin.crt in your current directory.
Now, base64 encode admin.crt:
cat admin.crt | base64
You can now set the certificate in the talosconfig to the base64 encoded string.
2.2.8 - NVIDIA Fabric Manager
In this guide we’ll follow the procedure to enable NVIDIA Fabric Manager.
The published versions of the NVIDIA fabricmanager system extensions is available here
The nvidia-fabricmanager extension version has to match with the NVIDIA driver version in use.
Upgrading Talos and enabling the NVIDIA fabricmanager system extension
In addition to the patch defined in the NVIDIA drivers guide, we need to add the nvidia-fabricmanager system extension to the patch yaml gpu-worker-patch.yaml:
In this guide we’ll follow the procedure to support NVIDIA GPU using OSS drivers on Talos.
Enabling NVIDIA GPU support on Talos is bound by NVIDIA EULA.
Talos GPU support has been promoted to beta.
The Talos published NVIDIA OSS drivers are bound to a specific Talos release.
The extensions versions also needs to be updated when upgrading Talos.
The published versions of the NVIDIA system extensions can be found here:
Update the driver version and Talos release in the above patch yaml from the published versions if there is a newer one available.
Make sure the driver version matches for both the nvidia-open-gpu-kernel-modules and nvidia-container-toolkit extensions.
The nvidia-open-gpu-kernel-modules extension is versioned as <nvidia-driver-version>-<talos-release-version> and the nvidia-container-toolkit extension is versioned as <nvidia-driver-version>-<nvidia-container-toolkit-version>.
Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU’s installed:
talosctl patch mc --patch @gpu-worker-patch.yaml
Now we can proceed to upgrading Talos to the same version to enable the system extension:
Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.
This can be confirmed by running:
talosctl read /proc/modules
which should produce an output similar to below:
nvidia_uvm 1146880 - - Live 0xffffffffc2733000 (PO)
nvidia_drm 69632 - - Live 0xffffffffc2721000 (PO)
nvidia_modeset 1142784 - - Live 0xffffffffc25ea000 (PO)
nvidia 39047168 - - Live 0xffffffffc00ac000 (PO)
talosctl get extensions
which should produce an output similar to below:
NODE NAMESPACE TYPE ID VERSION NAME VERSION
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-siderolabs-nvidia-container-toolkit-515.65.01-v1.10.0 1 nvidia-container-toolkit 515.65.01-v1.10.0
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-siderolabs-nvidia-open-gpu-kernel-modules-515.65.01-v1.2.0 1 nvidia-open-gpu-kernel-modules 515.65.01-v1.2.0
talosctl read /proc/driver/nvidia/version
which should produce an output similar to below:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.65.01 Wed Mar 16 11:24:05 UTC 2022
GCC version: gcc version 12.2.0 (GCC)
Deploying NVIDIA device plugin
First we need to create the RuntimeClass
Apply the following manifest to create a runtime class that uses the extension:
NAME READY STATUS RESTARTS AGE
gpu-operator-test 0/1 Completed 0 13s
kubectl logs gpu-operator-test
which should produce an output similar to below:
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
2.2.10 - NVIDIA GPU (Proprietary drivers)
In this guide we’ll follow the procedure to support NVIDIA GPU using proprietary drivers on Talos.
Enabling NVIDIA GPU support on Talos is bound by NVIDIA EULA.
Talos GPU support has been promoted to beta.
These are the steps to enabling NVIDIA support in Talos.
Talos pre-installed on a node with NVIDIA GPU installed.
Building a custom Talos installer image with NVIDIA modules
Upgrading Talos with the custom installer and enabling NVIDIA modules and the system extension
This requires that the user build and maintain their own Talos installer image.
Prerequisites
This guide assumes the user has access to a container registry with push permissions, docker installed on the build machine and the Talos host has pull access to the container registry.
Set the local registry and username environment variables:
Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it.
make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true
Replace the platform with linux/arm64 if building for ARM64
Now we need to create a custom Talos installer image.
Start by creating a Dockerfile with the following content:
FROM scratch as customizationCOPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:v1.2.8-nvidia /lib/modules /lib/modules
FROM ghcr.io/siderolabs/installer:v1.2.8COPY --from=ghcr.io/talos-user/kernel:v1.2.8-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.
This can be confirmed by running:
talosctl read /proc/modules
which should produce an output similar to below:
nvidia_uvm 1146880 - - Live 0xffffffffc2733000 (PO)
nvidia_drm 69632 - - Live 0xffffffffc2721000 (PO)
nvidia_modeset 1142784 - - Live 0xffffffffc25ea000 (PO)
nvidia 39047168 - - Live 0xffffffffc00ac000 (PO)
talosctl get extensions
which should produce an output similar to below:
NODE NAMESPACE TYPE ID VERSION NAME VERSION
172.31.41.27 runtime ExtensionStatus 000.ghcr.io-frezbo-nvidia-container-toolkit-510.60.02-v1.9.0 1 nvidia-container-toolkit 510.60.02-v1.9.0
talosctl read /proc/driver/nvidia/version
which should produce an output similar to below:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 510.60.02 Wed Mar 16 11:24:05 UTC 2022
GCC version: gcc version 11.2.0 (GCC)
Deploying NVIDIA device plugin
First we need to create the RuntimeClass
Apply the following manifest to create a runtime class that uses the extension:
NAME READY STATUS RESTARTS AGE
gpu-operator-test 0/1 Completed 0 13s
kubectl logs gpu-operator-test
which should produce an output similar to below:
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
2.2.11 - Pull Through Image Cache
How to set up local transparent container images caches.
In this guide we will create a set of local caching Docker registry proxies to minimize local cluster startup time.
When running Talos locally, pulling images from Docker registries might take a significant amount of time.
We spin up local caching pass-through registries to cache images and configure a local Talos cluster to use those proxies.
A similar approach might be used to run Talos in production in air-gapped environments.
It can be also used to verify that all the images are available in local registries.
Video Walkthrough
To see a live demo of this writeup, see the video below:
Requirements
The follow are requirements for creating the set of caching proxies:
Docker 18.03 or greater
Local cluster requirements for either docker or QEMU.
Launch the Caching Docker Registry Proxies
Talos pulls from docker.io, k8s.gcr.io, quay.io, gcr.io, and ghcr.io by default.
If your configuration is different, you might need to modify the commands below:
Note: Proxies are started as docker containers, and they’re automatically configured to start with Docker daemon.
Please note that quay.io proxy doesn’t support recent Docker image schema, so we run older registry image version (2.5).
As a registry container can only handle a single upstream Docker registry, we launch a container per upstream, each on its own
host port (5000, 5001, 5002, 5003 and 5004).
Using Caching Registries with QEMU Local Cluster
With a QEMU local cluster, a bridge interface is created on the host.
As registry containers expose their ports on the host, we can use bridge IP to direct proxy requests.
The Talos local cluster should now start pulling via caching registries.
This can be verified via registry logs, e.g. docker logs -f registry-docker.io.
The first time cluster boots, images are pulled and cached, so next cluster boot should be much faster.
Note: 10.5.0.1 is a bridge IP with default network (10.5.0.0/24), if using custom --cidr, value should be adjusted accordingly.
Using Caching Registries with docker Local Cluster
With a docker local cluster we can use docker bridge IP, default value for that IP is 172.17.0.1.
On Linux, the docker bridge address can be inspected with ip addr show docker0.
Note: Removing docker registry containers also removes the image cache.
So if you plan to use caching registries, keep the containers running.
2.2.12 - Role-based access control (RBAC)
Set up RBAC on the Talos Linux API.
Talos v0.11 introduced initial support for role-based access control (RBAC).
This guide will explain what that is and how to enable it without losing access to the cluster.
RBAC in Talos
Talos uses certificates to authorize users.
The certificate subject’s organization field is used to encode user roles.
There is a set of predefined roles that allow access to different API methods:
os:admin grants access to all methods;
os:reader grants access to “safe” methods (for example, that includes the ability to list files, but does not include the ability to read files content);
Roles in the current talosconfig can be checked with the following command:
$ talosctl config info
[...]Roles: os:admin
[...]
RBAC is enabled by default in new clusters created with talosctl v0.11+ and disabled otherwise.
Enabling RBAC
First, both the Talos cluster and talosctl tool should be upgraded.
Then the talosctl config new command should be used to generate a new client configuration with the os:admin role.
Additional configurations and certificates for different roles can be generated by passing --roles flag:
talosctl config new --roles=os:reader reader
That command will create a new client configuration file reader with a new certificate with os:reader role.
After that, RBAC should be enabled in the machine configuration:
machine:
features:
rbac: true
2.2.13 - System Extensions
Customizing the Talos Linux immutable root file system.
System extensions allow extending the Talos root filesystem, which enables a variety of features, such as including custom
container runtimes, loading additional firmware, etc.
System extensions are only activated during the installation or upgrade of Talos Linux.
With system extensions installed, the Talos root filesystem is still immutable and read-only.
Configuration
System extensions are configured in the .machine.install section:
During the initial install (e.g. when PXE booting or booting from an ISO), Talos will pull down container images for system extensions,
validate them, and include them into the Talos initramfs image.
System extensions will be activated on boot and overlaid on top of the Talos root filesystem.
In order to update the system extensions for a running instance, update .machine.install.extensions and upgrade Talos.
(Note: upgrading to the same version of Talos is fine).
Building a Talos Image with System Extensions
System extensions can be installed into the Talos disk image (e.g. AWS AMI or VMWare OVF) by running the following command to generate the image
from the Talos source tree:
make image-metal IMAGER_SYSTEM_EXTENSIONS="ghcr.io/siderolabs/amd-ucode:20220411 ghcr.io/siderolabs/gvisor:20220405.0-v1.0.0-10-g82b41ad"
Authoring System Extensions
A Talos system extension is a container image with the specific folder structure.
System extensions can be built and managed using any tool that produces container images, e.g. docker build.
Use talosctl get extensions to get a list of system extensions:
$ talosctl get extensions
NODE NAMESPACE TYPE ID VERSION NAME VERSION
172.20.0.2 runtime ExtensionStatus 000.ghcr.io-talos-systems-gvisor-54b831d 1 gvisor 20220117.0-v1.0.0
172.20.0.2 runtime ExtensionStatus 001.ghcr.io-talos-systems-intel-ucode-54b831d 1 intel-ucode microcode-20210608-v1.0.0
Use YAML or JSON format to see additional details about the extension:
machine.network.interfaces.interface name is generated by the Linux kernel and can be changed after a reboot.
Device names can change when the system has several interfaces of the same kind, e.g: eth0, eth1.
In that case pinning it to hardwareAddress will make Talos reliably configure the device even when interface name changes.
2.3.3 - Virtual (shared) IP
Using Talos Linux to set up a floating virtual IP address for cluster access.
One of the biggest pain points when building a high-availability controlplane
is giving clients a single IP or URL at which they can reach any of the controlplane nodes.
The most common approaches all require external resources: reverse proxy, load
balancer, BGP, and DNS.
Using a “Virtual” IP address, on the other hand, provides high availability
without external coordination or resources, so long as the controlplane members
share a layer 2 network.
In practical terms, this means that they are all connected via a switch, with no
router in between them.
The term “virtual” is misleading here.
The IP address is real, and it is assigned to an interface.
Instead, what actually happens is that the controlplane machines vie for
control of the shared IP address using etcd elections.
There can be only one owner of the IP address at any given time, but if that
owner disappears or becomes non-responsive, another owner will be chosen,
and it will take up the mantle: the IP address.
Talos has (as of version 0.9) built-in support for this form of shared IP address,
and it can utilize this for both the Kubernetes API server and the Talos endpoint set.
Talos uses etcd for elections and leadership (control) of the IP address.
It is not reccomended to use a virtual IP to access the API of Talos itself, since the
node using the shared IP is not deterministic and could change.
Video Walkthrough
To see a live demo of this writeup, see the video below:
Choose your Shared IP
To begin with, you should choose your shared IP address.
It should generally be a reserved, unused IP address in the same subnet as
your controlplane nodes.
It should not be assigned or assignable by your DHCP server.
For our example, we will assume that the controlplane nodes have the following
IP addresses:
192.168.0.10
192.168.0.11
192.168.0.12
We then choose our shared IP to be:
192.168.0.15
Configure your Talos Machines
The shared IP setting is only valid for controlplane nodes.
For the example above, each of the controlplane nodes should have the following
Machine Config snippet:
Obviously, for your own environment, the interface and the DHCP setting may
differ.
You are free to use static addressing (cidr) instead of DHCP.
Caveats
In general, the shared IP should just work.
However, since it relies on etcd for elections, the shared IP will not come
alive until after you have bootstrapped Kubernetes.
In general, this is not a problem, but it does mean that you cannot use the
shared IP when issuing the talosctl bootstrap command.
Instead, that command will need to target one of the controlplane nodes
discretely.
2.3.4 - Wireguard Network
A guide on how to set up Wireguard network using Kernel module.
Configuring Wireguard Network
Quick Start
The quickest way to try out Wireguard is to use talosctl cluster create command:
It will automatically generate Wireguard network configuration for each node with the following network topology:
Where all controlplane nodes will be used as Wireguard servers which listen on port 51111.
All controlplanes and workers will connect to all controlplanes.
It also sets PersistentKeepalive to 5 seconds to establish controlplanes to workers connection.
After the cluster is deployed it should be possible to verify Wireguard network connectivity.
It is possible to deploy a container with hostNetwork enabled, then do kubectl exec <container> /bin/bash and either do:
ping 10.1.0.2
Or install wireguard-tools package and run:
wg show
Wireguard show should output something like this:
interface: wg0
public key: OMhgEvNIaEN7zeCLijRh4c+0Hwh3erjknzdyvVlrkGM= private key: (hidden) listening port: 47946peer: 1EsxUygZo8/URWs18tqB5FW2cLVlaTA+lUisKIf8nh4= endpoint: 10.5.0.2:51111
allowed ips: 10.1.0.0/24
latest handshake: 1 minute, 55 seconds ago
transfer: 3.17 KiB received, 3.55 KiB sent
persistent keepalive: every 5 seconds
It is also possible to use generated configuration as a reference by pulling generated config files using:
All Wireguard configuration can be done by changing Talos machine config files.
As an example we will use this official Wireguard quick start tutorial.
Key Generation
This part is exactly the same:
wg genkey | tee privatekey | wg pubkey > publickey
Setting up Device
Inline comments show relations between configs and wg quickstart tutorial commands:
...
network:
interfaces:
...
# ip link add dev wg0 type wireguard - interface: wg0
mtu: 1500# ip address add dev wg0 192.168.2.1/24addresses:
- 192.168.2.1/24
# wg set wg0 listen-port 51820 private-key /path/to/private-key peer ABCDEF... allowed-ips 192.168.88.0/24 endpoint 209.202.254.14:8172wireguard:
privateKey: <privatekey file contents>
listenPort: 51820peers:
allowedIPs:
- 192.168.88.0/24
endpoint: 209.202.254.14.8172publicKey: ABCDEF...
...
When networkd gets this configuration it will create the device, configure it and will bring it up (equivalent to ip link set up dev wg0).
Steps on how to reset a Talos Linux machine to a clean state.
From time to time, it may be beneficial to reset a Talos machine to its “original” state.
Bear in mind that this is a destructive action for the given machine.
Doing this means removing the machine from Kubernetes, Etcd (if applicable), and clears any data on the machine that would normally persist a reboot.
CLI
WARNING: Running a talosctl reset on cloud VM’s might result in the VM being unable to boot as this wipes the entire disk.
It might be more useful to just wipe the STATE and EPHEMERAL partitions on a cloud VM if not booting via iPXE.
talosctl reset --system-labels-to-wipe STATE --system-labels-to-wipe EPHEMERAL
The API command for doing this is talosctl reset.
There are a couple of flags as part of this command:
Flags:
--graceful if true, attempt to cordon/drain node and leave etcd (if applicable)(default true) --reboot if true, reboot the node after resetting instead of shutting down
--system-labels-to-wipe strings if set, just wipe selected system disk partitions by label but keep other partitions intact keep other partitions intact
The graceful flag is especially important when considering HA vs. non-HA Talos clusters.
If the machine is part of an HA cluster, a normal, graceful reset should work just fine right out of the box as long as the cluster is in a good state.
However, if this is a single node cluster being used for testing purposes, a graceful reset is not an option since Etcd cannot be “left” if there is only a single member.
In this case, reset should be used with --graceful=false to skip performing checks that would normally block the reset.
Kernel Parameter
Another way to reset a machine is to specify talos.experimental.wipe=system kernel parameter.
If the machine got stuck in the boot loop and you access to the console you can use GRUB to specify this kernel argument.
Then when Talos boots for the next time it will reset system disk and reboot.
Next steps can be to install Talos either using PXE boot or by mounting an ISO.
2.5 - Upgrading Talos Linux
Guide on upgrading a Talos Linux machine.
OS upgrades, like other operations on Talos Linux, are effected by an API call, which can be sent via the talosctl CLI utility.
Because Talos Linux is image based, an upgrade is almost the same as installing Talos, with the difference that the system has already been initialized with a configuration.
The upgrade API call passes a node the installer image to use to perform the upgrade.
Each Talos version has a corresponding installer.
Upgrades use an A-B image scheme in order to facilitate rollbacks.
This scheme retains the previous Talos kernel and OS image following each upgrade.
If an upgrade fails to boot, Talos will roll back to the previous version.
Likewise, Talos may be manually rolled back via API (or talosctl rollback).
This will simply update the boot reference and reboot.
Unless explicitly told to preserve data, an upgrade will cause the node to wipe the EPHEMERAL partition, remove itself from the etcd cluster (if it is a control node), and generally make itself as pristine as is possible.
(This is generally the desired behavior, except in specialised use cases such as single-node clusters.)
Note that the default since Talos version 1.0 is to specify the Kubernetes version in the machine config.
This means an upgrade of the Talos Linux OS will not apply an upgrade of the Kubernetes version by default.
Kubernetes upgrades should be managed separately per upgrading kubernetes.
Each release of Talos Linux includes the latest stable Kubernetes version by default.
Before Upgrade to v1.2.8
PodSecurityPolicy removal
PodSecurityPolicy is removed in Kubernetes 1.25, so make sure that it is disabled on Talos side before trying to use Kubernetes 1.25:
This setting defaulted to true since Talos v1.0.0 release.
Video Walkthrough
To see a live demo of an upgrade of Talos Linux, see the video below:
After Upgrade to v1.2.8
There are no specific actions to be taken after an upgrade.
talosctl upgrade
To upgrade a Talos node, specify the node’s IP address and the
installer container image for the version of Talos to upgrade to.
For instance, if your Talos node has the IP address 10.20.30.40 and you want
to install the official version v1.0.0, you would enter a command such
as:
There is an option to this command: --preserve, which will explicitly tell Talos to keep ephemeral data intact.
In most cases, it is correct to let Talos perform its default action of erasing the ephemeral data.
However, if you are running a single-node control-plane, you will want to make sure that --preserve=true.
Rarely, a upgrade command will fail to run due to a process holding a file open on disk, or you may wish to set a node to upgrade, but delay the actual reboot as long as possible.
In these cases, you can use the --stage flag.
This puts the upgrade artifacts on disk, and adds some metadata to a disk partition that gets checked very early in the boot process.
Finally, the node is rebooted by the upgrade --stage process.
On the reboot, Talos sees that it needs to apply an upgrade, and will do so immediately.
Because this occurs in a just rebooted system, there will be no conflict with any files being held open.
After the upgrade is applied, the node will reboot again, in order to boot into the new version.
Note that because Talos Linux now reboots via the kexec syscall, the extra reboot adds very little time.
.cluster.allowSchedulingOnMasters renamed to .cluster.allowSchedulingOnControlPlanes, old setting is still supported
.cluster.etcd now has more options to configure etcd advertised and listen subnets, old settings are still supported
Upgrade Sequence
When a Talos node receives the upgrade command, it cordons
itself in Kubernetes, to avoid receiving any new workload.
It then starts to drain its existing workload.
NOTE: If any of your workloads are sensitive to being shut down ungracefully, be sure to use the lifecycle.preStop Pod spec.
Once all of the workload Pods are drained, Talos will start shutting down its
internal processes.
If it is a control node, this will include etcd.
If preserve is not enabled, Talos will leave etcd membership.
(Talos ensures the etcd cluster is healthy and will remain healthy after our node leaves the etcd cluster, before allowing a control plane node to be upgraded.)
Once all the processes are stopped and the services are shut down, the filesystems will be unmounted.
This allows Talos to produce a very clean upgrade, as close as possible to a pristine system.
We verify the disk and then perform the actual image upgrade.
We set the bootloader to boot once with the new kernel and OS image, then we reboot.
After the node comes back up and Talos verifies itself, it will make
the bootloader change permanent, rejoin the cluster, and finally uncordon itself to receive new workloads.
FAQs
Q. What happens if an upgrade fails?
A. Talos Linux attempts to safely handle upgrade failures.
The most common failure is an invalid installer image reference.
In this case, Talos will fail to download the upgraded image and will abort the upgrade.
Sometimes, Talos is unable to successfully kill off all of the disk access points, in which case it cannot safely unmount all filesystems to effect the upgrade.
In this case, it will abort the upgrade and reboot.
(upgrade --stage can ensure that upgrades can occur even when the filesytems cannot be unmounted.)
It is possible (especially with test builds) that the upgraded Talos system will fail to start.
In this case, the node will be rebooted, and the bootloader will automatically use the previous Talos kernel and image, thus effectively rolling back the upgrade.
Lastly, it is possible that Talos itself will upgrade successfully, start up, and rejoin the cluster but your workload will fail to run on it, for whatever reason.
This is when you would use the talosctl rollback command to revert back to the previous Talos version.
Q. Can upgrades be scheduled?
A. Because the upgrade sequence is API-driven, you can easily tie it in to your own business logic to schedule and coordinate your upgrades.
Q. Can the upgrade process be observed?
A. Yes, using the talosctl dmesg -f command.
Q. Are worker node upgrades handled differently from control plane node upgrades?
A. Short answer: no.
Long answer: Both node types follow the same set procedure.
From the user’s standpoint, however, the processes are identical.
However, since control plane nodes run additional services, such as etcd, there are some extra steps and checks performed on them.
For instance, Talos will refuse to upgrade a control plane node if that upgrade would cause a loss of quorum for etcd.
If multiple control plane nodes are asked to upgrade at the same time, Talos will protect the Kubernetes cluster by ensuring only one control plane node actively upgrades at any time, via checking etcd quorum.
If running a single-node cluster, and you want to force an upgrade despite the loss of quorum, you can set preserve to true.
Q. Can I break my cluster by upgrading everything at once?
A. Maybe - it’s not recommended.
Nothing prevents the user from sending near-simultaneous upgrades to each node of the cluster - and while Talos Linux and Kubernetes can generally deal with this situation, other components of the cluster may not be able to recover from more than one node rebooting at a time.
(e.g. any software that maintains a quorum or state across nodes, such as Rook/Ceph)
3 - Kubernetes Guides
Management of a Kubernetes Cluster hosted by Talos Linux
3.1 - Configuration
How to configure components of the Kubernetes cluster itself.
3.1.1 - Ceph Storage cluster with Rook
Guide on how to create a simple Ceph storage cluster with Rook for Kubernetes
Preparation
Talos Linux reserves an entire disk for the OS installation, so machines with multiple available disks are needed for a reliable Ceph cluster with Rook and Talos Linux.
Rook requires that the block devices or partitions used by Ceph have no partitions or formatted filesystems before use.
Rook also requires a minimum Kubernetes version of v1.16 and Helm v3.0 for installation of charts.
It is highly recommended that the Rook Ceph overview is read and understood before deploying a Ceph cluster with Rook.
Installation
Creating a Ceph cluster with Rook requires two steps; first the Rook Operator needs to be installed which can be done with a Helm Chart.
The example below installs the Rook Operator into the rook-ceph namespace, which is the default for a Ceph cluster with Rook.
$ helm repo add rook-release https://charts.rook.io/release
"rook-release" has been added to your repositories
$ helm install --create-namespace --namespace rook-ceph rook-ceph rook-release/rook-ceph
W0327 17:52:44.277830 54987 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0327 17:52:44.612243 54987 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: rook-ceph
LAST DEPLOYED: Sun Mar 27 17:52:42 2022NAMESPACE: rook-ceph
STATUS: deployed
REVISION: 1TEST SUITE: None
NOTES:
The Rook Operator has been installed. Check its status by running:
kubectl --namespace rook-ceph get pods -l "app=rook-ceph-operator"Visit https://rook.io/docs/rook/latest for instructions on how to create and configure Rook clusters
Important Notes:
- You must customize the 'CephCluster' resource in the sample manifests for your cluster.
- Each CephCluster must be deployed to its own namespace, the samples use `rook-ceph`for the namespace.
- The sample manifests assume you also installed the rook-ceph operator in the `rook-ceph` namespace.
- The helm chart includes all the RBAC required to create a CephCluster CRD in the same namespace.
- Any disk devices you add to the cluster in the 'CephCluster' must be empty (no filesystem and no partitions).
Once that is complete, the Ceph cluster can be installed with the official Helm Chart.
The Chart can be installed with default values, which will attempt to use all nodes in the Kubernetes cluster, and all unused disks on each node for Ceph storage, and make available block storage, object storage, as well as a shared filesystem.
Generally more specific node/device/cluster configuration is used, and the Rook documentation explains all the available options in detail.
For this example the defaults will be adequate.
$ helm install --create-namespace --namespace rook-ceph rook-ceph-cluster --set operatorNamespace=rook-ceph rook-release/rook-ceph-cluster
NAME: rook-ceph-cluster
LAST DEPLOYED: Sun Mar 27 18:12:46 2022NAMESPACE: rook-ceph
STATUS: deployed
REVISION: 1TEST SUITE: None
NOTES:
The Ceph Cluster has been installed. Check its status by running:
kubectl --namespace rook-ceph get cephcluster
Visit https://rook.github.io/docs/rook/latest/ceph-cluster-crd.html for more information about the Ceph CRD.
Important Notes:
- You can only deploy a single cluster per namespace
- If you wish to delete this cluster and start fresh, you will also have to wipe the OSD disks using `sfdisk`
Now the Ceph cluster configuration has been created, the Rook operator needs time to install the Ceph cluster and bring all the components online.
The progression of the Ceph cluster state can be followed with the following command.
$ watch kubectl --namespace rook-ceph get cephcluster rook-ceph
Every 2.0s: kubectl --namespace rook-ceph get cephcluster rook-ceph
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL
rook-ceph /var/lib/rook 3 57s Progressing Configuring Ceph Mons
Depending on the size of the Ceph cluster and the availability of resources the Ceph cluster should become available, and with it the storage classes that can be used with Kubernetes Physical Volumes.
$ kubectl --namespace rook-ceph get cephcluster rook-ceph
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL
rook-ceph /var/lib/rook 3 40m Ready Cluster created successfully HEALTH_OK
$ kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
ceph-block (default) rook-ceph.rbd.csi.ceph.com Delete Immediate true 77m
ceph-bucket rook-ceph.ceph.rook.io/bucket Delete Immediate false 77m
ceph-filesystem rook-ceph.cephfs.csi.ceph.com Delete Immediate true 77m
Talos Linux Considerations
It is important to note that a Rook Ceph cluster saves cluster information directly onto the node (by default dataDirHostPath is set to /var/lib/rook).
If running only a single mon instance, cluster management is little bit more involved, as any time a Talos Linux node is reconfigured or upgraded, the partition that stores the /varfile system is wiped, but the --preserve option of talosctl upgrade will ensure that doesn’t happen.
By default, Rook configues Ceph to have 3 mon instances, in which case the data stored in dataDirHostPath can be regenerated from the other mon instances.
So when performing maintenance on a Talos Linux node with a Rook Ceph cluster (e.g. upgrading the Talos Linux version), it is imperative that care be taken to maintain the health of the Ceph cluster.
Before upgrading, you should always check the health status of the Ceph cluster to ensure that it is healthy.
$ kubectl --namespace rook-ceph get cephclusters.ceph.rook.io rook-ceph
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL
rook-ceph /var/lib/rook 3 98m Ready Cluster created successfully HEALTH_OK
If it is, you can begin the upgrade process for the Talos Linux node, during which time the Ceph cluster will become unhealthy as the node is reconfigured.
Before performing any other action on the Talos Linux nodes, the Ceph cluster must return to a healthy status.
$ talosctl upgrade --nodes 172.20.15.5 --image ghcr.io/talos-systems/installer:v0.14.3
NODE ACK STARTED
172.20.15.5 Upgrade request received 2022-03-27 20:29:55.292432887 +0200 CEST m=+10.050399758
$ kubectl --namespace rook-ceph get cephclusters.ceph.rook.io
NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL
rook-ceph /var/lib/rook 3 99m Progressing Configuring Ceph Mgr(s) HEALTH_WARN
$ kubectl --namespace rook-ceph wait --timeout=1800s --for=jsonpath='{.status.ceph.health}=HEALTH_OK' rook-ceph
cephcluster.ceph.rook.io/rook-ceph condition met
The above steps need to be performed for each Talos Linux node undergoing maintenance, one at a time.
Cleaning Up
Rook Ceph Cluster Removal
Removing a Rook Ceph cluster requires a few steps, starting with signalling to Rook that the Ceph cluster is really being destroyed.
Then all Persistent Volumes (and Claims) backed by the Ceph cluster must be deleted, followed by the Storage Classes and the Ceph storage types.
If the Rook Operator is cleanly removed following the above process, the node metadata and disks should be clean and ready to be re-used.
In the case of an unclean cluster removal, there may be still a few instances of metadata stored on the system disk, as well as the partition information on the storage disks.
First the node metadata needs to be removed, make sure to update the nodeName with the actual name of a storage node that needs cleaning, and path with the Rook configuration dataDirHostPath set when installing the chart.
The following will need to be repeated for each node used in the Rook Ceph cluster.
$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: disk-clean
spec:
restartPolicy: Never
nodeName: <storage-node-name>
volumes:
- name: rook-data-dir
hostPath:
path: <dataDirHostPath>
containers:
- name: disk-clean
image: busybox
securityContext:
privileged: true
volumeMounts:
- name: rook-data-dir
mountPath: /node/rook-data
command: ["/bin/sh", "-c", "rm -rf /node/rook-data/*"]
EOFpod/disk-clean created
$ kubectl wait --timeout=900s --for=jsonpath='{.status.phase}=Succeeded' pod disk-clean
pod/disk-clean condition met
$ kubectl delete pod disk-clean
pod "disk-clean" deleted
Lastly, the disks themselves need the partition and filesystem data wiped before they can be reused.
Again, the following as to be repeated for each node and disk used in the Rook Ceph cluster, updating nodeName and of= in the command as needed.
$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: disk-wipe
spec:
restartPolicy: Never
nodeName: <storage-node-name>
containers:
- name: disk-wipe
image: busybox
securityContext:
privileged: true
command: ["/bin/sh", "-c", "dd if=/dev/zero bs=1M count=100 oflag=direct of=<device>"]
EOFpod/disk-wipe created
$ kubectl wait --timeout=900s --for=jsonpath='{.status.phase}=Succeeded' pod disk-wipe
pod/disk-wipe condition met
$ kubectl delete pod disk-clean
pod "disk-wipe" deleted
3.1.2 - Cluster Endpoint
How to explicitly set up an endpoint for the cluster API
In this section, we will step through the configuration of a Talos based Kubernetes cluster.
There are three major components we will configure:
apid and talosctl
the controlplane nodes
the worker nodes
Talos enforces a high level of security by using mutual TLS for authentication and authorization.
We recommend that the configuration of Talos be performed by a cluster owner.
A cluster owner should be a person of authority within an organization, perhaps a director, manager, or senior member of a team.
They are responsible for storing the root CA, and distributing the PKI for authorized cluster administrators.
Recommended settings
Talos runs great out of the box, but if you tweak some minor settings it will make your life
a lot easier in the future.
This is not a requirement, but rather a document to explain some key settings.
Endpoint
To configure the talosctl endpoint, it is recommended you use a resolvable DNS name.
This way, if you decide to upgrade to a multi-controlplane cluster you only have to add the ip address to the hostname configuration.
The configuration can either be done on a Loadbalancer, or simply trough DNS.
For example:
This is in the config file for the cluster e.g. controlplane.yaml and worker.yaml.
for more details, please see: v1alpha1 endpoint configuration
If you have a DNS name as the endpoint, you can upgrade your talos cluster with multiple controlplanes in the future (if you don’t have a multi-controlplane setup from the start)
Using a DNS name generates the corresponding Certificates (Kubernetes and Talos) for the correct hostname.
3.1.3 - Deploying Metrics Server
In this guide you will learn how to set up metrics-server.
Metrics Server enables use of the Horizontal Pod Autoscaler and Vertical Pod Autoscaler.
It does this by gathering metrics data from the kubelets in a cluster.
By default, the certificates in use by the kubelets will not be recognized by metrics-server.
This can be solved by either configuring metrics-server to do no validation of the TLS certificates, or by modifying the kubelet configuration to rotate its certificates and use ones that will be recognized by metrics-server.
Node Configuration
To enable kubelet certificate rotation, all nodes should have the following Machine Config snippet:
We will want to ensure that new certificates for the kubelets are approved automatically.
This can easily be done with the Kubelet Serving Certificate Approver, which will automatically approve the Certificate Signing Requests generated by the kubelets.
We can have Kubelet Serving Certificate Approver and metrics-server installed on the cluster automatically during bootstrap by adding the following snippet to the Cluster Config of the node that will be handling the bootstrap process:
If you choose not to use extraManifests to install Kubelet Serving Certificate Approver and metrics-server during bootstrap, you can install them once the cluster is online using kubectl:
To see a live demo of Cluster Discovery, see the video below:
Registries
Peers are aggregated from a number of optional registries.
By default, Talos will use the service registry, while kubernetes registry is not enabled.
Either one can be disabled.
To disable a registry, set disabled to true (this option is the same for all registries):
For example, to disable the service registry:
Service registry uses external Discovery Service to exchange encrypted information about cluster members.
Resource Definitions
Talos provides seven resources that can be used to introspect the new discovery and KubeSpan features.
Discovery
Identities
The node’s unique identity (base62 encoded random 32 bytes) can be obtained with:
Note: Using base62 allows the ID to be URL encoded without having to use the ambiguous URL-encoding version of base64.
$ talosctl get identities -o yaml
...
spec:
nodeId: Utoh3O0ZneV0kT2IUBrh7TgdouRcUW2yzaaMl4VXnCd
Node identity is used as the unique Affiliate identifier.
Node identity resource is preserved in the STATE partition in node-identity.yaml file.
Node identity is preserved across reboots and upgrades, but it is regenerated if the node is reset (wiped).
Affiliates
An affiliate is a proposed member attributed to the fact that the node has the same cluster ID and secret.
$ talosctl get affiliates
ID VERSION HOSTNAME MACHINE TYPE ADDRESSES
2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 2 talos-default-controlplane-2 controlplane ["172.20.0.3","fd83:b1f7:fcb5:2802:986b:7eff:fec5:889d"]6EVq8RHIne03LeZiJ60WsJcoQOtttw1ejvTS6SOBzhUA 2 talos-default-worker-1 worker ["172.20.0.5","fd83:b1f7:fcb5:2802:cc80:3dff:fece:d89d"]NVtfu1bT1QjhNq5xJFUZl8f8I8LOCnnpGrZfPpdN9WlB 2 talos-default-worker-2 worker ["172.20.0.6","fd83:b1f7:fcb5:2802:2805:fbff:fe80:5ed2"]Utoh3O0ZneV0kT2IUBrh7TgdouRcUW2yzaaMl4VXnCd 4 talos-default-controlplane-1 controlplane ["172.20.0.2","fd83:b1f7:fcb5:2802:8c13:71ff:feaf:7c94"]b3DebkPaCRLTLLWaeRF1ejGaR0lK3m79jRJcPn0mfA6C 2 talos-default-controlplane-3 controlplane ["172.20.0.4","fd83:b1f7:fcb5:2802:248f:1fff:fe5c:c3f"]
One of the Affiliates with the ID matching node identity is populated from the node data, other Affiliates are pulled from the registries.
Enabled discovery registries run in parallel and discovered data is merged to build the list presented above.
Details about data coming from each registry can be queried from the cluster-raw namespace:
Each Affiliate ID is prefixed with k8s/ for data coming from the Kubernetes registry and with service/ for data coming from the discovery service.
Members
A member is an affiliate that has been approved to join the cluster.
The members of the cluster can be obtained with:
$ talosctl get members
ID VERSION HOSTNAME MACHINE TYPE OS ADDRESSES
talos-default-controlplane-1 2 talos-default-controlplane-1 controlplane Talos (v1.2.8)["172.20.0.2","fd83:b1f7:fcb5:2802:8c13:71ff:feaf:7c94"]talos-default-controlplane-2 1 talos-default-controlplane-2 controlplane Talos (v1.2.8)["172.20.0.3","fd83:b1f7:fcb5:2802:986b:7eff:fec5:889d"]talos-default-controlplane-3 1 talos-default-controlplane-3 controlplane Talos (v1.2.8)["172.20.0.4","fd83:b1f7:fcb5:2802:248f:1fff:fe5c:c3f"]talos-default-worker-1 1 talos-default-worker-1 worker Talos (v1.2.8)["172.20.0.5","fd83:b1f7:fcb5:2802:cc80:3dff:fece:d89d"]talos-default-worker-2 1 talos-default-worker-2 worker Talos (v1.2.8)["172.20.0.6","fd83:b1f7:fcb5:2802:2805:fbff:fe80:5ed2"]
3.1.5 - iSCSI Storage with Synology CSI
Automatically provision iSCSI volumes on a Synology NAS with the synology-csi driver.
Background
Synology is a company that specializes in Network Attached Storage (NAS) devices.
They provide a number of features within a simple web OS, including an LDAP server, Docker support, and (perhaps most relevant to this guide) function as an iSCSI host.
The focus of this guide is to allow a Kubernetes cluster running on Talos to provision Kubernetes storage (both dynamic or static) on a Synology NAS using a direct integration, rather than relying on an intermediary layer like Rook/Ceph or Maystor.
This guide assumes a very basic familiarity with iSCSI terminology (LUN, iSCSI target, etc.).
Prerequisites
Synology NAS running DSM 7.0 or above
Provisioned Talos cluster running Kubernetes v1.20 or above
The synology-csi controller interacts with your NAS in two different ways: via the API and via the iSCSI protocol.
Actions such as creating a new iSCSI target or deleting an old one are accomplished via the Synology API, and require administrator access.
On the other hand, mounting the disk to a pod and reading from / writing to it will utilize iSCSI.
Because you can only authenticate with one account per DSM configured, that account needs to have admin privileges.
In order to minimize access in the case of these credentials being compromised, you should configure the account with the lease possible amount of access – explicitly specify “No Access” on all volumes when configuring the user permissions.
Setting up the Synology CSI
Note: this guide is paraphrased from the Synology CSI readme.
Please consult the readme for more in-depth instructions and explanations.
While Synology provides some automated scripts to deploy the CSI driver, they can be finicky especially when making changes to the source code.
We will be configuring and deploying things manually in this guide.
The relevant files we will be touching are in the following locations:
Use config/client-info-template.yml as an example to configure the connection information for DSM.
You can specify one or more storage systems on which the CSI volumes will be created.
See below for an example:
---
clients:
- host: 192.168.1.1# ipv4 address or domain of the DSMport: 5000# port for connecting to the DSMhttps: false# set this true to use https. you need to specify the port to DSM HTTPS port as wellusername: username # usernamepassword: password # password
Create a Kubernetes secret using the client information config file.
Note that if you rename the secret to something other than client-info-secret, make sure you update the corresponding references in the deployment manifests as well.
Build the Talos-compatible image
Modify the Makefile so that the image is built and tagged under your GitHub Container Registry username:
REGISTRY_NAME=ghcr.io/<username>
When you run make docker-build or make docker-build-multiarch, it will push the resulting image to ghcr.io/<username>/synology-csi:v1.1.0.
Ensure that you find and change any reference to synology/synology-csi:v1.1.0 to point to your newly-pushed image within the deployment manifests.
Configure the CSI driver
By default, the deployment manifests include one storage class and one volume snapshot class.
See below for examples:
It can be useful to configure multiple different StorageClasses.
For example, a popular strategy is to create two nearly identical StorageClasses, with one configured with reclaimPolicy: Retain and the other with reclaimPolicy: Delete.
Alternately, a workload may require a specific filesystem, such as ext4.
If a Synology NAS is going to be the most common way to configure storage on your cluster, it can be convenient to add the storageclass.kubernetes.io/is-default-class: "true" annotation to one of your StorageClasses.
The following table details the configurable parameters for the Synology StorageClass.
Name
Type
Description
Default
Supported protocols
dsm
string
The IPv4 address of your DSM, which must be included in the client-info.yml for the CSI driver to log in to DSM
-
iSCSI, SMB
location
string
The location (/volume1, /volume2, …) on DSM where the LUN for PersistentVolume will be created
-
iSCSI, SMB
fsType
string
The formatting file system of the PersistentVolumes when you mount them on the pods. This parameter only works with iSCSI. For SMB, the fsType is always ‘cifs‘.
ext4
iSCSI
protocol
string
The backing storage protocol. Enter ‘iscsi’ to create LUNs or ‘smb‘ to create shared folders on DSM.
iscsi
iSCSI, SMB
csi.storage.k8s.io/node-stage-secret-name
string
The name of node-stage-secret. Required if DSM shared folder is accessed via SMB.
-
SMB
csi.storage.k8s.io/node-stage-secret-namespace
string
The namespace of node-stage-secret. Required if DSM shared folder is accessed via SMB.
-
SMB
The VolumeSnapshotClass can be similarly configured with the following parameters:
Name
Type
Description
Default
Supported protocols
description
string
The description of the snapshot on DSM
-
iSCSI
is_locked
string
Whether you want to lock the snapshot on DSM
false
iSCSI, SMB
Apply YAML manifests
Once you have created the desired StorageClass(es) and VolumeSnapshotClass(es), the final step is to apply the Kubernetes manifests against the cluster.
The easiest way to apply them all at once is to create a kustomization.yaml file in the same directory as the manifests and use Kustomize to apply:
kubectl apply -k path/to/manifest/directory
Alternately, you can apply each manifest one-by-one:
kubectl apply -f <file>
Run performance tests
In order to test the provisioning, mounting, and performance of using a Synology NAS as Kubernetes persistent storage, use the following command:
If these two jobs complete successfully, use the following commands to get the results of the speed tests:
# Pod logs for read test:kubectl logs -l app=speedtest,job=read# Pod logs for write test:kubectl logs -l app=speedtest,job=write
When you’re satisfied with the results of the test, delete the artifacts created from the speedtest:
kubectl delete -f speedtest.yaml
3.1.6 - Local Storage
Using local storage with OpenEBS Jiva
If you want to use replicated storage leveraging disk space from a local disk with Talos Linux installed, OpenEBS Jiva is a great option.
This requires installing the iscsi-toolssystem extension.
Since OpenEBS Jiva is a replicated storage, it’s recommended to have at least three nodes where sufficient local disk space is available.
The documentation will follow installing OpenEBS Jiva via the offical Helm chart.
Since Talos is different from standard Operating Systems, the OpenEBS components need a little tweaking after the Helm installation.
Refer to the OpenEBS Jiva documentation if you need further customization.
NB: Also note that the Talos nodes need to be upgraded with --preserve set while running OpenEBS Jiva, otherwise you risk losing data.
Even though it’s possible to recover data from other replicas if the node is wiped during an upgrade, this can require extra operational knowledge to recover, so it’s highly recommended to use --preserve to avoid data loss.
Preparing the nodes
Create a machine config patch with the contents below and save as patch.yaml
To install the system extension, the node needs to be upgraded.
If there is no new release of Talos, the node can be upgraded to the same version as the existing Talos version.
Run the following command on each nodes subsequently:
You should see that the ext-tgtd and the ext-iscsid services are running.
NODE SERVICE STATE HEALTH LAST CHANGE LAST EVENT
192.168.20.51 apid Running OK 64h57m15s ago Health check successful
192.168.20.51 containerd Running OK 64h57m23s ago Health check successful
192.168.20.51 cri Running OK 64h57m20s ago Health check successful
192.168.20.51 etcd Running OK 64h55m29s ago Health check successful
192.168.20.51 ext-iscsid Running ? 64h57m19s ago Started task ext-iscsid (PID 4040) for container ext-iscsid
192.168.20.51 ext-tgtd Running ? 64h57m19s ago Started task ext-tgtd (PID 3999) for container ext-tgtd
192.168.20.51 kubelet Running OK 38h14m10s ago Health check successful
192.168.20.51 machined Running ? 64h57m29s ago Service started as goroutine
192.168.20.51 trustd Running OK 64h57m19s ago Health check successful
192.168.20.51 udevd Running OK 64h57m21s ago Health check successful
This will create a storage class named openebs-jiva-csi-default which can be used for workloads.
The storage class named openebs-hostpath is used by jiva to create persistent volumes backed by local storage and then used for replicated storage by the jiva controller.
Patching the jiva installation
Since Jiva assumes iscisd to be running natively on the host and not as a Talos extension service, we need to modify the CSI node daemonset to enable it to find the PID of the iscsid service.
The default config map used by Jiva also needs to be modified so that it can execute iscsiadm commands inside the PID namespace of the iscsid service.
Start by creating a configmap definition named config.yaml as below:
Enabling Pod Security Admission plugin to configure Pod Security Standards.
Kubernetes deprecated Pod Security Policy as of v1.21, and it is
going to be removed in v1.25.
Pod Security Policy was replaced with Pod Security Admission.
Pod Security Admission is alpha in v1.22 (requires a feature gate) and beta in v1.23 (enabled by default).
In this guide we are going to enable and configure Pod Security Admission in Talos.
Configuration
Talos provides default Pod Security Admission in the machine configuration:
more strict restricted profile is not enforced, but API server warns about found issues
This default policy can be modified by updating the generated machine configuration before the cluster is created or on the fly by using the talosctl CLI utility.
Verify current admission plugin configuration with:
Create a deployment that satisfies the baseline policy but gives warnings on restricted policy:
$ kubectl create deployment nginx --image=nginx
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation !=false(container "nginx" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "nginx" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot !=true(pod or container "nginx" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "nginx" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")deployment.apps/nginx created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-85b98978db-j68l8 1/1 Running 0 2m3s
Create a daemonset which fails to meet requirements of the baseline policy:
$ kubectl apply -f debug.yaml
Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true, hostIPC=true), privileged (container "debug-container" must not set securityContext.privileged=true), allowPrivilegeEscalation !=false(container "debug-container" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "debug-container" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot !=true(pod or container "debug-container" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "debug-container" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")daemonset.apps/debug-container created
Daemonset debug-container gets created, but no pods are scheduled:
$ kubectl get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
debug-container 00000 <none> 34s
Pod Security Admission plugin errors are in the daemonset events:
$ kubectl describe ds debug-container
...
Warning FailedCreate 92s daemonset-controller Error creating: pods "debug-container-kwzdj" is forbidden: violates PodSecurity "baseline:latest": host namespaces (hostNetwork=true, hostPID=true, hostIPC=true), privileged (container "debug-container" must not set securityContext.privileged=true)
Pod Security Admission configuration can also be overridden on a namespace level:
$ kubectl label ns default pod-security.kubernetes.io/enforce=privileged
namespace/default labeled
$ kubectl get ds
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
debug-container 22020 <none> 4s
As enforce policy was updated to the privileged for the default namespace, debug-container is now successfully running.
3.1.8 - Seccomp Profiles
Using custom Seccomp Profiles with Kubernetes workloads.
Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12.
It can be used to sandbox the privileges of a process, restricting the calls it is able to make from userspace into the kernel.
You can clean up the test resources by running the following command:
kubectl delete pod audit-pod
3.1.9 - Storage
Setting up storage for a Kubernetes cluster
In Kubernetes, using storage in the right way is well-facilitated by the API. However, unless you are running in a major public cloud, that API may not be hooked up to anything.
This frequently sends users down a rabbit hole of researching all the various options for storage backends for their platform, for Kubernetes, and for their workloads.
There are a lot of options out there, and it can be fairly bewildering.
For Talos, we try to limit the options somewhat to make the decision-making easier.
Public Cloud
If you are running on a major public cloud, use their block storage.
It is easy and automatic.
Storage Clusters
Sidero Labs recommends having separate disks (apart from the Talos install disk) to be used for storage.
Redundancy, scaling capabilities, reliability, speed, maintenance load, and ease of use are all factors you must consider when managing your own storage.
Running a storage cluster can be a very good choice when managing your own storage, and there are two projects we recommend, depending on your situation.
If you need vast amounts of storage composed of more than a dozen or so disks, we recommend you use Rook to manage Ceph.
Also, if you need both mount-once and mount-many capabilities, Ceph is your answer.
Ceph also bundles in an S3-compatible object store.
The down side of Ceph is that there are a lot of moving parts.
Please note that most people should never use mount-many semantics.
NFS is pervasive because it is old and easy, not because it is a good idea.
While it may seem like a convenience at first, there are all manner of locking, performance, change control, and reliability concerns inherent in any mount-many situation, so we strongly recommend you avoid this method.
If your storage needs are small enough to not need Ceph, use Mayastor.
Rook/Ceph
Ceph is the grandfather of open source storage clusters.
It is big, has a lot of pieces, and will do just about anything.
It scales better than almost any other system out there, open source or proprietary, being able to easily add and remove storage over time with no downtime, safely and easily.
It comes bundled with RadosGW, an S3-compatible object store; CephFS, a NFS-like clustered filesystem; and RBD, a block storage system.
With the help of Rook, the vast majority of the complexity of Ceph is hidden away by a very robust operator, allowing you to control almost everything about your Ceph cluster from fairly simple Kubernetes CRDs.
So if Ceph is so great, why not use it for everything?
Ceph can be rather slow for small clusters.
It relies heavily on CPUs and massive parallelisation to provide good cluster performance, so if you don’t have much of those dedicated to Ceph, it is not going to be well-optimised for you.
Also, if your cluster is small, just running Ceph may eat up a significant amount of the resources you have available.
Troubleshooting Ceph can be difficult if you do not understand its architecture.
There are lots of acronyms and the documentation assumes a fair level of knowledge.
There are very good tools for inspection and debugging, but this is still frequently seen as a concern.
Mayastor
Mayastor is an OpenEBS project built in Rust utilising the modern NVMEoF system.
(Despite the name, Mayastor does not require you to have NVME drives.)
It is fast and lean but still cluster-oriented and cloud native.
Unlike most of the other OpenEBS project, it is not built on the ancient iSCSI system.
Unlike Ceph, Mayastor is just a block store.
It focuses on block storage and does it well.
It is much less complicated to set up than Ceph, but you probably wouldn’t want to use it for more than a few dozen disks.
Mayastor is new, maybe too new.
If you’re looking for something well-tested and battle-hardened, this is not it.
However, if you’re looking for something lean, future-oriented, and simpler than Ceph, it might be a great choice.
Video Walkthrough
To see a live demo of this section, see the video below:
Prep Nodes
Either during initial cluster creation or on running worker nodes, several machine config values should be edited.
(This information is gathered from the Mayastor documentation.)
We need to set the vm.nr_hugepages sysctl and add openebs.io/engine=mayastor labels to the nodes which are meant to be storage nodes.
This can be done with talosctl patch machineconfig or via config patches during talosctl gen config.
Note: If you are adding/updating the vm.nr_hugepages on a node which already had the openebs.io/engine=mayastor label set, you’d need to restart kubelet so that it picks up the new value, by issuing the following command
talosctl -n <node ip> service kubelet restart
Deploy Mayastor
Continue setting up Mayastor using the official documentation.
NFS
NFS is an old pack animal long past its prime.
NFS is slow, has all kinds of bottlenecks involving contention, distributed locking, single points of service, and more.
However, it is supported by a wide variety of systems.
You don’t want to use it unless you have to, but unfortunately, that “have to” is too frequent.
The NFS client is part of the kubelet image maintained by the Talos team.
This means that the version installed in your running kubelet is the version of NFS supported by Talos.
You can reduce some of the contention problems by parceling Persistent Volumes from separate underlying directories.
Object storage
Ceph comes with an S3-compatible object store, but there are other options, as
well.
These can often be built on top of other storage backends.
For instance, you may have your block storage running with Mayastor but assign a
Pod a large Persistent Volume to serve your object store.
One of the most popular open source add-on object stores is MinIO.
Others (iSCSI)
The most common remaining systems involve iSCSI in one form or another.
These include the original OpenEBS, Rancher’s Longhorn, and many proprietary systems.
iSCSI in Linux is facilitated by open-iscsi.
This system was designed long before containers caught on, and it is not well
suited to the task, especially when coupled with a read-only host operating
system.
iSCSI support in Talos is now supported via the iscsi-toolssystem extension installed.
The extension enables compatibility with OpenEBS Jiva - refer to the local storage installation guide for more information.
3.2 - Network
Managing the Kubernetes cluster networking
3.2.1 - Deploying Cilium CNI
In this guide you will learn how to set up Cilium CNI on Talos.
From v1.9 onwards Cilium does no longer provide a one-liner install manifest that can be used to install Cilium on a node via kubectl apply -f or passing it in as an extra url in the urls part in the Talos machine configuration.
Installing Cilium the new way via the cilium cli is broken, so we’ll be using helm to install Cilium.
For more information: Install with CLI fails, works with Helm
This documentation will outline installing Cilium CNI v1.11.2 on Talos in four different ways.
Adhering to Talos principles we’ll deploy Cilium with IPAM mode set to Kubernetes.
Each method can either install Cilium using kube proxy (default) or without: Kubernetes Without kube-proxy
Machine config preparation
When generating the machine config for a node set the CNI to none.
For example using a config patch:
After applying the machine config and bootstrapping Talos will appear to hang on phase 18/19 with the message: retrying error: node not ready.
This happens because nodes in Kubernetes are only marked as ready once the CNI is up.
As there is no CNI defined, the boot process is pending and will reboot the node to retry after 10 minutes, this is expected behavior.
During this window you can install Cilium manually by running the following:
After generating cilium.yaml using helm template, instead of applying this manifest directly during the Talos boot window (before the reboot timeout).
You can also host this file somewhere and patch the machine config to apply this manifest automatically during bootstrap.
To do this patch your machine configuration to include this config instead of the above:
name: custom # Name of CNI to use.# URLs containing manifests to apply for the CNI.urls:
- https://server.yourdomain.tld/some/path/cilium.yaml
However, beware of the fact that the helm generated Cilium manifest contains sensitive key material.
As such you should definitely not host this somewhere publicly accessible.
Method 4: Helm manifests inline install
A more secure option would be to include the helm template output manifest inside the machine configuration.
The machine config should be generated with CNI set to none
This will install the Cilium manifests at just the right time during bootstrap.
Beware though:
Changing the namespace when templating with Helm does not generate a manifest containing the yaml to create that namespace.
As the inline manifest is processed from top to bottom make sure to manually put the namespace yaml at the start of the inline manifest.
Only add the Cilium inline manifest to the control plane nodes machine configuration.
Make sure all control plane nodes have an identical configuration.
If you delete any of the generated resources they will be restored whenever a control plane node reboots.
As a safety measure, Talos only creates missing resources from inline manifests, it never deletes or updates anything.
If you need to update a manifest make sure to first edit all control plane machine configurations and then run talosctl upgrade-k8s as it will take care of updating inline manifests.
Known issues
Currently there is an interaction between a Kubespan enabled Talos cluster and Cilium that results in the cluster going down during bootstrap after applying the Cilium manifests.
For more details: Kubespan and Cilium compatiblity: etcd is failing
Some kernel values changed by kube-proxy are not set to good defaults when running the cilium kernel-proxy alternative.
For more details: Kernel default values (sysctl)
Other things to know
Talos has full kernel module support for eBPF, See:
This allows you to set --set enableXTSocketFallback=false on the helm install/template command preventing Cilium from disabling the ip_early_demux kernel feature.
This will win back some performance.
3.2.2 - KubeSpan
Learn to use KubeSpan to connect Talos Linux machines securely across networks.
KubeSpan is a feature of Talos that automates the setup and maintenance of a full mesh WireGuard network for your cluster, giving you the ability to operate hybrid Kubernetes clusters that can span the edge, datacenter, and cloud.
Management of keys and discovery of peers can be completely automated for a zero-touch experience that makes it simple and easy to create hybrid clusters.
KubeSpan consists of client code in Talos Linux, as well as a discovery service that enables clients to securely find each other.
Sidero Labs operates a free Discovery Service, but the discovery service may be operated by your organization and can be downloaded here.
Video Walkthrough
To learn more about KubeSpan, see the video below:
To see a live demo of KubeSpan, see one the videos below:
Enabling
Creating a New Cluster
To generate configuration files for a new cluster, we can use the --with-kubespan flag in talosctl gen config.
This will enable peer discovery and KubeSpan.
...
# Provides machine specific network configuration options.network:
# Configures KubeSpan feature.kubespan:
enabled: true# Enable the KubeSpan feature....
# Configures cluster member discovery.discovery:
enabled: true# Enable the cluster membership discovery feature.# Configure registries used for cluster member discovery.registries:
# Kubernetes registry uses Kubernetes API server to discover cluster members and stores additional informationkubernetes: {}
# Service registry is using an external service to push and pull information about cluster members.service: {}
...
# Provides cluster specific configuration options.cluster:
id: yui150Ogam0pdQoNZS2lZR-ihi8EWxNM17bZPktJKKE= # Globally unique identifier for this cluster.secret: dAmFcyNmDXusqnTSkPJrsgLJ38W8oEEXGZKM0x6Orpc= # Shared secret of cluster.
The default discovery service is an external service hosted for free by Sidero Labs.
The default value is https://discovery.talos.dev/.
Contact Sidero Labs if you need to run this service privately.
Upgrading an Existing Cluster
In order to enable KubeSpan for an existing cluster, upgrade to the latest version of Talos (v1.2.8).
Once your cluster is upgraded, the configuration of each node must contain the globally unique identifier, the shared secret for the cluster, and have KubeSpan and discovery enabled.
Note: Discovery can be used without KubeSpan, but KubeSpan requires at least one discovery registry.
Talos v0.11 or Less
If you are migrating from Talos v0.11 or less, we need to generate a cluster ID and secret.
Now, update the configuration of each node with the cluster with the generated id and secret.
You should end up with the addition of something like this (your id and secret should be different):
Talos automatically configures unique IPv6 address for each node in the cluster-specific IPv6 ULA prefix.
Wireguard private key is generated for the node, private key never leaves the node while public key is published through the cluster discovery.
KubeSpanIdentity is persisted across reboots and upgrades in STATE partition in the file kubespan-identity.yaml.
KubeSpanPeerSpecs
A node’s WireGuard peers can be obtained with:
$ talosctl get kubespanpeerspecs
ID VERSION LABEL ENDPOINTS
06D9QQOydzKrOL7oeLiqHy9OWE8KtmJzZII2A5/FLFI=2 talos-default-controlplane-2 ["172.20.0.3:51820"]THtfKtfNnzJs1nMQKs5IXqK0DFXmM//0WMY+NnaZrhU=2 talos-default-controlplane-3 ["172.20.0.4:51820"]nVHu7l13uZyk0AaI1WuzL2/48iG8af4WRv+LWmAax1M=2 talos-default-worker-2 ["172.20.0.6:51820"]zXP0QeqRo+CBgDH1uOBiQ8tA+AKEQP9hWkqmkE/oDlc=2 talos-default-worker-1 ["172.20.0.5:51820"]
The peer ID is the Wireguard public key.
KubeSpanPeerSpecs are built from the cluster discovery data.
KubeSpanPeerStatuses
The status of a node’s WireGuard peers can be obtained with:
$ talosctl get kubespanpeerstatuses
ID VERSION LABEL ENDPOINT STATE RX TX
06D9QQOydzKrOL7oeLiqHy9OWE8KtmJzZII2A5/FLFI=63 talos-default-controlplane-2 172.20.0.3:51820 up 1504322017869488THtfKtfNnzJs1nMQKs5IXqK0DFXmM//0WMY+NnaZrhU=62 talos-default-controlplane-3 172.20.0.4:51820 up 1457320818157680nVHu7l13uZyk0AaI1WuzL2/48iG8af4WRv+LWmAax1M=60 talos-default-worker-2 172.20.0.6:51820 up 13007246888zXP0QeqRo+CBgDH1uOBiQ8tA+AKEQP9hWkqmkE/oDlc=60 talos-default-worker-1 172.20.0.5:51820 up 13004446556
KubeSpan peer status includes following information:
the actual endpoint used for peer communication
link state:
unknown: the endpoint was just changed, link state is not known yet
up: there is a recent handshake from the peer
down: there is no handshake from the peer
number of bytes sent/received over the Wireguard link with the peer
If the connection state goes down, Talos will be cycling through the available endpoints until it finds the one which works.
Peer status information is updated every 30 seconds.
KubeSpanEndpoints
A node’s WireGuard endpoints (peer addresses) can be obtained with:
$ talosctl get kubespanendpoints
ID VERSION ENDPOINT AFFILIATE ID
06D9QQOydzKrOL7oeLiqHy9OWE8KtmJzZII2A5/FLFI=1 172.20.0.3:51820 2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF
THtfKtfNnzJs1nMQKs5IXqK0DFXmM//0WMY+NnaZrhU=1 172.20.0.4:51820 b3DebkPaCRLTLLWaeRF1ejGaR0lK3m79jRJcPn0mfA6C
nVHu7l13uZyk0AaI1WuzL2/48iG8af4WRv+LWmAax1M=1 172.20.0.6:51820 NVtfu1bT1QjhNq5xJFUZl8f8I8LOCnnpGrZfPpdN9WlB
zXP0QeqRo+CBgDH1uOBiQ8tA+AKEQP9hWkqmkE/oDlc=1 172.20.0.5:51820 6EVq8RHIne03LeZiJ60WsJcoQOtttw1ejvTS6SOBzhUA
The endpoint ID is the base64 encoded WireGuard public key.
The observed endpoints are submitted back to the discovery service (if enabled) so that other peers can try additional endpoints to establish the connection.
3.3 - Upgrading Kubernetes
Guide on how to upgrade the Kubernetes cluster from Talos Linux.
This guide covers upgrading Kubernetes on Talos Linux clusters.
For upgrading the Talos Linux operating system, see Upgrading Talos
Video Walkthrough
To see a demo of this process, watch this video:
Automated Kubernetes Upgrade
The recommended method to upgrade Kubernetes is to use the talosctl upgrade-k8s command.
This will automatically update the components needed to upgrade Kubernetes safely.
Upgrading Kubernetes is non-disruptive to the cluster workloads.
To trigger a Kubernetes upgrade, issue a command specifiying the version of Kubernetes to ugprade to, such as:
Note that the --nodes parameter specifies the control plane node to send the API call to, but all members of the cluster will be upgraded.
To check what will be upgraded you can run talosctl upgrade-k8s with the --dry-run flag:
$ talosctl --nodes <controlplane node> upgrade-k8s --to 1.25.5 --dry-run
WARNING: found resources which are going to be deprecated/migrated in the version 1.25.5
RESOURCE COUNT
validatingwebhookconfigurations.v1beta1.admissionregistration.k8s.io 4mutatingwebhookconfigurations.v1beta1.admissionregistration.k8s.io 3customresourcedefinitions.v1beta1.apiextensions.k8s.io 25apiservices.v1beta1.apiregistration.k8s.io 54leases.v1beta1.coordination.k8s.io 4automatically detected the lowest Kubernetes version 1.24.3
checking for resource APIs to be deprecated in version 1.25.5
discovered controlplane nodes ["172.20.0.2""172.20.0.3""172.20.0.4"]discovered worker nodes ["172.20.0.5""172.20.0.6"]updating "kube-apiserver" to version "1.25.5" > "172.20.0.2": starting update
> update kube-apiserver: v1.24.3 -> 1.25.5
> skipped in dry-run
> "172.20.0.3": starting update
> update kube-apiserver: v1.24.3 -> 1.25.5
> skipped in dry-run
> "172.20.0.4": starting update
> update kube-apiserver: v1.24.3 -> 1.25.5
> skipped in dry-run
updating "kube-controller-manager" to version "1.25.5" > "172.20.0.2": starting update
> update kube-controller-manager: v1.24.3 -> 1.25.5
> skipped in dry-run
> "172.20.0.3": starting update
<snip>
updating manifests
> apply manifest Secret bootstrap-token-3lb63t
> apply skipped in dry run
> apply manifest ClusterRoleBinding system-bootstrap-approve-node-client-csr
> apply skipped in dry run
<snip>
To upgrade Kubernetes from v1.24.3 to v1.25.5 run:
$ talosctl --nodes <controlplane node> upgrade-k8s --to 1.25.5
automatically detected the lowest Kubernetes version 1.24.3
checking for resource APIs to be deprecated in version 1.25.5
discovered controlplane nodes ["172.20.0.2""172.20.0.3""172.20.0.4"]discovered worker nodes ["172.20.0.5""172.20.0.6"]updating "kube-apiserver" to version "1.25.5" > "172.20.0.2": starting update
> update kube-apiserver: v1.24.3 -> 1.25.5
> "172.20.0.2": machine configuration patched
> "172.20.0.2": waiting for API server state pod update
< "172.20.0.2": successfully updated
> "172.20.0.3": starting update
> update kube-apiserver: v1.24.3 -> 1.25.5
<snip>
This command runs in several phases:
Every control plane node machine configuration is patched with the new image version for each control plane component.
Talos renders new static pod definitions on the configuration update which is picked up by the kubelet.
The command waits for the change to propagate to the API server state.
The command updates the kube-proxy daemonset with the new image version.
On every node in the cluster, the kubelet version is updated.
The command then waits for the kubelet service to be restarted and become healthy.
The update is verified by checking the Node resource state.
Kubernetes bootstrap manifests are re-applied to the cluster.
Updated bootstrap manifests might come with a new Talos version (e.g. CoreDNS version update), or might be the result of machine configuration change.
Note: The upgrade-k8s command never deletes any resources from the cluster: they should be deleted manually.
If the command fails for any reason, it can be safely restarted to continue the upgrade process from the moment of the failure.
Manual Kubernetes Upgrade
Kubernetes can be upgraded manually by following the steps outlined below.
They are equivalent to the steps performed by the talosctl upgrade-k8s command.
Kubeconfig
In order to edit the control plane, you need a working kubectl config.
If you don’t already have one, you can get one by running:
talosctl --nodes <controlplane node> kubeconfig
API Server
Patch machine configuration using talosctl patch command:
$ talosctl -n <CONTROL_PLANE_IP_1> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "k8s.gcr.io/kube-apiserver:v1.25.5"}]'patched mc at the node 172.20.0.2
The JSON patch might need to be adjusted if current machine configuration is missing .cluster.apiServer.image key.
Also the machine configuration can be edited manually with talosctl -n <IP> edit mc --mode=no-reboot.
Capture the new version of kube-apiserver config with:
In this example, the new version is 5.
Wait for the new pod definition to propagate to the API server state (replace talos-default-controlplane-1 with the node name):
$ kubectl get pod -n kube-system -l k8s-app=kube-apiserver --field-selector spec.nodeName=talos-default-controlplane-1 -o jsonpath='{.items[0].metadata.annotations.talos\.dev/config\-version}'5
Check that the pod is running:
$ kubectl get pod -n kube-system -l k8s-app=kube-apiserver --field-selector spec.nodeName=talos-default-controlplane-1
NAME READY STATUS RESTARTS AGE
kube-apiserver-talos-default-controlplane-1 1/1 Running 0 16m
Repeat this process for every control plane node, verifying that state got propagated successfully between each node update.
Controller Manager
Patch machine configuration using talosctl patch command:
$ talosctl -n <CONTROL_PLANE_IP_1> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/controllerManager/image", "value": "k8s.gcr.io/kube-controller-manager:v1.25.5"}]'patched mc at the node 172.20.0.2
The JSON patch might need be adjusted if current machine configuration is missing .cluster.controllerManager.image key.
Capture new version of kube-controller-manager config with:
In this example, new version is 3.
Wait for the new pod definition to propagate to the API server state (replace talos-default-controlplane-1 with the node name):
$ kubectl get pod -n kube-system -l k8s-app=kube-controller-manager --field-selector spec.nodeName=talos-default-controlplane-1 -o jsonpath='{.items[0].metadata.annotations.talos\.dev/config\-version}'3
Check that the pod is running:
$ kubectl get pod -n kube-system -l k8s-app=kube-controller-manager --field-selector spec.nodeName=talos-default-controlplane-1
NAME READY STATUS RESTARTS AGE
kube-controller-manager-talos-default-controlplane-1 1/1 Running 0 35m
Repeat this process for every control plane node, verifying that state propagated successfully between each node update.
Scheduler
Patch machine configuration using talosctl patch command:
$ talosctl -n <CONTROL_PLANE_IP_1> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/scheduler/image", "value": "k8s.gcr.io/kube-scheduler:v1.25.5"}]'patched mc at the node 172.20.0.2
JSON patch might need be adjusted if current machine configuration is missing .cluster.scheduler.image key.
Capture new version of kube-scheduler config with:
In this example, new version is 3.
Wait for the new pod definition to propagate to the API server state (replace talos-default-controlplane-1 with the node name):
$ kubectl get pod -n kube-system -l k8s-app=kube-scheduler --field-selector spec.nodeName=talos-default-controlplane-1 -o jsonpath='{.items[0].metadata.annotations.talos\.dev/config\-version}'3
Check that the pod is running:
$ kubectl get pod -n kube-system -l k8s-app=kube-scheduler --field-selector spec.nodeName=talos-default-controlplane-1
NAME READY STATUS RESTARTS AGE
kube-scheduler-talos-default-controlplane-1 1/1 Running 0 39m
Repeat this process for every control plane node, verifying that state got propagated successfully between each node update.
Note: if some bootstrap resources were removed, they have to be removed from the cluster manually.
kubelet
For every node, patch machine configuration with new kubelet version, wait for the kubelet to restart with new version:
$ talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.25.5"}]'patched mc at the node 172.20.0.2
Once kubelet restarts with the new configuration, confirm upgrade with kubectl get nodes <name>:
$ kubectl get nodes talos-default-controlplane-1
NAME STATUS ROLES AGE VERSION
talos-default-controlplane-1 Ready control-plane 123m v1.25.5
4 - Advanced Guides
4.1 - Advanced Networking
How to configure advanced networking options on Talos Linux.
Static Addressing
Static addressing is comprised of specifying addresses, routes ( remember to add your default gateway ), and interface.
Most likely you’ll also want to define the nameservers so you have properly functioning DNS.
In some environments you may need to set additional addresses on an interface.
In the following example, we set two additional addresses on the loopback interface.
Setting up Talos Linux to work in environments with no internet access.
In this guide we will create a Talos cluster running in an air-gapped environment with all the required images being pulled from an internal registry.
We will use the QEMU provisioner available in talosctl to create a local cluster, but the same approach could be used to deploy Talos in bigger air-gapped networks.
In air-gapped environments, access to the public Internet is restricted, so Talos can’t pull images from public Docker registries (docker.io, ghcr.io, etc.)
We need to identify the images required to install and run Talos.
The same strategy can be used for images required by custom workloads running on the cluster.
The talosctl images command provides a list of default images used by the Talos cluster (with default configuration
settings).
To print the list of images, run:
talosctl images
This list contains images required by a default deployment of Talos.
There might be additional images required for the workloads running on this cluster, and those should be added to this list.
Preparing the Internal Registry
As access to the public registries is restricted, we have to run an internal Docker registry.
In this guide, we will launch the registry on the same machine using Docker:
This registry will be accepting connections on port 6000 on the host IPs.
The registry is empty by default, so we have fill it with the images required by Talos.
First, we pull all the images to our local Docker daemon:
$ for image in `talosctl images`; do docker pull $image; donev0.15.1: Pulling from coreos/flannel
Digest: sha256:9a296fbb67790659adc3701e287adde3c59803b7fcefe354f1fc482840cdb3d9
...
All images are now stored in the Docker daemon store:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
gcr.io/etcd-development/etcd v3.5.3 604d4f022632 6 days ago 181MB
ghcr.io/siderolabs/install-cni v1.0.0-2-gc5d3ab0 4729e54f794d 6 days ago 76MB
...
Now we need to re-tag them so that we can push them to our local registry.
We are going to replace the first component of the image name (before the first slash) with our registry endpoint 127.0.0.1:6000:
$ for image in `talosctl images`; do\
docker tag $image`echo$image | sed -E 's#^[^/]+/#127.0.0.1:6000/#'`; \
done
As the next step, we push images to the internal registry:
$ for image in `talosctl images`; do\
docker push `echo$image | sed -E 's#^[^/]+/#127.0.0.1:6000/#'`; \
done
We can now verify that the images are pushed to the registry:
Note: images in the registry don’t have the registry endpoint prefix anymore.
Launching Talos in an Air-gapped Environment
For Talos to use the internal registry, we use the registry mirror feature to redirect all image pull requests to the internal registry.
This means that the registry endpoint (as the first component of the image reference) gets ignored, and all pull requests are sent directly to the specified endpoint.
We are going to use a QEMU-based Talos cluster for this guide, but the same approach works with Docker-based clusters as well.
As QEMU-based clusters go through the Talos install process, they can be used better to model a real air-gapped environment.
Identify all registry prefixes from talosctl images, for example:
docker.io
gcr.io
ghcr.io
k8s.gcr.io
quay.io
The talosctl cluster create command provides conveniences for common configuration options.
The only required flag for this guide is --registry-mirror <endpoint>=http://10.5.0.1:6000 which redirects every pull request to the internal registry, this flag
needs to be repeated for each of the identified registry prefixes above.
The endpoint being used is 10.5.0.1, as this is the default bridge interface address which will be routable from the QEMU VMs (127.0.0.1 IP will be pointing to the VM itself).
$ sudo --preserve-env=HOME talosctl cluster create --provisioner=qemu --install-image=ghcr.io/siderolabs/installer:v1.2.8 \
--registry-mirror docker.io=http://10.5.0.1:6000 \
--registry-mirror gcr.io=http://10.5.0.1:6000 \
--registry-mirror ghcr.io=http://10.5.0.1:6000 \
--registry-mirror k8s.gcr.io=http://10.5.0.1:6000 \
--registry-mirror quay.io=http://10.5.0.1:6000
validating CIDR and reserving IPs
generating PKI and tokens
creating state directory in "/home/user/.talos/clusters/talos-default"creating network talos-default
creating load balancer
creating dhcpd
creating master nodes
creating worker nodes
waiting for API
...
Note: --install-image should match the image which was copied into the internal registry in the previous step.
You can be verify that the cluster is air-gapped by inspecting the registry logs: docker logs -f registry-airgapped.
Closing Notes
Running in an air-gapped environment might require additional configuration changes, for example using custom settings for DNS and NTP servers.
When scaling this guide to the bare-metal environment, following Talos config snippet could be used as an equivalent of the --registry-mirror flag above:
Other implementations of Docker registry can be used in place of the Docker registry image used above to run the registry.
If required, auth can be configured for the internal registry (and custom TLS certificates if needed).
4.3 - Customizing the Kernel
Guide on how to customize the kernel used by Talos Linux.
The installer image contains ONBUILD instructions that handle the following:
the decompression, and unpacking of the initramfs.xz
the unsquashing of the rootfs
the copying of new rootfs files
the squashing of the new rootfs
and the packing, and compression of the new initramfs.xz
When used as a base image, the installer will perform the above steps automatically with the requirement that a customization stage be defined in the Dockerfile.
Build and push your own kernel:
git clone https://github.com/talos-systems/pkgs.git
cd pkgs
make kernel-menuconfig USERNAME=_your_github_user_name_
docker login ghcr.io --username _your_github_user_name_
make kernel USERNAME=_your_github_user_name_ PUSH=true
Using a multi-stage Dockerfile we can define the customization stage and build FROM the installer image:
FROM scratch AS customizationCOPY --from=<custom kernel image> /lib/modules /lib/modules
FROM ghcr.io/siderolabs/installer:latestCOPY --from=<custom kernel image> /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
When building the image, the customization stage will automatically be copied into the rootfs.
The customization stage is not limited to a single COPY instruction.
In fact, you can do whatever you would like in this stage, but keep in mind that everything in / will be copied into the rootfs.
Note: buildkit has a bug #816, to disable it use DOCKER_BUILDKIT=0
Now that we have a custom installer we can build Talos for the specific platform we wish to deploy to.
4.4 - Customizing the Root Filesystem
How to add your own content to the immutable root file system of Talos Linux.
The installer image contains ONBUILD instructions that handle the following:
the decompression, and unpacking of the initramfs.xz
the unsquashing of the rootfs
the copying of new rootfs files
the squashing of the new rootfs
and the packing, and compression of the new initramfs.xz
When used as a base image, the installer will perform the above steps automatically with the requirement that a customization stage be defined in the Dockerfile.
For example, say we have an image that contains the contents of a library we wish to add to the Talos rootfs.
We need to define a stage with the name customization:
FROM scratch AS customizationCOPY --from=<name|index> <src> <dest>
Using a multi-stage Dockerfile we can define the customization stage and build FROM the installer image:
FROM scratch AS customizationCOPY --from=<name|index> <src> <dest>
FROM ghcr.io/siderolabs/installer:latest
When building the image, the customization stage will automatically be copied into the rootfs.
The customization stage is not limited to a single COPY instruction.
In fact, you can do whatever you would like in this stage, but keep in mind that everything in / will be copied into the rootfs.
Note: <dest> is the path relative to the rootfs that you wish to place the contents of <src>.
This will perform a rm -rf on the specified paths relative to the rootfs.
Note: RM must be a whitespace delimited list.
The resulting image can be used to:
generate an image for any of the supported providers
perform bare-metall installs
perform upgrades
We will step through common customizations in the remainder of this section.
4.5 - Developing Talos
Learn how to set up a development environment for local testing and hacking on Talos itself!
This guide outlines steps and tricks to develop Talos operating systems and related components.
The guide assumes Linux operating system on the development host.
Some steps might work under Mac OS X, but using Linux is highly advised.
Note: network=host allows buildx builder to access host network, so that it can push to a local container registry (see below).
Make sure the following steps work:
make talosctl
make initramfs kernel
Set up a local docker registry:
docker run -d -p 5005:5000 \
--restart always \
--name local registry:2
Try to build and push to local registry an installer image:
make installer IMAGE_REGISTRY=127.0.0.1:5005 PUSH=true
Record the image name output in the step above.
Note: it is also possible to force a stable image tag by using TAG variable: make installer IMAGE_REGISTRY=127.0.0.1:5005 TAG=v1.0.0-alpha.1 PUSH=true.
Running Talos cluster
Set up local caching docker registries (this speeds up Talos cluster boot a lot), script is in the Talos repo:
custom --cidr to make QEMU cluster use different network than default Docker setup (optional)
--registry-mirror uses the caching proxies set up above to speed up boot time a lot, last one adds your local registry (installer image was pushed to it)
--install-image is the image you built with make installer above
--controlplanes & --workers configure cluster size, choose to match your resources; 3 controlplanes give you HA control plane; 1 controlplane is enough, never do 2 controlplanes
--with-bootloader=false disables boot from disk (Talos will always boot from _out/vmlinuz-amd64 and _out/initramfs-amd64.xz).
This speeds up development cycle a lot - no need to rebuild installer and perform install, rebooting is enough to get new code.
Note: as boot loader is not used, it’s not necessary to rebuild installer each time (old image is fine), but sometimes it’s needed (when configuration changes are done and old installer doesn’t validate the config).
talosctl cluster create derives Talos machine configuration version from the install image tag, so sometimes early in the development cycle (when new minor tag is not released yet), machine config version can be overridden with --talos-version=v1.2.
If the --with-bootloader=false flag is not enabled, for Talos cluster to pick up new changes to the code (in initramfs), it will require a Talos upgrade (so new installer should be built).
With --with-bootloader=false flag, Talos always boots from initramfs in _out/ directory, so simple reboot is enough to pick up new code changes.
If the installation flow needs to be tested, --with-bootloader=false shouldn’t be used.
Once talosctl cluster create finishes successfully, talosconfig and kubeconfig will be set up automatically to point to your cluster.
Start playing with talosctl:
talosctl -n 172.20.0.2 version
talosctl -n 172.20.0.3,172.20.0.4 dashboard
talosctl -n 172.20.0.4 get members
Same with kubectl:
kubectl get nodes -o wide
You can deploy some Kubernetes workloads to the cluster.
You can edit machine config on the fly with talosctl edit mc --immediate, config patches can be applied via --config-patch flags, also many features have specific flags in talosctl cluster create.
Quick Reboot
To reboot whole cluster quickly (e.g. to pick up a change made in the code):
for socket in ~/.talos/clusters/talos-default/talos-default-*.monitor; doecho"q" | sudo socat - unix-connect:$socket; done
Sending q to a single socket allows to reboot a single node.
Note: This command performs immediate reboot (as if the machine was powered down and immediately powered back up), for normal Talos reboot use talosctl reboot.
Development Cycle
Fast development cycle:
bring up a cluster
make code changes
rebuild initramfs with make initramfs
reboot a node to pick new initramfs
verify code changes
more code changes…
Some aspects of Talos development require to enable bootloader (when working on installer itself), in that case quick development cycle is no longer possible, and cluster should be destroyed and recreated each time.
Running Integration Tests
If integration tests were changed (or when running them for the first time), first rebuild the integration test binary:
rm -f _out/integration-test-linux-amd64; make _out/integration-test-linux-amd64
Running short tests against QEMU provisioned cluster:
This command stops QEMU and helper processes, tears down bridged network on the host, and cleans up
cluster state in ~/.talos/clusters.
Note: if the host machine is rebooted, QEMU instances and helpers processes won’t be started back.
In that case it’s required to clean up files in ~/.talos/clusters/<cluster-name> directory manually.
Optional
Set up cross-build environment with:
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
Note: the static qemu binaries which come with Ubuntu 21.10 seem to be broken.
Unit tests
Unit tests can be run in buildx with make unit-tests, on Ubuntu systems some tests using loop devices will fail because Ubuntu uses low-index loop devices for snaps.
Most of the unit-tests can be run standalone as well, with regular go test, or using IDE integration:
go test -v ./internal/pkg/circular/
This provides much faster feedback loop, but some tests require either elevated privileges (running as root) or additional binaries available only in Talos rootfs (containerd tests).
Running tests as root can be done with -exec flag to go test, but this is risky, as test code has root access and can potentially make undesired changes:
go test -exec sudo -v ./internal/app/machined/pkg/controllers/network/...
Go Profiling
Build initramfs with debug enabled: make initramfs WITH_DEBUG=1.
Launch Talos cluster with bootloader disabled, and use go tool pprof to capture the profile and show the output in your browser:
go tool pprof http://172.20.0.2:9982/debug/pprof/heap
The IP address 172.20.0.2 is the address of the Talos node, and port :9982 depends on the Go application to profile:
9981: apid
9982: machined
9983: trustd
Testing Air-gapped Environments
There is a hidden talosctl debug air-gapped command which launches two components:
HTTP proxy capable of proxying HTTP and HTTPS requests
HTTPS server with a self-signed certificate
The command also writes down Talos machine configuration patch to enable the HTTP proxy and add a self-signed certificate
to the list of trusted certificates:
$ talosctl debug air-gapped --advertised-address 172.20.0.1
2022/08/04 16:43:14 writing config patch to air-gapped-patch.yaml
2022/08/04 16:43:14 starting HTTP proxy on :8002
2022/08/04 16:43:14 starting HTTPS server with self-signed cert on :8001
The --advertised-address should match the bridge IP of the Talos node.
The first section appends a self-signed certificate of the HTTPS server to the list of trusted certificates,
followed by the HTTP proxy setup (in-cluster traffic is excluded from the proxy).
The last section adds an extra Kubernetes manifest hosted on the HTTPS server.
The machine configuration patch can now be used to launch a test Talos cluster:
The following lines should appear in the output of the talosctl debug air-gapped command:
CONNECT discovery.talos.dev:443: the HTTP proxy is used to talk to the discovery service
http: TLS handshake error from 172.20.0.2:53512: remote error: tls: bad certificate: an expected error on Talos side, as self-signed cert is not written yet to the file
GET /debug.yaml: Talos successfully fetches the extra manifest successfully
There might be more output depending on the registry caches being used or not.
4.6 - Disaster Recovery
Procedure for snapshotting etcd database and recovering from catastrophic control plane failure.
etcd database backs Kubernetes control plane state, so if the etcd service is unavailable,
the Kubernetes control plane goes down, and the cluster is not recoverable until etcd is recovered.
etcd builds around the consensus protocol Raft, so highly-available control plane clusters can tolerate the loss of nodes so long as more than half of the members are running and reachable.
For a three control plane node Talos cluster, this means that the cluster tolerates a failure of any single node,
but losing more than one node at the same time leads to complete loss of service.
Because of that, it is important to take routine backups of etcd state to have a snapshot to recover the cluster from
in case of catastrophic failure.
Backup
Snapshotting etcd Database
Create a consistent snapshot of etcd database with talosctl etcd snapshot command:
$ talosctl -n <IP> etcd snapshot db.snapshot
etcd snapshot saved to "db.snapshot"(2015264 bytes)snapshot info: hash c25fd181, revision 4193, total keys 1287, total size 3035136
Note: filename db.snapshot is arbitrary.
This database snapshot can be taken on any healthy control plane node (with IP address <IP> in the example above),
as all etcd instances contain exactly same data.
It is recommended to configure etcd snapshots to be created on some schedule to allow point-in-time recovery using the latest snapshot.
Disaster Database Snapshot
If the etcd cluster is not healthy (for example, if quorum has already been lost), the talosctl etcd snapshot command might fail.
In that case, copy the database snapshot directly from the control plane node:
This snapshot might not be fully consistent (if the etcd process is running), but it allows
for disaster recovery when latest regular snapshot is not available.
Machine Configuration
Machine configuration might be required to recover the node after hardware failure.
Backup Talos node machine configuration with the command:
talosctl -n IP get mc v1alpha1 -o yaml | yq eval'.spec' -
Recovery
Before starting a disaster recovery procedure, make sure that etcd cluster can’t be recovered:
get etcd cluster member list on all healthy control plane nodes with talosctl -n IP etcd members command and compare across all members.
query etcd health across control plane nodes with talosctl -n IP service etcd.
If the quorum can be restored, restoring quorum might be a better strategy than performing full disaster recovery
procedure.
Latest Etcd Snapshot
Get hold of the latest etcd database snapshot.
If a snapshot is not fresh enough, create a database snapshot (see above), even if the etcd cluster is unhealthy.
Init Node
Make sure that there are no control plane nodes with machine type init:
$ talosctl -n <IP1>,<IP2>,... get machinetype
NODE NAMESPACE TYPE ID VERSION TYPE
172.20.0.2 config MachineType machine-type 2 controlplane
172.20.0.4 config MachineType machine-type 2 controlplane
172.20.0.3 config MachineType machine-type 2 controlplane
Init node type is deprecated, and are incompatible with etcd recovery procedure.
init node can be converted to controlplane type with talosctl edit mc --mode=staged command followed
by node reboot with talosctl reboot command.
Preparing Control Plane Nodes
If some control plane nodes experienced hardware failure, replace them with new nodes.
Use machine configuration backup to re-create the nodes with the same secret material and control plane settings
to allow workers to join the recovered control plane.
If a control plane node is up but etcd isn’t, wipe the node’s EPHEMERAL partition to remove the etcd
data directory (make sure a database snapshot is taken before doing this):
At this point, all control plane nodes should boot up, and etcd service should be in the Preparing state.
The Kubernetes control plane endpoint should be pointed to the new control plane nodes if there were
changes to the node addresses.
Recovering from the Backup
Make sure all etcd service instances are in Preparing state:
$ talosctl -n <IP> service etcd
NODE 172.20.0.2
ID etcd
STATE Preparing
HEALTH ?
EVENTS [Preparing]: Running pre state (17s ago)[Waiting]: Waiting for service "cri" to be "up", time sync (18s ago)[Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", time sync (20s ago)
Execute the bootstrap command against any control plane node passing the path to the etcd database snapshot:
$ talosctl -n <IP> bootstrap --recover-from=./db.snapshot
recovering from snapshot "./db.snapshot": hash c25fd181, revision 4193, total keys 1287, total size 3035136
Note: if database snapshot was copied out directly from the etcd data directory using talosctl cp,
add flag --recover-skip-hash-check to skip integrity check on restore.
Talos node should print matching information in the kernel log:
recovering etcd from snapshot: hash c25fd181, revision 4193, total keys 1287, total size 3035136
{"level":"info","msg":"restoring snapshot","path":"/var/lib/etcd.snapshot","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/li}
{"level":"info","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":3360}
{"level":"info","msg":"added member","cluster-id":"a3390e43eb5274e2","local-member-id":"0","added-peer-id":"eb4f6f534361855e","added-peer-peer-urls":["https:/}
{"level":"info","msg":"restored snapshot","path":"/var/lib/etcd.snapshot","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"}
Now etcd service should become healthy on the bootstrap node, Kubernetes control plane components
should start and control plane endpoint should become available.
Remaining control plane nodes join etcd cluster once control plane endpoint is up.
Single Control Plane Node Cluster
This guide applies to the single control plane clusters as well.
In fact, it is much more important to take regular snapshots of the etcd database in single control plane node
case, as loss of the control plane node might render the whole cluster irrecoverable without a backup.
4.7 - Extension Services
Use extension services in Talos Linux.
Talos provides a way to run additional system services early in the Talos boot process.
Extension services should be included into the Talos root filesystem (e.g. using system extensions).
Extension services run as privileged containers with ephemeral root filesystem located in the Talos root filesystem.
Extension services can be used to use extend core features of Talos in a way that is not possible via static pods or
Kubernetes DaemonSets.
Potential extension services use-cases:
storage: Open iSCSI, software RAID, etc.
networking: BGP FRR, etc.
platform integration: VMWare open VM tools, etc.
Configuration
Talos on boot scans directory /usr/local/etc/containers for *.yaml files describing the extension services to run.
Format of the extension service config:
Field name sets the service name, valid names are [a-z0-9-_]+.
The service container root filesystem path is derived from the name: /usr/local/lib/containers/<name>.
The extension service will be registered as a Talos service under an ext-<name> identifier.
container
entrypoint defines the container entrypoint relative to the container root filesystem (/usr/local/lib/containers/<name>)
args defines the additional arguments to pass to the entrypoint
mounts defines the volumes to be mounted into the container root
All requested directories will be mounted into the extension service container mount namespace.
If the source directory doesn’t exist in the host filesystem, it will be created (only for writable paths in the Talos root filesystem).
Talos starts the container for the extension service with container root filesystem at /usr/local/lib/containers/hello-world:
/
├── hello
└── config.ini
Extension service is registered as ext-hello-world in talosctl services:
$ talosctl service ext-hello-world
NODE 172.20.0.5
ID ext-hello-world
STATE Running
HEALTH ?
EVENTS [Running]: Started task ext-hello-world (PID 1100)for container ext-hello-world (2m47s ago)[Preparing]: Creating service runner (2m47s ago)[Preparing]: Running pre state (2m47s ago)[Waiting]: Waiting for service "containerd" to be "up"(2m48s ago)[Waiting]: Waiting for service "containerd" to be "up", network (2m49s ago)
An extension service can be started, restarted and stopped using talosctl service ext-hello-world start|restart|stop.
Use talosctl logs ext-hello-world to get the logs of the service.
Complete example of the extension service can be found in the extensions repository.
4.8 - Migrating from Kubeadm
Migrating Kubeadm-based clusters to Talos.
It is possible to migrate Talos from a cluster that is created using
kubeadm to Talos.
High-level steps are the following:
Collect CA certificates and a bootstrap token from a control plane node.
Create a Talos machine config with the CA certificates with the ones you collected.
Update control plane endpoint in the machine config to point to the existing control plane (i.e. your load balancer address).
Boot a new Talos machine and apply the machine config.
Verify that the new control plane node is ready.
Remove one of the old control plane nodes.
Repeat the same steps for all control plane nodes.
Verify that all control plane nodes are ready.
Repeat the same steps for all worker nodes, using the machine config generated for the workers.
Remarks on kube-apiserver load balancer
While migrating to Talos, you need to make sure that your kube-apiserver load balancer is in place
and keeps pointing to the correct set of control plane nodes.
This process depends on your load balancer setup.
If you are using an LB that is external to the control plane nodes (e.g. cloud provider LB, F5 BIG-IP, etc.),
you need to make sure that you update the backend IPs of the load balancer to point to the control plane nodes as
you add Talos nodes and remove kubeadm-based ones.
If your load balancing is done on the control plane nodes (e.g. keepalived + haproxy on the control plane nodes),
you can do the following:
Add Talos nodes and remove kubeadm-based ones while updating the haproxy backends
to point to the newly added nodes except the last kubeadm-based control plane node.
Turn off keepalived to drop the virtual IP used by the kubeadm-based nodes (introduces kube-apiserver downtime).
Set up a virtual-IP based new load balancer on the new set of Talos control plane nodes.
Use the previous LB IP as the LB virtual IP.
Verify apiserver connectivity over the Talos-managed virtual IP.
Migrate the last control-plane node.
Prerequisites
Admin access to the kubeadm-based cluster
Access to the /etc/kubernetes/pki directory (e.g. SSH & root permissions)
on the control plane nodes of the kubeadm-based cluster
Access to kube-apiserver load-balancer configuration
Step-by-step guide
Download /etc/kubernetes/pki directory from a control plane node of the kubeadm-based cluster.
Create a new join token for the new control plane nodes:
# inside a control plane nodekubeadm token create
Create Talos secrets from the PKI directory you downloaded on step 1 and the token you generated on step 2:
talosctl gen secrets --kubernetes-bootstrap-token <TOKEN> --from-kubernetes-pki <PKI_DIR>
Create a new Talos config from the secrets:
talosctl gen config --with-secrets secrets.yaml <CLUSTER_NAME> https://<EXISTING_CLUSTER_LB_IP>
Collect the information about the kubeadm-based cluster from the kubeadm configmap:
kubectl get configmap -n kube-system kubeadm-config -oyaml
Take note of the following information in the ClusterConfiguration:
.controlPlaneEndpoint
.networking.dnsDomain
.networking.podSubnet
.networking.serviceSubnet
Replace the following information in the generated controlplane.yaml:
.cluster.network.cni.name with none
.cluster.network.podSubnets[0] with the value of the networking.podSubnet from the previous step
.cluster.network.serviceSubnets[0] with the value of the networking.serviceSubnet from the previous step
.cluster.network.dnsDomain with the value of the networking.dnsDomain from the previous step
Go through the rest of controlplane.yaml and worker.yaml to customize them according to your needs.
Bring up a Talos node to be the initial Talos control plane node.
Apply the generated controlplane.yaml to the Talos control plane node:
Clone the Linux kernel and check out the revision that pkgs uses (this can be found in kernel/kernel-prepare/pkg.yaml and it will be something like the following: https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-x.xx.x.tar.xz)
git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git &&cd linux
git checkout v5.15
Your module will need to be converted to be in-tree.
The steps for this are different depending on the complexity of the module to port, but generally it would involve moving the module source code into the drivers tree and creating a new Makefile and Kconfig.
Stage your changes in Git with git add -A.
Run git diff --cached --no-prefix > foobar.patch to generate a patch from your changes.
Copy this patch to kernel/kernel/patches in the pkgs repo.
Add a patch line in the prepare segment of kernel/kernel/pkg.yaml:
patch -p0 < /pkg/patches/foobar.patch
Build the kernel image.
Make sure you are logged in to ghcr.io before running this command, and you can change or omit PLATFORM depending on what you want to target.
make kernel PLATFORM=linux/amd64 USERNAME=your-username PUSH=true
Make a note of the image name the make command outputs.
Building the installer image
Copy the following into a new Dockerfile:
FROM scratch AS customizationCOPY --from=ghcr.io/your-username/kernel:<kernel version> /lib/modules /lib/modules
FROM ghcr.io/siderolabs/installer:<talos version>COPY --from=ghcr.io/your-username/kernel:<kernel version> /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
Using Talos Linux to set up static pods in Kubernetes.
Static Pods
Static pods are run directly by the kubelet bypassing the Kubernetes API server checks and validations.
Most of the time DaemonSet is a better alternative to static pods, but some workloads need to run
before the Kubernetes API server is available or might need to bypass security restrictions imposed by the API server.
Talos renders static pod definitions to the kubelet manifest directory (/etc/kubernetes/manifests), kubelet picks up the definition and launches the pod.
Talos accepts changes to the static pod configuration without a reboot.
Usage
Kubelet mirrors pod definition to the API server state, so static pods can be inspected with kubectl get pods, logs can be retrieved with kubectl logs, etc.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-talos-default-controlplane-2 1/1 Running 0 17s
If the API server is not available, status of the static pod can also be inspected with talosctl containers --kubernetes:
Logs of static pods can be retrieved with talosctl logs --kubernetes:
$ talosctl logs --kubernetes default/nginx-talos-default-controlplane-2:nginx
172.20.0.3: 2022-02-10T15:26:01.289208227Z stderr F 2022/02/10 15:26:01 [notice] 1#1: using the "epoll" event method
172.20.0.3: 2022-02-10T15:26:01.2892466Z stderr F 2022/02/10 15:26:01 [notice] 1#1: nginx/1.21.6
172.20.0.3: 2022-02-10T15:26:01.28925723Z stderr F 2022/02/10 15:26:01 [notice] 1#1: built by gcc 10.2.1 20210110(Debian 10.2.1-6)
Troubleshooting
Talos doesn’t perform any validation on the static pod definitions.
If the pod isn’t running, use kubelet logs (talosctl logs kubelet) to find the problem:
$ talosctl logs kubelet
172.20.0.2: {"ts":1644505520281.427,"caller":"config/file.go:187","msg":"Could not process manifest file","path":"/etc/kubernetes/manifests/talos-default-nginx-gvisor.yaml","err":"invalid pod: [spec.containers: Required value]"}
Resource Definitions
Static pod definitions are available as StaticPod resources combined with Talos-generated control plane static pods:
$ talosctl get staticpods
NODE NAMESPACE TYPE ID VERSION
172.20.0.3 k8s StaticPod default-nginx 1172.20.0.3 k8s StaticPod kube-apiserver 1172.20.0.3 k8s StaticPod kube-controller-manager 1172.20.0.3 k8s StaticPod kube-scheduler 1
Talos assigns ID <namespace>-<name> to the static pods specified in the machine configuration.
On control plane nodes status of the running static pods is available in the StaticPodStatus resource:
$ talosctl get staticpodstatus
NODE NAMESPACE TYPE ID VERSION READY
172.20.0.3 k8s StaticPodStatus default/nginx-talos-default-controlplane-2 2 True
172.20.0.3 k8s StaticPodStatus kube-system/kube-apiserver-talos-default-controlplane-2 2 True
172.20.0.3 k8s StaticPodStatus kube-system/kube-controller-manager-talos-default-controlplane-2 3 True
172.20.0.3 k8s StaticPodStatus kube-system/kube-scheduler-talos-default-controlplane-2 3 True
4.11 - Talos API access from Kubernetes
How to access Talos API from within Kubernetes.
In this guide, we will enable the Talos feature to access the Talos API from within Kubernetes.
Enabling the Feature
Edit the machine configuration to enable the feature, specifying the Kubernetes namespaces from which Talos API
can be accessed and the allowed Talos API roles.
talosctl -n 172.20.0.2 edit machineconfig
Configure the kubernetesTalosAPIAccess like the following:
This means that the pod can talk to Talos API of node 172.20.0.2 successfully.
4.12 - Troubleshooting Control Plane
Troubleshoot control plane failures for running cluster and bootstrap process.
In this guide we assume that Talos client config is available and Talos API access is available.
Kubernetes client configuration can be pulled from control plane nodes with talosctl -n <IP> kubeconfig
(this command works before Kubernetes is fully booted).
What is the control plane endpoint?
The Kubernetes control plane endpoint is the single canonical URL by which the
Kubernetes API is accessed.
Especially with high-availability (HA) control planes, this endpoint may point to a load balancer or a DNS name which may
have multiple A and AAAA records.
Like Talos’ own API, the Kubernetes API uses mutual TLS, client
certs, and a common Certificate Authority (CA).
Unlike general-purpose websites, there is no need for an upstream CA, so tools
such as cert-manager, Let’s Encrypt, or products such
as validated TLS certificates are not required.
Encryption, however, is, and hence the URL scheme will always be https://.
By default, the Kubernetes API server in Talos runs on port 6443.
As such, the control plane endpoint URLs for Talos will almost always be of the form
https://endpoint:6443.
(The port, since it is not the https default of 443 is required.)
The endpoint above may be a DNS name or IP address, but it should be
directed to the set of all controlplane nodes, as opposed to a
single one.
As mentioned above, this can be achieved by a number of strategies, including:
BGP peering of a shared IP (such as with kube-vip)
Using a DNS name here is a good idea, since it allows any other option, while offering
a layer of abstraction.
It allows the underlying IP addresses to change without impacting the
canonical URL.
Unlike most services in Kubernetes, the API server runs with host networking,
meaning that it shares the network namespace with the host.
This means you can use the IP address(es) of the host to refer to the Kubernetes
API server.
For availability of the API, it is important that any load balancer be aware of
the health of the backend API servers, to minimize disruptions during
common node operations like reboots and upgrades.
It is critical that the control plane endpoint works correctly during cluster bootstrap phase, as nodes discover
each other using control plane endpoint.
kubelet is not running on control plane node
The kubelet service should be running on control plane nodes as soon as networking is configured:
$ talosctl -n <IP> service kubelet
NODE 172.20.0.2
ID kubelet
STATE Running
HEALTH OK
EVENTS [Running]: Health check successful (2m54s ago)[Running]: Health check failed: Get "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused (3m4s ago)[Running]: Started task kubelet (PID 2334)for container kubelet (3m6s ago)[Preparing]: Creating service runner (3m6s ago)[Preparing]: Running pre state (3m15s ago)[Waiting]: Waiting for service "timed" to be "up"(3m15s ago)[Waiting]: Waiting for service "cri" to be "up", service "timed" to be "up"(3m16s ago)[Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", service "timed" to be "up"(3m18s ago)
If the kubelet is not running, it may be due to invalid configuration.
Check kubelet logs with the talosctl logs command:
By far the most likely cause of etcd not running is because the cluster has
not yet been bootstrapped or because bootstrapping is currently in progress.
The talosctl bootstrap command must be run manually and only once per
cluster, and this step is commonly missed.
Once a node is bootstrapped, it will start etcd and, over the course of a
minute or two (depending on the download speed of the control plane nodes), the
other control plane nodes should discover it and join themselves to the cluster.
Also, etcd will only run on control plane nodes.
If a node is designated as a worker node, you should not expect etcd to be
running on it.
When node boots for the first time, the etcd data directory (/var/lib/etcd) is empty, and it will only be populated when etcd is launched.
If etcd is not running, check service etcd state:
$ talosctl -n <IP> service etcd
NODE 172.20.0.2
ID etcd
STATE Running
HEALTH OK
EVENTS [Running]: Health check successful (3m21s ago)[Running]: Started task etcd (PID 2343)for container etcd (3m26s ago)[Preparing]: Creating service runner (3m26s ago)[Preparing]: Running pre state (3m26s ago)[Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", service "timed" to be "up"(3m26s ago)
If service is stuck in Preparing state for bootstrap node, it might be related to slow network - at this stage
Talos pulls the etcd image from the container registry.
If the etcd service is crashing and restarting, check its logs with talosctl -n <IP> logs etcd.
The most common reasons for crashes are:
wrong arguments passed via extraArgs in the configuration;
booting Talos on non-empty disk with previous Talos installation, /var/lib/etcd contains data from old cluster.
etcd is not running on non-bootstrap control plane node
The etcd service on control plane nodes which were not the target of the cluster bootstrap will wait until the bootstrapped control plane node has completed.
The bootstrap and discovery processes may take a few minutes to complete.
As soon as the bootstrapped node starts its Kubernetes control plane components, kubectl get endpoints will return the IP of bootstrapped control plane node.
At this point, the other control plane nodes will start their etcd services, join the cluster, and then start their own Kubernetes control plane components.
Kubernetes static pod definitions are not generated
Talos should write the static pod definitions for the Kubernetes control plane
in /etc/kubernetes/manifests:
$ talosctl -n <IP> ls /etc/kubernetes/manifests
NODE NAME
172.20.0.2 .
172.20.0.2 talos-kube-apiserver.yaml
172.20.0.2 talos-kube-controller-manager.yaml
172.20.0.2 talos-kube-scheduler.yaml
If the static pod definitions are not rendered, check etcd and kubelet service health (see above)
and the controller runtime logs (talosctl logs controller-runtime).
Talos prints error an error on the server ("") has prevented the request from succeeding
This is expected during initial cluster bootstrap and sometimes after a reboot:
[ 70.093289][talos] task labelNodeAsControlPlane (1/1): starting
[ 80.094038][talos] retrying error: an error on the server ("") has prevented the request from succeeding (get nodes talos-default-controlplane-1)
Initially kube-apiserver component is not running yet, and it takes some time before it becomes fully up
during bootstrap (image should be pulled from the Internet, etc.)
Once the control plane endpoint is up, Talos should continue with its boot
process.
If Talos doesn’t proceed, it may be due to a configuration issue.
In any case, the status of the control plane components on each control plane nodes can be checked with talosctl containers -k:
If kube-apiserver shows as CONTAINER_EXITED, it might have exited due to configuration error.
Logs can be checked with taloctl logs --kubernetes (or with -k as a shorthand):
$ talosctl -n <IP> logs -k kube-system/kube-apiserver-talos-default-controlplane-1:kube-apiserver
172.20.0.2: 2021-03-05T20:46:13.133902064Z stderr F 2021/03/05 20:46:13 Running command:
172.20.0.2: 2021-03-05T20:46:13.133933824Z stderr F Command env: (log-file=, also-stdout=false, redirect-stderr=true)172.20.0.2: 2021-03-05T20:46:13.133938524Z stderr F Run from directory:
172.20.0.2: 2021-03-05T20:46:13.13394154Z stderr F Executable path: /usr/local/bin/kube-apiserver
...
Talos prints error nodes "talos-default-controlplane-1" not found
This error means that kube-apiserver is up and the control plane endpoint is healthy, but the kubelet hasn’t received
its client certificate yet, and it wasn’t able to register itself to Kubernetes.
The Kubernetes controller manager (kube-controller-manager)is responsible for monitoring the certificate
signing requests (CSRs) and issuing certificates for each of them.
The kubelet is responsible for generating and submitting the CSRs for its
associated node.
For the kubelet to get its client certificate, then, the Kubernetes control plane
must be healthy:
the API server is running and available at the Kubernetes control plane
endpoint URL
the controller manager is running and a leader has been elected
The states of any CSRs can be checked with kubectl get csr:
$ kubectl get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-jcn9j 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-p6b9q 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-sw6rm 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-vlghg 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
Talos prints error node not ready
A Node in Kubernetes is marked as Ready only once its CNI is up.
It takes a minute or two for the CNI images to be pulled and for the CNI to start.
If the node is stuck in this state for too long, check CNI pods and logs with kubectl.
Usually, CNI-related resources are created in kube-system namespace.
For example, for Talos default Flannel CNI:
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
...
kube-flannel-25drx 1/1 Running 0 23m
kube-flannel-8lmb6 1/1 Running 0 23m
kube-flannel-gl7nx 1/1 Running 0 23m
kube-flannel-jknt9 1/1 Running 0 23m
...
Talos prints error x509: certificate signed by unknown authority
The full error might look like:
x509: certificate signed by unknown authority (possiby because of crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"
Usually, this occurs because the control plane endpoint points to a different
cluster than the client certificate was generated for.
If a node was recycled between clusters, make sure it was properly wiped between
uses.
If a client has multiple client configurations, make sure you are matching the correct talosconfig with the
correct cluster.
etcd is running on bootstrap node, but stuck in pre state on non-bootstrap nodes
Please see question etcd is not running on non-bootstrap control plane node.
Checking kube-controller-manager and kube-scheduler
If the control plane endpoint is up, the status of the pods can be ascertained with kubectl:
$ kubectl get pods -n kube-system -l k8s-app=kube-controller-manager
NAME READY STATUS RESTARTS AGE
kube-controller-manager-talos-default-controlplane-1 1/1 Running 0 28m
kube-controller-manager-talos-default-controlplane-2 1/1 Running 0 28m
kube-controller-manager-talos-default-controlplane-3 1/1 Running 0 28m
If the control plane endpoint is not yet up, the container status of the control plane components can be queried with
talosctl containers --kubernetes:
If some of the containers are not running, it could be that image is still being pulled.
Otherwise the process might crashing.
The logs can be checked with talosctl logs --kubernetes <containerID>:
$ talosctl -n <IP> logs -k kube-system/kube-controller-manager-talos-default-controlplane-1:kube-controller-manager
172.20.0.3: 2021-03-09T13:59:34.291667526Z stderr F 2021/03/09 13:59:34 Running command:
172.20.0.3: 2021-03-09T13:59:34.291702262Z stderr F Command env: (log-file=, also-stdout=false, redirect-stderr=true)172.20.0.3: 2021-03-09T13:59:34.291707121Z stderr F Run from directory:
172.20.0.3: 2021-03-09T13:59:34.291710908Z stderr F Executable path: /usr/local/bin/kube-controller-manager
172.20.0.3: 2021-03-09T13:59:34.291719163Z stderr F Args (comma-delimited): /usr/local/bin/kube-controller-manager,--allocate-node-cidrs=true,--cloud-provider=,--cluster-cidr=10.244.0.0/16,--service-cluster-ip-range=10.96.0.0/12,--cluster-signing-cert-file=/system/secrets/kubernetes/kube-controller-manager/ca.crt,--cluster-signing-key-file=/system/secrets/kubernetes/kube-controller-manager/ca.key,--configure-cloud-routes=false,--kubeconfig=/system/secrets/kubernetes/kube-controller-manager/kubeconfig,--leader-elect=true,--root-ca-file=/system/secrets/kubernetes/kube-controller-manager/ca.crt,--service-account-private-key-file=/system/secrets/kubernetes/kube-controller-manager/service-account.key,--profiling=false172.20.0.3: 2021-03-09T13:59:34.293870359Z stderr F 2021/03/09 13:59:34 Now listening for interrupts
172.20.0.3: 2021-03-09T13:59:34.761113762Z stdout F I0309 13:59:34.760982 10 serving.go:331] Generated self-signed cert in-memory
...
Checking controller runtime logs
Talos runs a set of controllers which operate on resources to build and support the Kubernetes control plane.
Some debugging information can be queried from the controller logs with talosctl logs controller-runtime:
Controllers continuously run a reconcile loop, so at any time, they may be starting, failing, or restarting.
This is expected behavior.
Things to look for:
v1alpha1.BootstrapStatusController: bootkube initialized status not found: control plane is not self-hosted, running with static pods.
k8s.KubeletStaticPodController: writing static pod "/etc/kubernetes/manifests/talos-kube-apiserver.yaml": static pod definitions were rendered successfully.
k8s.ManifestApplyController: controller failed: error creating mapping for object /v1/Secret/bootstrap-token-q9pyzr: an error on the server ("") has prevented the request from succeeding: control plane endpoint is not up yet, bootstrap manifests can’t be injected, controller is going to retry.
k8s.KubeletStaticPodController: controller failed: error refreshing pod status: error fetching pod status: an error on the server ("Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)") has prevented the request from succeeding: kubelet hasn’t been able to contact kube-apiserver yet to push pod status, controller
is going to retry.
k8s.ManifestApplyController: created rbac.authorization.k8s.io/v1/ClusterRole/psp:privileged: one of the bootstrap manifests got successfully applied.
secrets.KubernetesController: controller failed: missing cluster.aggregatorCA secret: Talos is running with 0.8 configuration, if the cluster was upgraded from 0.8, this is expected, and conversion process will fix machine config
automatically.
If this cluster was bootstrapped with version 0.9, machine configuration should be regenerated with 0.9 talosctl.
If there are no new messages in the controller-runtime log, it means that the controllers have successfully finished reconciling, and that the current system state is the desired system state.
Checking static pod definitions
Talos generates static pod definitions for the kube-apiserver, kube-controller-manager, and kube-scheduler
components based on its machine configuration.
These definitions can be checked as resources with talosctl get staticpods:
The status of the static pods can queried with talosctl get staticpodstatus:
$ talosctl -n <IP> get staticpodstatus
NODE NAMESPACE TYPE ID VERSION READY
172.20.0.2 controlplane StaticPodStatus kube-system/kube-apiserver-talos-default-controlplane-1 1 True
172.20.0.2 controlplane StaticPodStatus kube-system/kube-controller-manager-talos-default-controlplane-1 1 True
172.20.0.2 controlplane StaticPodStatus kube-system/kube-scheduler-talos-default-controlplane-1 1 True
The most important status field is READY, which is the last column printed.
The complete status can be fetched by adding -o yaml flag.
Checking bootstrap manifests
As part of the bootstrap process, Talos injects bootstrap manifests into Kubernetes API server.
There are two kinds of these manifests: system manifests built-in into Talos and extra manifests downloaded (custom CNI, extra manifests in the machine config):
Worker node is stuck with apid health check failures
Control plane nodes have enough secret material to generate apid server certificates, but worker nodes
depend on control plane trustd services to generate certificates.
Worker nodes wait for their kubelet to join the cluster.
Then the Talos apid queries the Kubernetes endpoints via control plane
endpoint to find trustd endpoints.
They then use trustd to request and receive their certificate.
So if apid health checks are failing on worker node: