1 - Installation

How to install Talos Linux on various platforms

1.1 - Bare Metal Platforms

Installation of Talos Linux on various bare-metal platforms.

1.1.1 - Digital Rebar

In this guide we will create an Kubernetes cluster with 1 worker node, and 2 controlplane nodes using an existing digital rebar deployment.

Prerequisites

Creating a Cluster

In this guide we will create an Kubernetes cluster with 1 worker node, and 2 controlplane nodes. We assume an existing digital rebar deployment, and some familiarity with iPXE.

We leave it up to the user to decide if they would like to use static networking, or DHCP. The setup and configuration of DHCP will not be covered.

Create the Machine Configuration Files

Generating Base Configurations

Using the DNS name of the load balancer, generate the base configuration files for the Talos machines:

$ talosctl gen config talos-k8s-metal-tutorial https://<load balancer IP or DNS>:<port>
created controlplane.yaml
created worker.yaml
created talosconfig

The loadbalancer is used to distribute the load across multiple controlplane nodes. This isn’t covered in detail, because we assume some loadbalancing knowledge before hand. If you think this should be added to the docs, please create a issue.

At this point, you can modify the generated configs to your liking. Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Validate the Configuration Files

$ talosctl validate --config controlplane.yaml --mode metal
controlplane.yaml is valid for metal mode
$ talosctl validate --config worker.yaml --mode metal
worker.yaml is valid for metal mode

Publishing the Machine Configuration Files

Digital Rebar has a built-in fileserver, which means we can use this feature to expose the talos configuration files. We will place controlplane.yaml, and worker.yaml into Digital Rebar file server by using the drpcli tools.

Copy the generated files from the step above into your Digital Rebar installation.

drpcli file upload <file>.yaml as <file>.yaml

Replacing <file> with controlplane or worker.

Download the boot files

Download a recent version of boot.tar.gz from github.

Upload to DRB:

$ drpcli isos upload boot.tar.gz as talos.tar.gz
{
  "Path": "talos.tar.gz",
  "Size": 96470072
}

We have some Digital Rebar example files in the Git repo you can use to provision Digital Rebar with drpcli.

To apply these configs you need to create them, and then apply them as follow:

$ drpcli bootenvs create talos
{
  "Available": true,
  "BootParams": "",
  "Bundle": "",
  "Description": "",
  "Documentation": "",
  "Endpoint": "",
  "Errors": [],
  "Initrds": [],
  "Kernel": "",
  "Meta": {},
  "Name": "talos",
  "OS": {
    "Codename": "",
    "Family": "",
    "IsoFile": "",
    "IsoSha256": "",
    "IsoUrl": "",
    "Name": "",
    "SupportedArchitectures": {},
    "Version": ""
  },
  "OnlyUnknown": false,
  "OptionalParams": [],
  "ReadOnly": false,
  "RequiredParams": [],
  "Templates": [],
  "Validated": true
}
drpcli bootenvs update talos - < bootenv.yaml

You need to do this for all files in the example directory. If you don’t have access to the drpcli tools you can also use the webinterface.

It’s important to have a corresponding SHA256 hash matching the boot.tar.gz

Bootenv BootParams

We’re using some of Digital Rebar built in templating to make sure the machine gets the correct role assigned.

talos.platform=metal talos.config={{ .ProvisionerURL }}/files/{{.Param \"talos/role\"}}.yaml"

This is why we also include a params.yaml in the example directory to make sure the role is set to one of the following:

  • controlplane
  • worker

The {{.Param \"talos/role\"}} then gets populated with one of the above roles.

Boot the Machines

In the UI of Digital Rebar you need to select the machines you want to provision. Once selected, you need to assign to following:

  • Profile
  • Workflow

This will provision the Stage and Bootenv with the talos values. Once this is done, you can boot the machine.

To understand the boot process, we have a higher level overview located at metal overview.

Bootstrap Etcd

To configure talosctl we will need the first control plane node’s IP:

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

1.1.2 - Equinix Metal

Creating Talos cluster using Equinix Metal.

Prerequisites

This guide assumes the user has a working API token, the Equinix Metal CLI installed, and some familiarity with the CLI.

Network Booting

To install Talos to a server a working TFTP and iPXE server are needed. How this is done varies and is left as an exercise for the user. In general this requires a Talos kernel vmlinuz and initramfs. These assets can be downloaded from a given release.

Special Considerations

PXE Boot Kernel Parameters

The following is a list of kernel parameters required by Talos:

  • talos.platform: set this to equinixMetal
  • init_on_alloc=1: required by KSPP
  • slab_nomerge: required by KSPP
  • pti=on: required by KSPP

User Data

To configure a Talos you can use the metadata service provide by Equinix Metal. It is required to add a shebang to the top of the configuration file. The shebang is arbitrary in the case of Talos, and the convention we use is #!talos.

Creating a Cluster via the Equinix Metal CLI

Control Plane Endpoint

The strategy used for an HA cluster varies and is left as an exercise for the user. Some of the known ways are:

  • DNS
  • Load Balancer
  • BGP

Create the Machine Configuration Files

Generating Base Configurations

Using the DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines:

$ talosctl gen config talos-k8s-aws-tutorial https://<load balancer IP or DNS>:<port>
created controlplane.yaml
created worker.yaml
created talosconfig

Now add the required shebang (e.g. #!talos) at the top of controlplane.yaml, and worker.yaml At this point, you can modify the generated configs to your liking. Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Validate the Configuration Files

talosctl validate --config controlplane.yaml --mode metal
talosctl validate --config worker.yaml --mode metal

Note: Validation of the install disk could potentially fail as the validation is performed on you local machine and the specified disk may not exist.

Create the Control Plane Nodes

metal device create \
  --project-id $PROJECT_ID \
  --facility $FACILITY \
  --ipxe-script-url $PXE_SERVER \
  --operating-system "custom_ipxe" \
  --plan $PLAN\
  --hostname $HOSTNAME\
  --userdata-file controlplane.yaml

Note: The above should be invoked at least twice in order for etcd to form quorum.

Create the Worker Nodes

metal device create \
  --project-id $PROJECT_ID \
  --facility $FACILITY \
  --ipxe-script-url $PXE_SERVER \
  --operating-system "custom_ipxe" \
  --plan $PLAN\
  --hostname $HOSTNAME\
  --userdata-file worker.yaml

Bootstrap Etcd

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

1.1.3 - Matchbox

In this guide we will create an HA Kubernetes cluster with 3 worker nodes using an existing load balancer and matchbox deployment.

Creating a Cluster

In this guide we will create an HA Kubernetes cluster with 3 worker nodes. We assume an existing load balancer, matchbox deployment, and some familiarity with iPXE.

We leave it up to the user to decide if they would like to use static networking, or DHCP. The setup and configuration of DHCP will not be covered.

Create the Machine Configuration Files

Generating Base Configurations

Using the DNS name of the load balancer, generate the base configuration files for the Talos machines:

$ talosctl gen config talos-k8s-metal-tutorial https://<load balancer IP or DNS>:<port>
created controlplane.yaml
created worker.yaml
created talosconfig

At this point, you can modify the generated configs to your liking. Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Validate the Configuration Files

$ talosctl validate --config controlplane.yaml --mode metal
controlplane.yaml is valid for metal mode
$ talosctl validate --config worker.yaml --mode metal
worker.yaml is valid for metal mode

Publishing the Machine Configuration Files

In bare-metal setups it is up to the user to provide the configuration files over HTTP(S). A special kernel parameter (talos.config) must be used to inform Talos about where it should retreive its’ configuration file. To keep things simple we will place controlplane.yaml, and worker.yaml into Matchbox’s assets directory. This directory is automatically served by Matchbox.

Create the Matchbox Configuration Files

The profiles we will create will reference vmlinuz, and initramfs.xz. Download these files from the release of your choice, and place them in /var/lib/matchbox/assets.

Profiles

Control Plane Nodes
{
  "id": "control-plane",
  "name": "control-plane",
  "boot": {
    "kernel": "/assets/vmlinuz",
    "initrd": ["/assets/initramfs.xz"],
    "args": [
      "initrd=initramfs.xz",
      "init_on_alloc=1",
      "slab_nomerge",
      "pti=on",
      "console=tty0",
      "console=ttyS0",
      "printk.devkmsg=on",
      "talos.platform=metal",
      "talos.config=http://matchbox.talos.dev/assets/controlplane.yaml"
    ]
  }
}

Note: Be sure to change http://matchbox.talos.dev to the endpoint of your matchbox server.

Worker Nodes
{
  "id": "default",
  "name": "default",
  "boot": {
    "kernel": "/assets/vmlinuz",
    "initrd": ["/assets/initramfs.xz"],
    "args": [
      "initrd=initramfs.xz",
      "init_on_alloc=1",
      "slab_nomerge",
      "pti=on",
      "console=tty0",
      "console=ttyS0",
      "printk.devkmsg=on",
      "talos.platform=metal",
      "talos.config=http://matchbox.talos.dev/assets/worker.yaml"
    ]
  }
}

Groups

Now, create the following groups, and ensure that the selectors are accurate for your specific setup.

{
  "id": "control-plane-1",
  "name": "control-plane-1",
  "profile": "control-plane",
  "selector": {
    ...
  }
}
{
  "id": "control-plane-2",
  "name": "control-plane-2",
  "profile": "control-plane",
  "selector": {
    ...
  }
}
{
  "id": "control-plane-3",
  "name": "control-plane-3",
  "profile": "control-plane",
  "selector": {
    ...
  }
}
{
  "id": "default",
  "name": "default",
  "profile": "default"
}

Boot the Machines

Now that we have our configuraton files in place, boot all the machines. Talos will come up on each machine, grab its’ configuration file, and bootstrap itself.

Bootstrap Etcd

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

1.1.4 - Sidero

Sidero is a project created by the Talos team that has native support for Talos.

Sidero Metal is a project created by the Talos team that provides a bare metal installer for Cluster API, and that has native support for Talos Linux. It can be easily installed using clusterctl. The best way to get started with Sidero Metal is to visit the website.

1.2 - Virtualized Platforms

Installation of Talos Linux for virtualization platforms.

1.2.1 - Hyper-V

Creating a Talos Kubernetes cluster using Hyper-V.

Pre-requisities

  1. Download the latest talos-amd64.iso ISO from github releases page
  2. Create a New-TalosVM folder in any of your PS Module Path folders $env:PSModulePath -split ';' and save the New-TalosVM.psm1 there

Plan Overview

Here we will create a basic 3 node cluster with a single control-plane node and two worker nodes. The only difference between control plane and worker node is the amount of RAM and an additional storage VHD. This is personal preference and can be configured to your liking.

We are using a VMNamePrefix argument for a VM Name prefix and not the full hostname. This command will find any existing VM with that prefix and “+1” the highest suffix it finds. For example, if VMs talos-cp01 and talos-cp02 exist, this will create VMs starting from talos-cp03, depending on NumberOfVMs argument.

Setup a Control Plane Node

Use the following command to create a single control plane node:

New-TalosVM -VMNamePrefix talos-cp -CPUCount 2 -StartupMemory 4GB -SwitchName LAB -TalosISOPath C:\ISO\talos-amd64.iso -NumberOfVMs 1 -VMDestinationBasePath 'D:\Virtual Machines\Test VMs\Talos'

This will create talos-cp01 VM and power it on.

Setup Worker Nodes

Use the following command to create 2 worker nodes:

New-TalosVM -VMNamePrefix talos-worker -CPUCount 4 -StartupMemory 8GB -SwitchName LAB -TalosISOPath C:\ISO\talos-amd64.iso -NumberOfVMs 2 -VMDestinationBasePath 'D:\Virtual Machines\Test VMs\Talos' -StorageVHDSize 50GB

This will create two VMs: talos-worker01 and talos-wworker02 and attach an additional VHD of 50GB for storage (which in my case will be passed to Mayastor).

Pushing Config to the Nodes

Now that our VMs are ready, find their IP addresses from console of VM. With that information, push config to the control plane node with:

# set control plane IP variable
$CONTROL_PLANE_IP='10.10.10.x'

# Generate talos config
talosctl gen config talos-cluster https://$($CONTROL_PLANE_IP):6443 --output-dir .

# Apply config to control plane node
talosctl apply-config --insecure --nodes $CONTROL_PLANE_IP --file .\controlplane.yaml

Pushing Config to Worker Nodes

Similarly, for the workers:

talosctl apply-config --insecure --nodes 10.10.10.x --file .\worker.yaml

Apply the config to both nodes.

Bootstrap Cluster

Now that our nodes are ready, we are ready to bootstrap the Kubernetes cluster.

# Use following command to set node and endpoint permanantly in config so you dont have to type it everytime
talosctl config endpoint $CONTROL_PLANE_IP
talosctl config node $CONTROL_PLANE_IP

# Bootstrap cluster
talosctl bootstrap

# Generate kubeconfig
talosctl kubeconfig .

This will generate the kubeconfig file, you can use to connect to the cluster.

1.2.2 - KVM

Talos is known to work on KVM. We don’t yet have a documented guide specific to KVM; however, you can follow the General Getting Started Guide. If you run into any issues, our community can probably help!

1.2.3 - Proxmox

Creating Talos Kubernetes cluster using Proxmox.

In this guide we will create a Kubernetes cluster using Proxmox.

Video Walkthrough

To see a live demo of this writeup, visit Youtube here:

Installation

How to Get Proxmox

It is assumed that you have already installed Proxmox onto the server you wish to create Talos VMs on. Visit the Proxmox downloads page if necessary.

Install talosctl

You can download talosctl via github.com/siderolabs/talos/releases

curl https://github.com/siderolabs/talos/releases/download/<version>/talosctl-<platform>-<arch> -L -o talosctl

For example version v1.1.1 for linux platform:

curl https://github.com/siderolabs/talos/releases/download/v1.1.1/talosctl-linux-amd64 -L -o talosctl
sudo cp talosctl /usr/local/bin
sudo chmod +x /usr/local/bin/talosctl

Download ISO Image

In order to install Talos in Proxmox, you will need the ISO image from the Talos release page. You can download talos-amd64.iso via github.com/siderolabs/talos/releases

mkdir -p _out/
curl https://github.com/siderolabs/talos/releases/download/<version>/talos-<arch>.iso -L -o _out/talos-<arch>.iso

For example version v1.1.1 for linux platform:

mkdir -p _out/
curl https://github.com/siderolabs/talos/releases/download/v1.1.1/talos-amd64.iso -L -o _out/talos-amd64.iso

Upload ISO

From the Proxmox UI, select the “local” storage and enter the “Content” section. Click the “Upload” button:

Select the ISO you downloaded previously, then hit “Upload”

Create VMs

Start by creating a new VM by clicking the “Create VM” button in the Proxmox UI:

Fill out a name for the new VM:

In the OS tab, select the ISO we uploaded earlier:

Keep the defaults set in the “System” tab.

Keep the defaults in the “Hard Disk” tab as well, only changing the size if desired.

In the “CPU” section, give at least 2 cores to the VM:

Note: As of Talos v1.0 (which requires the x86-64-v2 microarchitecture), booting with the default Processor Type kvm64 will not work. You can enable the required CPU features after creating the VM by adding the following line in the corresponding /etc/pve/qemu-server/<vmid>.conf file:

args: -cpu kvm64,+cx16,+lahf_lm,+popcnt,+sse3,+ssse3,+sse4.1,+sse4.2

Alternatively, you can set the Processor Type to host if your Proxmox host supports these CPU features, this however prevents using live VM migration.

Verify that the RAM is set to at least 2GB:

Keep the default values for networking, verifying that the VM is set to come up on the bridge interface:

Finish creating the VM by clicking through the “Confirm” tab and then “Finish”.

Repeat this process for a second VM to use as a worker node. You can also repeat this for additional nodes desired.

Start Control Plane Node

Once the VMs have been created and updated, start the VM that will be the first control plane node. This VM will boot the ISO image specified earlier and enter “maintenance mode”.

With DHCP server

Once the machine has entered maintenance mode, there will be a console log that details the IP address that the node received. Take note of this IP address, which will be referred to as $CONTROL_PLANE_IP for the rest of this guide. If you wish to export this IP as a bash variable, simply issue a command like export CONTROL_PLANE_IP=1.2.3.4.

Without DHCP server

To apply the machine configurations in maintenance mode, VM has to have IP on the network. So you can set it on boot time manually.

Press e on the boot time. And set the IP parameters for the VM. Format is:

ip=<client-ip>:<srv-ip>:<gw-ip>:<netmask>:<host>:<device>:<autoconf>

For example $CONTROL_PLANE_IP will be 192.168.0.100 and gateway 192.168.0.1

linux /boot/vmlinuz init_on_alloc=1 slab_nomerge pti=on panic=0 consoleblank=0 printk.devkmsg=on earlyprintk=ttyS0 console=tty0 console=ttyS0 talos.platform=metal ip=192.168.0.100::192.168.0.1:255.255.255.0::eth0:off

Then press Ctrl-x or F10

Generate Machine Configurations

With the IP address above, you can now generate the machine configurations to use for installing Talos and Kubernetes. Issue the following command, updating the output directory, cluster name, and control plane IP as you see fit:

talosctl gen config talos-vbox-cluster https://$CONTROL_PLANE_IP:6443 --output-dir _out

This will create several files in the _out directory: controlplane.yaml, worker.yaml, and talosconfig.

Create Control Plane Node

Using the controlplane.yaml generated above, you can now apply this config using talosctl. Issue:

talosctl apply-config --insecure --nodes $CONTROL_PLANE_IP --file _out/controlplane.yaml

You should now see some action in the Proxmox console for this VM. Talos will be installed to disk, the VM will reboot, and then Talos will configure the Kubernetes control plane on this VM.

Note: This process can be repeated multiple times to create an HA control plane.

Create Worker Node

Create at least a single worker node using a process similar to the control plane creation above. Start the worker node VM and wait for it to enter “maintenance mode”. Take note of the worker node’s IP address, which will be referred to as $WORKER_IP

Issue:

talosctl apply-config --insecure --nodes $WORKER_IP --file _out/worker.yaml

Note: This process can be repeated multiple times to add additional workers.

Using the Cluster

Once the cluster is available, you can make use of talosctl and kubectl to interact with the cluster. For example, to view current running containers, run talosctl containers for a list of containers in the system namespace, or talosctl containers -k for the k8s.io namespace. To view the logs of a container, use talosctl logs <container> or talosctl logs -k <container>.

First, configure talosctl to talk to your control plane node by issuing the following, updating paths and IPs as necessary:

export TALOSCONFIG="_out/talosconfig"
talosctl config endpoint $CONTROL_PLANE_IP
talosctl config node $CONTROL_PLANE_IP

Bootstrap Etcd

Set the endpoints and nodes:

talosctl --talosconfig _out/talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig _out/talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig _out/talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig _out/talosconfig kubeconfig .

Cleaning Up

To cleanup, simply stop and delete the virtual machines from the Proxmox UI.

1.2.4 - VMware

Creating Talos Kubernetes cluster using VMware.

Creating a Cluster via the govc CLI

In this guide we will create an HA Kubernetes cluster with 2 worker nodes. We will use the govc cli which can be downloaded here.

Prereqs/Assumptions

This guide will use the virtual IP (“VIP”) functionality that is built into Talos in order to provide a stable, known IP for the Kubernetes control plane. This simply means the user should pick an IP on their “VM Network” to designate for this purpose and keep it handy for future steps.

Create the Machine Configuration Files

Generating Base Configurations

Using the VIP chosen in the prereq steps, we will now generate the base configuration files for the Talos machines. This can be done with the talosctl gen config ... command. Take note that we will also use a JSON6902 patch when creating the configs so that the control plane nodes get some special information about the VIP we chose earlier, as well as a daemonset to install vmware tools on talos nodes.

First, download cp.patch.yaml to your local machine and edit the VIP to match your chosen IP. You can do this by issuing: curl -fsSLO https://raw.githubusercontent.com/siderolabs/talos/master/website/content/v1.1/talos-guides/install/virtualized-platforms/vmware/cp.patch.yaml. It’s contents should look like the following:

- op: add
  path: /machine/network
  value:
    interfaces:
    - interface: eth0
      dhcp: true
      vip:
        ip: <VIP>
- op: replace
  path: /cluster/extraManifests
  value:
    - "https://raw.githubusercontent.com/mologie/talos-vmtoolsd/master/deploy/unstable.yaml"

With the patch in hand, generate machine configs with:

$ talosctl gen config vmware-test https://<VIP>:<port> --config-patch-control-plane @cp.patch.yaml
created controlplane.yaml
created worker.yaml
created talosconfig

At this point, you can modify the generated configs to your liking if needed. Optionally, you can specify additional patches by adding to the cp.patch.yaml file downloaded earlier, or create your own patch files.

Validate the Configuration Files

$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode

Set Environment Variables

govc makes use of the following environment variables

export GOVC_URL=<vCenter url>
export GOVC_USERNAME=<vCenter username>
export GOVC_PASSWORD=<vCenter password>

Note: If your vCenter installation makes use of self signed certificates, you’ll want to export GOVC_INSECURE=true.

There are some additional variables that you may need to set:

export GOVC_DATACENTER=<vCenter datacenter>
export GOVC_RESOURCE_POOL=<vCenter resource pool>
export GOVC_DATASTORE=<vCenter datastore>
export GOVC_NETWORK=<vCenter network>

Choose Install Approach

As part of this guide, we have a more automated install script that handles some of the complexity of importing OVAs and creating VMs. If you wish to use this script, we will detail that next. If you wish to carry out the manual approach, simply skip ahead to the “Manual Approach” section.

Scripted Install

Download the vmware.sh script to your local machine. You can do this by issuing curl -fsSLO "https://raw.githubusercontent.com/siderolabs/talos/master/website/content/v1.1/talos-guides/install/virtualized-platforms/vmware/vmware.sh". This script has default variables for things like Talos version and cluster name that may be interesting to tweak before deploying.

Import OVA

To create a content library and import the Talos OVA corresponding to the mentioned Talos version, simply issue:

./vmware.sh upload_ova

Create Cluster

With the OVA uploaded to the content library, you can create a 5 node (by default) cluster with 3 control plane and 2 worker nodes:

./vmware.sh create

This step will create a VM from the OVA, edit the settings based on the env variables used for VM size/specs, then power on the VMs.

You may now skip past the “Manual Approach” section down to “Bootstrap Cluster”.

Manual Approach

Import the OVA into vCenter

A talos.ova asset is published with each release. We will refer to the version of the release as $TALOS_VERSION below. It can be easily exported with export TALOS_VERSION="v0.3.0-alpha.10" or similar.

curl -LO https://github.com/siderolabs/talos/releases/download/$TALOS_VERSION/talos.ova

Create a content library (if needed) with:

govc library.create <library name>

Import the OVA to the library with:

govc library.import -n talos-${TALOS_VERSION} <library name> /path/to/downloaded/talos.ova

Create the Bootstrap Node

We’ll clone the OVA to create the bootstrap node (our first control plane node).

govc library.deploy <library name>/talos-${TALOS_VERSION} control-plane-1

Talos makes use of the guestinfo facility of VMware to provide the machine/cluster configuration. This can be set using the govc vm.change command. To facilitate persistent storage using the vSphere cloud provider integration with Kubernetes, disk.enableUUID=1 is used.

govc vm.change \
  -e "guestinfo.talos.config=$(cat controlplane.yaml | base64)" \
  -e "disk.enableUUID=1" \
  -vm control-plane-1

Update Hardware Resources for the Bootstrap Node

  • -c is used to configure the number of cpus
  • -m is used to configure the amount of memory (in MB)
govc vm.change \
  -c 2 \
  -m 4096 \
  -vm control-plane-1

The following can be used to adjust the EPHEMERAL disk size.

govc vm.disk.change -vm control-plane-1 -disk.name disk-1000-0 -size 10G
govc vm.power -on control-plane-1

Create the Remaining Control Plane Nodes

govc library.deploy <library name>/talos-${TALOS_VERSION} control-plane-2
govc vm.change \
  -e "guestinfo.talos.config=$(base64 controlplane.yaml)" \
  -e "disk.enableUUID=1" \
  -vm control-plane-2

govc library.deploy <library name>/talos-${TALOS_VERSION} control-plane-3
govc vm.change \
  -e "guestinfo.talos.config=$(base64 controlplane.yaml)" \
  -e "disk.enableUUID=1" \
  -vm control-plane-3
govc vm.change \
  -c 2 \
  -m 4096 \
  -vm control-plane-2

govc vm.change \
  -c 2 \
  -m 4096 \
  -vm control-plane-3
govc vm.disk.change -vm control-plane-2 -disk.name disk-1000-0 -size 10G

govc vm.disk.change -vm control-plane-3 -disk.name disk-1000-0 -size 10G
govc vm.power -on control-plane-2

govc vm.power -on control-plane-3

Update Settings for the Worker Nodes

govc library.deploy <library name>/talos-${TALOS_VERSION} worker-1
govc vm.change \
  -e "guestinfo.talos.config=$(base64 worker.yaml)" \
  -e "disk.enableUUID=1" \
  -vm worker-1

govc library.deploy <library name>/talos-${TALOS_VERSION} worker-2
govc vm.change \
  -e "guestinfo.talos.config=$(base64 worker.yaml)" \
  -e "disk.enableUUID=1" \
  -vm worker-2
govc vm.change \
  -c 4 \
  -m 8192 \
  -vm worker-1

govc vm.change \
  -c 4 \
  -m 8192 \
  -vm worker-2
govc vm.disk.change -vm worker-1 -disk.name disk-1000-0 -size 10G

govc vm.disk.change -vm worker-2 -disk.name disk-1000-0 -size 10G
govc vm.power -on worker-1

govc vm.power -on worker-2

Bootstrap Cluster

In the vSphere UI, open a console to one of the control plane nodes. You should see some output stating that etcd should be bootstrapped. This text should look like:

"etcd is waiting to join the cluster, if this node is the first node in the cluster, please run `talosctl bootstrap` against one of the following IPs:

Take note of the IP mentioned here and issue:

talosctl --talosconfig talosconfig bootstrap -e <control plane IP> -n <control plane IP>

Keep this IP handy for the following steps as well.

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig config endpoint <control plane IP>
talosctl --talosconfig talosconfig config node <control plane IP>
talosctl --talosconfig talosconfig kubeconfig .

Configure talos-vmtoolsd

The talos-vmtoolsd application was deployed as a daemonset as part of the cluster creation; however, we must now provide a talos credentials file for it to use.

Create a new talosconfig with:

talosctl --talosconfig talosconfig -n <control plane IP> config new vmtoolsd-secret.yaml --roles os:admin

Create a secret from the talosconfig:

kubectl -n kube-system create secret generic talos-vmtoolsd-config \
  --from-file=talosconfig=./vmtoolsd-secret.yaml

Clean up the generated file from local system:

rm vmtoolsd-secret.yaml

Once configured, you should now see these daemonset pods go into “Running” state and in vCenter, you will now see IPs and info from the Talos nodes present in the UI.

1.2.5 - Xen

Talos is known to work on Xen. We don’t yet have a documented guide specific to Xen; however, you can follow the General Getting Started Guide. If you run into any issues, our community can probably help!

1.3 - Cloud Platforms

Installation of Talos Linux on many cloud platforms.

1.3.1 - AWS

Creating a cluster via the AWS CLI.

Official AMI Images

Official AMI image ID can be found in the cloud-images.json file attached to the Talos release:

curl -sL https://github.com/siderolabs/talos/releases/download/v1.1.1/cloud-images.json | \
    jq -r '.[] | select(.region == "us-east-1") | select (.arch == "amd64") | .id'

Replace us-east-1 and amd64 in the line above with the desired region and architecture.

Creating a Cluster via the AWS CLI

In this guide we will create an HA Kubernetes cluster with 3 worker nodes. We assume an existing VPC, and some familiarity with AWS. If you need more information on AWS specifics, please see the official AWS documentation.

Create the Subnet

aws ec2 create-subnet \
    --region $REGION \
    --vpc-id $VPC \
    --cidr-block ${CIDR_BLOCK}

Create the AMI

Prepare the Import Prerequisites

Create the S3 Bucket
aws s3api create-bucket \
    --bucket $BUCKET \
    --create-bucket-configuration LocationConstraint=$REGION \
    --acl private
Create the vmimport Role

In order to create an AMI, ensure that the vmimport role exists as described in the official AWS documentation.

Note that the role should be associated with the S3 bucket we created above.

Create the Image Snapshot

First, download the AWS image from a Talos release:

curl -LO https://github.com/siderolabs/talos/releases/download/v1.1.1/aws-amd64.tar.gz | tar -xv

Copy the RAW disk to S3 and import it as a snapshot:

aws s3 cp disk.raw s3://$BUCKET/talos-aws-tutorial.raw
aws ec2 import-snapshot \
    --region $REGION \
    --description "Talos kubernetes tutorial" \
    --disk-container "Format=raw,UserBucket={S3Bucket=$BUCKET,S3Key=talos-aws-tutorial.raw}"

Save the SnapshotId, as we will need it once the import is done. To check on the status of the import, run:

aws ec2 describe-import-snapshot-tasks \
    --region $REGION \
    --import-task-ids

Once the SnapshotTaskDetail.Status indicates completed, we can register the image.

Register the Image
aws ec2 register-image \
    --region $REGION \
    --block-device-mappings "DeviceName=/dev/xvda,VirtualName=talos,Ebs={DeleteOnTermination=true,SnapshotId=$SNAPSHOT,VolumeSize=4,VolumeType=gp2}" \
    --root-device-name /dev/xvda \
    --virtualization-type hvm \
    --architecture x86_64 \
    --ena-support \
    --name talos-aws-tutorial-ami

We now have an AMI we can use to create our cluster. Save the AMI ID, as we will need it when we create EC2 instances.

Create a Security Group

aws ec2 create-security-group \
    --region $REGION \
    --group-name talos-aws-tutorial-sg \
    --description "Security Group for EC2 instances to allow ports required by Talos"

Using the security group ID from above, allow all internal traffic within the same security group:

aws ec2 authorize-security-group-ingress \
    --region $REGION \
    --group-name talos-aws-tutorial-sg \
    --protocol all \
    --port 0 \
    --source-group $SECURITY_GROUP

and expose the Talos and Kubernetes APIs:

aws ec2 authorize-security-group-ingress \
    --region $REGION \
    --group-name talos-aws-tutorial-sg \
    --protocol tcp \
    --port 6443 \
    --cidr 0.0.0.0/0

aws ec2 authorize-security-group-ingress \
    --region $REGION \
    --group-name talos-aws-tutorial-sg \
    --protocol tcp \
    --port 50000-50001 \
    --cidr 0.0.0.0/0

Create a Load Balancer

aws elbv2 create-load-balancer \
    --region $REGION \
    --name talos-aws-tutorial-lb \
    --type network --subnets $SUBNET

Take note of the DNS name and ARN. We will need these soon.

Create the Machine Configuration Files

Generating Base Configurations

Using the DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines:

$ talosctl gen config talos-k8s-aws-tutorial https://<load balancer IP or DNS>:<port> --with-examples=false --with-docs=false
created controlplane.yaml
created worker.yaml
created talosconfig

Take note that the generated configs are too long for AWS userdata field if the --with-examples and --with-docs flags are not passed.

At this point, you can modify the generated configs to your liking.

Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Validate the Configuration Files

$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode

Create the EC2 Instances

Note: There is a known issue that prevents Talos from running on T2 instance types. Please use T3 if you need burstable instance types.

Create the Control Plane Nodes

CP_COUNT=1
while [[ "$CP_COUNT" -lt 4 ]]; do
  aws ec2 run-instances \
    --region $REGION \
    --image-id $AMI \
    --count 1 \
    --instance-type t3.small \
    --user-data file://controlplane.yaml \
    --subnet-id $SUBNET \
    --security-group-ids $SECURITY_GROUP \
    --associate-public-ip-address \
    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=talos-aws-tutorial-cp-$CP_COUNT}]"
  ((CP_COUNT++))
done

Make a note of the resulting PrivateIpAddress from the init and controlplane nodes for later use.

Create the Worker Nodes

aws ec2 run-instances \
    --region $REGION \
    --image-id $AMI \
    --count 3 \
    --instance-type t3.small \
    --user-data file://worker.yaml \
    --subnet-id $SUBNET \
    --security-group-ids $SECURITY_GROUP
    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=talos-aws-tutorial-worker}]"

Configure the Load Balancer

aws elbv2 create-target-group \
    --region $REGION \
    --name talos-aws-tutorial-tg \
    --protocol TCP \
    --port 6443 \
    --target-type ip \
    --vpc-id $VPC

Now, using the target group’s ARN, and the PrivateIpAddress from the instances that you created :

aws elbv2 register-targets \
    --region $REGION \
    --target-group-arn $TARGET_GROUP_ARN \
    --targets Id=$CP_NODE_1_IP  Id=$CP_NODE_2_IP  Id=$CP_NODE_3_IP

Using the ARNs of the load balancer and target group from previous steps, create the listener:

aws elbv2 create-listener \
    --region $REGION \
    --load-balancer-arn $LOAD_BALANCER_ARN \
    --protocol TCP \
    --port 443 \
    --default-actions Type=forward,TargetGroupArn=$TARGET_GROUP_ARN

Bootstrap Etcd

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

1.3.2 - Azure

Creating a cluster via the CLI on Azure.

Creating a Cluster via the CLI

In this guide we will create an HA Kubernetes cluster with 1 worker node. We assume existing Blob Storage, and some familiarity with Azure. If you need more information on Azure specifics, please see the official Azure documentation.

Environment Setup

We’ll make use of the following environment variables throughout the setup. Edit the variables below with your correct information.

# Storage account to use
export STORAGE_ACCOUNT="StorageAccountName"

# Storage container to upload to
export STORAGE_CONTAINER="StorageContainerName"

# Resource group name
export GROUP="ResourceGroupName"

# Location
export LOCATION="centralus"

# Get storage account connection string based on info above
export CONNECTION=$(az storage account show-connection-string \
                    -n $STORAGE_ACCOUNT \
                    -g $GROUP \
                    -o tsv)

Create the Image

First, download the Azure image from a Talos release. Once downloaded, untar with tar -xvf /path/to/azure-amd64.tar.gz

Upload the VHD

Once you have pulled down the image, you can upload it to blob storage with:

az storage blob upload \
  --connection-string $CONNECTION \
  --container-name $STORAGE_CONTAINER \
  -f /path/to/extracted/talos-azure.vhd \
  -n talos-azure.vhd

Register the Image

Now that the image is present in our blob storage, we’ll register it.

az image create \
  --name talos \
  --source https://$STORAGE_ACCOUNT.blob.core.windows.net/$STORAGE_CONTAINER/talos-azure.vhd \
  --os-type linux \
  -g $GROUP

Network Infrastructure

Virtual Networks and Security Groups

Once the image is prepared, we’ll want to work through setting up the network. Issue the following to create a network security group and add rules to it.

# Create vnet
az network vnet create \
  --resource-group $GROUP \
  --location $LOCATION \
  --name talos-vnet \
  --subnet-name talos-subnet

# Create network security group
az network nsg create -g $GROUP -n talos-sg

# Client -> apid
az network nsg rule create \
  -g $GROUP \
  --nsg-name talos-sg \
  -n apid \
  --priority 1001 \
  --destination-port-ranges 50000 \
  --direction inbound

# Trustd
az network nsg rule create \
  -g $GROUP \
  --nsg-name talos-sg \
  -n trustd \
  --priority 1002 \
  --destination-port-ranges 50001 \
  --direction inbound

# etcd
az network nsg rule create \
  -g $GROUP \
  --nsg-name talos-sg \
  -n etcd \
  --priority 1003 \
  --destination-port-ranges 2379-2380 \
  --direction inbound

# Kubernetes API Server
az network nsg rule create \
  -g $GROUP \
  --nsg-name talos-sg \
  -n kube \
  --priority 1004 \
  --destination-port-ranges 6443 \
  --direction inbound

Load Balancer

We will create a public ip, load balancer, and a health check that we will use for our control plane.

# Create public ip
az network public-ip create \
  --resource-group $GROUP \
  --name talos-public-ip \
  --allocation-method static

# Create lb
az network lb create \
  --resource-group $GROUP \
  --name talos-lb \
  --public-ip-address talos-public-ip \
  --frontend-ip-name talos-fe \
  --backend-pool-name talos-be-pool

# Create health check
az network lb probe create \
  --resource-group $GROUP \
  --lb-name talos-lb \
  --name talos-lb-health \
  --protocol tcp \
  --port 6443

# Create lb rule for 6443
az network lb rule create \
  --resource-group $GROUP \
  --lb-name talos-lb \
  --name talos-6443 \
  --protocol tcp \
  --frontend-ip-name talos-fe \
  --frontend-port 6443 \
  --backend-pool-name talos-be-pool \
  --backend-port 6443 \
  --probe-name talos-lb-health

Network Interfaces

In Azure, we have to pre-create the NICs for our control plane so that they can be associated with our load balancer.

for i in $( seq 0 1 2 ); do
  # Create public IP for each nic
  az network public-ip create \
    --resource-group $GROUP \
    --name talos-controlplane-public-ip-$i \
    --allocation-method static


  # Create nic
  az network nic create \
    --resource-group $GROUP \
    --name talos-controlplane-nic-$i \
    --vnet-name talos-vnet \
    --subnet talos-subnet \
    --network-security-group talos-sg \
    --public-ip-address talos-controlplane-public-ip-$i\
    --lb-name talos-lb \
    --lb-address-pools talos-be-pool
done

# NOTES:
# Talos can detect PublicIPs automatically if PublicIP SKU is Basic.
# Use `--sku Basic` to set SKU to Basic.

Cluster Configuration

With our networking bits setup, we’ll fetch the IP for our load balancer and create our configuration files.

LB_PUBLIC_IP=$(az network public-ip show \
              --resource-group $GROUP \
              --name talos-public-ip \
              --query [ipAddress] \
              --output tsv)

talosctl gen config talos-k8s-azure-tutorial https://${LB_PUBLIC_IP}:6443

Compute Creation

We are now ready to create our azure nodes. Azure allows you to pass Talos machine configuration to the virtual machine at bootstrap time via user-data or custom-data methods.

Talos supports only custom-data method, machine configuration is available to the VM only on the first boot.

# Create availability set
az vm availability-set create \
  --name talos-controlplane-av-set \
  -g $GROUP

# Create the controlplane nodes
for i in $( seq 0 1 2 ); do
  az vm create \
    --name talos-controlplane-$i \
    --image talos \
    --custom-data ./controlplane.yaml \
    -g $GROUP \
    --admin-username talos \
    --generate-ssh-keys \
    --verbose \
    --boot-diagnostics-storage $STORAGE_ACCOUNT \
    --os-disk-size-gb 20 \
    --nics talos-controlplane-nic-$i \
    --availability-set talos-controlplane-av-set \
    --no-wait
done

# Create worker node
  az vm create \
    --name talos-worker-0 \
    --image talos \
    --vnet-name talos-vnet \
    --subnet talos-subnet \
    --custom-data ./worker.yaml \
    -g $GROUP \
    --admin-username talos \
    --generate-ssh-keys \
    --verbose \
    --boot-diagnostics-storage $STORAGE_ACCOUNT \
    --nsg talos-sg \
    --os-disk-size-gb 20 \
    --no-wait

# NOTES:
# `--admin-username` and `--generate-ssh-keys` are required by the az cli,
# but are not actually used by talos
# `--os-disk-size-gb` is the backing disk for Kubernetes and any workload containers
# `--boot-diagnostics-storage` is to enable console output which may be necessary
# for troubleshooting

Bootstrap Etcd

You should now be able to interact with your cluster with talosctl. We will need to discover the public IP for our first control plane node first.

CONTROL_PLANE_0_IP=$(az network public-ip show \
                    --resource-group $GROUP \
                    --name talos-controlplane-public-ip-0 \
                    --query [ipAddress] \
                    --output tsv)

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint $CONTROL_PLANE_0_IP
talosctl --talosconfig talosconfig config node $CONTROL_PLANE_0_IP

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

1.3.3 - DigitalOcean

Creating a cluster via the CLI on DigitalOcean.

Creating a Cluster via the CLI

In this guide we will create an HA Kubernetes cluster with 1 worker node. We assume an existing Space, and some familiarity with DigitalOcean. If you need more information on DigitalOcean specifics, please see the official DigitalOcean documentation.

Create the Image

First, download the DigitalOcean image from a Talos release. Extract the archive to get the disk.raw file, compress it using gzip to disk.raw.gz.

Using an upload method of your choice (doctl does not have Spaces support), upload the image to a space. Now, create an image using the URL of the uploaded image:

doctl compute image create \
    --region $REGION \
    --image-description talos-digital-ocean-tutorial \
    --image-url https://talos-tutorial.$REGION.digitaloceanspaces.com/disk.raw.gz \
    Talos

Save the image ID. We will need it when creating droplets.

Create a Load Balancer

doctl compute load-balancer create \
    --region $REGION \
    --name talos-digital-ocean-tutorial-lb \
    --tag-name talos-digital-ocean-tutorial-control-plane \
    --health-check protocol:tcp,port:6443,check_interval_seconds:10,response_timeout_seconds:5,healthy_threshold:5,unhealthy_threshold:3 \
    --forwarding-rules entry_protocol:tcp,entry_port:443,target_protocol:tcp,target_port:6443

We will need the IP of the load balancer. Using the ID of the load balancer, run:

doctl compute load-balancer get --format IP <load balancer ID>

Save it, as we will need it in the next step.

Create the Machine Configuration Files

Generating Base Configurations

Using the DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines:

$ talosctl gen config talos-k8s-digital-ocean-tutorial https://<load balancer IP or DNS>:<port>
created controlplane.yaml
created worker.yaml
created talosconfig

At this point, you can modify the generated configs to your liking. Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Validate the Configuration Files

$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode

Create the Droplets

Create the Control Plane Nodes

Run the following commands, to give ourselves three total control plane nodes:

doctl compute droplet create \
    --region $REGION \
    --image <image ID> \
    --size s-2vcpu-4gb \
    --enable-private-networking \
    --tag-names talos-digital-ocean-tutorial-control-plane \
    --user-data-file controlplane.yaml \
    --ssh-keys <ssh key fingerprint> \
    talos-control-plane-1
doctl compute droplet create \
    --region $REGION \
    --image <image ID> \
    --size s-2vcpu-4gb \
    --enable-private-networking \
    --tag-names talos-digital-ocean-tutorial-control-plane \
    --user-data-file controlplane.yaml \
    --ssh-keys <ssh key fingerprint> \
    talos-control-plane-2
doctl compute droplet create \
    --region $REGION \
    --image <image ID> \
    --size s-2vcpu-4gb \
    --enable-private-networking \
    --tag-names talos-digital-ocean-tutorial-control-plane \
    --user-data-file controlplane.yaml \
    --ssh-keys <ssh key fingerprint> \
    talos-control-plane-3

Note: Although SSH is not used by Talos, DigitalOcean still requires that an SSH key be associated with the droplet. Create a dummy key that can be used to satisfy this requirement.

Create the Worker Nodes

Run the following to create a worker node:

doctl compute droplet create \
    --region $REGION \
    --image <image ID> \
    --size s-2vcpu-4gb \
    --enable-private-networking \
    --user-data-file worker.yaml \
    --ssh-keys <ssh key fingerprint> \
    talos-worker-1

Bootstrap Etcd

To configure talosctl we will need the first control plane node’s IP:

doctl compute droplet get --format PublicIPv4 <droplet ID>

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

1.3.4 - GCP

Creating a cluster via the CLI on Google Cloud Platform.

Creating a Cluster via the CLI

In this guide, we will create an HA Kubernetes cluster in GCP with 1 worker node. We will assume an existing Cloud Storage bucket, and some familiarity with Google Cloud. If you need more information on Google Cloud specifics, please see the official Google documentation.

jq and talosctl also needs to be installed

Manual Setup

Environment Setup

We’ll make use of the following environment variables throughout the setup. Edit the variables below with your correct information.

# Storage account to use
export STORAGE_BUCKET="StorageBucketName"
# Region
export REGION="us-central1"

Create the Image

First, download the Google Cloud image from a Talos release. These images are called gcp-$ARCH.tar.gz.

Upload the Image

Once you have downloaded the image, you can upload it to your storage bucket with:

gsutil cp /path/to/gcp-amd64.tar.gz gs://$STORAGE_BUCKET

Register the image

Now that the image is present in our bucket, we’ll register it.

gcloud compute images create talos \
 --source-uri=gs://$STORAGE_BUCKET/gcp-amd64.tar.gz \
 --guest-os-features=VIRTIO_SCSI_MULTIQUEUE

Network Infrastructure

Load Balancers and Firewalls

Once the image is prepared, we’ll want to work through setting up the network. Issue the following to create a firewall, load balancer, and their required components.

130.211.0.0/22 and 35.191.0.0/16 are the GCP Load Balancer IP ranges

# Create Instance Group
gcloud compute instance-groups unmanaged create talos-ig \
  --zone $REGION-b

# Create port for IG
gcloud compute instance-groups set-named-ports talos-ig \
    --named-ports tcp6443:6443 \
    --zone $REGION-b

# Create health check
gcloud compute health-checks create tcp talos-health-check --port 6443

# Create backend
gcloud compute backend-services create talos-be \
    --global \
    --protocol TCP \
    --health-checks talos-health-check \
    --timeout 5m \
    --port-name tcp6443

# Add instance group to backend
gcloud compute backend-services add-backend talos-be \
    --global \
    --instance-group talos-ig \
    --instance-group-zone $REGION-b

# Create tcp proxy
gcloud compute target-tcp-proxies create talos-tcp-proxy \
    --backend-service talos-be \
    --proxy-header NONE

# Create LB IP
gcloud compute addresses create talos-lb-ip --global

# Forward 443 from LB IP to tcp proxy
gcloud compute forwarding-rules create talos-fwd-rule \
    --global \
    --ports 443 \
    --address talos-lb-ip \
    --target-tcp-proxy talos-tcp-proxy

# Create firewall rule for health checks
gcloud compute firewall-rules create talos-controlplane-firewall \
     --source-ranges 130.211.0.0/22,35.191.0.0/16 \
     --target-tags talos-controlplane \
     --allow tcp:6443

# Create firewall rule to allow talosctl access
gcloud compute firewall-rules create talos-controlplane-talosctl \
  --source-ranges 0.0.0.0/0 \
  --target-tags talos-controlplane \
  --allow tcp:50000

Cluster Configuration

With our networking bits setup, we’ll fetch the IP for our load balancer and create our configuration files.

LB_PUBLIC_IP=$(gcloud compute forwarding-rules describe talos-fwd-rule \
               --global \
               --format json \
               | jq -r .IPAddress)

talosctl gen config talos-k8s-gcp-tutorial https://${LB_PUBLIC_IP}:443

Additionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Compute Creation

We are now ready to create our GCP nodes.

# Create the control plane nodes.
for i in $( seq 1 3 ); do
  gcloud compute instances create talos-controlplane-$i \
    --image talos \
    --zone $REGION-b \
    --tags talos-controlplane \
    --boot-disk-size 20GB \
    --metadata-from-file=user-data=./controlplane.yaml
    --tags talos-controlplane-$i
done

# Add control plane nodes to instance group
for i in $( seq 1 3 ); do
  gcloud compute instance-groups unmanaged add-instances talos-ig \
      --zone $REGION-b \
      --instances talos-controlplane-$i
done

# Create worker
gcloud compute instances create talos-worker-0 \
  --image talos \
  --zone $REGION-b \
  --boot-disk-size 20GB \
  --metadata-from-file=user-data=./worker.yaml
  --tags talos-worker-$i

Bootstrap Etcd

You should now be able to interact with your cluster with talosctl. We will need to discover the public IP for our first control plane node first.

CONTROL_PLANE_0_IP=$(gcloud compute instances describe talos-controlplane-0 \
                     --zone $REGION-b \
                     --format json \
                     | jq -r '.networkInterfaces[0].accessConfigs[0].natIP')

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint $CONTROL_PLANE_0_IP
talosctl --talosconfig talosconfig config node $CONTROL_PLANE_0_IP

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

Cleanup

# cleanup VM's
gcloud compute instances delete \
  talos-worker-0 \
  talos-controlplane-0 \
  talos-controlplane-1 \
  talos-controlplane-2

# cleanup firewall rules
gcloud compute firewall-rules delete \
  talos-controlplane-talosctl \
  talos-controlplane-firewall

# cleanup forwarding rules
gcloud compute forwarding-rules delete \
  talos-fwd-rule

# cleanup addresses
gcloud compute addresses delete \
  talos-lb-ip

# cleanup proxies
gcloud compute target-tcp-proxies delete \
  talos-tcp-proxy

# cleanup backend services
gcloud compute backend-services delete \
  talos-be

# cleanup health checks
gcloud compute health-checks delete \
  talos-health-check

# cleanup unmanaged instance groups
gcloud compute instance-groups unmanaged delete \
  talos-ig

# cleanup Talos image
gcloud compute images delete \
  talos

Using GCP Deployment manager

Using GCP deployment manager automatically creates a Google Storage bucket and uploads the Talos image to it. Once the deployment is complete the generated talosconfig and kubeconfig files are uploaded to the bucket.

By default this setup creates a three node control plane and a single worker in us-west1-b

First we need to create a folder to store our deployment manifests and perform all subsequent operations from that folder.

mkdir -p talos-gcp-deployment
cd talos-gcp-deployment

Getting the deployment manifests

We need to download two deployment manifests for the deployment from the Talos github repository.

curl -fsSLO "https://raw.githubusercontent.com/siderolabs/talos/master/website/content/v1.1/talos-guides/install/cloud-platforms/gcp/config.yaml"
curl -fsSLO "https://raw.githubusercontent.com/siderolabs/talos/master/website/content/v1.1/talos-guides/install/cloud-platforms/gcp/talos-ha.jinja"
# if using ccm
curl -fsSLO "https://raw.githubusercontent.com/siderolabs/talos/master/website/content/v1.1/talos-guides/install/cloud-platforms/gcp/gcp-ccm.yaml"

Updating the config

Now we need to update the local config.yaml file with any required changes such as changing the default zone, Talos version, machine sizes, nodes count etc.

An example config.yaml file is shown below:

imports:
  - path: talos-ha.jinja

resources:
  - name: talos-ha
    type: talos-ha.jinja
    properties:
      zone: us-west1-b
      talosVersion: v1.1.1
      externalCloudProvider: false
      controlPlaneNodeCount: 5
      controlPlaneNodeType: n1-standard-1
      workerNodeCount: 3
      workerNodeType: n1-standard-1
outputs:
  - name: bucketName
    value: $(ref.talos-ha.bucketName)

Enabling external cloud provider

Note: The externalCloudProvider property is set to false by default. The manifest used for deploying the ccm (cloud controller manager) is currently using the GCP ccm provided by openshift since there are no public images for the ccm yet.

Since the routes controller is disabled while deploying the CCM, the CNI pods needs to be restarted after the CCM deployment is complete to remove the node.kubernetes.io/network-unavailable taint. See Nodes network-unavailable taint not removed after installing ccm for more information

Use a custom built image for the ccm deployment if required.

Creating the deployment

Now we are ready to create the deployment. Confirm with y for any prompts. Run the following command to create the deployment:

# use a unique name for the deployment, resources are prefixed with the deployment name
export DEPLOYMENT_NAME="<deployment name>"
gcloud deployment-manager deployments create "${DEPLOYMENT_NAME}" --config config.yaml

Retrieving the outputs

First we need to get the deployment outputs.

# first get the outputs
OUTPUTS=$(gcloud deployment-manager deployments describe "${DEPLOYMENT_NAME}" --format json | jq '.outputs[]')

BUCKET_NAME=$(jq -r '. | select(.name == "bucketName").finalValue' <<< "${OUTPUTS}")
# used when cloud controller is enabled
SERVICE_ACCOUNT=$(jq -r '. | select(.name == "serviceAccount").finalValue' <<< "${OUTPUTS}")
PROJECT=$(jq -r '. | select(.name == "project").finalValue' <<< "${OUTPUTS}")

Note: If cloud controller manager is enabled, the below command needs to be run to allow the controller custom role to access cloud resources

gcloud projects add-iam-policy-binding \
    "${PROJECT}" \
    --member "serviceAccount:${SERVICE_ACCOUNT}" \
    --role roles/iam.serviceAccountUser

gcloud projects add-iam-policy-binding \
    "${PROJECT}" \
    --member serviceAccount:"${SERVICE_ACCOUNT}" \
    --role roles/compute.admin

gcloud projects add-iam-policy-binding \
    "${PROJECT}" \
    --member serviceAccount:"${SERVICE_ACCOUNT}" \
    --role roles/compute.loadBalancerAdmin

Downloading talos and kube config

In addition to the talosconfig and kubeconfig files, the storage bucket contains the controlplane.yaml and worker.yaml files used to join additional nodes to the cluster.

gsutil cp "gs://${BUCKET_NAME}/generated/talosconfig" .
gsutil cp "gs://${BUCKET_NAME}/generated/kubeconfig" .

Deploying the cloud controller manager

kubectl \
  --kubeconfig kubeconfig \
  --namespace kube-system \
  apply \
  --filename gcp-ccm.yaml
#  wait for the ccm to be up
kubectl \
  --kubeconfig kubeconfig \
  --namespace kube-system \
  rollout status \
  daemonset cloud-controller-manager

If the cloud controller manager is enabled, we need to restart the CNI pods to remove the node.kubernetes.io/network-unavailable taint.

# restart the CNI pods, in this case flannel
kubectl \
  --kubeconfig kubeconfig \
  --namespace kube-system \
  rollout restart \
  daemonset kube-flannel
# wait for the pods to be restarted
kubectl \
  --kubeconfig kubeconfig \
  --namespace kube-system \
  rollout status \
  daemonset kube-flannel

Check cluster status

kubectl \
  --kubeconfig kubeconfig \
  get nodes

Cleanup deployment

Warning: This will delete the deployment and all resources associated with it.

Run below if cloud controller manager is enabled

gcloud projects remove-iam-policy-binding \
    "${PROJECT}" \
    --member "serviceAccount:${SERVICE_ACCOUNT}" \
    --role roles/iam.serviceAccountUser

gcloud projects remove-iam-policy-binding \
    "${PROJECT}" \
    --member serviceAccount:"${SERVICE_ACCOUNT}" \
    --role roles/compute.admin

gcloud projects remove-iam-policy-binding \
    "${PROJECT}" \
    --member serviceAccount:"${SERVICE_ACCOUNT}" \
    --role roles/compute.loadBalancerAdmin

Now we can finally remove the deployment

# delete the objects in the bucket first
gsutil -m rm -r "gs://${BUCKET_NAME}"
gcloud deployment-manager deployments delete "${DEPLOYMENT_NAME}" --quiet

1.3.5 - Hetzner

Creating a cluster via the CLI (hcloud) on Hetzner.

Upload image

Hetzner Cloud does not support uploading custom images. You can email their support to get a Talos ISO uploaded by following issues:3599 or you can prepare image snapshot by yourself.

There are two options to upload your own.

  1. Run an instance in rescue mode and replace the system OS with the Talos image
  2. Use Hashicorp packer to prepare an image

Rescue mode

Create a new Server in the Hetzner console. Enable the Hetzner Rescue System for this server and reboot. Upon a reboot, the server will boot a special minimal Linux distribution designed for repair and reinstall. Once running, login to the server using ssh to prepare the system disk by doing the following:

# Check that you in Rescue mode
df

### Result is like:
# udev                   987432         0    987432   0% /dev
# 213.133.99.101:/nfs 308577696 247015616  45817536  85% /root/.oldroot/nfs
# overlay                995672      8340    987332   1% /
# tmpfs                  995672         0    995672   0% /dev/shm
# tmpfs                  398272       572    397700   1% /run
# tmpfs                    5120         0      5120   0% /run/lock
# tmpfs                  199132         0    199132   0% /run/user/0

# Download the Talos image
cd /tmp
wget -O /tmp/talos.raw.xz https://github.com/siderolabs/talos/releases/download/v1.1.1/hcloud-amd64.raw.xz
# Replace system
xz -d -c /tmp/talos.raw.xz | dd of=/dev/sda && sync
# shutdown the instance
shutdown -h now

To make sure disk content is consistent, it is recommended to shut the server down before taking an image (snapshot). Once shutdown, simply create an image (snapshot) from the console. You can now use this snapshot to run Talos on the cloud.

Packer

Install packer to the local machine.

Create a config file for packer to use:

# hcloud.pkr.hcl

packer {
  required_plugins {
    hcloud = {
      version = ">= 1.0.0"
      source  = "github.com/hashicorp/hcloud"
    }
  }
}

variable "talos_version" {
  type    = string
  default = "v1.1.1"
}

locals {
  image = "https://github.com/siderolabs/talos/releases/download/${var.talos_version}/hcloud-amd64.raw.xz"
}

source "hcloud" "talos" {
  rescue       = "linux64"
  image        = "debian-11"
  location     = "hel1"
  server_type  = "cx11"
  ssh_username = "root"

  snapshot_name = "talos system disk"
  snapshot_labels = {
    type    = "infra",
    os      = "talos",
    version = "${var.talos_version}",
  }
}

build {
  sources = ["source.hcloud.talos"]

  provisioner "shell" {
    inline = [
      "apt-get install -y wget",
      "wget -O /tmp/talos.raw.xz ${local.image}",
      "xz -d -c /tmp/talos.raw.xz | dd of=/dev/sda && sync",
    ]
  }
}

Create a new image by issuing the commands shown below. Note that to create a new API token for your Project, switch into the Hetzner Cloud Console choose a Project, go to Access → Security, and create a new token.

# First you need set API Token
export HCLOUD_TOKEN=${TOKEN}

# Upload image
packer init .
packer build .
# Save the image ID
export IMAGE_ID=<image-id-in-packer-output>

After doing this, you can find the snapshot in the console interface.

Creating a Cluster via the CLI

This section assumes you have the hcloud console utility on your local machine.

# Set hcloud context and api key
hcloud context create talos-tutorial

Create a Load Balancer

Create a load balancer by issuing the commands shown below. Save the IP/DNS name, as this info will be used in the next step.

hcloud load-balancer create --name controlplane --network-zone eu-central --type lb11 --label 'type=controlplane'

### Result is like:
# LoadBalancer 484487 created
# IPv4: 49.12.X.X
# IPv6: 2a01:4f8:X:X::1

hcloud load-balancer add-service controlplane \
    --listen-port 6443 --destination-port 6443 --protocol tcp
hcloud load-balancer add-target controlplane \
    --label-selector 'type=controlplane'

Create the Machine Configuration Files

Generating Base Configurations

Using the IP/DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines by issuing:

$ talosctl gen config talos-k8s-hcloud-tutorial https://<load balancer IP or DNS>:6443
created controlplane.yaml
created worker.yaml
created talosconfig

At this point, you can modify the generated configs to your liking. Optionally, you can specify --config-patch with RFC6902 jsonpatches which will be applied during the config generation.

Validate the Configuration Files

Validate any edited machine configs with:

$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode

Create the Servers

We can now create our servers. Note that you can find IMAGE_ID in the snapshot section of the console: https://console.hetzner.cloud/projects/$PROJECT_ID/servers/snapshots.

Create the Control Plane Nodes

Create the control plane nodes with:

export IMAGE_ID=<your-image-id>

hcloud server create --name talos-control-plane-1 \
    --image ${IMAGE_ID} \
    --type cx21 --location hel1 \
    --label 'type=controlplane' \
    --user-data-from-file controlplane.yaml

hcloud server create --name talos-control-plane-2 \
    --image ${IMAGE_ID} \
    --type cx21 --location fsn1 \
    --label 'type=controlplane' \
    --user-data-from-file controlplane.yaml

hcloud server create --name talos-control-plane-3 \
    --image ${IMAGE_ID} \
    --type cx21 --location nbg1 \
    --label 'type=controlplane' \
    --user-data-from-file controlplane.yaml

Create the Worker Nodes

Create the worker nodes with the following command, repeating (and incrementing the name counter) as many times as desired.

hcloud server create --name talos-worker-1 \
    --image ${IMAGE_ID} \
    --type cx21 --location hel1 \
    --label 'type=worker' \
    --user-data-from-file worker.yaml

Bootstrap Etcd

To configure talosctl we will need the first control plane node’s IP. This can be found by issuing:

hcloud server list | grep talos-control-plane

Set the endpoints and nodes for your talosconfig with:

talosctl --talosconfig talosconfig config endpoint <control-plane-1-IP>
talosctl --talosconfig talosconfig config node <control-plane-1-IP>

Bootstrap etcd on the first control plane node with:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

1.3.6 - Nocloud

Creating a cluster via the CLI using qemu.

Talos supports nocloud data source implementation.

There are two ways to configure Talos server with nocloud platform:

  • via SMBIOS “serial number” option
  • using CDROM or USB-flash filesystem

SMBIOS Serial Number

This method requires the network connection to be up (e.g. via DHCP). Configuration is delivered from the HTTP server.

ds=nocloud-net;s=http://10.10.0.1/configs/;h=HOSTNAME

After the network initialization is complete, Talos fetches:

  • the machine config from http://10.10.0.1/configs/user-data
  • the network config (if available) from http://10.10.0.1/configs/network-config

SMBIOS: QEMU

Add the following flag to qemu command line when starting a VM:

qemu-system-x86_64 \
  ...\
  -smbios type=1,serial=ds=nocloud-net;s=http://10.10.0.1/configs/

SMBIOS: Proxmox

Set the source machine config through the serial number on Proxmox GUI.

The Proxmox stores the VM config at /etc/pve/qemu-server/$ID.conf ($ID - VM ID number of virtual machine), you will see something like:

...
smbios1: uuid=ceae4d10,serial=ZHM9bm9jbG91ZC1uZXQ7cz1odHRwOi8vMTAuMTAuMC4xL2NvbmZpZ3Mv,base64=1
...

Where serial holds the base64-encoded string version of ds=nocloud-net;s=http://10.10.0.1/configs/.

CDROM/USB

Talos can also get machine config from local attached storage without any prior network connection being established.

You can provide configs to the server via files on a VFAT or ISO9660 filesystem. The filesystem volume label must be cidata or CIDATA.

Example: QEMU

Create and prepare Talos machine config:

export CONTROL_PLANE_IP=192.168.1.10

talosctl gen config talos-nocloud https://$CONTROL_PLANE_IP:6443 --output-dir _out

Prepare cloud-init configs:

mkdir -p iso
mv _out/controlplane.yaml iso/user-data
echo "local-hostname: controlplane-1" > iso/meta-data
cat > iso/network-config << EOF
version: 1
config:
   - type: physical
     name: eth0
     mac_address: "52:54:00:12:34:00"
     subnets:
        - type: static
          address: 192.168.1.10
          netmask: 255.255.255.0
          gateway: 192.168.1.254
EOF

Create cloud-init iso image

cd iso && genisoimage -output cidata.iso -V cidata -r -J user-data meta-data network-config

Start the VM

qemu-system-x86_64 \
    ...
    -cdrom iso/cidata.iso \
    ...

Example: Proxmox

Proxmox can create cloud-init disk for you. Edit the cloud-init config information in Proxmox as follows, substitute your own information as necessary:

and then update cicustom param at /etc/pve/qemu-server/$ID.conf.

cicustom: user=local:snippets/master-1.yml
ipconfig0: ip=192.168.1.10/24,gw=192.168.10.254
nameserver: 1.1.1.1
searchdomain: local

Note: snippets/master-1.yml is Talos machine config. It is usually located at /var/lib/vz/snippets/master-1.yml. This file must be placed to this path manually, as Proxmox does not support snippet uploading via API/GUI.

Click on Regenerate Image button after the above changes are made.

1.3.7 - Openstack

Creating a cluster via the CLI on Openstack.

Creating a Cluster via the CLI

In this guide, we will create an HA Kubernetes cluster in Openstack with 1 worker node. We will assume an existing some familiarity with Openstack. If you need more information on Openstack specifics, please see the official Openstack documentation.

Environment Setup

You should have an existing openrc file. This file will provide environment variables necessary to talk to your Openstack cloud. See here for instructions on fetching this file.

Create the Image

First, download the Openstack image from a Talos release. These images are called openstack-$ARCH.tar.gz. Untar this file with tar -xvf openstack-$ARCH.tar.gz. The resulting file will be called disk.raw.

Upload the Image

Once you have the image, you can upload to Openstack with:

openstack image create --public --disk-format raw --file disk.raw talos

Network Infrastructure

Load Balancer and Network Ports

Once the image is prepared, you will need to work through setting up the network. Issue the following to create a load balancer, the necessary network ports for each control plane node, and associations between the two.

Creating loadbalancer:

# Create load balancer, updating vip-subnet-id if necessary
openstack loadbalancer create --name talos-control-plane --vip-subnet-id public

# Create listener
openstack loadbalancer listener create --name talos-control-plane-listener --protocol TCP --protocol-port 6443 talos-control-plane

# Pool and health monitoring
openstack loadbalancer pool create --name talos-control-plane-pool --lb-algorithm ROUND_ROBIN --listener talos-control-plane-listener --protocol TCP
openstack loadbalancer healthmonitor create --delay 5 --max-retries 4 --timeout 10 --type TCP talos-control-plane-pool

Creating ports:

# Create ports for control plane nodes, updating network name if necessary
openstack port create --network shared talos-control-plane-1
openstack port create --network shared talos-control-plane-2
openstack port create --network shared talos-control-plane-3

# Create floating IPs for the ports, so that you will have talosctl connectivity to each control plane
openstack floating ip create --port talos-control-plane-1 public
openstack floating ip create --port talos-control-plane-2 public
openstack floating ip create --port talos-control-plane-3 public

Note: Take notice of the private and public IPs associated with each of these ports, as they will be used in the next step. Additionally, take node of the port ID, as it will be used in server creation.

Associate port’s private IPs to loadbalancer:

# Create members for each port IP, updating subnet-id and address as necessary.
openstack loadbalancer member create --subnet-id shared-subnet --address <PRIVATE IP OF talos-control-plane-1 PORT> --protocol-port 6443 talos-control-plane-pool
openstack loadbalancer member create --subnet-id shared-subnet --address <PRIVATE IP OF talos-control-plane-2 PORT> --protocol-port 6443 talos-control-plane-pool
openstack loadbalancer member create --subnet-id shared-subnet --address <PRIVATE IP OF talos-control-plane-3 PORT> --protocol-port 6443 talos-control-plane-pool

Security Groups

This example uses the default security group in Openstack. Ports have been opened to ensure that connectivity from both inside and outside the group is possible. You will want to allow, at a minimum, ports 6443 (Kubernetes API server) and 50000 (Talos API) from external sources. It is also recommended to allow communication over all ports from within the subnet.

Cluster Configuration

With our networking bits setup, we’ll fetch the IP for our load balancer and create our configuration files.

LB_PUBLIC_IP=$(openstack loadbalancer show talos-control-plane -f json | jq -r .vip_address)

talosctl gen config talos-k8s-openstack-tutorial https://${LB_PUBLIC_IP}:6443

Additionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Compute Creation

We are now ready to create our Openstack nodes.

Create control plane:

# Create control planes 2 and 3, substituting the same info.
for i in $( seq 1 3 ); do
  openstack server create talos-control-plane-$i --flavor m1.small --nic port-id=talos-control-plane-$i --image talos --user-data /path/to/controlplane.yaml
done

Create worker:

# Update network name as necessary.
openstack server create talos-worker-1 --flavor m1.small --network shared --image talos --user-data /path/to/worker.yaml

Note: This step can be repeated to add more workers.

Bootstrap Etcd

You should now be able to interact with your cluster with talosctl. We will use one of the floating IPs we allocated earlier. It does not matter which one.

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

1.3.8 - Oracle

Creating a cluster via the CLI (oci) on OracleCloud.com.

Upload image

Oracle Cloud at the moment does not have a Talos official image. So you can use Bring Your Own Image (BYOI) approach.

Once the image is uploaded, set the Boot volume type to Paravirtualized mode.

OracleCloud has highly available NTP service, it can be enabled in Talos machine config with:

machine:
  time:
    servers:
      - 169.254.169.254

Creating a Cluster via the CLI

Login to the console. And open the Cloud Shell.

Create a network

export cidr_block=10.0.0.0/16
export subnet_block=10.0.0.0/24
export compartment_id=<substitute-value-of-compartment_id> # https://docs.cloud.oracle.com/en-us/iaas/tools/oci-cli/latest/oci_cli_docs/cmdref/network/vcn/create.html#cmdoption-compartment-id

export vcn_id=$(oci network vcn create --cidr-block $cidr_block --display-name talos-example --compartment-id $compartment_id --query data.id --raw-output)
export rt_id=$(oci network subnet create --cidr-block $subnet_block --display-name kubernetes --compartment-id $compartment_id --vcn-id $vcn_id --query data.route-table-id --raw-output)
export ig_id=$(oci network internet-gateway create --compartment-id $compartment_id --is-enabled true --vcn-id $vcn_id --query data.id --raw-output)

oci network route-table update --rt-id $rt_id --route-rules "[{\"cidrBlock\":\"0.0.0.0/0\",\"networkEntityId\":\"$ig_id\"}]" --force

# disable firewall
export sl_id=$(oci network vcn list --compartment-id $compartment_id --query 'data[0]."default-security-list-id"' --raw-output)

oci network security-list update --security-list-id $sl_id --egress-security-rules '[{"destination": "0.0.0.0/0", "protocol": "all", "isStateless": false}]' --ingress-security-rules '[{"source": "0.0.0.0/0", "protocol": "all", "isStateless": false}]' --force

Create a Load Balancer

Create a load balancer by issuing the commands shown below. Save the IP/DNS name, as this info will be used in the next step.

export subnet_id=$(oci network subnet list --compartment-id=$compartment_id --display-name kubernetes --query data[0].id --raw-output)
export network_load_balancer_id=$(oci nlb network-load-balancer create --compartment-id $compartment_id --display-name controlplane-lb --subnet-id $subnet_id --is-preserve-source-destination false --is-private false --query data.id --raw-output)

cat <<EOF > talos-health-checker.json
{
  "intervalInMillis": 10000,
  "port": 50000,
  "protocol": "TCP"
}
EOF

oci nlb backend-set create --health-checker file://talos-health-checker.json --name talos --network-load-balancer-id $network_load_balancer_id --policy TWO_TUPLE --is-preserve-source false
oci nlb listener create --default-backend-set-name talos --name talos --network-load-balancer-id $network_load_balancer_id --port 50000 --protocol TCP

cat <<EOF > controlplane-health-checker.json
{
  "intervalInMillis": 10000,
  "port": 6443,
  "protocol": "HTTPS",
  "returnCode": 200,
  "urlPath": "/readyz"
}
EOF

oci nlb backend-set create --health-checker file://controlplane-health-checker.json --name controlplane --network-load-balancer-id $network_load_balancer_id --policy TWO_TUPLE --is-preserve-source false
oci nlb listener create --default-backend-set-name controlplane --name controlplane --network-load-balancer-id $network_load_balancer_id --port 6443 --protocol TCP

# Save the external IP
oci nlb network-load-balancer list --compartment-id $compartment_id --display-name controlplane-lb --query 'data.items[0]."ip-addresses"'

Create the Machine Configuration Files

Generating Base Configurations

Using the IP/DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines by issuing:

$ talosctl gen config talos-k8s-oracle-tutorial https://<load balancer IP or DNS>:6443 --additional-sans <load balancer IP or DNS>
created controlplane.yaml
created worker.yaml
created talosconfig

At this point, you can modify the generated configs to your liking. Optionally, you can specify --config-patch with RFC6902 jsonpatches which will be applied during the config generation.

Validate the Configuration Files

Validate any edited machine configs with:

$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode

Create the Servers

Create the Control Plane Nodes

Create the control plane nodes with:

export shape='VM.Standard.A1.Flex'
export subnet_id=$(oci network subnet list --compartment-id=$compartment_id --display-name kubernetes --query data[0].id --raw-output)
export image_id=$(oci compute image list --compartment-id $compartment_id --shape $shape --operating-system Talos --limit 1 --query data[0].id --raw-output)
export availability_domain=$(oci iam availability-domain list --compartment-id=$compartment_id --query data[0].name --raw-output)
export network_load_balancer_id=$(oci nlb network-load-balancer list --compartment-id $compartment_id --display-name controlplane-lb --query 'data.items[0].id' --raw-output)

cat <<EOF > shape.json
{
  "memoryInGBs": 4,
  "ocpus": 1
}
EOF

export instance_id=$(oci compute instance launch --shape $shape --shape-config file://shape.json --availability-domain $availability_domain --compartment-id $compartment_id --image-id $image_id --subnet-id $subnet_id --display-name controlplane-1 --private-ip 10.0.0.11 --assign-public-ip true --launch-options '{"networkType":"PARAVIRTUALIZED"}' --user-data-file controlplane.yaml --query 'data.id' --raw-output)

oci nlb backend create --backend-set-name talos --network-load-balancer-id $network_load_balancer_id --port 50000 --target-id $instance_id
oci nlb backend create --backend-set-name controlplane --network-load-balancer-id $network_load_balancer_id --port 6443 --target-id $instance_id

export instance_id=$(oci compute instance launch --shape $shape --shape-config file://shape.json --availability-domain $availability_domain --compartment-id $compartment_id --image-id $image_id --subnet-id $subnet_id --display-name controlplane-2 --private-ip 10.0.0.12 --assign-public-ip true --launch-options '{"networkType":"PARAVIRTUALIZED"}' --user-data-file controlplane.yaml --query 'data.id' --raw-output)

oci nlb backend create --backend-set-name talos --network-load-balancer-id $network_load_balancer_id --port 50000 --target-id $instance_id
oci nlb backend create --backend-set-name controlplane --network-load-balancer-id $network_load_balancer_id --port 6443 --target-id $instance_id

export instance_id=$(oci compute instance launch --shape $shape --shape-config file://shape.json --availability-domain $availability_domain --compartment-id $compartment_id --image-id $image_id --subnet-id $subnet_id --display-name controlplane-3 --private-ip 10.0.0.13 --assign-public-ip true --launch-options '{"networkType":"PARAVIRTUALIZED"}' --user-data-file controlplane.yaml --query 'data.id' --raw-output)

oci nlb backend create --backend-set-name talos --network-load-balancer-id $network_load_balancer_id --port 50000 --target-id $instance_id
oci nlb backend create --backend-set-name controlplane --network-load-balancer-id $network_load_balancer_id --port 6443 --target-id $instance_id

Create the Worker Nodes

Create the worker nodes with the following command, repeating (and incrementing the name counter) as many times as desired.

export subnet_id=$(oci network subnet list --compartment-id=$compartment_id --display-name kubernetes --query data[0].id --raw-output)
export image_id=$(oci compute image list --compartment-id $compartment_id --operating-system Talos --limit 1 --query data[0].id --raw-output)
export availability_domain=$(oci iam availability-domain list --compartment-id=$compartment_id --query data[0].name --raw-output)
export shape='VM.Standard.E2.1.Micro'

oci compute instance launch --shape $shape --availability-domain $availability_domain --compartment-id $compartment_id --image-id $image_id --subnet-id $subnet_id --display-name worker-1 --assign-public-ip true --user-data-file worker.yaml

oci compute instance launch --shape $shape --availability-domain $availability_domain --compartment-id $compartment_id --image-id $image_id --subnet-id $subnet_id --display-name worker-2 --assign-public-ip true --user-data-file worker.yaml

oci compute instance launch --shape $shape --availability-domain $availability_domain --compartment-id $compartment_id --image-id $image_id --subnet-id $subnet_id --display-name worker-3 --assign-public-ip true --user-data-file worker.yaml

Bootstrap Etcd

To configure talosctl we will need the first control plane node’s IP. This can be found by issuing:

export instance_id=$(oci compute instance list --compartment-id $compartment_id --display-name controlplane-1 --query 'data[0].id' --raw-output)

oci compute instance list-vnics --instance-id $instance_id --query 'data[0]."private-ip"' --raw-output

Set the endpoints and nodes for your talosconfig with:

talosctl --talosconfig talosconfig config endpoint <load balancer IP or DNS>
talosctl --talosconfig talosconfig config node <control-plane-1-IP>

Bootstrap etcd on the first control plane node with:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

1.3.9 - Scaleway

Creating a cluster via the CLI (scw) on scaleway.com.

Talos is known to work on scaleway.com; however, it is currently undocumented.

1.3.10 - UpCloud

Creating a cluster via the CLI (upctl) on UpCloud.com.

In this guide we will create an HA Kubernetes cluster 3 control plane nodes and 1 worker node. We assume some familiarity with UpCloud. If you need more information on UpCloud specifics, please see the official UpCloud documentation.

Create the Image

The best way to create an image for UpCloud, is to build one using Hashicorp packer, with the upcloud-amd64.raw.xz image found on the Talos Releases. Using the general ISO is also possible, but the UpCloud image has some UpCloud specific features implemented, such as the fetching of metadata and user data to configure the nodes.

To create the cluster, you need a few things locally installed:

  1. UpCloud CLI
  2. Hashicorp Packer

NOTE: Make sure your account allows API connections. To do so, log into UpCloud control panel and go to People -> Account -> Permissions -> Allow API connections checkbox. It is recommended to create a separate subaccount for your API access and only set the API permission.

To use the UpCloud CLI, you need to create a config in $HOME/.config/upctl.yaml

username: your_upcloud_username
password: your_upcloud_password

To use the UpCloud packer plugin, you need to also export these credentials to your environment variables, by e.g. putting the following in your .bashrc or .zshrc

export UPCLOUD_USERNAME="<username>"
export UPCLOUD_PASSWORD="<password>"

Next create a config file for packer to use:

# upcloud.pkr.hcl

packer {
  required_plugins {
    upcloud = {
      version = ">=v1.0.0"
      source  = "github.com/UpCloudLtd/upcloud"
    }
  }
}

variable "talos_version" {
  type    = string
  default = "v1.1.1"
}

locals {
  image = "https://github.com/siderolabs/talos/releases/download/${var.talos_version}/upcloud-amd64.raw.xz"
}

variable "username" {
  type        = string
  description = "UpCloud API username"
  default     = "${env("UPCLOUD_USERNAME")}"
}

variable "password" {
  type        = string
  description = "UpCloud API password"
  default     = "${env("UPCLOUD_PASSWORD")}"
  sensitive   = true
}

source "upcloud" "talos" {
  username        = "${var.username}"
  password        = "${var.password}"
  zone            = "us-nyc1"
  storage_name    = "Debian GNU/Linux 11 (Bullseye)"
  template_name   = "Talos (${var.talos_version})"
}

build {
  sources = ["source.upcloud.talos"]

  provisioner "shell" {
    inline = [
      "apt-get install -y wget xz-utils",
      "wget -q -O /tmp/talos.raw.xz ${local.image}",
      "xz -d -c /tmp/talos.raw.xz | dd of=/dev/vda",
    ]
  }

  provisioner "shell-local" {
      inline = [
      "upctl server stop --type hard custom",
      ]
  }
}

Now create a new image by issuing the commands shown below.

packer init .
packer build .

After doing this, you can find the custom image in the console interface under storage.

Creating a Cluster via the CLI

Create an Endpoint

To communicate with the Talos cluster you will need a single endpoint that is used to access the cluster. This can either be a loadbalancer that will sit in front of all your control plane nodes, a DNS name with one or more A or AAAA records pointing to the control plane nodes, or directly the IP of a control plane node.

Which option is best for you will depend on your needs. Endpoint selection has been further documented here.

After you decide on which endpoint to use, note down the domain name or IP, as we will need it in the next step.

Create the Machine Configuration Files

Generating Base Configurations

Using the DNS name of the endpoint created earlier, generate the base configuration files for the Talos machines:

$ talosctl gen config talos-upcloud-tutorial https://<load balancer IP or DNS>:<port> --install-disk /dev/vda
created controlplane.yaml
created worker.yaml
created talosconfig

At this point, you can modify the generated configs to your liking. Depending on the Kubernetes version you want to run, you might need to select a different Talos version, as not all versions are compatible. You can find the support matrix here.

Optionally, you can specify --config-patch with RFC6902 jsonpatch or yamlpatch which will be applied during the config generation.

Validate the Configuration Files

$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode

Create the Servers

Create the Control Plane Nodes

Run the following to create three total control plane nodes:

for ID in $(seq 3); do
    upctl server create \
      --zone us-nyc1 \
      --title talos-us-nyc1-master-$ID \
      --hostname talos-us-nyc1-master-$ID \
      --plan 2xCPU-4GB \
      --os "Talos (v1.1.1)" \
      --user-data "$(cat controlplane.yaml)" \
      --enable-metada
done

Note: modify the zone and OS depending on your preferences. The OS should match the template name generated with packer in the previous step.

Note the IP address of the first control plane node, as we will need it later.

Create the Worker Nodes

Run the following to create a worker node:

upctl server create \
  --zone us-nyc1 \
  --title talos-us-nyc1-worker-1 \
  --hostname talos-us-nyc1-worker-1 \
  --plan 2xCPU-4GB \
  --os "Talos (v1.1.1)" \
  --user-data "$(cat worker.yaml)" \
  --enable-metada

Bootstrap Etcd

To configure talosctl we will need the first control plane node’s IP, as noted earlier. We only add one node IP, as that is the entry into our cluster against which our commands will be run. All requests to other nodes are proxied through the endpoint, and therefore not all nodes need to be manually added to the config. You don’t want to run your commands against all nodes, as this can destroy your cluster if you are not careful (further documentation).

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig

It will take a few minutes before Kubernetes has been fully bootstrapped, and is accessible.

You can check if the nodes are registered in Talos by running

talosctl --talosconfig talosconfig get members

To check if your nodes are ready, run

kubectl get nodes

1.3.11 - Vultr

Creating a cluster via the CLI (vultr-cli) on Vultr.com.

Creating a Cluster using the Vultr CLI

This guide will demonstrate how to create a highly-available Kubernetes cluster with one worker using the Vultr cloud provider.

Vultr have a very well documented REST API, and an open-source CLI tool to interact with the API which will be used in this guide. Make sure to follow installation and authentication instructions for the vultr-cli tool.

Upload image

First step is to make the Talos ISO available to Vultr by uploading the latest release of the ISO to the Vultr ISO server.

vultr-cli iso create --url https://github.com/siderolabs/talos/releases/download/v1.1.1/talos-amd64.iso

Make a note of the ID in the output, it will be needed later when creating the instances.

Create a Load Balancer

A load balancer is needed to serve as the Kubernetes endpoint for the cluster.

vultr-cli load-balancer create \
   --region $REGION \
   --label "Talos Kubernetes Endpoint" \
   --port 6443 \
   --protocol tcp \
   --check-interval 10 \
   --response-timeout 5 \
   --healthy-threshold 5 \
   --unhealthy-threshold 3 \
   --forwarding-rules frontend_protocol:tcp,frontend_port:443,backend_protocol:tcp,backend_port:6443

Make a note of the ID of the load balancer from the output of the above command, it will be needed after the control plane instances are created.

vultr-cli load-balancer get $LOAD_BALANCER_ID | grep ^IP

Make a note of the IP address, it will be needed later when generating the configuration.

Create the Machine Configuration

Generate Base Configuration

Using the IP address (or DNS name if one was created) of the load balancer created above, generate the machine configuration files for the new cluster.

talosctl gen config talos-kubernetes-vultr https://$LOAD_BALANCER_ADDRESS

Once generated, the machine configuration can be modified as necessary for the new cluster, for instance updating disk installation, or adding SANs for the certificates.

Validate the Configuration Files

talosctl validate --config controlplane.yaml --mode cloud
talosctl validate --config worker.yaml --mode cloud

Create the Nodes

Create the Control Plane Nodes

First a control plane needs to be created, with the example below creating 3 instances in a loop. The instance type (noted by the --plan vc2-2c-4gb argument) in the example is for a minimum-spec control plane node, and should be updated to suit the cluster being created.

for id in $(seq 3); do
    vultr-cli instance create \
        --plan vc2-2c-4gb \
        --region $REGION \
        --iso $TALOS_ISO_ID \
        --host talos-k8s-cp${id} \
        --label "Talos Kubernetes Control Plane" \
        --tags talos,kubernetes,control-plane
done

Make a note of the instance IDs, as they are needed to attach to the load balancer created earlier.

vultr-cli load-balancer update $LOAD_BALANCER_ID --instances $CONTROL_PLANE_1_ID,$CONTROL_PLANE_2_ID,$CONTROL_PLANE_3_ID

Once the nodes are booted and waiting in maintenance mode, the machine configuration can be applied to each one in turn.

talosctl --talosconfig talosconfig apply-config --insecure --nodes $CONTROL_PLANE_1_ADDRESS --file controlplane.yaml
talosctl --talosconfig talosconfig apply-config --insecure --nodes $CONTROL_PLANE_2_ADDRESS --file controlplane.yaml
talosctl --talosconfig talosconfig apply-config --insecure --nodes $CONTROL_PLANE_3_ADDRESS --file controlplane.yaml

Create the Worker Nodes

Now worker nodes can be created and configured in a similar way to the control plane nodes, the difference being mainly in the machine configuration file. Note that like with the control plane nodes, the instance type (here set by --plan vc2-1-1gb) should be changed for the actual cluster requirements.

for id in $(seq 1); do
    vultr-cli instance create \
        --plan vc2-1c-1gb \
        --region $REGION \
        --iso $TALOS_ISO_ID \
        --host talos-k8s-worker${id} \
        --label "Talos Kubernetes Worker" \
        --tags talos,kubernetes,worker
done

Once the worker is booted and in maintenance mode, the machine configuration can be applied in the following manner.

talosctl --talosconfig talosconfig apply-config --insecure --nodes $WORKER_1_ADDRESS --file worker.yaml

Bootstrap etcd

Once all the cluster nodes are correctly configured, the cluster can be bootstrapped to become functional. It is important that the talosctl bootstrap command be executed only once and against only a single control plane node.

talosctl --talosconfig talosconfig boostrap --endpoints $CONTROL_PLANE_1_ADDRESS --nodes $CONTROL_PLANE_1_ADDRESS

Configure Endpoints and Nodes

While the cluster goes through the bootstrapping process and beings to self-manage, the talosconfig can be updated with the endpoints and nodes.

talosctl --talosconfig talosconfig config endpoints $CONTROL_PLANE_1_ADDRESS $CONTROL_PLANE_2_ADDRESS $CONTROL_PLANE_3_ADDRESS
talosctl --talosconfig talosconfig config nodes $CONTROL_PLANE_1_ADDRESS $CONTROL_PLANE_2_ADDRESS $CONTROL_PLANE_3_ADDRESS WORKER_1_ADDRESS

Retrieve the kubeconfig

Finally, with the cluster fully running, the administrative kubeconfig can be retrieved from the Talos API to be saved locally.

talosctl --talosconfig talosconfig kubeconfig .

Now the kubeconfig can be used by any of the usual Kubernetes tools to interact with the Talos-based Kubernetes cluster as normal.

1.4 - Local Platforms

Installation of Talos Linux on local platforms, helpful for testing and developing.

1.4.1 - Docker

Creating Talos Kubernetes cluster using Docker.

In this guide we will create a Kubernetes cluster in Docker, using a containerized version of Talos.

Running Talos in Docker is intended to be used in CI pipelines, and local testing when you need a quick and easy cluster. Furthermore, if you are running Talos in production, it provides an excellent way for developers to develop against the same version of Talos.

Requirements

The follow are requirements for running Talos in Docker:

  • Docker 18.03 or greater
  • a recent version of talosctl

Caveats

Due to the fact that Talos will be running in a container, certain APIs are not available. For example upgrade, reset, and similar APIs don’t apply in container mode. Further, when running on a Mac in docker, due to networking limitations, VIPs are not supported.

Create the Cluster

Creating a local cluster is as simple as:

talosctl cluster create --wait

Once the above finishes successfully, your talosconfig(~/.talos/config) will be configured to point to the new cluster.

Note: Startup times can take up to a minute or more before the cluster is available.

Finally, we just need to specify which nodes you want to communicate with using talosctl. Talosctl can operate on one or all the nodes in the cluster – this makes cluster wide commands much easier.

talosctl config nodes 10.5.0.2 10.5.0.3

Using the Cluster

Once the cluster is available, you can make use of talosctl and kubectl to interact with the cluster. For example, to view current running containers, run talosctl containers for a list of containers in the system namespace, or talosctl containers -k for the k8s.io namespace. To view the logs of a container, use talosctl logs <container> or talosctl logs -k <container>.

Cleaning Up

To cleanup, run:

talosctl cluster destroy

1.4.2 - QEMU

Creating Talos Kubernetes cluster using QEMU VMs.

In this guide we will create a Kubernetes cluster using QEMU.

Video Walkthrough

To see a live demo of this writeup, see the video below:

Requirements

  • Linux
  • a kernel with
    • KVM enabled (/dev/kvm must exist)
    • CONFIG_NET_SCH_NETEM enabled
    • CONFIG_NET_SCH_INGRESS enabled
  • at least CAP_SYS_ADMIN and CAP_NET_ADMIN capabilities
  • QEMU
  • bridge, static and firewall CNI plugins from the standard CNI plugins, and tc-redirect-tap CNI plugin from the awslabs tc-redirect-tap installed to /opt/cni/bin (installed automatically by talosctl)
  • iptables
  • /var/run/netns directory should exist

Installation

How to get QEMU

Install QEMU with your operating system package manager. For example, on Ubuntu for x86:

apt install qemu-system-x86 qemu-kvm

Install talosctl

You can download talosctl and all required binaries via github.com/siderolabs/talos/releases

curl https://github.com/siderolabs/talos/releases/download/<version>/talosctl-<platform>-<arch> -L -o talosctl

For example version v1.1.1 for linux platform:

curl https://github.com/siderolabs/talos/releases/download/v1.1.1/talosctl-linux-amd64 -L -o talosctl
sudo cp talosctl /usr/local/bin
sudo chmod +x /usr/local/bin/talosctl

Install Talos kernel and initramfs

QEMU provisioner depends on Talos kernel (vmlinuz) and initramfs (initramfs.xz). These files can be downloaded from the Talos release:

mkdir -p _out/
curl https://github.com/siderolabs/talos/releases/download/<version>/vmlinuz-<arch> -L -o _out/vmlinuz-<arch>
curl https://github.com/siderolabs/talos/releases/download/<version>/initramfs-<arch>.xz -L -o _out/initramfs-<arch>.xz

For example version v1.1.1:

curl https://github.com/siderolabs/talos/releases/download/v1.1.1/vmlinuz-amd64 -L -o _out/vmlinuz-amd64
curl https://github.com/siderolabs/talos/releases/download/v1.1.1/initramfs-amd64.xz -L -o _out/initramfs-amd64.xz

Create the Cluster

For the first time, create root state directory as your user so that you can inspect the logs as non-root user:

mkdir -p ~/.talos/clusters

Create the cluster:

sudo --preserve-env=HOME talosctl cluster create --provisioner qemu

Before the first cluster is created, talosctl will download the CNI bundle for the VM provisioning and install it to ~/.talos/cni directory.

Once the above finishes successfully, your talosconfig (~/.talos/config) will be configured to point to the new cluster, and kubeconfig will be downloaded and merged into default kubectl config location (~/.kube/config).

Cluster provisioning process can be optimized with registry pull-through caches.

Using the Cluster

Once the cluster is available, you can make use of talosctl and kubectl to interact with the cluster. For example, to view current running containers, run talosctl -n 10.5.0.2 containers for a list of containers in the system namespace, or talosctl -n 10.5.0.2 containers -k for the k8s.io namespace. To view the logs of a container, use talosctl -n 10.5.0.2 logs <container> or talosctl -n 10.5.0.2 logs -k <container>.

A bridge interface will be created, and assigned the default IP 10.5.0.1. Each node will be directly accessible on the subnet specified at cluster creation time. A loadbalancer runs on 10.5.0.1 by default, which handles loadbalancing for the Kubernetes APIs.

You can see a summary of the cluster state by running:

$ talosctl cluster show --provisioner qemu
PROVISIONER       qemu
NAME              talos-default
NETWORK NAME      talos-default
NETWORK CIDR      10.5.0.0/24
NETWORK GATEWAY   10.5.0.1
NETWORK MTU       1500

NODES:

NAME                     TYPE           IP         CPU    RAM      DISK
talos-default-master-1   Init           10.5.0.2   1.00   1.6 GB   4.3 GB
talos-default-master-2   ControlPlane   10.5.0.3   1.00   1.6 GB   4.3 GB
talos-default-master-3   ControlPlane   10.5.0.4   1.00   1.6 GB   4.3 GB
talos-default-worker-1   Worker         10.5.0.5   1.00   1.6 GB   4.3 GB

Cleaning Up

To cleanup, run:

sudo --preserve-env=HOME talosctl cluster destroy --provisioner qemu

Note: In that case that the host machine is rebooted before destroying the cluster, you may need to manually remove ~/.talos/clusters/talos-default.

Manual Clean Up

The talosctl cluster destroy command depends heavily on the clusters state directory. It contains all related information of the cluster. The PIDs and network associated with the cluster nodes.

If you happened to have deleted the state folder by mistake or you would like to cleanup the environment, here are the steps how to do it manually:

Remove VM Launchers

Find the process of talosctl qemu-launch:

ps -elf | grep 'talosctl qemu-launch'

To remove the VMs manually, execute:

sudo kill -s SIGTERM <PID>

Example output, where VMs are running with PIDs 157615 and 157617

ps -elf | grep '[t]alosctl qemu-launch'
0 S root      157615    2835  0  80   0 - 184934 -     07:53 ?        00:00:00 talosctl qemu-launch
0 S root      157617    2835  0  80   0 - 185062 -     07:53 ?        00:00:00 talosctl qemu-launch
sudo kill -s SIGTERM 157615
sudo kill -s SIGTERM 157617

Stopping VMs

Find the process of qemu-system:

ps -elf | grep 'qemu-system'

To stop the VMs manually, execute:

sudo kill -s SIGTERM <PID>

Example output, where VMs are running with PIDs 158065 and 158216

ps -elf | grep qemu-system
2 S root     1061663 1061168 26  80   0 - 1786238 -    14:05 ?        01:53:56 qemu-system-x86_64 -m 2048 -drive format=raw,if=virtio,file=/home/username/.talos/clusters/talos-default/bootstrap-master.disk -smp cpus=2 -cpu max -nographic -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -device virtio-net-pci,netdev=net0,mac=1e:86:c6:b4:7c:c4 -device virtio-rng-pci -no-reboot -boot order=cn,reboot-timeout=5000 -smbios type=1,uuid=7ec0a73c-826e-4eeb-afd1-39ff9f9160ca -machine q35,accel=kvm
2 S root     1061663 1061170 67  80   0 - 621014 -     21:23 ?        00:00:07 qemu-system-x86_64 -m 2048 -drive format=raw,if=virtio,file=/homeusername/.talos/clusters/talos-default/pxe-1.disk -smp cpus=2 -cpu max -nographic -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -device virtio-net-pci,netdev=net0,mac=36:f3:2f:c3:9f:06 -device virtio-rng-pci -no-reboot -boot order=cn,reboot-timeout=5000 -smbios type=1,uuid=ce12a0d0-29c8-490f-b935-f6073ab916a6 -machine q35,accel=kvm
sudo kill -s SIGTERM 1061663
sudo kill -s SIGTERM 1061663

Remove load balancer

Find the process of talosctl loadbalancer-launch:

ps -elf | grep 'talosctl loadbalancer-launch'

To remove the LB manually, execute:

sudo kill -s SIGTERM <PID>

Example output, where loadbalancer is running with PID 157609

ps -elf | grep '[t]alosctl loadbalancer-launch'
4 S root      157609    2835  0  80   0 - 184998 -     07:53 ?        00:00:07 talosctl loadbalancer-launch --loadbalancer-addr 10.5.0.1 --loadbalancer-upstreams 10.5.0.2
sudo kill -s SIGTERM 157609

Remove DHCP server

Find the process of talosctl dhcpd-launch:

ps -elf | grep 'talosctl dhcpd-launch'

To remove the LB manually, execute:

sudo kill -s SIGTERM <PID>

Example output, where loadbalancer is running with PID 157609

ps -elf | grep '[t]alosctl dhcpd-launch'
4 S root      157609    2835  0  80   0 - 184998 -     07:53 ?        00:00:07 talosctl dhcpd-launch --state-path /home/username/.talos/clusters/talos-default --addr 10.5.0.1 --interface talosbd9c32bc
sudo kill -s SIGTERM 157609

Remove network

This is more tricky part as if you have already deleted the state folder. If you didn’t then it is written in the state.yaml in the ~/.talos/clusters/<cluster-name> directory.

sudo cat ~/.talos/clusters/<cluster-name>/state.yaml | grep bridgename
bridgename: talos<uuid>

If you only had one cluster, then it will be the interface with name talos<uuid>

46: talos<uuid>: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether a6:72:f4:0a:d3:9c brd ff:ff:ff:ff:ff:ff
    inet 10.5.0.1/24 brd 10.5.0.255 scope global talos17c13299
       valid_lft forever preferred_lft forever
    inet6 fe80::a472:f4ff:fe0a:d39c/64 scope link
       valid_lft forever preferred_lft forever

To remove this interface:

sudo ip link del talos<uuid>

Remove state directory

To remove the state directory execute:

sudo rm -Rf /home/$USER/.talos/clusters/<cluster-name>

Troubleshooting

Logs

Inspect logs directory

sudo cat ~/.talos/clusters/<cluster-name>/*.log

Logs are saved under <cluster-name>-<role>-<node-id>.log

For example in case of k8s cluster name:

ls -la ~/.talos/clusters/k8s | grep log
-rw-r--r--. 1 root root      69415 Apr 26 20:58 k8s-master-1.log
-rw-r--r--. 1 root root      68345 Apr 26 20:58 k8s-worker-1.log
-rw-r--r--. 1 root root      24621 Apr 26 20:59 lb.log

Inspect logs during the installation

tail -f ~/.talos/clusters/<cluster-name>/*.log

1.4.3 - VirtualBox

Creating Talos Kubernetes cluster using VurtualBox VMs.

In this guide we will create a Kubernetes cluster using VirtualBox.

Video Walkthrough

To see a live demo of this writeup, visit Youtube here:

Installation

How to Get VirtualBox

Install VirtualBox with your operating system package manager or from the website. For example, on Ubuntu for x86:

apt install virtualbox

Install talosctl

You can download talosctl via github.com/siderolabs/talos/releases

curl https://github.com/siderolabs/talos/releases/download/<version>/talosctl-<platform>-<arch> -L -o talosctl

For example version v1.1.1 for linux platform:

curl https://github.com/siderolabs/talos/releases/download/v1.1.1/talosctl-linux-amd64 -L -o talosctl
sudo cp talosctl /usr/local/bin
sudo chmod +x /usr/local/bin/talosctl

Download ISO Image

In order to install Talos in VirtualBox, you will need the ISO image from the Talos release page. You can download talos-amd64.iso via github.com/siderolabs/talos/releases

mkdir -p _out/
curl https://github.com/siderolabs/talos/releases/download/<version>/talos-<arch>.iso -L -o _out/talos-<arch>.iso

For example version v1.1.1 for linux platform:

mkdir -p _out/
curl https://github.com/siderolabs/talos/releases/download/v1.1.1/talos-amd64.iso -L -o _out/talos-amd64.iso

Create VMs

Start by creating a new VM by clicking the “New” button in the VirtualBox UI:

Supply a name for this VM, and specify the Type and Version:

Edit the memory to supply at least 2GB of RAM for the VM:

Proceed through the disk settings, keeping the defaults. You can increase the disk space if desired.

Once created, select the VM and hit “Settings”:

In the “System” section, supply at least 2 CPUs:

In the “Network” section, switch the network “Attached To” section to “Bridged Adapter”:

Finally, in the “Storage” section, select the optical drive and, on the right, select the ISO by browsing your filesystem:

Repeat this process for a second VM to use as a worker node. You can also repeat this for additional nodes desired.

Start Control Plane Node

Once the VMs have been created and updated, start the VM that will be the first control plane node. This VM will boot the ISO image specified earlier and enter “maintenance mode”. Once the machine has entered maintenance mode, there will be a console log that details the IP address that the node received. Take note of this IP address, which will be referred to as $CONTROL_PLANE_IP for the rest of this guide. If you wish to export this IP as a bash variable, simply issue a command like export CONTROL_PLANE_IP=1.2.3.4.

Generate Machine Configurations

With the IP address above, you can now generate the machine configurations to use for installing Talos and Kubernetes. Issue the following command, updating the output directory, cluster name, and control plane IP as you see fit:

talosctl gen config talos-vbox-cluster https://$CONTROL_PLANE_IP:6443 --output-dir _out

This will create several files in the _out directory: controlplane.yaml, worker.yaml, and talosconfig.

Create Control Plane Node

Using the controlplane.yaml generated above, you can now apply this config using talosctl. Issue:

talosctl apply-config --insecure --nodes $CONTROL_PLANE_IP --file _out/controlplane.yaml

You should now see some action in the VirtualBox console for this VM. Talos will be installed to disk, the VM will reboot, and then Talos will configure the Kubernetes control plane on this VM.

Note: This process can be repeated multiple times to create an HA control plane.

Create Worker Node

Create at least a single worker node using a process similar to the control plane creation above. Start the worker node VM and wait for it to enter “maintenance mode”. Take note of the worker node’s IP address, which will be referred to as $WORKER_IP

Issue:

talosctl apply-config --insecure --nodes $WORKER_IP --file _out/worker.yaml

Note: This process can be repeated multiple times to add additional workers.

Using the Cluster

Once the cluster is available, you can make use of talosctl and kubectl to interact with the cluster. For example, to view current running containers, run talosctl containers for a list of containers in the system namespace, or talosctl containers -k for the k8s.io namespace. To view the logs of a container, use talosctl logs <container> or talosctl logs -k <container>.

First, configure talosctl to talk to your control plane node by issuing the following, updating paths and IPs as necessary:

export TALOSCONFIG="_out/talosconfig"
talosctl config endpoint $CONTROL_PLANE_IP
talosctl config node $CONTROL_PLANE_IP

Bootstrap Etcd

Set the endpoints and nodes:

talosctl --talosconfig $TALOSCONFIG config endpoint <control plane 1 IP>
talosctl --talosconfig $TALOSCONFIG config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig $TALOSCONFIG bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig $TALOSCONFIG kubeconfig .

You can then use kubectl in this fashion:

kubectl get nodes

Cleaning Up

To cleanup, simply stop and delete the virtual machines from the VirtualBox UI.

1.5 - Single Board Computers

Installation of Talos Linux on single-board computers.

1.5.1 - Banana Pi M64

Installing Talos on Banana Pi M64 SBC using raw disk image.

Prerequisites

You will need

  • talosctl
  • an SD card

Download the latest talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/siderolabs/talos/releases/download/v1.1.1/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Download the Image

Download the image and decompress it:

curl -LO https://github.com/siderolabs/talos/releases/download/v1.1.1/metal-bananapi_m64-arm64.img.xz
xz -d metal-bananapi_m64-arm64.img.xz

Writing the Image

The path to your SD card can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

Now dd the image to your SD card:

sudo dd if=metal-bananapi_m64-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

Bootstrapping the Node

Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

1.5.2 - Jetson Nano

Installing Talos on Jetson Nano SBC using raw disk image.

Prerequisites

You will need

Download the latest talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/siderolabs/talos/releases/download/v1.1.1/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Flashing the firmware to on-board SPI flash

Flashing the firmware only needs to be done once.

We will use the R32.7.2 release for the Jetson Nano. Most of the instructions is similar to this doc except that we’d be using a upstream version of u-boot with patches from NVIDIA u-boot so that USB boot also works.

Before flashing we need the following:

  • A USB-A to micro USB cable
  • A jumper wire to enable recovery mode
  • A HDMI monitor to view the logs if the USB serial adapter is not available
  • A USB to Serial adapter with 3.3V TTL (optional)
  • A 5V DC barrel jack

If you’re planning to use the serial console follow the documentation here

First start by downloading the Jetson Nano L4T release.

curl -SLO https://developer.nvidia.com/embedded/l4t/r32_release_v7.1/t210/jetson-210_linux_r32.7.2_aarch64.tbz2

Next we will extract the L4T release and replace the u-boot binary with the patched version.

tar xf jetson-210_linux_r32.6.1_aarch64.tbz2
cd Linux_for_Tegra
crane --platform=linux/arm64 export ghcr.io/siderolabs/u-boot:v1.1.0-alpha.0-42-gcd05ae8 - | tar xf - --strip-components=1 -C bootloader/t210ref/p3450-0000/ jetson_nano/u-boot.bin

Next we will flash the firmware to the Jetson Nano SPI flash. In order to do that we need to put the Jetson Nano into Force Recovery Mode (FRC). We will use the instructions from here

  • Ensure that the Jetson Nano is powered off. There is no need for the SD card/USB storage/network cable to be connected
  • Connect the micro USB cable to the micro USB port on the Jetson Nano, don’t plug the other end to the PC yet
  • Enable Force Recovery Mode (FRC) by placing a jumper across the FRC pins on the Jetson Nano
    • For board revision A02, these are pins 3 and 4 of header J40
    • For board revision B01, these are pins 9 and 10 of header J50
  • Place another jumper across J48 to enable power from the DC jack and connect the Jetson Nano to the DC jack J25
  • Now connect the other end of the micro USB cable to the PC and remove the jumper wire from the FRC pins

Now the Jetson Nano is in Force Recovery Mode (FRC) and can be confirmed by running the following command

lsusb | grep -i "nvidia"

Now we can move on the flashing the firmware.

sudo ./flash p3448-0000-max-spi external

This will flash the firmware to the Jetson Nano SPI flash and you’ll see a lot of output. If you’ve connected the serial console you’ll also see the progress there. Once the flashing is done you can disconnect the USB cable and power off the Jetson Nano.

Download the Image

Download the image and decompress it:

curl -LO https://github.com/siderolabs/talos/releases/download/v1.1.1/metal-jetson_nano-arm64.img.xz
xz -d metal-jetson_nano-arm64.img.xz

Writing the Image

Now dd the image to your SD card/USB storage:

sudo dd if=metal-jetson_nano-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M status=progress

| Replace /dev/mmcblk0 with the name of your SD card/USB storage.

Bootstrapping the Node

Insert the SD card/USB storage to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

1.5.3 - Libre Computer Board ALL-H3-CC

Installing Talos on Libre Computer Board ALL-H3-CC SBC using raw disk image.

Prerequisites

You will need

  • talosctl
  • an SD card

Download the latest talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/siderolabs/talos/releases/download/v1.1.1/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Download the Image

Download the image and decompress it:

curl -LO https://github.com/siderolabs/talos/releases/download/v1.1.1/metal-libretech_all_h3_cc_h5-arm64.img.xz
xz -d metal-libretech_all_h3_cc_h5-arm64.img.xz

Writing the Image

The path to your SD card can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

Now dd the image to your SD card:

sudo dd if=metal-libretech_all_h3_cc_h5-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

Bootstrapping the Node

Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

1.5.4 - Pine64

Installing Talos on a Pine64 SBC using raw disk image.

Prerequisites

You will need

  • talosctl
  • an SD card

Download the latest talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/siderolabs/talos/releases/download/v1.1.1/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Download the Image

Download the image and decompress it:

curl -LO https://github.com/siderolabs/talos/releases/download/v1.1.1/metal-pine64-arm64.img.xz
xz -d metal-pine64-arm64.img.xz

Writing the Image

The path to your SD card can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

Now dd the image to your SD card:

sudo dd if=metal-pine64-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

Bootstrapping the Node

Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

1.5.5 - Pine64 Rock64

Installing Talos on Pine64 Rock64 SBC using raw disk image.

Prerequisites

You will need

  • talosctl
  • an SD card

Download the latest talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/siderolabs/talos/releases/download/v1.1.1/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Download the Image

Download the image and decompress it:

curl -LO https://github.com/siderolabs/talos/releases/download/v1.1.1/metal-rock64-arm64.img.xz
xz -d metal-rock64-arm64.img.xz

Writing the Image

The path to your SD card can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

Now dd the image to your SD card:

sudo dd if=metal-rock64-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

Bootstrapping the Node

Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

1.5.6 - Radxa ROCK PI 4

Installing Talos on Radxa ROCK PI 4a/4b SBC using raw disk image.

Prerequisites

You will need

  • talosctl
  • an SD card or an eMMC or USB drive or an nVME drive

Download the latest talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/siderolabs/talos/releases/download/v1.1.1/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Download the Image

Download the image and decompress it:

curl -LO https://github.com/siderolabs/talos/releases/download/v1.1.1/metal-rockpi_4-arm64.img.xz
xz -d metal-rockpi_4-arm64.img.xz

Writing the Image

The path to your SD card/eMMC/USB/nVME can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

Now dd the image to your SD card:

sudo dd if=metal-rockpi_4-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

The user has two options to proceed:

  • booting from a SD card or eMMC
  • booting from a USB or nVME (requires the RockPi board to have the SPI flash)

Booting from SD card or eMMC

Insert the SD card into the board, turn it on and proceed to bootstrapping the node.

Booting from USB or nVME

This requires the user to flash the RockPi SPI flash with u-boot.

This requires the user has access to crane CLI, a spare SD card and optionally access to the RockPi serial console.

  • Flash the Rock PI 4c variant of Debian to the SD card.
  • Boot into the debian image
  • Check that /dev/mtdblock0 exists otherwise the command will silently fail; e.g. lsblk.
  • Download u-boot image from talos u-boot:
mkdir _out
crane --platform=linux/arm64 export ghcr.io/siderolabs/u-boot:v1.1.0-alpha.0-19-g6691342 - | tar xf - --strip-components=1 -C _out rockpi_4/rkspi_loader.img
sudo dd if=rkspi_loader.img of=/dev/mtdblock0 bs=4K
  • Optionally, you can also write Talos image to the SSD drive right from your Rock PI board:
curl -LO https://github.com/siderolabs/talos/releases/download/v1.1.1/metal-rockpi_4-arm64.img.xz
xz -d metal-rockpi_4-arm64.img.xz
sudo dd if=metal-rockpi_4-arm64.img.xz of=/dev/nvme0n1
  • remove SD card and reboot.

After these steps, Talos will boot from the nVME/USB and enter maintenance mode. Proceed to bootstrapping the node.

Bootstrapping the Node

Wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

1.5.7 - Radxa ROCK PI 4C

Installing Talos on Radxa ROCK PI 4c SBC using raw disk image.

Prerequisites

You will need

  • talosctl
  • an SD card or an eMMC or USB drive or an nVME drive

Download the latest talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/siderolabs/talos/releases/download/v1.1.1/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Download the Image

Download the image and decompress it:

curl -LO https://github.com/siderolabs/talos/releases/download/v1.1.1/metal-rockpi_4c-arm64.img.xz
xz -d metal-rockpi_4c-arm64.img.xz

Writing the Image

The path to your SD card/eMMC/USB/nVME can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

Now dd the image to your SD card:

sudo dd if=metal-rockpi_4c-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

The user has two options to proceed:

  • booting from a SD card or eMMC
  • booting from a USB or nVME (requires the RockPi board to have the SPI flash)

Booting from SD card or eMMC

Insert the SD card into the board, turn it on and proceed to bootstrapping the node.

Booting from USB or nVME

This requires the user to flash the RockPi SPI flash with u-boot.

This requires the user has access to crane CLI, a spare SD card and optionally access to the RockPi serial console.

  • Flash the Rock PI 4c variant of Debian to the SD card.
  • Boot into the debian image
  • Check that /dev/mtdblock0 exists otherwise the command will silently fail; e.g. lsblk.
  • Download u-boot image from talos u-boot:
mkdir _out
crane --platform=linux/arm64 export ghcr.io/siderolabs/u-boot:v1.1.0-alpha.0-19-g6691342 - | tar xf - --strip-components=1 -C _out rockpi_4c/rkspi_loader.img
sudo dd if=rkspi_loader.img of=/dev/mtdblock0 bs=4K
  • Optionally, you can also write Talos image to the SSD drive right from your Rock PI board:
curl -LO https://github.com/siderolabs/talos/releases/download/v1.1.1/metal-rockpi_4c-arm64.img.xz
xz -d metal-rockpi_4c-arm64.img.xz
sudo dd if=metal-rockpi_4c-arm64.img.xz of=/dev/nvme0n1
  • remove SD card and reboot.

After these steps, Talos will boot from the nVME/USB and enter maintenance mode. Proceed to bootstrapping the node.

Bootstrapping the Node

Wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

1.5.8 - Raspberry Pi 4 Model B

Installing Talos on Rpi4 SBC using raw disk image.

Video Walkthrough

To see a live demo of this writeup, see the video below:

Prerequisites

You will need

  • talosctl
  • an SD card

Download the latest talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/siderolabs/talos/releases/download/v1.1.1/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Updating the EEPROM

At least version v2020.09.03-138a1 of the bootloader (rpi-eeprom) is required. To update the bootloader we will need an SD card. Insert the SD card into your computer and use Raspberry Pi Imager to install the bootloader on it (select Operating System > Misc utility images > Bootloader > SD Card Boot). Alternatively, you can use the console on Linux or macOS. The path to your SD card can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

curl -Lo rpi-boot-eeprom-recovery.zip https://github.com/raspberrypi/rpi-eeprom/releases/download/v2021.04.29-138a1/rpi-boot-eeprom-recovery-2021-04-29-vl805-000138a1.zip
sudo mkfs.fat -I /dev/mmcblk0
sudo mount /dev/mmcblk0p1 /mnt
sudo bsdtar rpi-boot-eeprom-recovery.zip -C /mnt

Remove the SD card from your local machine and insert it into the Raspberry Pi. Power the Raspberry Pi on, and wait at least 10 seconds. If successful, the green LED light will blink rapidly (forever), otherwise an error pattern will be displayed. If an HDMI display is attached to the port closest to the power/USB-C port, the screen will display green for success or red if a failure occurs. Power off the Raspberry Pi and remove the SD card from it.

Note: Updating the bootloader only needs to be done once.

Download the Image

Download the image and decompress it:

curl -LO https://github.com/siderolabs/talos/releases/download/v1.1.1/metal-rpi_4-arm64.img.xz
xz -d metal-rpi_4-arm64.img.xz

Writing the Image

Now dd the image to your SD card:

sudo dd if=metal-rpi_4-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

Bootstrapping the Node

Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --mode=interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Note: if you have an HDMI display attached and it shows only a rainbow splash, please use the other HDMI port, the one closest to the power/USB-C port.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

Troubleshooting

The following table can be used to troubleshoot booting issues:

Long FlashesShort FlashesStatus
03Generic failure to boot
04start*.elf not found
07Kernel image not found
08SDRAM failure
09Insufficient SDRAM
010In HALT state
21Partition not FAT
22Failed to read from partition
23Extended partition not FAT
24File signature/hash mismatch - Pi 4
44Unsupported board type
45Fatal firmware error
46Power failure type A
47Power failure type B

2 - Configuration

Guides on how to configure Talos Linux machines

2.1 - Containerd

Customize Containerd Settings

The base containerd configuration expects to merge in any additional configs present in /var/cri/conf.d/*.toml.

An example of exposing metrics

Into each machine config, add the following:

machine:
  ...
  files:
    - content: |
        [metrics]
          address = "0.0.0.0:11234"        
      path: /var/cri/conf.d/metrics.toml
      op: create

Create cluster like normal and see that metrics are now present on this port:

$ curl 127.0.0.1:11234/v1/metrics
# HELP container_blkio_io_service_bytes_recursive_bytes The blkio io service bytes recursive
# TYPE container_blkio_io_service_bytes_recursive_bytes gauge
container_blkio_io_service_bytes_recursive_bytes{container_id="0677d73196f5f4be1d408aab1c4125cf9e6c458a4bea39e590ac779709ffbe14",device="/dev/dm-0",major="253",minor="0",namespace="k8s.io",op="Async"} 0
container_blkio_io_service_bytes_recursive_bytes{container_id="0677d73196f5f4be1d408aab1c4125cf9e6c458a4bea39e590ac779709ffbe14",device="/dev/dm-0",major="253",minor="0",namespace="k8s.io",op="Discard"} 0
...
...

2.2 - Custom Certificate Authorities

How to supply custom certificate authorities

Appending the Certificate Authority

Put into each machine the PEM encoded certificate:

machine:
  ...
  files:
    - content: |
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----        
      permissions: 0644
      path: /etc/ssl/certs/ca-certificates
      op: append

2.3 - Disk Encryption

Guide on using system disk encryption

It is possible to enable encryption for system disks at the OS level. As of this writing, only STATE and EPHEMERAL partitions can be encrypted. STATE contains the most sensitive node data: secrets and certs. EPHEMERAL partition may contain some sensitive workload data. Data is encrypted using LUKS2, which is provided by the Linux kernel modules and cryptsetup utility. The operating system will run additional setup steps when encryption is enabled.

If the disk encryption is enabled for the STATE partition, the system will:

  • Save STATE encryption config as JSON in the META partition.
  • Before mounting the STATE partition, load encryption configs either from the machine config or from the META partition. Note that the machine config is always preferred over the META one.
  • Before mounting the STATE partition, format and encrypt it. This occurs only if the STATE partition is empty and has no filesystem.

If the disk encryption is enabled for the EPHEMERAL partition, the system will:

  • Get the encryption config from the machine config.
  • Before mounting the EPHEMERAL partition, encrypt and format it. This occurs only if the EPHEMERAL partition is empty and has no filesystem.

Configuration

Right now this encryption is disabled by default. To enable disk encryption you should modify the machine configuration with the following options:

machine:
  ...
  systemDiskEncryption:
    ephemeral:
      provider: luks2
      keys:
        - nodeID: {}
          slot: 0
    state:
      provider: luks2
      keys:
        - nodeID: {}
          slot: 0

Encryption Keys

Note: What the LUKS2 docs call “keys” are, in reality, a passphrase. When this passphrase is added, LUKS2 runs argon2 to create an actual key from that passphrase.

LUKS2 supports up to 32 encryption keys and it is possible to specify all of them in the machine configuration. Talos always tries to sync the keys list defined in the machine config with the actual keys defined for the LUKS2 partition. So if you update the keys list you should have at least one key that is not changed to be used for keys management.

When you define a key you should specify the key kind and the slot:

machine:
  ...
  state:
    keys:
      - nodeID: {} # key kind
        slot: 1

  ephemeral:
    keys:
      - static:
          passphrase: supersecret
        slot: 0

Take a note that key order does not play any role on which key slot is used. Every key must always have a slot defined.

Encryption Key Kinds

Talos supports two kinds of keys:

  • nodeID which is generated using the node UUID and the partition label (note that if the node UUID is not really random it will fail the entropy check).
  • static which you define right in the configuration.

Note: Use static keys only if your STATE partition is encrypted and only for the EPHEMERAL partition. For the STATE partition it will be stored in the META partition, which is not encrypted.

Key Rotation

It is necessary to do talosctl apply-config a couple of times to rotate keys, since there is a need to always maintain a single working key while changing the other keys around it.

So, for example, first add a new key:

machine:
  ...
  ephemeral:
    keys:
      - static:
          passphrase: oldkey
        slot: 0
      - static:
          passphrase: newkey
        slot: 1
  ...

Run:

talosctl apply-config -n <node> -f config.yaml

Then remove the old key:

machine:
  ...
  ephemeral:
    keys:
      - static:
          passphrase: newkey
        slot: 1
  ...

Run:

talosctl apply-config -n <node> -f config.yaml

Going from Unencrypted to Encrypted and Vice Versa

Ephemeral Partition

There is no in-place encryption support for the partitions right now, so to avoid losing any data only empty partitions can be encrypted.

As such, migration from unencrypted to encrypted needs some additional handling, especially around explicitly wiping partitions.

  • apply-config should be called with --mode=staged.
  • Partition should be wiped after apply-config, but before the reboot.

Edit your machine config and add the encryption configuration:

vim config.yaml

Apply the configuration with --mode=staged:

talosctl apply-config -f config.yaml -n <node ip> --mode=staged

Wipe the partition you’re going to encrypt:

talosctl reset --system-labels-to-wipe EPHEMERAL -n <node ip> --reboot=true

That’s it! After you run the last command, the partition will be wiped and the node will reboot. During the next boot the system will encrypt the partition.

State Partition

Calling wipe against the STATE partition will make the node lose the config, so the previous flow is not going to work.

The flow should be to first wipe the STATE partition:

talosctl reset  --system-labels-to-wipe STATE -n <node ip> --reboot=true

Node will enter into maintenance mode, then run apply-config with --insecure flag:

talosctl apply-config --insecure -n <node ip> -f config.yaml

After installation is complete the node should encrypt the STATE partition.

2.4 - Editing Machine Configuration

How to edit and patch Talos machine configuration, with reboot, immediately, or stage update on reboot.

Talos node state is fully defined by machine configuration. Initial configuration is delivered to the node at bootstrap time, but configuration can be updated while the node is running.

Note: Be sure that config is persisted so that configuration updates are not overwritten on reboots. Configuration persistence was enabled by default since Talos 0.5 (persist: true in machine configuration).

There are three talosctl commands which facilitate machine configuration updates:

  • talosctl apply-config to apply configuration from the file
  • talosctl edit machineconfig to launch an editor with existing node configuration, make changes and apply configuration back
  • talosctl patch machineconfig to apply automated machine configuration via JSON patch

Each of these commands can operate in one of four modes:

  • apply change in automatic mode(default): reboot if the change can’t be applied without a reboot, otherwise apply the change immediately
  • apply change with a reboot (--mode=reboot): update configuration, reboot Talos node to apply configuration change
  • apply change immediately (--mode=no-reboot flag): change is applied immediately without a reboot, fails if the change contains any fields that can not be updated without a reboot
  • apply change on next reboot (--mode=staged): change is staged to be applied after a reboot, but node is not rebooted
  • apply change in the interactive mode (--mode=interactive; only for talosctl apply-config): launches TUI based interactive installer

Note: applying change on next reboot (--mode=staged) doesn’t modify current node configuration, so next call to talosctl edit machineconfig --mode=staged will not see changes

Additionally, there is also talosctl get machineconfig, which retrieves the current node configuration API resource and contains the machine configuration in the .spec field. It can be used to modify the configuration locally before being applied to the node.

The list of config changes allowed to be applied immediately in Talos v1.1.1:

  • .debug
  • .cluster
  • .machine.time
  • .machine.certCANs
  • .machine.install (configuration is only applied during install/upgrade)
  • .machine.network
  • .machine.sysfs
  • .machine.sysctls
  • .machine.logging
  • .machine.controlplane
  • .machine.kubelet
  • .machine.pods
  • .machine.kernel
  • .machine.registries (CRI containerd plugin will not pick up the registry authentication settings without a reboot)

talosctl apply-config

This command is mostly used to submit initial machine configuration to the node (generated by talosctl gen config). It can be used to apply new configuration from the file to the running node as well, but most of the time it’s not convenient, as it doesn’t operate on the current node machine configuration.

Example:

talosctl -n <IP> apply-config -f config.yaml

Command apply-config can also be invoked as apply machineconfig:

talosctl -n <IP> apply machineconfig -f config.yaml

Applying machine configuration immediately (without a reboot):

talosctl -n IP apply machineconfig -f config.yaml --mode=no-reboot

Starting the interactive installer:

talosctl -n IP apply machineconfig --mode=interactive

Note: when a Talos node is running in the maintenance mode it’s necessary to provide --insecure (-i) flag to connect to the API and apply the config.

taloctl edit machineconfig

Command talosctl edit loads current machine configuration from the node and launches configured editor to modify the config. If config hasn’t been changed in the editor (or if updated config is empty), update is not applied.

Note: Talos uses environment variables TALOS_EDITOR, EDITOR to pick up the editor preference. If environment variables are missing, vi editor is used by default.

Example:

talosctl -n <IP> edit machineconfig

Configuration can be edited for multiple nodes if multiple IP addresses are specified:

talosctl -n <IP1>,<IP2>,... edit machineconfig

Applying machine configuration change immediately (without a reboot):

talosctl -n <IP> edit machineconfig --mode=no-reboot

talosctl patch machineconfig

Command talosctl patch works similar to talosctl edit command - it loads current machine configuration, but instead of launching configured editor it applies a set of JSON patches to the configuration and writes the result back to the node.

Example, updating kubelet version (in auto mode):

$ talosctl -n <IP> patch machineconfig -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.24.2"}]'
patched mc at the node <IP>

Updating kube-apiserver version in immediate mode (without a reboot):

$ talosctl -n <IP> patch machineconfig --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "k8s.gcr.io/kube-apiserver:v1.24.2"}]'
patched mc at the node <IP>

A patch might be applied to multiple nodes when multiple IPs are specified:

talosctl -n <IP1>,<IP2>,... patch machineconfig -p '[{...}]'

Patches can also be sourced from files using @file syntax:

talosctl -n <IP> patch machineconfig -p @kubelet-patch.json -p @manifest-patch.json

It might be easier to store patches in YAML format vs. the default JSON format. Talos can detect file format automatically:

# kubelet-patch.yaml
- op: replace
  path: /machine/kubelet/image
  value: ghcr.io/siderolabs/kubelet:v1.24.2
talosctl -n <IP> patch machineconfig -p @kubelet-patch.yaml

Recovering from Node Boot Failures

If a Talos node fails to boot because of wrong configuration (for example, control plane endpoint is incorrect), configuration can be updated to fix the issue.

2.5 - Logging

Dealing with Talos Linux logs.

Viewing logs

Kernel messages can be retrieved with talosctl dmesg command:

$ talosctl -n 172.20.1.2 dmesg

172.20.1.2: kern:    info: [2021-11-10T10:09:37.662764956Z]: Command line: init_on_alloc=1 slab_nomerge pti=on consoleblank=0 nvme_core.io_timeout=4294967295 random.trust_cpu=on printk.devkmsg=on ima_template=ima-ng ima_appraise=fix ima_hash=sha512 console=ttyS0 reboot=k panic=1 talos.shutdown=halt talos.platform=metal talos.config=http://172.20.1.1:40101/config.yaml
[...]

Service logs can be retrieved with talosctl logs command:

$ talosctl -n 172.20.1.2 services

NODE         SERVICE      STATE     HEALTH   LAST CHANGE   LAST EVENT
172.20.1.2   apid         Running   OK       19m27s ago    Health check successful
172.20.1.2   containerd   Running   OK       19m29s ago    Health check successful
172.20.1.2   cri          Running   OK       19m27s ago    Health check successful
172.20.1.2   etcd         Running   OK       19m22s ago    Health check successful
172.20.1.2   kubelet      Running   OK       19m20s ago    Health check successful
172.20.1.2   machined     Running   ?        19m30s ago    Service started as goroutine
172.20.1.2   trustd       Running   OK       19m27s ago    Health check successful
172.20.1.2   udevd        Running   OK       19m28s ago    Health check successful

$ talosctl -n 172.20.1.2 logs machined

172.20.1.2: [talos] task setupLogger (1/1): done, 106.109µs
172.20.1.2: [talos] phase logger (1/7): done, 564.476µs
[...]

Container logs for Kubernetes pods can be retrieved with talosctl logs -k command:

$ talosctl -n 172.20.1.2 containers -k
NODE         NAMESPACE   ID                                                 IMAGE                                                         PID    STATUS
172.20.1.2   k8s.io      kube-system/kube-flannel-dk6d5                     k8s.gcr.io/pause:3.5                                          1329   SANDBOX_READY
172.20.1.2   k8s.io      └─ kube-system/kube-flannel-dk6d5:install-cni      ghcr.io/siderolabs/install-cni:v0.7.0-alpha.0-1-g2bb2efc      0      CONTAINER_EXITED
172.20.1.2   k8s.io      └─ kube-system/kube-flannel-dk6d5:install-config   quay.io/coreos/flannel:v0.13.0                                0      CONTAINER_EXITED
172.20.1.2   k8s.io      └─ kube-system/kube-flannel-dk6d5:kube-flannel     quay.io/coreos/flannel:v0.13.0                                1610   CONTAINER_RUNNING
172.20.1.2   k8s.io      kube-system/kube-proxy-gfkqj                       k8s.gcr.io/pause:3.5                                          1311   SANDBOX_READY
172.20.1.2   k8s.io      └─ kube-system/kube-proxy-gfkqj:kube-proxy         k8s.gcr.io/kube-proxy:v1.24.2                                 1379   CONTAINER_RUNNING

$ talosctl -n 172.20.1.2 logs -k kube-system/kube-proxy-gfkqj:kube-proxy
172.20.1.2: 2021-11-30T19:13:20.567825192Z stderr F I1130 19:13:20.567737       1 server_others.go:138] "Detected node IP" address="172.20.0.3"
172.20.1.2: 2021-11-30T19:13:20.599684397Z stderr F I1130 19:13:20.599613       1 server_others.go:206] "Using iptables Proxier"
[...]

Sending logs

Service logs

You can enable logs sendings in machine configuration:

machine:
  logging:
    destinations:
      - endpoint: "udp://127.0.0.1:12345/"
        format: "json_lines"
      - endpoint: "tcp://host:5044/"
        format: "json_lines"

Several destinations can be specified. Supported protocols are UDP and TCP. The only currently supported format is json_lines:

{
  "msg": "[talos] apply config request: immediate true, on reboot false",
  "talos-level": "info",
  "talos-service": "machined",
  "talos-time": "2021-11-10T10:48:49.294858021Z"
}

Messages are newline-separated when sent over TCP. Over UDP messages are sent with one message per packet. msg, talos-level, talos-service, and talos-time fields are always present; there may be additional fields.

Kernel logs

Kernel log delivery can be enabled with the talos.logging.kernel kernel command line argument, which can be specified in the .machine.installer.extraKernelArgs:

machine:
  install:
    extraKernelArgs:
      - talos.logging.kernel=tcp://host:5044/

Kernel log destination is specified in the same way as service log endpoint. The only supported format is json_lines.

Sample message:

{
  "clock":6252819, // time relative to the kernel boot time
  "facility":"user",
  "msg":"[talos] task startAllServices (1/1): waiting for 6 services\n",
  "priority":"warning",
  "seq":711,
  "talos-level":"warn", // Talos-translated `priority` into common logging level
  "talos-time":"2021-11-26T16:53:21.3258698Z" // Talos-translated `clock` using current time
}

extraKernelArgs in the machine configuration are only applied on Talos upgrades, not just by applying the config. (Upgrading to the same version is fine).

Filebeat example

To forward logs to other Log collection services, one way to do this is sending them to a Filebeat running in the cluster itself (in the host network), which takes care of forwarding it to other endpoints (and the necessary transformations).

If Elastic Cloud on Kubernetes is being used, the following Beat (custom resource) configuration might be helpful:

apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
  name: talos
spec:
  type: filebeat
  version: 7.15.1
  elasticsearchRef:
    name: talos
  config:
    filebeat.inputs:
      - type: "udp"
        host: "127.0.0.1:12345"
        processors:
          - decode_json_fields:
              fields: ["message"]
              target: ""
          - timestamp:
              field: "talos-time"
              layouts:
                - "2006-01-02T15:04:05.999999999Z07:00"
          - drop_fields:
              fields: ["message", "talos-time"]
          - rename:
              fields:
                - from: "msg"
                  to: "message"

  daemonSet:
    updateStrategy:
      rollingUpdate:
        maxUnavailable: 100%
    podTemplate:
      spec:
        dnsPolicy: ClusterFirstWithHostNet
        hostNetwork: true
        securityContext:
          runAsUser: 0
        containers:
          - name: filebeat
            ports:
              - protocol: UDP
                containerPort: 12345
                hostPort: 12345

The input configuration ensures that messages and timestamps are extracted properly. Refer to the Filebeat documentation on how to forward logs to other outputs.

Also note the hostNetwork: true in the daemonSet configuration.

This ensures filebeat uses the host network, and listens on 127.0.0.1:12345 (UDP) on every machine, which can then be specified as a logging endpoint in the machine configuration.

Fluent-bit example

First, we’ll create a value file for the fluentd-bit Helm chart.

# fluentd-bit.yaml

podAnnotations:
  fluentbit.io/exclude: 'true'

extraPorts:
  - port: 12345
    containerPort: 12345
    protocol: TCP
    name: talos

config:
  service: |
    [SERVICE]
      Flush         5
      Daemon        Off
      Log_Level     warn
      Parsers_File  custom_parsers.conf    

  inputs: |
    [INPUT]
      Name          tcp
      Listen        0.0.0.0
      Port          12345
      Format        json
      Tag           talos.*

    [INPUT]
      Name          tail
      Alias         kubernetes
      Path          /var/log/containers/*.log
      Parser        containerd
      Tag           kubernetes.*

    [INPUT]
      Name          tail
      Alias         audit
      Path          /var/log/audit/kube/*.log
      Parser        audit
      Tag           audit.*    

  filters: |
    [FILTER]
      Name                kubernetes
      Alias               kubernetes
      Match               kubernetes.*
      Kube_Tag_Prefix     kubernetes.var.log.containers.
      Use_Kubelet         Off
      Merge_Log           On
      Merge_Log_Trim      On
      Keep_Log            Off
      K8S-Logging.Parser  Off
      K8S-Logging.Exclude On
      Annotations         Off
      Labels              On

    [FILTER]
      Name          modify
      Match         kubernetes.*
      Add           source kubernetes
      Remove        logtag    

  customParsers: |
    [PARSER]
      Name          audit
      Format        json
      Time_Key      requestReceivedTimestamp
      Time_Format   %Y-%m-%dT%H:%M:%S.%L%z

    [PARSER]
      Name          containerd
      Format        regex
      Regex         ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
      Time_Key      time
      Time_Format   %Y-%m-%dT%H:%M:%S.%L%z    

  outputs: |
    [OUTPUT]
      Name    stdout
      Alias   stdout
      Match   *
      Format  json_lines    

  # If you wish to ship directly to Loki from Fluentbit,
  # Uncomment the following output, updating the Host with your Loki DNS/IP info as necessary.
  # [OUTPUT]
  # Name loki
  # Match *
  # Host loki.loki.svc
  # Port 3100
  # Labels job=fluentbit
  # Auto_Kubernetes_Labels on

daemonSetVolumes:
  - name: varlog
    hostPath:
      path: /var/log

daemonSetVolumeMounts:
  - name: varlog
    mountPath: /var/log

tolerations:
  - operator: Exists
    effect: NoSchedule

Next, we will add the helm repo for FluentBit, and deploy it to the cluster.

helm repo add fluent https://fluent.github.io/helm-charts
helm upgrade -i --namespace=kube-system -f fluentd-bit.yaml fluent-bit fluent/fluent-bit

Now we need to find the service IP.

$ kubectl -n kube-system get svc -l app.kubernetes.io/name=fluent-bit

NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
fluent-bit   ClusterIP   10.200.0.138   <none>        2020/TCP,5170/TCP   108m

Finally, we will change talos log destination with the command talosctl edit mc.

machine:
  logging:
    destinations:
      - endpoint: "tcp://10.200.0.138:5170"
        format: "json_lines"

This example configuration was well tested with Cilium CNI, and it should work with iptables/ipvs based CNI plugins too.

Vector example

Vector is a lightweight observability pipeline ideal for a Kubernetes environment. It can ingest (source) logs from multiple sources, perform remapping on the logs (transform), and forward the resulting pipeline to multiple destinations (sinks). As it is an end to end platform, it can be run as a single-deployment ‘aggregator’ as well as a replicaSet of ‘Agents’ that run on each node.

As Talos can be set as above to send logs to a destination, we can run Vector as an Aggregator, and forward both kernel and service to a UDP socket in-cluster.

Below is an excerpt of a source/sink setup for Talos, with a ‘sink’ destination of an in-cluster Grafana Loki log aggregation service. As Loki can create labels from the log input, we have set up the Loki sink to create labels based on the host IP, service and facility of the inbound logs.

Note that a method of exposing the Vector service will be required which may vary depending on your setup - a LoadBalancer is a good option.

role: "Stateless-Aggregator"

# Sources
sources:
  talos_kernel_logs:
    address: 0.0.0.0:6050
    type: socket
    mode: udp
    max_length: 102400
    decoding:
      codec: json
    host_key: __host

  talos_service_logs:
    address: 0.0.0.0:6051
    type: socket
    mode: udp
    max_length: 102400
    decoding:
      codec: json
    host_key: __host

# Sinks
sinks:
  talos_kernel:
    type: loki
    inputs:
      - talos_kernel_logs_xform
    endpoint: http://loki.system-monitoring:3100
    encoding:
      codec: json
      except_fields:
        - __host
    batch:
      max_bytes: 1048576
    out_of_order_action: rewrite_timestamp
    labels:
      hostname: >-
                {{`{{ __host }}`}}
      facility: >-
                {{`{{ facility }}`}}

  talos_service:
    type: loki
    inputs:
      - talos_service_logs_xform
    endpoint: http://loki.system-monitoring:3100
    encoding:
      codec: json
      except_fields:
        - __host
    batch:
      max_bytes: 400000
    out_of_order_action: rewrite_timestamp
    labels:
      hostname: >-
                {{`{{ __host }}`}}
      service: >-
                {{`{{ "talos-service" }}`}}

2.6 - Managing PKI

How to manage Public Key Infrastructure

Generating an Administrator Key Pair

In order to create a key pair, you will need the root CA.

Save the CA public key, and CA private key as ca.crt, and ca.key respectively. Now, run the following commands to generate a certificate:

talosctl gen key --name admin
talosctl gen csr --key admin.key --ip 127.0.0.1
talosctl gen crt --ca ca --csr admin.csr --name admin

Now, base64 encode admin.crt, and admin.key:

cat admin.crt | base64
cat admin.key | base64

You can now set the crt and key fields in the talosconfig to the base64 encoded strings.

Renewing an Expired Administrator Certificate

In order to renew the certificate, you will need the root CA, and the admin private key. The base64 encoded key can be found in any one of the control plane node’s configuration file. Where it is exactly will depend on the specific version of the configuration file you are using.

Save the CA public key, CA private key, and admin private key as ca.crt, ca.key, and admin.key respectively. Now, run the following commands to generate a certificate:

talosctl gen csr --key admin.key --ip 127.0.0.1
talosctl gen crt --ca ca --csr admin.csr --name admin

You should see admin.crt in your current directory. Now, base64 encode admin.crt:

cat admin.crt | base64

You can now set the certificate in the talosconfig to the base64 encoded string.

2.7 - NVIDIA GPU

In this guide we’ll follow the procedure to support NVIDIA GPU on Talos.

Enabling NVIDIA GPU support on Talos is bound by NVIDIA EULA Talos GPU support is an alpha feature.

These are the steps to enabling NVIDIA support in Talos.

  • Talos pre-installed on a node with NVIDIA GPU installed.
  • Building a custom Talos installer image with NVIDIA modules
  • Building NVIDIA container toolkit system extension which allows to register a custom runtime with containerd
  • Upgrading Talos with the custom installer and enabling NVIDIA modules and the system extension

Both these components require that the user build and maintain their own Talos installer image and the NVIDIA container toolkit Talos System Extension.

Prerequisites

This guide assumes the user has access to a container registry with push permissions, docker installed on the build machine and the Talos host has pull access to the container registry.

Set the local registry and username environment variables:

export USERNAME=<username>
export REGISTRY=<registry>

For eg:

export USERNAME=talos-user
export REGISTRY=ghcr.io

The examples below will use the sample variables set above. Modify accordingly for your environment.

Building the installer image

Start by cloning the pkgs repository.

Now run the following command to build and push custom Talos kernel image and the NVIDIA image with the NVIDIA kernel modules signed by the kernel built along with it.

make kernel nonfree-kmod-nvidia PLATFORM=linux/amd64 PUSH=true

Replace the platform with linux/arm64 if building for ARM64

Now we need to create a custom Talos installer image.

Start by creating a Dockerfile with the following content:

FROM scratch as customization
COPY --from=ghcr.io/talos-user/nonfree-kmod-nvidia:v1.1.1-nvidia /lib/modules /lib/modules

FROM ghcr.io/siderolabs/installer:v1.1.1
COPY --from=ghcr.io/talos-user/kernel:v1.1.1-nvidia /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz

Now build the image and push it to the registry.

DOCKER_BUILDKIT=0 docker build --squash --build-arg RM="/lib/modules" -t ghcr.io/talos-user/installer:v1.1.1-nvidia .
docker push ghcr.io/talos-user/installer:v1.1.1-nvidia

Note: buildkit has a bug #816, to disable it use DOCKER_BUILDKIT=0

Building the system extension

Start by cloning the extensions repository.

Now run the following command to build and push the system extension.

make nvidia-container-toolkit PLATFORM=linux/amd64 PUSH=true TAG=510.60.02-v1.9.0

Replace the platform with linux/arm64 if building for ARM64

Upgrading Talos and enabling the NVIDIA modules and the system extension

Make sure to use talosctl version v1.1.1 or later

First create a patch yaml gpu-worker-patch.yaml to update the machine config similar to below:

- op: add
  path: /machine/install/extensions
  value:
    - image: ghcr.io/talos-user/nvidia-container-toolkit:510.60.02-v1.9.0
- op: add
  path: /machine/kernel
  value:
    modules:
      - name: nvidia
      - name: nvidia_uvm
      - name: nvidia_drm
      - name: nvidia_modeset
- op: add
  path: /machine/sysctls
  value:
    net.core.bpf_jit_harden: 1

Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU’s installed:

talosctl patch mc --patch @gpu-worker-patch.yaml

Now we can proceed to upgrading Talos with the installer built previously:

talosctl upgrade --image=ghcr.io/talos-user/installer:v1.1.1-nvidia

Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.

This can be confirmed by running:

talosctl read /proc/modules

which should produce an output similar to below:

nvidia_uvm 1146880 - - Live 0xffffffffc2733000 (PO)
nvidia_drm 69632 - - Live 0xffffffffc2721000 (PO)
nvidia_modeset 1142784 - - Live 0xffffffffc25ea000 (PO)
nvidia 39047168 - - Live 0xffffffffc00ac000 (PO)
talosctl get extensions

which should produce an output similar to below:

NODE           NAMESPACE   TYPE              ID                                                                 VERSION   NAME                       VERSION
172.31.41.27   runtime     ExtensionStatus   000.ghcr.io-frezbo-nvidia-container-toolkit-510.60.02-v1.9.0       1         nvidia-container-toolkit   510.60.02-v1.9.0
talosctl read /proc/driver/nvidia/version

which should produce an output similar to below:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  510.60.02  Wed Mar 16 11:24:05 UTC 2022
GCC version:  gcc version 11.2.0 (GCC)

Deploying NVIDIA device plugin

First we need to create the RuntimeClass

Apply the following manifest to create a runtime class that uses the extension:

---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

Install the NVIDIA device plugin:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.11.0 --set=runtimeClassName=nvidia

Apply the following manifest to run CUDA pod via nvidia runtime:

cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: Pod
metadata:
  name: gpu-operator-test
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-vector-add
    image: "nvidia/samples:vectoradd-cuda11.6.0"
    resources:
      limits:
         nvidia.com/gpu: 1
<<EOF

The status can be viewed by running:

kubectl get pods

which should produce an output similar to below:

NAME                READY   STATUS      RESTARTS   AGE
gpu-operator-test   0/1     Completed   0          13s
kubectl logs gpu-operator-test

which should produce an output similar to below:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

2.8 - Pull Through Image Cache

How to set up local transparent container images caches.

In this guide we will create a set of local caching Docker registry proxies to minimize local cluster startup time.

When running Talos locally, pulling images from Docker registries might take a significant amount of time. We spin up local caching pass-through registries to cache images and configure a local Talos cluster to use those proxies. A similar approach might be used to run Talos in production in air-gapped environments. It can be also used to verify that all the images are available in local registries.

Video Walkthrough

To see a live demo of this writeup, see the video below:

Requirements

The follow are requirements for creating the set of caching proxies:

  • Docker 18.03 or greater
  • Local cluster requirements for either docker or QEMU.

Launch the Caching Docker Registry Proxies

Talos pulls from docker.io, k8s.gcr.io, quay.io, gcr.io, and ghcr.io by default. If your configuration is different, you might need to modify the commands below:

docker run -d -p 5000:5000 \
    -e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io \
    --restart always \
    --name registry-docker.io registry:2

docker run -d -p 5001:5000 \
    -e REGISTRY_PROXY_REMOTEURL=https://k8s.gcr.io \
    --restart always \
    --name registry-k8s.gcr.io registry:2

docker run -d -p 5002:5000 \
    -e REGISTRY_PROXY_REMOTEURL=https://quay.io \
    --restart always \
    --name registry-quay.io registry:2.5

docker run -d -p 5003:5000 \
    -e REGISTRY_PROXY_REMOTEURL=https://gcr.io \
    --restart always \
    --name registry-gcr.io registry:2

docker run -d -p 5004:5000 \
    -e REGISTRY_PROXY_REMOTEURL=https://ghcr.io \
    --restart always \
    --name registry-ghcr.io registry:2

Note: Proxies are started as docker containers, and they’re automatically configured to start with Docker daemon. Please note that quay.io proxy doesn’t support recent Docker image schema, so we run older registry image version (2.5).

As a registry container can only handle a single upstream Docker registry, we launch a container per upstream, each on its own host port (5000, 5001, 5002, 5003 and 5004).

Using Caching Registries with QEMU Local Cluster

With a QEMU local cluster, a bridge interface is created on the host. As registry containers expose their ports on the host, we can use bridge IP to direct proxy requests.

sudo talosctl cluster create --provisioner qemu \
    --registry-mirror docker.io=http://10.5.0.1:5000 \
    --registry-mirror k8s.gcr.io=http://10.5.0.1:5001 \
    --registry-mirror quay.io=http://10.5.0.1:5002 \
    --registry-mirror gcr.io=http://10.5.0.1:5003 \
    --registry-mirror ghcr.io=http://10.5.0.1:5004

The Talos local cluster should now start pulling via caching registries. This can be verified via registry logs, e.g. docker logs -f registry-docker.io. The first time cluster boots, images are pulled and cached, so next cluster boot should be much faster.

Note: 10.5.0.1 is a bridge IP with default network (10.5.0.0/24), if using custom --cidr, value should be adjusted accordingly.

Using Caching Registries with docker Local Cluster

With a docker local cluster we can use docker bridge IP, default value for that IP is 172.17.0.1. On Linux, the docker bridge address can be inspected with ip addr show docker0.

talosctl cluster create --provisioner docker \
    --registry-mirror docker.io=http://172.17.0.1:5000 \
    --registry-mirror k8s.gcr.io=http://172.17.0.1:5001 \
    --registry-mirror quay.io=http://172.17.0.1:5002 \
    --registry-mirror gcr.io=http://172.17.0.1:5003 \
    --registry-mirror ghcr.io=http://172.17.0.1:5004

Cleaning Up

To cleanup, run:

docker rm -f registry-docker.io
docker rm -f registry-k8s.gcr.io
docker rm -f registry-quay.io
docker rm -f registry-gcr.io
docker rm -f registry-ghcr.io

Note: Removing docker registry containers also removes the image cache. So if you plan to use caching registries, keep the containers running.

2.9 - Role-based access control (RBAC)

Set up RBAC on the Talos Linux API.

Talos v0.11 introduced initial support for role-based access control (RBAC). This guide will explain what that is and how to enable it without losing access to the cluster.

RBAC in Talos

Talos uses certificates to authorize users. The certificate subject’s organization field is used to encode user roles. There is a set of predefined roles that allow access to different API methods:

  • os:admin grants access to all methods;
  • os:reader grants access to “safe” methods (for example, that includes the ability to list files, but does not include the ability to read files content);
  • os:etcd:backup grants access to /machine.MachineService/EtcdSnapshot method.

Roles in the current talosconfig can be checked with the following command:

$ talosctl config info

[...]
Roles:               os:admin
[...]

RBAC is enabled by default in new clusters created with talosctl v0.11+ and disabled otherwise.

Enabling RBAC

First, both the Talos cluster and talosctl tool should be upgraded. Then the talosctl config new command should be used to generate a new client configuration with the os:admin role. Additional configurations and certificates for different roles can be generated by passing --roles flag:

talosctl config new --roles=os:reader reader

That command will create a new client configuration file reader with a new certificate with os:reader role.

After that, RBAC should be enabled in the machine configuration:

machine:
  features:
    rbac: true

2.10 - System Extensions

Customizing the Talos Linux immutable root file system.

System extensions allow extending the Talos root filesystem, which enables a variety of features, such as including custom container runtimes, loading additional firmware, etc.

System extensions are only activated during the installation or upgrade of Talos Linux. With system extensions installed, the Talos root filesystem is still immutable and read-only.

Configuration

System extensions are configured in the .machine.install section:

machine:
  install:
    extensions:
      - image: ghcr.io/siderolabs/gvisor:33f613e

During the initial install (e.g. when PXE booting or booting from an ISO), Talos will pull down container images for system extensions, validate them, and include them into the Talos initramfs image. System extensions will be activated on boot and overlaid on top of the Talos root filesystem.

In order to update the system extensions for a running instance, update .machine.install.extensions and upgrade Talos. (Note: upgrading to the same version of Talos is fine).

Building a Talos Image with System Extensions

System extensions can be installed into the Talos disk image (e.g. AWS AMI or VMWare OVF) by running the following command to generate the image from the Talos source tree:

make image-metal IMAGER_SYSTEM_EXTENSIONS="ghcr.io/siderolabs/amd-ucode:20220411 ghcr.io/siderolabs/gvisor:20220405.0-v1.0.0-10-g82b41ad"

Authoring System Extensions

A Talos system extension is a container image with the specific folder structure. System extensions can be built and managed using any tool that produces container images, e.g. docker build.

Sidero Labs maintains a repository of system extensions.

Resource Definitions

Use talosctl get extensions to get a list of system extensions:

$ talosctl get extensions
NODE         NAMESPACE   TYPE              ID                                              VERSION   NAME          VERSION
172.20.0.2   runtime     ExtensionStatus   000.ghcr.io-talos-systems-gvisor-54b831d        1         gvisor        20220117.0-v1.0.0
172.20.0.2   runtime     ExtensionStatus   001.ghcr.io-talos-systems-intel-ucode-54b831d   1         intel-ucode   microcode-20210608-v1.0.0

Use YAML or JSON format to see additional details about the extension:

$ talosctl -n 172.20.0.2 get extensions 001.ghcr.io-talos-systems-intel-ucode-54b831d -o yaml
node: 172.20.0.2
metadata:
    namespace: runtime
    type: ExtensionStatuses.runtime.talos.dev
    id: 001.ghcr.io-talos-systems-intel-ucode-54b831d
    version: 1
    owner: runtime.ExtensionStatusController
    phase: running
    created: 2022-02-10T18:25:04Z
    updated: 2022-02-10T18:25:04Z
spec:
    image: 001.ghcr.io-talos-systems-intel-ucode-54b831d.sqsh
    metadata:
        name: intel-ucode
        version: microcode-20210608-v1.0.0
        author: Spencer Smith
        description: |
            This system extension provides Intel microcode binaries.
        compatibility:
            talos:
                version: '>= v1.0.0'

Example: gVisor

See readme of the gVisor extension.

3 - Network

Set up networking layers for Talos Linux

3.1 - Corporate Proxies

How to configure Talos Linux to use proxies in a corporate environment

Appending the Certificate Authority of MITM Proxies

Put into each machine the PEM encoded certificate:

machine:
  ...
  files:
    - content: |
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----        
      permissions: 0644
      path: /etc/ssl/certs/ca-certificates
      op: append

Configuring a Machine to Use the Proxy

To make use of a proxy:

machine:
  env:
    http_proxy: <http proxy>
    https_proxy: <https proxy>
    no_proxy: <no proxy>

Additionally, configure the DNS nameservers, and NTP servers:

machine:
  env:
  ...
  time:
    servers:
      - <server 1>
      - <server ...>
      - <server n>
  ...
  network:
    nameservers:
      - <ip 1>
      - <ip ...>
      - <ip n>

3.2 - Network Device Selector

How to configure network devices by selecting them using hardware information

Configuring Network Device Using Device Selector

deviceSelector is an alternative method of configuring a network device:

machine:
  ...
  network:
    interfaces:
      - deviceSelector:
          driver: virtio
          hardwareAddr: "00:00:*"
        address: 192.168.88.21

Selector has the following traits:

  • qualifiers match a device by reading the hardware information in /sys/class/net/...
  • qualifiers are applied using logical AND
  • machine.network.interfaces.deviceConfig option is mutually exclusive with machine.network.interfaces.interface
  • the selector is invalid when it matches multiple devices, the controller will fail and won’t create any devices for the malformed selector

The available hardware information used in the selector can be observed in the LinkStatus resource (works in maintenance mode):

# talosctl get links eth0 -o yaml
spec:
  ...
  hardwareAddr: 4e:95:8e:8f:e4:47
  busPath: 0000:06:00.0
  driver: alx
  pciID: 1969:E0B1

Use Case

machine.network.interfaces.interface name is generated by the Linux kernel and can be changed after a reboot. Device names can change when the system has several interfaces of the same kind, e.g: eth0, eth1.

In that case pinning it to hardwareAddress will make Talos reliably configure the device even when interface name changes.

3.3 - Virtual (shared) IP

Using Talos Linux to set up a floating virtual IP address for cluster access.

One of the biggest pain points when building a high-availability controlplane is giving clients a single IP or URL at which they can reach any of the controlplane nodes. The most common approaches all require external resources: reverse proxy, load balancer, BGP, and DNS.

Using a “Virtual” IP address, on the other hand, provides high availability without external coordination or resources, so long as the controlplane members share a layer 2 network. In practical terms, this means that they are all connected via a switch, with no router in between them.

The term “virtual” is misleading here. The IP address is real, and it is assigned to an interface. Instead, what actually happens is that the controlplane machines vie for control of the shared IP address using etcd elections. There can be only one owner of the IP address at any given time, but if that owner disappears or becomes non-responsive, another owner will be chosen, and it will take up the mantle: the IP address.

Talos has (as of version 0.9) built-in support for this form of shared IP address, and it can utilize this for both the Kubernetes API server and the Talos endpoint set. Talos uses etcd for elections and leadership (control) of the IP address. It is not reccomended to use a virtual IP to access the API of Talos itself, since the node using the shared IP is not deterministic and could change.

Video Walkthrough

To see a live demo of this writeup, see the video below:

Choose your Shared IP

To begin with, you should choose your shared IP address. It should generally be a reserved, unused IP address in the same subnet as your controlplane nodes. It should not be assigned or assignable by your DHCP server.

For our example, we will assume that the controlplane nodes have the following IP addresses:

  • 192.168.0.10
  • 192.168.0.11
  • 192.168.0.12

We then choose our shared IP to be:

192.168.0.15

Configure your Talos Machines

The shared IP setting is only valid for controlplane nodes.

For the example above, each of the controlplane nodes should have the following Machine Config snippet:

machine:
  network:
    interfaces:
    - interface: eth0
      dhcp: true
      vip:
        ip: 192.168.0.15

Virtual IP’s can also be configured on a VLAN interface.

machine:
  network:
    interfaces:
    - interface: eth0
      dhcp: true
      vip:
        ip: 192.168.0.15
      vlans:
        - vlanId: 100
          dhcp: true
          vip:
            ip: 192.168.1.15

Obviously, for your own environment, the interface and the DHCP setting may differ. You are free to use static addressing (cidr) instead of DHCP.

Caveats

In general, the shared IP should just work. However, since it relies on etcd for elections, the shared IP will not come alive until after you have bootstrapped Kubernetes. In general, this is not a problem, but it does mean that you cannot use the shared IP when issuing the talosctl bootstrap command. Instead, that command will need to target one of the controlplane nodes discretely.

3.4 - Wireguard Network

A guide on how to set up Wireguard network using Kernel module.

Configuring Wireguard Network

Quick Start

The quickest way to try out Wireguard is to use talosctl cluster create command:

talosctl cluster create --wireguard-cidr 10.1.0.0/24

It will automatically generate Wireguard network configuration for each node with the following network topology:

Where all controlplane nodes will be used as Wireguard servers which listen on port 51111. All controlplanes and workers will connect to all controlplanes. It also sets PersistentKeepalive to 5 seconds to establish controlplanes to workers connection.

After the cluster is deployed it should be possible to verify Wireguard network connectivity. It is possible to deploy a container with hostNetwork enabled, then do kubectl exec <container> /bin/bash and either do:

ping 10.1.0.2

Or install wireguard-tools package and run:

wg show

Wireguard show should output something like this:

interface: wg0
  public key: OMhgEvNIaEN7zeCLijRh4c+0Hwh3erjknzdyvVlrkGM=
  private key: (hidden)
  listening port: 47946

peer: 1EsxUygZo8/URWs18tqB5FW2cLVlaTA+lUisKIf8nh4=
  endpoint: 10.5.0.2:51111
  allowed ips: 10.1.0.0/24
  latest handshake: 1 minute, 55 seconds ago
  transfer: 3.17 KiB received, 3.55 KiB sent
  persistent keepalive: every 5 seconds

It is also possible to use generated configuration as a reference by pulling generated config files using:

talosctl read -n 10.5.0.2 /system/state/config.yaml > controlplane.yaml
talosctl read -n 10.5.0.3 /system/state/config.yaml > worker.yaml

Manual Configuration

All Wireguard configuration can be done by changing Talos machine config files. As an example we will use this official Wireguard quick start tutorial.

Key Generation

This part is exactly the same:

wg genkey | tee privatekey | wg pubkey > publickey

Setting up Device

Inline comments show relations between configs and wg quickstart tutorial commands:

...
network:
  interfaces:
    ...
      # ip link add dev wg0 type wireguard
    - interface: wg0
      mtu: 1500
      # ip address add dev wg0 192.168.2.1/24
      addresses:
        - 192.168.2.1/24
      # wg set wg0 listen-port 51820 private-key /path/to/private-key peer ABCDEF... allowed-ips 192.168.88.0/24 endpoint 209.202.254.14:8172
      wireguard:
        privateKey: <privatekey file contents>
        listenPort: 51820
        peers:
          allowedIPs:
            - 192.168.88.0/24
          endpoint: 209.202.254.14.8172
          publicKey: ABCDEF...
...

When networkd gets this configuration it will create the device, configure it and will bring it up (equivalent to ip link set up dev wg0).

All supported config parameters are described in the Machine Config Reference.

4 - Resetting a Machine

Steps on how to reset a Talos Linux machine to a clean state.

From time to time, it may be beneficial to reset a Talos machine to its “original” state. Bear in mind that this is a destructive action for the given machine. Doing this means removing the machine from Kubernetes, Etcd (if applicable), and clears any data on the machine that would normally persist a reboot.

CLI

WARNING: Running a talosctl reset on cloud VM’s might result in the VM being unable to boot as this wipes the entire disk. It might be more useful to just wipe the STATE and EPHEMERAL partitions on a cloud VM if not booting via iPXE. talosctl reset --system-labels-to-wipe STATE --system-labels-to-wipe EPHEMERAL

The API command for doing this is talosctl reset. There are a couple of flags as part of this command:

Flags:
      --graceful                        if true, attempt to cordon/drain node and leave etcd (if applicable) (default true)
      --reboot                          if true, reboot the node after resetting instead of shutting down
      --system-labels-to-wipe strings   if set, just wipe selected system disk partitions by label but keep other partitions intact keep other partitions intact

The graceful flag is especially important when considering HA vs. non-HA Talos clusters. If the machine is part of an HA cluster, a normal, graceful reset should work just fine right out of the box as long as the cluster is in a good state. However, if this is a single node cluster being used for testing purposes, a graceful reset is not an option since Etcd cannot be “left” if there is only a single member. In this case, reset should be used with --graceful=false to skip performing checks that would normally block the reset.

Kernel Parameter

Another way to reset a machine is to specify talos.experimental.wipe=system kernel parameter. If the machine got stuck in the boot loop and you access to the console you can use GRUB to specify this kernel argument. Then when Talos boots for the next time it will reset system disk and reboot.

Next steps can be to install Talos either using PXE boot or by mounting an ISO.

5 - Upgrading Talos Linux

Guide on upgrading a Talos Linux machine.

OS upgrades, like other operations on Talos Linux, are effected by an API call, which can be sent via the talosctl CLI utility. Because Talos Linux is image based, an upgrade is almost the same as installing Talos, with the difference that the system has already been initialized with a configuration.

The upgrade API call passes a node the installer image to use to perform the upgrade. Each Talos version has a corresponding installer.

Upgrades use an A-B image scheme in order to facilitate rollbacks. This scheme retains the previous Talos kernel and OS image following each upgrade. If an upgrade fails to boot, Talos will roll back to the previous version. Likewise, Talos may be manually rolled back via API (or talosctl rollback). This will simply update the boot reference and reboot.

Unless explicitly told to preserve data, an upgrade will cause the node to wipe the EPHEMERAL partition, remove itself from the etcd cluster (if it is a control node), and generally make itself as pristine as is possible. (This is generally the desired behavior, except in specialised use cases such as single-node clusters.)

Note that the default since Talos version 1.0 is to specify the Kubernetes version in the machine config. This means an upgrade of the Talos Linux OS will not apply an upgrade of the Kubernetes version by default. Kubernetes upgrades should be managed separately per upgrading kubernetes. Each release of Talos Linux includes the latest stable Kubernetes version by default.

Video Walkthrough

To see a live demo of an upgrade of Talos Linux, see the video below:

After Upgrade to v1.1.1

There are no specific actions to be taken after an upgrade.

talosctl upgrade

To upgrade a Talos node, specify the node’s IP address and the installer container image for the version of Talos to upgrade to.

For instance, if your Talos node has the IP address 10.20.30.40 and you want to install the official version v1.0.0, you would enter a command such as:

  $ talosctl upgrade --nodes 10.20.30.40 \
      --image ghcr.io/siderolabs/installer:v1.1.1

There is an option to this command: --preserve, which will explicitly tell Talos to keep ephemeral data intact. In most cases, it is correct to let Talos perform its default action of erasing the ephemeral data. However, if you are running a single-node control-plane, you will want to make sure that --preserve=true.

Rarely, a upgrade command will fail to run due to a process holding a file open on disk, or you may wish to set a node to upgrade, but delay the actual reboot as long as possible. In these cases, you can use the --stage flag. This puts the upgrade artifacts on disk, and adds some metadata to a disk partition that gets checked very early in the boot process. The node is not rebooted by the upgrade --stage process. However, whenever the system does next reboot, Talos sees that it needs to apply an upgrade, and will do so immediately. Because this occurs in a just rebooted system, there will be no conflict with any files being held open. After the upgrade is applied, the node will reboot again, in order to boot into the new version. Note that because Talos Linux now reboots via the kexec syscall, the extra reboot adds very little time.

Machine Configuration Changes

Talos 1.1.0 provides a default configuration for Pod Security Admission:

cluster:
    apiServer:
        admissionControl:
            - name: PodSecurity
              configuration:
                apiVersion: pod-security.admission.config.k8s.io/v1alpha1
                defaults:
                    audit: restricted
                    audit-version: latest
                    enforce: baseline
                    enforce-version: latest
                    warn: restricted
                    warn-version: latest
                exemptions:
                    namespaces:
                        - kube-system
                    runtimeClasses: []
                    usernames: []
                kind: PodSecurityConfiguration

Upgrade Sequence

When a Talos node receives the upgrade command, it cordons itself in Kubernetes, to avoid receiving any new workload. It then starts to drain its existing workload.

NOTE: If any of your workloads are sensitive to being shut down ungracefully, be sure to use the lifecycle.preStop Pod spec.

Once all of the workload Pods are drained, Talos will start shutting down its internal processes. If it is a control node, this will include etcd. If preserve is not enabled, Talos will leave etcd membership. (Talos ensures the etcd cluster is healthy and will remain healthy after our node leaves the etcd cluster, before allowing a control plane node to be upgraded.)

Once all the processes are stopped and the services are shut down, the filesystems will be unmounted. This allows Talos to produce a very clean upgrade, as close as possible to a pristine system. We verify the disk and then perform the actual image upgrade. We set the bootloader to boot once with the new kernel and OS image, then we reboot.

After the node comes back up and Talos verifies itself, it will make the bootloader change permanent, rejoin the cluster, and finally uncordon itself to receive new workloads.

FAQs

Q. What happens if an upgrade fails?

A. Talos Linux attempts to safely handle upgrade failures.

The most common failure is an invalid installer image reference. In this case, Talos will fail to download the upgraded image and will abort the upgrade.

Sometimes, Talos is unable to successfully kill off all of the disk access points, in which case it cannot safely unmount all filesystems to effect the upgrade. In this case, it will abort the upgrade and reboot. (upgrade --stage can ensure that upgrades can occur even when the filesytems cannot be unmounted.)

It is possible (especially with test builds) that the upgraded Talos system will fail to start. In this case, the node will be rebooted, and the bootloader will automatically use the previous Talos kernel and image, thus effectively rolling back the upgrade.

Lastly, it is possible that Talos itself will upgrade successfully, start up, and rejoin the cluster but your workload will fail to run on it, for whatever reason. This is when you would use the talosctl rollback command to revert back to the previous Talos version.

Q. Can upgrades be scheduled?

A. Because the upgrade sequence is API-driven, you can easily tie it in to your own business logic to schedule and coordinate your upgrades.

Q. Can the upgrade process be observed?

A. Yes, using the talosctl dmesg -f command.

Q. Are worker node upgrades handled differently from control plane node upgrades?

A. Short answer: no.

Long answer: Both node types follow the same set procedure. From the user’s standpoint, however, the processes are identical. However, since control plane nodes run additional services, such as etcd, there are some extra steps and checks performed on them. For instance, Talos will refuse to upgrade a control plane node if that upgrade would cause a loss of quorum for etcd. If multiple control plane nodes are asked to upgrade at the same time, Talos will protect the Kubernetes cluster by ensuring only one control plane node actively upgrades at any time, via checking etcd quorum. If running a single-node cluster, and you want to force an upgrade despite the loss of quorum, you can set preserve to true.

Q. Can I break my cluster by upgrading everything at once?

A. Maybe - it’s not recommended.

Nothing prevents the user from sending near-simultaneous upgrades to each node of the cluster - and while Talos Linux and Kubernetes can generally deal with this situation, other components of the cluster may not be able to recover from more than one node rebooting at a time. (e.g. any software that maintains a quorum or state across nodes, such as Rook/Ceph)