Movergan Jun 28 2023 at 16:14

Validation WebHook troubleshooting, how low can you go?

Medium

11 min

1.3K

Case

I'm Alex Movergan, DevOps team lead at Altenar. I focus on automation in general and on improving troubleshooting skills within my team. In this article, I'll share a captivating tale that revolves around Kubernetes, validation webhooks, kubespray, and Calico.

Join me on this DevOps journey as we explore real-world scenarios and practical solutions, unraveling the intricacies of troubleshooting in a Kubernetes environment. Continue reading to delve into the world of Kubernetes troubleshooting.

Introduction

Every time a new engineer joins our team, they undergo an onboarding process before diving into production. As part of this process, they build a small instance of our Sportsbook. Our Sportsbook operates on Kubernetes, which is deployed on several VMWare clusters. This particular story takes place during the onboarding of one of our new engineers, highlighting essential technological aspects that will provide a better understanding of our environment.

Our infrastructure is based on the following technologies:

Kubernetes cluster is deployed by kubespray on top of VMWare.
Calico in IPVS mode is used as CNI.
MetaLLB used for internal load balancers
Flux for Kubernetes infrastructure settings such as network policy, namespaces, and PSS.

The task is to build a Kubernetes cluster using our internal terraform module that invokes VMware provider and Kubespray under the hood. The cluster consists of six workers and one master.

Symptoms

It all began when our new engineer approached me and my senior colleagues for assistance after days of troubleshooting. Initially, the issue seemed simple: the MetaLLB AddressPool deployment was failing, accompanied by the following error:

IPAddressPool/metallb/4p-address-pool dry-run failed, 
reason: InternalError, error: Internal error occurred: 
        failed calling webhook "paddresspoolvalida 
        [metallb.to*: falled to call webhook: 
        Post "https://metallb-webhook-service.metallb.svc:443/validate-metallb-to-vibetai-ipaddresspool?timeout=10s":
        context deadline exceeded

"What is happening here?" you may ask. Well, during installation, MetaLLB creates a ValidatingAdmissionWebhook. This webhook is called by the kube-api server to validate an object before creation. However, it is currently encountering a timeout, leading to the failure.

Troubleshooting

Let's start with the basics. While calling the same URL from other pods works fine, the kube-api continues to flood logs with the error. This leads us to the conclusion that the MetaLLB controller is present and functioning correctly.

To isolate the issue, we eliminated other components. We scaled down the flux pods to 0 and removed all network policies from the cluster. However, the result remains the same: the webhook service is accessible from other pods but kube-api. Therefore, we can conclude that it is not a network policy problem.

To replicate the issue and proceed further, we face a challenge. The kube-api pod does not have a shell, so we cannot directly access it and make the same URL call. To overcome this, we added a sidecar container. Since kube-api is not a regular pod but a static pod on the master node, we SSHed into the VM and modified the /etc/kubernetes/manifests/kube-apiserver.yaml file. Specifically, we addd netshoot as a sidecar container:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint: 10.202.144.170:6443
  creationTimestamp: null
  labels:
    component: kube-apiserver
    tier: control-plane
  name: kube-apiserver
  namespace: kube-system
spec:
  containers:

  - name: netshoot
    image: nicolaka/netshoot
    command: ["/bin/bash"]
    args: ["-c", "while true; do ping localhost; sleep 60;done"]

  - command:
    - kube-apiserver
  ...
  hostNetwork: true
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/ssl/certs

An important aspect to note is the presence of the hostNetwork: true attribute. This attribute indicates that the kube-api pod is running on a node and has access to the node's network.

To summarize the situation, we encountered the same outcome: the curl command failed with a timeout:

t005xknm000:~# curl https://10.202.145.97:443
curl: (28) Failed to connect to 10.202.145.97 port 443 after 131395 ms: Couldn't connect to server

At this point, we no longer needed to rely solely on kube-api logs; we can now reproduce the issue ourselves, which simplifies the troubleshooting process. Moving forward, let's employ ~~heavy artillery~~ a powerful tool for network troubleshooting: tcpdump. Personally, I find tcpdump indispensable in network troubleshooting scenarios (and it always comes to network troubleshooting, doesn't it?). Open tcpdump in one terminal tab and run the curl command in another:

t005xknm000:~# tcpdump -i vxlan.calico port 9443
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vxlan.calico, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:38:19.641205 IP 10.202.146.129.59577 > 10.202.146.97.9443: Flags [S], seq 934288524, win 43690, options [mss 65495,sackOK,TS val 177509761 ecr 0,nop,wscale 10], length 0
10:38:20.666577 IP 10.202.146.129.59577 > 10.202.146.97.9443: Flags [S], seq 934288524, win 43690, options [mss 65495,sackOK,TS val 177510787 ecr 0,nop,wscale 10], length 0
10:38:22.714509 IP 10.202.146.129.59577 > 10.202.146.97.9443: Flags [S], seq 934288524, win 43690, options [mss 65495,sackOK,TS val 177512835 ecr 0,nop,wscale 10], length 0
10:39:35.197070 IP 10.202.146.129.30507 > 10.202.146.97.9443: Flags [S], seq 79391636, win 43690, options [mss 65495,sackOK,TS val 177585317 ecr 0,nop,wscale 10], length 0
10:39:36.250518 IP 10.202.146.129.30507 > 10.202.146.97.9443: Flags [S], seq 79391636, win 43690, options [mss 65495,sackOK,TS val 177586371 ecr 0,nop,wscale 10], length 0
10:41:17.163901 IP 10.202.146.129.62428 > 10.202.146.97.9443: Flags [S], seq 2885711851, win 43690, options [mss 65495,sackOK,TS val 177687284 ecr 0,nop,wscale 10], length 0
10:41:18.203494 IP 10.202.146.129.62428 > 10.202.146.97.9443: Flags [S], seq 2885711851, win 43690, options [mss 65495,sackOK,TS val 177688324 ecr 0,nop,wscale 10], length 0

If you paid close attention, you may have noticed that we used curl to connect to10.202.145.97:443, while tcpdump was running on 10.202.146.97.9443. This distinction arises from the way Calico operates, utilizing iptables and ipset to convert service IP:port into pod IP:port. In this case, 10.202.146.97.9443 represents the address and port of the metallb-controller pod.

In conclusion, we are sending traffic but not receiving any response.

The next step involves identifying the network segment where the traffic is being lost. To achieve this, we add Netshoot to the metallb-controller pod, allowing us to verify if the traffic reaches the pod.

    containers:
        - name: netshoot
          image: nicolaka/netshoot
          command:
            - /bin/bash
          args:
            - '-c'
            - while true; do ping localhost; sleep 60;done

        - name: metallb-controller
          image: docker.io/bitnami/metallb-controller:0.13.7-debian-11-r29

Let's repeat the experiment, this time running tcpdump on both ends of the traffic pipeline:

On kube-api pod:

t005xknm000:~# tcpdump -i vxlan.calico port 9443
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vxlan.calico, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:34:27.081229 IP 10.202.146.129.32229 > 10.202.146.97.9443: Flags [S], seq 3654443615, win 43690, options [mss 65495,sackOK,TS val 177277201 ecr 0,nop,wscale 10], length 0
10:34:28.090506 IP 10.202.146.129.32229 > 10.202.146.97.9443: Flags [S], seq 3654443615, win 43690, options [mss 65495,sackOK,TS val 177278211 ecr 0,nop,wscale 10], length 0

On metallb-controller pod:

metallb-controller-667f54487b-24bv5:~# tcpdump -nnn port 9443
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
11:18:50.381973 IP 10.202.146.129.32229 > 10.202.146.97.9443: Flags [S], seq 3654443615, win 43690, options [mss 65495,sackOK,TS val 179940505 ecr 0,nop,wscale 10], length 0
11:18:50.382000 IP 10.202.146.97.9443 > 10.202.146.129.32229: Flags [S.], seq 1413484485, ack 41975616, win 27960, options [mss 1410,sackOK,TS val 3365562262 ecr 179940505,nop,wscale 10], length 0
11:18:51.383183 IP 10.202.146.129.32229 > 10.202.146.97.9443: Flags [S], seq 3654443615, win 43690, options [mss 65495,sackOK,TS val 179941507 ecr 0,nop,wscale 10], length 0
11:18:51.383208 IP 10.202.146.97.9443 > 10.202.146.129.32229: Flags [S.], seq 1413484485, ack 41975616, win 27960, options [mss 1410,sackOK,TS val 3365563263 ecr 179940505,nop,wscale 10], length 0
11:18:52.445353 IP 10.202.146.97.9443 > 10.202.146.129.32229: Flags [S.], seq 1413484485, ack 41975616, win 27960, options [mss 1410,sackOK,TS val 3365564326 ecr 179940505,nop,wscale 10], length 0

This is becoming quite intriguing. It appears that the traffic successfully reaches the final destination, with the metallb-controller providing replies each time. However, the issue lies in delivering packets back to the kube-api pod.

After spending approximately two hours on trial and error, we decided to take a lunch break. Despite our break, we couldn't stop discussing the issue and managed to come up with a few other tests we can attempt to resolve it.

Let's try to create a simple pod with netshoot and 'hostNetwork' option set to 'true'. A few "ctr+c" "ctr+v" and we have another static pod running on the same master node.

yaml definition of that pod

apiVersion: v1
kind: Pod
metadata:
  name: netshoot
  namespace: kube-system
spec:
  containers:
  - name: netshoot
    image: nicolaka/netshoot
    command: ["/bin/bash"]
    args: ["-c", "while true; do ping localhost; sleep 60;done"]
  hostNetwork: true

We proceeded with the same steps, and as expected, curl failed with a timeout once again.

To simplify the testing further, we attempted to ping the metallb-controller from our netshoot container.

On netshoot:

t005xknm000:~# ping 10.202.146.97
PING 10.202.146.97 (10.202.146.97) 56(84) bytes of data.
^C
--- 10.202.146.97 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2079ms

On metallb-controller:

metallb-controller-667f54487b-24bv5:~# tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:33:05.680242 IP 10.202.146.129 > metallb-controller-667f54487b-24bv5: ICMP echo request, id 10, seq 2060, length 64
12:33:05.680249 IP metallb-controller-667f54487b-24bv5 > 10.202.146.129: ICMP echo reply, id 10, seq 2060, length 64
12:33:05.691276 IP 10.202.146.129 > metallb-controller-667f54487b-24bv5: ICMP echo request, id 10, seq 2061, length 64
12:33:05.691290 IP metallb-controller-667f54487b-24bv5 > 10.202.146.129: ICMP echo reply, id 10, seq 2061, length 64
12:33:05.702445 IP 10.202.146.129 > metallb-controller-667f54487b-24bv5: ICMP echo request, id 10, seq 2062, length 64
12:33:05.702456 IP metallb-controller-667f54487b-24bv5 > 10.202.146.129: ICMP echo reply, id 10, seq 2062, length 64

The results were quite similar to what we observed earlier with curl. If you have reached this point, it is evident that the issue lies on the network side. Without delving into the specific documentation we explored regarding Calico, IPVS, and Kubernetes networking, I must say it has been quite an adventure.

The next idea we had was to ping the kube-api server pod from the metallb-controller:

ping 10.202.146.129
connect: Invalid argument

Wait, what? Excuse me? What on earth is happening here? We attempted the same command multiple times and even manually entered the correct IP address, but the outcome remained unchanged. What could possibly be the issue? Not to mention, we also tried using telnet and curl, and they all failed in the same manner.

Another tool that I find quite handy is strace. So, let's strace telnet and connect it to the kube-api pod IP:

strace telnet 10.202.146.129 8443
...
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
setsockopt(3, SOL_IP, IP_TOS, [16], 4)  = 0
connect(3, {sa_family=AF_INET, sin_port=htons(8443), sin_addr=inet_addr("10.202.146.129")}, 16) = -1 EINVAL (Invalid argument)
write(2, "telnet: connect to address 10.20"..., 60telnet: connect to address 10.202.146.129: Invalid argument
) = 60
close(3)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++

Upon examination, we noticed that the connect() function returned an error of "-1 EINVAL (Invalid argument)". Intrigued, we decided to dive into the Linux kernel functions documentation. How exciting! Funny enough, the first result from Google did not contain the description of this failure result, Luckily, the second result proved to be much more informative:

EINVAL
The address_len argument is not a valid length for the address family; 
or invalid address family in the sockaddr structure.

In conclusion, there seems to be a network configuration issue with the IP addresses. The question remains, though: Where exactly is the problem originating from?

At this point, we knew we are on the right track. We started from Calico IPPool configuration:

apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  annotations:
  generation: 2
  name: default-pool
  resourceVersion: '4060638'
  uid: fe5070d5-4078-4265-87b1-5e24b964a37e
  selfLink: /apis/crd.projectcalico.org/v1/ippools/default-pool
spec:
  allowedUses:
    - Workload
    - Tunnel
  blockSize: 26
  cidr: 10.202.146.0/24
  ipipMode: Never
  natOutgoing: true
  nodeSelector: all()
  vxlanMode: Always

Then terraform code we used for deployment:

module "k8s_cluster" {
...
  k8s_cluster_name                  = var.cluster_name
  k8s_kubespray_url                 = var.k8s_kubespray_url
  k8s_kubespray_version             = var.k8s_kubespray_version
  k8s_version                       = var.k8s_version
  k8s_pod_cidr                      = var.k8s_pod_cidr
  k8s_service_cidr                  = var.k8s_service_cidr
  k8s_network_node_prefix           = var.k8s_network_node_prefix
  k8s_api_lb_vip                    = var.k8s_api_lb_vip
  k8s_metrics_server_enabled        = var.k8s_metrics_server_enabled
  k8s_vsphere_username              = var.k8s_vsphere_username
  k8s_vsphere_password              = var.k8s_vsphere_password
  k8s_zone_a_major_index            = var.k8s_zone_a_major_index
  k8s_zone_b_major_index            = var.k8s_zone_b_major_index
  k8s_zone_c_major_index            = var.k8s_zone_c_major_index
  container_manager                 = var.container_manager
  etcd_deployment_type              = var.etcd_deployment_type
  action                            = var.action
}

variable "k8s_pod_cidr" {
  description = "Subnet for Kubernetes pod IPs, should be /22 or wider"
  default     = "10.202.146.0/24"
  type        = string
}

variable "k8s_network_node_prefix" {
  description = "subnet allocated per-node for pod IPs. Also read the comments in templates/kubespray_extra.tpl"
  default     = "25"
  type        = string
}

Next pod cidr on nodes:

kubectl describe node t005xknm000.nix.tech.altenar.net  | grep PodCIDR
PodCIDR:                      10.202.146.0/25
PodCIDRs:                     10.202.146.0/25

Ah, we have identified the discrepancy! It appears that Calico has a "blockSize: 26" setting, whereas the "k8s_network_node_prefix" is set to 25.

Therefore, the network configuration is incorrect. I must mention that this is a test cluster created solely for the purpose of exploring cluster building with our toolchain. Hence, all the networks in the cluster are "/24". However, even though it is a test cluster, it was deployed with six workers and one master. Simple network calculations reveal that a "/24" network cannot be divided into seven "/26" subnets, let alone seven "/25" subnets:

All 4 of the Possible /26 Networks for 10.202.146.*
Network Address	Usable Host Range	Broadcast Address:
10.202.146.0	10.202.146.1 - 10.202.146.62	10.202.146.63
10.202.146.64	10.202.146.65 - 10.202.146.126	10.202.146.127
10.202.146.128	10.202.146.129 - 10.202.146.190	10.202.146.191
10.202.146.192	10.202.146.193 - 10.202.146.254	10.202.146.255

Now, the question arises: Why did this occur? We passed a specific value for a parameter. To uncover the root cause, let's delve into the code of kubespray:

This part:

 - name: Calico | Set kubespray calico network pool
      set_fact:
        _calico_pool: >
          {
            "kind": "IPPool",
            "apiVersion": "projectcalico.org/v3",
            "metadata": {
              "name": "{{ calico_pool_name }}",
            },
            "spec": {
              "blockSize": {{ calico_pool_blocksize | default(kube_network_node_prefix) }},
              "cidr": "{{ calico_pool_cidr | default(kube_pods_subnet) }}",
              "ipipMode": "{{ calico_ipip_mode }}",
              "vxlanMode": "{{ calico_vxlan_mode }}",
              "natOutgoing": {{ nat_outgoing|default(false) }}
            }
          }

And this part too:

# add default ippool blockSize (defaults kube_network_node_prefix)
calico_pool_blocksize: 26

What we realized is that even though we used

kube_network_node_prefix: ${k8s_network_node_prefix}

in our terraform module code, this variable was never used by kubespray because calico_pool_blocksize comes first and it has a default value:

              "blockSize": {{ calico_pool_blocksize | default(kube_network_node_prefix) }},

At this juncture, we concluded our investigations and successfully identified the root cause. With this newfound clarity, our action plan is crystal clear:

Obtain a larger network range.
Rectify the Terraform module for Kubernetes deployment.
Perform a complete redeployment of the cluster from scratch.

Even though the root cause is trivial as always, as a result of our troubleshooting experience, we have decided to document our steps in this article, in the hope that it may prove valuable to others facing similar challenges.

Thank you very much for your attention. I remain available to address any questions or comments you may have.

Tags:

Hubs: