Troubleshoot network problems in AKS clusters - Azure Architecture Center

link之家
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
Client can't reach the API server

These errors involve connection problems that occur when you can't reach an Azure Kubernetes Service (AKS) cluster's API server through the Kubernetes cluster command-line tool (kubectl) or any other tool, like the REST API via a programming language.
Error
You might see errors that look like these:
Unable to connect to the server: dial tcp <API-server-IP>:443: i/o timeout 
Unable to connect to the server: dial tcp <API-server-IP>:443: connectex: A connection attempt
failed because the connected party did not properly respond after a period, or established 
connection failed because connected host has failed to respond. 
Cause 1
It's possible that IP ranges authorized by the API server are enabled on the cluster's API server, but the client's IP address isn't included in those IP ranges. To determine whether IP ranges are enabled, use the following az aks show command in Azure CLI. If the IP ranges are enabled, the command will produce a list of IP ranges.
az aks show --resource-group <cluster-resource-group> \ 
    --name <cluster-name> \ 
    --query apiServerAccessProfile.authorizedIpRanges 
Solution 1
Ensure that your client's IP address is within the ranges authorized by the cluster's API server:
Find your local IP address. For information on how to find it on Windows and Linux, see How to find my IP.
Update the range that's authorized by the API server by using the az aks update command in Azure CLI. Authorize your client's IP address. For instructions, see Update a cluster's API server authorized IP ranges.
Cause 2
If your AKS cluster is a private cluster, the API server endpoint doesn't have a public IP address. You need to use a VM that has network access to the AKS cluster's virtual network.
Solution 2
For information on how to resolve this problem, see options for connecting to a private cluster.
Pod fails to allocate the IP address
Error
The Pod is stuck in the ContainerCreating state, and its events report a Failed to allocate address error:
Normal   SandboxChanged          5m (x74 over 8m)    kubelet, k8s-agentpool-00011101-0 Pod sandbox
changed, it will be killed and re-created. 
  Warning  FailedCreatePodSandBox  21s (x204 over 8m)  kubelet, k8s-agentpool-00011101-0 Failed 
create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod 
"deployment-azuredisk6-874857994-487td_default" network: Failed to allocate address: Failed to 
delegate: Failed to allocate address: No available addresses 
Check the allocated IP addresses in the plugin IPAM store. You might find that all IP addresses are allocated, but the number is much less than the number of running Pods:
# Kubenet, for example. The actual path of the IPAM store file depends on network plugin implementation. 
cd /var/lib/cni/networks/kubenet 
ls -al|wc -l 
docker ps | grep POD | wc -l 
Cause 1
This error can be caused by a bug in the network plugin. The plugin can fail to deallocate the IP address when a Pod is terminated.
Solution 1
Contact Microsoft for a workaround or fix.
Cause 2
Pod creation is much faster than garbage collection of terminated Pods.
Solution 2
Configure fast garbage collection for the kubelet. For instructions, see the Kubernetes garbage collection documentation.
Service not accessible within Pods
The first step to resolving this problem is to check whether endpoints have been created automatically for the service:
kubectl get endpoints <service-name> 
If you get an empty result, your service's label selector might be wrong. Confirm that the label is correct:
# Query Service LabelSelector. 
kubectl get svc <service-name> -o jsonpath='{.spec.selector}' 
# Get Pods matching the LabelSelector and check whether they're running. 
kubectl get pods -l key1=value1,key2=value2 
If the preceding steps return expected values:
Check whether the Pod containerPort is the same as the service containerPort.
Check whether podIP:containerPort is working:
# Testing via cURL. 
curl -v telnet ://<Pod-IP>:<containerPort>
# Testing via Telnet. 
telnet <Pod-IP>:<containerPort> 
These are some other potential causes of service problems:
The container isn't listening to the specified containerPort. (Check the Pod description.)
A CNI plugin error or network route error is occurring.
kube-proxy isn't running or iptables rules aren't configured correctly.
Network Policies is dropping traffic. For information on applying and testing Network Policies, see Azure Kubernetes Network Policies overview.
If you're using Calico as your network plugin, you can capture network policy traffic as well. For information on configuring that, see the Calico site.
Nodes can't reach the API server
Many add-ons and containers need to access the Kubernetes API (for example, kube-dns and operator containers). If errors occur during this process, the following steps can help you determine the source of the problem.
First, confirm whether the Kubernetes API is accessible within Pods:
kubectl run curl --image=mcr.microsoft.com/azure-cli -i -t --restart=Never --overrides='[{"op":"add","path":"/spec/containers/0/resources","value":{"limits":{"cpu":"200m","memory":"128Mi"}}}]' --override-type json --command -- sh
Then execute the following from within the container that you now are shelled into.
# If you don't see a command prompt, try selecting Enter. 
KUBE_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) 
curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/api/v1/namespaces/default/pods
Healthy output will look similar to the following.
  "kind": "PodList", 
  "apiVersion": "v1", 
  "metadata": { 
    "selfLink": "/api/v1/namespaces/default/pods", 
    "resourceVersion": "2285" 
  "items": [ 
If an error occurs, check whether the kubernetes-internal service and its endpoints are healthy:
kubectl get service kubernetes-internal
NAME                TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE 
kubernetes-internal ClusterIP   10.96.0.1    <none>        443/TCP   25m 
kubectl get endpoints kubernetes-internal
NAME                ENDPOINTS          AGE 
kubernetes-internal 172.17.0.62:6443   25m 
If both tests return responses like the preceding ones, and the IP and port returned match the ones for your container, it's likely that kube-apiserver isn't running or is blocked from the network.
There are four main reasons why the access might be blocked:
Your network policies. They might be preventing access to the API management plane. For information on testing Network Policies, see Network Policies overview.
Your API's allowed IP addresses. For information about resolving this problem, see Update a cluster's API server authorized IP ranges.
Your private firewall. If you route the AKS traffic through a private firewall, make sure there are outbound rules as described in Required outbound network rules and FQDNs for AKS clusters.
Your private DNS. If you're hosting a private cluster and you're unable to reach the API server, your DNS forwarders might not be configured properly. To ensure proper communication, complete the steps in Hub and spoke with custom DNS.
You can also check kube-apiserver logs by using Container insights. For information on querying kube-apiserver logs, and many other queries, see How to query logs from Container insights.
Finally, you can check the kube-apiserver status and its logs on the cluster itself:
# Check kube-apiserver status. 
kubectl -n kube-system get pod -l component=kube-apiserver 
# Get kube-apiserver logs. 
PODNAME=$(kubectl -n kube-system get pod -l component=kube-apiserver -o jsonpath='{.items[0].metadata.name}')
kubectl -n kube-system logs $PODNAME --tail 100
If a 403 - Forbidden error returns, kube-apiserver is probably configured with role-based access control (RBAC) and your container's ServiceAccount probably isn't authorized to access resources. In this case, you should create appropriate RoleBinding and ClusterRoleBinding objects. For information about roles and role bindings, see Access and identity. For examples of how to configure RBAC on your cluster, see Using RBAC Authorization.
Contributors
This article is maintained by Microsoft. It was originally written by the following contributors.
Principal author:
Michael Walters | Senior Consultant
Other contributors:
Mick Alberts | Technical Writer
Ayobami Ayodeji | Senior Program Manager
Bahram Rushenas | Architect
Next steps
Network concepts for applications in AKS
Troubleshoot Applications
Debug Services
Kubernetes Cluster Networking
Choose the best networking plugin for AKS
Related resources
AKS architecture design
Lift and shift to containers with AKS
Baseline architecture for an AKS cluster
AKS baseline for multiregion clusters
AKS day-2 operations guide
Triage practices
Patching and upgrade guidance
Monitoring AKS with Azure Monitor