无邪的大熊猫 · springboot k8s - CSDN文库· 1 周前 · |
要出家的感冒药 · “最矮身高1米65,体重成考核指标”女民兵阅 ...· 6 月前 · |
威武的登山鞋 · 广东省粮食和物资储备局 - ...· 6 月前 · |
千杯不醉的脆皮肠 · 告别畏首畏尾 飞出最大潜能 - 解放军报 ...· 10 月前 · |
另类的麦片 · 6月MPV销量排行榜盘点,腾势D9 ...· 1 年前 · |
update vsphere kubernetes |
https://docs.vmware.com/en/VMware-vSphere/7.0/rn/vmware-vsphere-with-tanzu-70-release-notes/index.html |
谦逊的水煮肉
1 年前 |
VMware vSphere with Tanzu has monthly patches to introduce new features and capabilities, provide updates to Kubernetes and other services, keep up with upstream, and to resolve reported issues. Here we document what each monthly patch delivers.
This patch fixes the following issues.
With the activation of Large Receive Offload (LRO), Tanzu Kubernetes Grid cluster VMs that use Antrea-ENCAP might experience network performance issues.
Enabling the embedded Harbor registry on Supervisor can result in an insecure default configuration.
Supervisor
Supervisor Clusters Support Kubernetes 1.23 - This release adds support for Kubernetes 1.23 and drops support for Kubernetes 1.20. The supported versions of Kubernetes in this release are 1.23, 1.22, and 1.21. Supervisors running on Kubernetes version 1.20 will be automatically upgraded to version 1.21 to ensure that all your Supervisors are running on a supported version of Kubernetes.
New SecurityPolicy CRD: vSphere 7.0 Update 3i introduces a SecurityPolicy CRD for users to apply NSX based security policy to VMs and Pods in the Supervisor. This provides the ability to configure "Kubernetes network policy" via code by extending Kubernetes network policy via CRD for the Supervisor namespaces.
Support: The TLS version used in the kube-apiserver-authproxy-svc service and system pods have been updated to TLSv1.2.
This patch fixes the issue with the vCenter Server upgrade to version 70u3f that was failing because WCP service did not start after the upgrade. The error occurred when you attempted to upgrade vCenter Server with vSphere with Tanzu activated on versions prior to 70u3d. Attempts to upgrade vCenter Server from a version earlier than 70u3d to vCenter Server 70u3d and then to vCenter Server 70u3f failed.
To learn more about the issue that's been resolved, read Knowledge Base Article 89010 .
Supervisor Cluster
Support LoadBalancer IP string value for VM Service – Enables a user to provide a string value to the spec.LoadBalancerIP value that represents/matches an IPPool created and tagged in NSX-T.
Back up and Restore VM Service VMs – VMware now supports backup and restore for VM Service VMs in on-premises vSphere and VMware Cloud on AWS through a comprehensive and fully documented workflow that supports Veeam and other backup vendors based on vSphere Storage APIs for Data Protection (vADP), ensuring a more complete data protection solution and the general availability of VM Service on VMware Cloud on AWS.
Added Network Security Policy support for VMs deployed via VM operator service – Security Policies on NSX-T can be created via Security Groups based on Tags. It is now possible to create NSX-T based security policy and apply it to VMs deployed through VM operator based on NSX-T tags.
Supervisor Clusters Support Kubernetes 1.22 – This release adds the support of Kubernetes 1.22 and drops the support for Kubernetes 1.19. The supported versions of Kubernetes in this release are 1.22, 1.21, and 1.20. Supervisor Clusters running on Kubernetes version 1.19 will be auto-upgraded to version 1.20 to ensure that all your Supervisor Clusters are running on the supported versions of Kubernetes.
If you upgraded vCenter Server from a version prior to 7.0 Update 3c and your Supervisor Cluster is on Kubernetes 1.19.x, the tkg-controller-manager pods go into a CrashLoopBackOff state, rendering the guest clusters unmanageable. You will see an error similar to the following:
Observed a panic: Invalid memory address or nil pointer dereference
Workaround: To learn about and address this issue, read Knowledge Base Article 88443 .
Spherelet VIB installations fail on vTPM hosts
Installation of spherelet VIB fails with the error ' Could not find a trusted signer: self-signed certificate ' on a vTPM host.
vSphere upgrade results in the Supervisor Cluster becoming unstable
After upgrading vSphere from 7.0 Update 1 to 7.0 Update 2 the Supervisor Cluster goes into a configuring state.
Supervisor cluster enablement stalls when using NSX-T Advanced Load Balancer
Enabling supervisor cluster with NSX-T Advanced Load Balancer can result in the configuration stalling based on default auth configuration.
IMPORTANT : VMware removed ESXi 7.0 Update 3, 7.0 Update 3a and 7.0 Update 3b from all sites on November 19, 2021 due to an upgrade-impacting issue. Build 19193900 for ESXi 7.0 Update 3c ISO replaces build 18644231. See the VMware ESXi 7.0 Update 3c Release Notes for more information.
Tanzu Kubernetes Grid Service for vSphere
Enable containerized workloads to leverage GPU acceleration on your Tanzu Kubernetes cluster - vSphere admins can now provision GPU accelerated VMs to Tanzu Kubernetes clusters, and developers can now add GPU accelerated VMs to their Tanzu Kubernetes Grid clusters with native Kubernetes commands.
Tanzu Kubernetes release (TKr) based on Ubuntu 20.04 -This is our first TKr release based on Ubuntu 20.04. This image was optimized and tested specifically for GPU (AI / ML) workloads. Refer to the TKr release notes for details.
IMPORTANT : VMware removed ESXi 7.0 Update 3, 7.0 Update 3a and 7.0 Update 3b from all sites on November 19, 2021 due to an upgrade-impacting issue. Build 19193900 for ESXi 7.0 Update 3c ISO replaces build 18644231. See the VMware ESXi 7.0 Update 3c Release Notes for more information.
Supervisor Cluster
vSphere Services - By using the vSphere Services framework, vSphere administrators can now manage Supervisor Services asynchronously including MinIO, Cloudian Hyperstore, Dell ObjectScale, and Velero. The decoupled nature of Supervisor Services lets administrators add new services to the service catalog outside of a vCenter Server release, further empowering the DevOps community. Supervisor Services are only available in Supervisor Clusters configured with NSX-T Data Center networking. Check out the documentation for information on managing Supervisor Services in the vSphere Client.
Support for vGPU in VM Service - vSphere administrators can now provide self-service access to developers for consumption of GPUs through VMs using Kubernetes, bound by limits enforced through VM classes. DevOps users can then quickly create VMs with GPUs using these pre-defined VM classes and images.
Enable Workload Management with DHCP Networking - This release adds DHCP Network as an alternate network setup path to simplify enablement for a quicker POC. vSphere administrators can configure the Management Network and Workload Network with DHCP simply by selecting the network and port group and all other inputs including DNS, NTP, and Floating IP are automatically acquired using DHCP.
Network and Load Balancer Health Checking during Enablement - During enablement, health checks for network connectivity, DNS, NTP, and load-balancer connectivity validate the success of your Supervisor Cluster enablement and give human-readable error messages to help diagnose and take action on common issues. Check out the documentation for further instructions on resolving the error messages.
Supervisor Clusters Support Kubernetes 1.21 - This release adds the support of Kubernetes 1.21 and drops the support for Kubernetes 1.18. The supported versions of Kubernetes in this release are 1.21, 1.20, and 1.19. Supervisor Clusters running on Kubernetes version 1.18 will be auto-upgraded to version 1.19, to ensure all your Supervisor Clusters are running on the supported versions of Kubernetes.
Labels and Annotations to Supervisor Namespaces - Namespaces created by DevOps users through the namespace self-service template can now have Kubernetes labels and annotations.
Edit Supervisor Cluster Configuration after Enablement - After enabling Supervisor Clusters with the vSphere networking stack, vSphere administrators can now edit the following settings from both the API and vSphere Client: Load Balancer username and password, Management Network DNS Search Domains, and Workload Network DNS servers, NTP servers, expand the service IP range, and add a new workload network. For clusters using either vSphere or NSX-T networking, you can scale up the control plane size after enablement. Note that you can only increase the scale of the cluster, reducing the scale is not supported at this time. See the documentation for information on how to change the Supervisor Cluster settings through the vSphere Client.
Tanzu License Key Expiration - vSphere Administrators now have additional flexibility in managing expired Tanzu Edition license keys. On the expiration of Tanzu license keys, hard enforcements will not automatically occur, allowing the administrators more flexibility to procure and assign a new license key without impacting normal operations.
Tanzu Kubernetes Grid Service for vSphere
RWX Support for vSAN Persistent Volumes - Workloads running on Tanzu Kubernetes clusters can now mount vSAN based Persistent Volumes with RWX.
Tanzu Kubernetes Grid Service v1alpha2 API Update - API updates to the Tanzu Kubernetes Cluster API, exposing new fields that allow for enhanced configuration of Tanzu Kubernetes Grid Service, including support for multiple Worker NodePools. Deprecation of the v1alpha1 API in favor of the new v1alpha2 API. See the documentation for more information.
Metrics Server - Metrics Server is now included by default in Tanzu Kubernetes clusters moving forward beginning with 1.20.9+ and 1.21 Tanzu Kubernetes releases.
Ability to support No-NAT (routed) topology - Tanzu Kubernetes clusters can now be created with a networking topology that allows cluster nodes to be routed outside of the cluster network. See the documentation for more information.
Supervisor Cluster
Supervisor Clusters Support Kubernetes 1.20 - This release adds the support of Kubernetes 1.20 and drops the support for Kubernetes 1.17. The supported versions of Kubernetes in this release are 1.20, 1.19 and 1.18. Supervisor Clusters running on Kubernetes version 1.17 will be auto-upgraded to version 1.18, to ensure all your supervisor clusters are running on the supported versions of Kubernetes.
Velero Plugin for vSphere support for vSphere Pods - This release supports Velero 1.5.1 and the Velero Plugin for vSphere 1.1.0 and higher for backup and restore of vSphere Pods.
Tanzu Kubernetes Grid Service for vSphere
Harbor and external-dns as new in-cluster extensions - Platform Operators now have access to two additional supported in-cluster extensions: Harbor and external-dns. Harbor is the CNCF graduated container registry, and external-dns is a popular tool for dynamically configuring DNS records based on Kubernetes load-balanced Services.
Improved remediation of control plane nodes - Tanzu Kubernetes clusters will now automatically remediate common control plane node failures which provides a more robust Kubernetes runtime.
Velero Plugin for vSphere support for Tanzu Kubernetes cluster workloads - This release provides support for Velero 1.5.1 and higher and Velero Plugin for vSphere 1.1.0 and higher for backup and restore of workloads running on Tanzu Kubernetes clusters.
Velero standalone support for Tanzu Kubernetes cluster workloads - This release provides support for Velero 1.6 for backing up and restoring Tanzu Kubernetes cluster workloads using standalone Velero with Restic.
Supervisor Cluster
Management of VMs using Kubernetes via the Virtual Machine Service . This release adds the Virtual Machine Service to the infrastructure services included in vSphere with Tanzu, delivering Kubernetes native VM management for developers. VM Service enables developers to deploy and manage VMs in a namespace using Kubernetes commands. At the same time, the vSphere administrator is able to govern resource consumption and availability of the service, while still providing developers with a cloud-native experience.
Self-service creation of namespaces for developers . vSphere administrators can now create and configure a Supervisor Namespace as a self-service namespace template. This template defines resource limits and permissions for usage. Developers can then use this template to provision a namespace and run workloads within it, without having to request one and wait for approval.
Tanzu Kubernetes Grid Service for vSphere
IMPORTANT: CVE Fix for TKRs : There are new Tanzu Kubernetes Releases available to address CVE-2021-30465.
IMPORTANT: CVE Fix for the Contour Ingress Extension : There is a new Envoy image version to address CVE-2021-28682, CVE-2021-28683, and CVE-2021-29258. See the associated KB article for more information.
New Workflow for Using Default VM Classes. There is a new workflow for using the default VM classes to provision Tanzu Kubernetes clusters. Before creating a new cluster, add the default VM classes to the vSphere Namespace where you are provisioning the cluster. See the documentation for guidance.
System mutating webhooks now support dry-run mode. Users can now integrate popular tools like the Terraform Kubernetes provider with Tanzu Kubernetes Grid Service. Previously the system webhooks did not support dry-run mode, which was a requirement for the Terraform `plan` command.
Custom VM Classes. Tanzu Kubernetes Clusters can consume the custom Virtual Machine Classes through VM Service. This will allow users to configure different amounts of CPU and Memory allocated to the Virtual Machines that make up a Tanzu Kubernetes Cluster.
Supervisor Cluster
Support of NSX Advanced Load Balancer for a Supervisor Cluster configured with VDS networking . You can now enable a Supervisor Cluster with NSX Advanced Load Balancer (Avi Networks) for L4 load balancing, as well as load balancing for the control plane nodes of Supervisor and Tanzu Kubernetes clusters. Checkout the documentation page for guidance on configuring the NSX Advanced Load Balancer.
Upgrade of the Supervisor Cluster to Kubernetes 1.19 with auto-upgrade of a Supervisor Cluster running Kubernetes 1.16 . You can upgrade the Supervisor Cluster to Kubernetes 1.19. With this update, the following Supervisor Cluster versions are supported: 1.19, 1.18, and 1.17. Supervisor Clusters running Kubernetes 1.16 will be automatically upgraded to 1.17 once vCenter Server is updated. This will ensure all your Supervisor Clusters are running with the supported version of Kubernetes.
Expansion of PersistentVolumeClaims (PVCs). You can now expand existing volumes by modifying the PersistentVolumeClaim object, even when the volume is in active use. This applies to volumes in the Supervisor Cluster and Tanzu Kubernetes clusters.
Management of Supervisor Cluster lifecycle using vSphere Lifecycle Manager. For Supervisor Clusters configured with NSX-T networking, you can use vSphere Lifecycle Manager for infrastructure configuration and lifecycle management.
Tanzu Kubernetes Grid Service for vSphere
Support for private container registries . vSphere administrators and Kubernetes platform operators can now define additional Certificate Authority certificates (CAs) to use in Tanzu Kubernetes clusters for trusting private container registries. This feature enables Tanzu Kubernetes clusters to pull container images from container registries that have enterprise or self-signed certificates. You can configure private CAs as a default for Tanzu Kubernetes clusters on a Supervisor Cluster-wide basis or per-Tanzu Kubernetes Cluster. Read more about how to configure support for private container registries to Tanzu Kubernetes clusters by visiting the documentation page.
User-defined IPs for Service type: LoadBalancer with NSX-T and NSX Advanced Load Balancer . Kubernetes application operators can now provide a user-defined LoadBalancerIP when configuring a Service type: LoadBalancer allowing for a static IP endpoint for the service. This advanced feature requires either NSX-T load balancing or the NSX Advanced Load Balancer with the Supervisor Cluster. Learn how to configure this feature by visiting the documentation page.
ExternalTrafficPolicy and LoadBalancerSourceRanges for Service type: LoadBalancer with NSX-T. Kubernetes application operators can now configure the ExternalTrafficPolicy of 'local' for Services to propagate client IP address to the end pods. You also can define loadBalancerSourceRanges for Services to restrict which client IPs can access the load balanced service. These two advanced features require NSX-T load balancing with the Supervisor Cluster.
Kubernetes version management and indications . You can now use kubectl to inspect the compatibility of TanzuKubernetesReleases with the underlying Supervisor Cluster environment. Tanzu Kubernetes clusters now also indicate whether there is a Kubernetes upgrade available and recommend the next TanzuKubernetesRelease(s) to use. For more information on using this new feature, see the documentation page.
Improved Cluster Status at a Glance . In a previous release, VMware expanded WCPCluster and WCPMachine CRDs by implementing conditional status reporting to surface common problems and errors. With vSphere 7.0 Update 2 release, we enhanced TanzuKubernetesCluster CRDs to summarize conditional status reporting for subsystem components, supplying immediate answers and fine-grained guidance to help you investigate issues. Learn how to configure this feature by visitng the documentation page.
Per-Tanzu Kubernetes cluster HTTP Proxy Configuration . You can now define the HTTP/HTTPS Proxy configuration on a per-Tanzu Kubernetes cluster basis or, alternately, define it on a Supervisor Cluster-wide through a default configuration. For information on configuring this feature, see the documentation page.
Support for Tanzu Kubernetes Grid Extensions . In-cluster extensions are now fully supported on Tanzu Kubernetes Grid Service, including Fluent Bit, Contour, Prometheus, AlertManager, and Grafana.
The vSphere 7.0 Update 2 release includes functionality that automatically upgrades the Supervisor Cluster when vCenter Server is updated. If you have Tanzu Kubernetes clusters provisioned in your environment, read Knowledge Base Article 82592 before upgrading to vCenter Server 7.0 Update 2. The article provides guidance on running a pre-check to determine whether any Tanzu Kubernetes cluster will become incompatible after the Supervisor Cluster is auto-upgraded.
The embedded container registry SSL certificate is not copied to Tanzu Kubernetes cluster nodes
When the embedded container registry is enabled for a Supervisor Cluster, the Harbor SSL certificate is not included in any Tanzu Kubernetes cluster nodes created on that SC, and you cannot connect to the registry from those nodes.
Post upgrade from Tanzu Kubernetes Grid 1.16.8 to 1.17.4, the "guest-cluster-auth-svc" pod on one of the control plane nodes is stuck at "Container Creating" state
After updating a Tanzu Kubernetes Cluster from Tanzu Kubernetes Grid Service 1.16.8 to 1.17.4, the "guest-cluster-auth-svc" pod on one of the cluster control plane nodes is stuck at "Container Creating" state
User is unable to manage existing pods on a Tanzu Kubernetes cluster during or after performing a cluster update
User is unable to manage existing pods on a Tanzu Kubernetes cluster during or after performing a cluster update.
Tanzu Kubernetes cluster Upgrade Job fails with "timed out waiting for etcd health check to pass."
The upgrade job in the vmware-system-tkg namespace associated with the upgrade of a Tanzu Kubernetes cluster fails with the following error message "timed out waiting for etcd health check to pass." The issue is caused by the missing PodIP addresses for the etcd pods.
Antrea CNI not supported in current TKC version
While provisioning a Tanzu Kubernetes cluster, you receive the error "Antrea CNI not supported in current TKC version."
Option 1 (recommended): Update the Tanzu Kubernetes cluster to use the OVA version that supports Antrea (v1.17.8 or later).
Option 2: In the Tanzu Kubernetes cluster specification YAML, enter "calico" in the spec.settings.network.cni section.
Option 3: Change the default CNI to Calico. Refer to the topic in the documentation on how to do this.
New Tanzu Kubernetes Grid Service features were unavailable in existing Supervisors with vSphere networking
In the previous release, new Tanzu Kubernetes Grid Service capabilities and bug-fixes were only available in newly created Supervisor Clusters when vSphere networking was used. In this release, users can now update Supervisor Clusters with vSphere networking to take advantage of the latest Tanzu Kubernetes Grid Service features and bug-fixes.
Note: To take advantage of new Tanzu Kubernetes Grid Service capabilities and bug-fixes in this release, you need to create a new Supervisor cluster if vSphere networking is used.
Supervisor Cluster
Supervisor Namespace Isolation with Dedicated T1 Router . Supervisor Clusters using NSX-T network uses a new topology where each namespace has its own dedicated T1 router.
Newly created Supervisor Clusters uses this new topology automatically.
Existing Supervisor Clusters are migrated to this new topology during an upgrade.
Supervisor Clusters Support NSX-T 3.1.0 . Supervisor Clusters is compatible with NSX-T 3.1.0.
Supervisor Cluster Version 1.16.x Support Removed . Supervisor Cluster Version 1.16.x is now removed. Supervisor Clusters running 1.16.x should be upgraded to a new version.
Tanzu Kubernetes Grid Service for vSphere
HTTP/HTTPS Proxy Support. Newly created Tanzu Kubernetes clusters can use a global HTTP/HTTPS Proxy for egress traffic as well as for pulling container images from internet registries.
Integration with Registry Service . Newly created Tanzu Kubernetes clusters work out of the box with the vSphere Registry Service. Existing clusters, once updated to a new version, also work with the Registry Service.
Configurable Node Storage. Tanzu Kubernetes clusters can now mount an additional storage volume to virtual machines thereby increasing available node storage capacity. This enables users to deploy larger container images that might exceed the default 16GB root volume size.
Improved status information . WCPCluster and WCPMachine Custom Resource Definitions now implement conditional status reporting. Successful Tanzu Kubernetes cluster lifecycle management depends on a number of subsystems (for example, Supervisor, storage, networking) and understanding failures can be challenging. Now WCPCluster and WCPMachine CRDs surface common status and failure conditions to ease troubleshooting.
Missing new default VM Classes introduced in vCenter Server 7.0 Update 1
After upgrading to vSphere 7.0.1, and then performing a vSphere Namespaces update of the Supervisor Cluster, running the command "kubectl get virtualmachineclasses" did not list the new VM class sizes 2x-large, 4x-large, 8x-large. This has been resolved and all Supervisor Clusters will be configured with the correct set of default VM Classes.
Supervisor Cluster
Configuration of Supervisor Clusters with vSphere networking . We introduced vSphere networking for Supervisor Clusters, enabling you to deliver a developer-ready platform using your existing network infrastructure.
Support of HAproxy load balancer for setting up Supervisor Clusters with vSphere networking . If you configure Supervisor Clusters with vSphere networking, you need to add a load balancer to handle your modern workloads. You can deploy and set up your load balancer with an HAproxy OVA.
Management of Supervisor Cluster lifecycle using vSphere Lifecycle Manager . For Supervisor Clusters configured with vSphere networking, you can use vSphere Lifecycle Manager for infrastructure configuration and lifecycle management.
Opportunity to try vSphere with Tanzu on your hardware . We now offer you an in-product-trial if you want to enable a Supervisor Cluster on your hardware and test this modern application platform at no additional cost.
Tanzu Kubernetes Grid Service for vSphere
Exposure of Kubernetes versions to DevOps users . We introduced a new 'TanzuKubernetesRelease' custom resource definition in the Supervisor Cluster. This custom resource definition provides detailed information to the DevOps user about the Kubernetes versions they can use in their Tanzu Kubernetes clusters.
Integration of VMware Container Networking with Antrea for Kubernetes . We integrated a commercially supported version Antrea as the default Container Network Interface (CNI) for new Tanzu Kubernetes clusters. Antrea brings a comprehensive suite of enterprise network policy features to Tanzu Kubernetes Grid Service. For more details, read the release announcement . While Antrea is the default CNI, vSphere administrators and DevOps users can still choose Calico as the CNI for Tanzu Kubernetes clusters.
Support of Supervisor cluster environments that use vSphere networking . We now support Supervisor Cluster environments that use vSphere networking so you can leverage your existing network infrastructure.
None, this is simply a bug-fix release.
High CPU utilization upon upgrading to the July 30 patch
vCenter Server generates a high CPU utilization after upgrade to the July 30 patch. This issue is now fixed.
Supervisor cluster enablement failure due to certificate with Windows line endings
Enabling supervisor cluster can fail if there are Windows line endings in the certificate. This issue is now fixed.
New Features
Supervisor cluster: new version of Kubernetes, support for custom certificates and PNID changes
The Supervisor cluster now supports Kubernetes 1.18.2 (along with 1.16.7 and 1.17.4)
Replacing machine SSL certificates with custom certificates is now supported
vCenter PNID update is now supported when there are Supervisor clusters in vCenter Server
Tanzu Kubernetes Grid Service for vSphere: new features added for cluster scale-in, networking and storage
Cluster scale-in operation is now supported for Tanzu Kubernetes Grid service clusters
Ingress firewall rules are now enforced by default for all Tanzu Kubernetes Grid service clusters
New versions of Kubernetes shipping regularly asynchronously to vSphere patches, current versions are 1.16.8, 1.16.12, 1.17.7, 1.17.8
Network service: new version of NCP
SessionAffinity is now supported for ClusterIP services
IngressClass, PathType, and Wildcard domain are supported for Ingress in Kubernetes 1.18
Client Auth is now supported in Ingress Controller
Registry service: new version of Harbor
The Registry service now is upgraded to 1.10.3
For more information and instructions on how to upgrade, refer to the Updating vSphere with Tanzu Clusters documentation.
Resolved Issues
Tanzu Kubernetes Grid Service cluster NTP sync issue
Tanzu Kubernetes Grid Service cluster upgrade failure
We have resolved an issue where upgrade a Tanzu Kubernetes Grid service cluster can failed due to "Error: unknown previous node"
Supervisor cluster upgrade failure
We have resolved an issue where a Supervisor cluster update may get stuck if the embedded Harbor is in a failed state
New Features
Tanzu Kubernetes Grid Service for vSphere: rolling upgrade and services upgrade
Customers can now perform rolling upgrades over their worker nodes and control plane nodes for the Tanzu Kubernetes Grid Service for vSphere, and upgrade the pvCSI, Calico, and authsvc services. This includes pre-checks and upgrade compatibility for this matrix of services.
Rolling upgrades can be used to vertically scale worker nodes, i.e. change the VM class of your worker nodes to a smaller or larger size.
Supervisor cluster: new versions of Kubernetes, upgrade supported
The Supervisor cluster now supports Kubernetes 1.17.4
The Supervisor cluster now supports upgrading from Kubernetes 1.16.x to 1.17.x
Resolved Issues
Naming conflict for deleted namespaces
We have resolved an issue where, if a user deleted a vSphere namespace and then created a new vSphere namespace with the same name, we had a naming collision that resulted in being unable to create Tanzu Kubernetes clusters.
Improved distribution names
We have made clearer which version of Kubernetes you are running by moving OVF versioning information to a separate column.
VMware provides a variety of resources you can use to learn about vSphere with Tanzu.
Learn how to configure, manage, and use vSphere with Tanzu by reading vSphere with Tanzu Configuration and Mananagement . Designed for vSphere system administrators and DevOps teams, this guide provides details on vSphere with Tanzu architecture, services, licensing, system requirements, set up, and usage.
Use the VMware Compatibility Guides to learn about hardware compatibility and product interoperability for vSphere with Tanzu. vSphere with Tanzu has the same hardware requirements as vSphere 7.0. For certain configurations, it also requires the use of NSX-T Edge virtual machines, and those VMs have their own smaller subset of CPU compatibility. See the NSX-T Data Center Installation Guide for more information.
Find out what languages vSphere with Tanzu is available in by visiting the Internationalization section of the vSphere 7.0 Release Notes. These are the same languages VMware provides for vSphere.
View the copyrights and licenses for vSphere with Tanzu open source components by visiting the Open Source section of the vSphere 7.0 Release Notes. The vSphere 7.0 Release Notes also tell you where to download vSphere open source components.
VM Service card on the Namespace Summary page disappears after a vCenter upgrade to vCenter Server 7.0 Update 2c
After a vCenter upgrade from vCenter Server 7.0 Update 2a to vCenter Server 7.0 Update 2c, pre-existing namespaces that were created before the upgrade do not show VM Class and Content Libraries card under the Namespace Summary view. This is specific to vCenter Server 7.0 Update 2a source and should not affect upgrade from earlier versions such as vCenter Server 7.0 Update 2
For the affected namespaces, upgrading the Supervisor cluster should restore the cards on the Namespace Summary view
Certain operations on virtual machines created with VM Service might fail due to incompatible virtual machine hardware version
Operations on virtual machines will fail if the virtual machine hardware version of the OVF image does not support the operation. For example, for an image with a virtual hardware version vmx-11, attaching a persistent volume will fail with the following error:
Attach a virtual disk: The device or operation specified at index '-1' is not supported for the current virtual machine version 'vmx-11'. A minimum version of 'vmx-13' is required for this operation to succeed
Workaround: None
During Supervisor cluster upgrade, extra vSphere Pods might be created and stuck at pending status if Daemon set is used
During Supervisor cluster upgrade, Daemon set controller creates extra vSphere Pods for each Supervisor control plane node. This is caused by an upstream Kubernetes issue.
Workaround: Add NodeSelector/NodeAffinity to vSphere Pod spec, so the Daemon set controller can skip the control plane nodes for pods creation.
Some environments have reported pod creation intermittently failing with the following error ““Failed to get image”. Connection timeout with a No Route to Host error”. It is caused by a default 3 second timeout during the image fetcher request.
Contact GSS to increase default timeout value.
In environments where many hosts are disconnected due to bad username and password Supervisor service may have crashed and not been able to come up again.
This problem is fixed in the latest release.
Configuring a Supervisor Cluster with NSX-T Data Center might fail on vCenter Server 7.0 Update 3
Configuring a Supervisor Cluster with NSX-T Data Center might fail to install the spherelet VIB on ESXi hosts with the following error:
Could not find a trusted signer: self signed certificate
A self-signed Spherelet VIB is bundled with vCenter Server 7.0 Update 3. However, if you do not have secure boot enabled on the ESXi host that are part of the vSphere cluster or the hosts do not have vSphere Life Cycle manager (vLCM) enabled on the cluster, the Supervisor Cluster enable operation will be successful. This problem is now fixed in the latest release.
see Knowledge Base article 89010
This workaround can be applied either at vCenter Server 7.0u3f itself that is the failed state or vCenter Server 7.0u3d which is before starting upgrading to 7.0u3f.
1. Connect to vCenter Server SSH session with root credentials.
2. Connect to WCP database using the below command:
PGPASSFILE=/etc/vmware/wcp/.pgpass /opt/vmware/vpostgres/current/bin/psql -d VCDB -U wcpuser -h localhost
3. Run the following command to check the entries that have instance_id as null:
SELECT cluster, instance_id FROM cluster_db_configs WHERE instance_id is NULL;
4. Update the instance_id in cluster_db_configs to random UUID where it is null:
UPDATE cluster_db_configs SET instance_id=gen_random_uuid() WHERE instance_id is NULL;
5. WCP service (and any other service that has not started after the upgrade) needs to be restarted once the DB entry has been fixed.
service-control --status --all
service-control --restart --all (--stop or --start)
service-control --restart wcp (--stop or --start)
6. Re-run Step 2 and 3 to verify instance_id is not NULL. Now vCenter Server must be up and running.
7. At this stage if the user have applied this workaround at vCenter Server 70u3d, then proceed upgrading to vCenter Server 70u3f or If the user has applied the workaround at vCenter Server 70u3f, then visit the VMware Appliance Management Interface (VAMI) or CLI installer and resume the upgrade.
Improved network throughput performance on Tanzu Kubernetes Grid cluster VMs that use Antrea-ENCAP with activated LRO
Added datapath optimizations to improve network throughput performance for Tanzu Kubernetes Grid clusters that that use Antrea-ENCAP with enabled LRO.
Enabling the embedded Harbor registry on Supervisor can result in an insecure default configuration
If you have completed the steps outlined in " Enable the Embedded Harbor Registry on the Supervisor Cluster ," an insecure default configuration may be present on the embedded Harbor registry. Further information on this matter is available in VMware Knowledge Base article 91452 .
Workaround: The issue can be resolved by either installing this release or by implementing a temporary workaround as described in KB 91452.
<span style="color:#FF0000">NEW:</span> When deleting multiple FCDs and volumes from shared datastores such as vSAN, you might notice changes in performance
The performance changes can be caused by a fixed issue. While unfixed, the issue caused stale FCDs and volumes to remain in the datastore after an unsuccessful FCD delete operation.
Workaround: None. The delete operation works as usual despite the change in the performance.
When you move a disconnected ESXi node from a cluster, but it stays under the same data center, pods and PVCs running on the node remain in terminating state
If an ESXi host is in Not responding state due to PSOD and you move this host out of the cluster, under the same data center, pods and PVCs running on the host get stuck in terminating state. The problem happens even when a spare node is available in the cluster.
This issue typically occurs when the following takes place:
You enable partner service on the Supervisor Cluster and create instances of the service.
An ESXi node where service instances run experiences PSOD.
You disconnect the non-responsive host and move it out of the cluster under the same data center.
When the host is out of cluster, you can observe that the pods and PVCs present on the node remain in terminating state.
Workaround: Remove the host from inventory instead of moving it out of the cluster under the same data center.
Clusters where you enable workload management also must have HA and automated DRS enabled. Enabling workload management on clusters where HA and DRS are not enabled or where DRS is running in manual mode can lead to inconsistent behavior and Pod creation failures.
Workaround: Enable DRS on the cluster and set it to Fully Automate or Partially Automate . Also ensure that HA is enabled on the cluster.
Storage class appears when you run kubectl get sc even after you remove the corresponding storage policy
If you run
kubectl get sc
after you create storage policy, add the policy to a namespace, and then remove the policy, the command response will still list the corresponding storage class.
Workaround: Run
kubectl describe namespace
to see the storage classes actually associated with the namespace.
All storage classes returned when you run kubectl describe storage-class or kubectl get storage-class on a Supervisor Cluster instead of just the ones for the Supervisor namespace
When you run the
kubectl describe storage-class
or
kubectl get storage-class
command on a Supervisor Cluster, the command returns all storage classes instead of just the ones for the Supervisor namespace.
Workaround: Infer the storage class names associated with the namespace from the verbose name of the quota.
Even if FQDN is configured for the Kubernetes control plane IP for Supervisor Cluster namespace, the share namespace button gives the IP address instead of the FQDN.
Workaround: Manually share Supervisor Cluster namespace with FQDN.
You cannot access the api server via kubectl vSphere login when using a load balanced endpoint.
Workaround: This issue can manifest in two ways.
Check whether the api server is accessible through the control plane <curl -k https://vip:6443 (or 443)>
If you are unable to access the load balancer from the api server, then the api server is not up yet.
Workaround: Wait a few minutes for the api server to become accessible.
Check if the edge virtual machine node status is up.
Log in to the NSX Manager.
Go to System > Fabric > Nodes > Edge Transport Nodes . The node status should be up.
Go to Networking > Load Balancers > Virtual Servers. Find the vips that end with kube-apiserver-lb-svc-6443 and kube-apiserver-lb-svc-443. If their status is not up, use the following workaround.
Workaround: Reboot the edge VM. The edge VM should reconfigure after the reboot.
During the configuration of the cluster, you may see the following error messages:
Api request to param0 failed
Config operation for param0 node VM timed out
Workaround: None. Enabling vSphere with Tanzu can take from 30 to 60 minutes. If you see these or similar
param0
timeout messages, they are not errors and can be safely ignored.
Disabling and immediately re-enabling a vSphere service in vSphere with Tanzu causes a corresponding namespace in vCenter Server to become unresponsive.
When the vSphere service is disabled and immediately reenabled, the namespace in the Supervisor cluster gets deleted and re-created. However, the corresponding namespace in vCenter Server remains in "Terminating" state. As a result, resource quotas cannot be assigned to the namespace in vCenter Server, and DevOps engineers cannot create such resources as pods and PVCs on the namespace. In addition, the UI plugin deployment fails because the service operator pod cannot run.
You can obtain the App Platform logs by running the following command on the Supervisor cluster: kubectl -n vmware-system-appplatform-operator-system logs vmware-system-appplatform-operator-mgr-0.
Workaround:
Before reenabling the service, wait for the namespace to be completely deleted from vCenter Server.
If the namespace is stuck in the terminating state, perform the following steps:
1. Disable the service again.
2. Restart the wcp service in vCenter Server.
3. Wait for the namespace to be deleted. This step might take some time.
4. Reenable the service.
When the user enables the container registry from the UI, the enable action fails after 10 minutes with a timeout error.
Workaround: Disable the container registry and retry to enable. Note that the timeout error may occur again.
Enabling a cluster shortly after disabling the cluster may create a conflict in the service account password reset process. The enable action fails with an error.
Workaround: Restart with the command
vmon-cli --restart wcp
.
Deleting a container image tag in an embedded container registry might delete all image tags that share the same physical container image
Multiple images with different tags can be pushed to a project in an embedded container registry from the same container image. If one of the images on the project is deleted, all other images with different tags that are pushed from the same image will be deleted.
Workaround: The operation cannot be undone. Push the image to the project again.
When you perform a purge operation on a registry project, the project temporarily displays as being in an error state. You will not be able to push or pull images from such project. At regular intervals, the project will be checked and all projects which are in error state will be deleted and recreated. When this happens, all previous project members will be added back to the recreated project and all the repositories and images which previously existed in the project will be deleted, effectively completing the purge operation.
Workaround: None.
Container registry enablement fails when the storage capacity is less than 2000 mebibytes
There is a minimum total storage capacity requirement for the container registry, addressed as the "limit" field in VMODL. This is because some Kubernetes pods need enough storage space to work properly. To achieve container registry functionality, there is a minimum capacity of 5 Gigabytes. Note that this limit offers no guarantee of improved performance or increased number or size of images that can be supported.
Workaround: This issue can be avoided by deploying the container registry with a larger total capacity. The recommended storage volume is no less than 5 gigabytes.
If you replace the TLS certificate of the NSX load balancer for Kubernetes cluster you might fail to log in to the embedded Harbor registry from a docker client or the Harbor UI
To replace the TLS certificate of the NSX load balancer for Kubernetes cluster, from the vSphere UI navigate to
Configure > Namespaces > Certificates > NSX Load Balancer > Actions
and click
Replace Certificate
. When you replace the NSX certificate, the login operation to the embedded Harbor registry from a docker client or the Harbor UI might fail with the
unauthorized: authentication required
or
Invalid user name or password
error.
Workaround: Restart the registry agent pod in the
vmware-system-registry
namespace:
Run the
kubectl get pod -n vmware-system-registry
command.
Delete the pod output by running the
kubectl delete pod vmware-registry-controller-manager-xxxxxx -n vmware-system-registry
command.
Wait until pod restarts.
Any vSphere pod deployed in supervisor clusters that makes use of the DNSDefault will fallback to using the clusterDNS configured for the cluster
Workaround: None.
All hosts in a cluster might be updated simultaneously when upgrading a Supervisor Cluster
In certain cases, all hosts in a cluster will be updated in parallel during the Supervisor Cluster upgrade process. This will cause downtime for all pods running on this cluster.
Workaround: During Supervisor Cluster upgrade, don't restart wcpsvc or remove/add hosts.
Supervisor Cluster upgrade can be stuck indefinitely if VMCA is used as an intermediate CA
Supervisor Cluster upgrade can be stuck indefinitely in "configuring" if VMCA is being used as an intermediate CA.
Workaround: Switch to a non-intermediate CA for VMCA and delete any control plane VMs stuck in "configuring".
vSphere Pod deployment will failed if a Storage Policy with encryption enabled is assigned for Pod Ephemeral Disks
If a Storage Policy with encryption enabled is used for Pod Ephemeral Disks, vSphere Pod creation will be failed with an "AttachVolume.Attach failed for volume" error.
Workaround: Use a storage policy with no encryption for Pod Ephemeral Disks.
Supervisor Cluster upgrade hangs at 50% during "Namespaces cluster upgrade is in upgrade host step"
The problem occurs when a vSphere Pod hangs at TERMINATING state during the upgrade of the Kubernetes control plane node. The controller of control plane node tries to upgrade the Spherelet process and during that phase vSphere Pods are being evicted or killed on that control plane node to unregister the node from the Kubernetes control plane. Because of this reason, the Supervisor Cluster upgrade hangs at an older version until vSphere Pods in TERMINATING state are removed from inventory.
Workaround :
1. Login to the ESXi host on which vSphere Pod is hanging in TERMINATING state.
2. Remove the TERMINATING vSphere Pods by using following commands:
# vim-cmd vmsvc/getallvms
# vim-cmd vmsvc/destroy
After this step, the vSphere Pods display in orphaned state in the vSphere Client.
3. Delete the orphaned vSphere Pods by first adding a user to the ServiceProviderUsers group.
a.) Login to the vSphere client, select Administration -> Users and Groups -> Create User, and click Groups.
b.) Search for ServiceProviderUsers or the Administrators group and add a user to the group.
4. Login to the vSphere Client by using the just created user and delete the orphaned vSphere Pods.
5. In kubectl, use the following command:
kubectl patch pod -p -n '{"metadata":{"finalizers":null}}'
Workload Management UI throws the following license error: None of the hosts connected to this vCenter are licensed for Workload Management
After successfully enabling Workload Management on a vSphere Cluster, you might see the following licensing error after rebooting vCenter Server or upgrading ESXI hosts where Workload Management is enabled: None of the hosts connected to this vCenter are licensed for Workload Management. This is a cosmetic UI error. Your license should still be valid and your workloads should still be running.
Workaround: Users should clear their browser cache for the vSphere Client.
Large vSphere environments might take long to sync on a cloud with the VMware NSX Advanced Load Balancer Controller
vSphere environments with inventories that contain more than 2,000 ESXi hosts and 45,000 virtual machines might take as much as 2 hours to sync on a cloud by using an NSX Advanced Load Balancer Controller.
Workaround: none
The private container registry of the Supervisor Cluster might become unhealthy after VMware Certificate Authority (VMCA) root certificate is changed on a vCenter Server 7.0 Update 2
After you change the VMware Certificate Authority (VMCA) root certificate on a vCenter Server system 7.0 Update 2, the private container registry of the Supervisor Cluster might become unhealthy and the registry operations might stop working as expected. The following health status message for the container registry is displayed on the cluster configuration UI:
Harbor registry harbor-1560339792 on cluster domain-c8 is unhealthy. Reason: failed to get harbor health: Get https://30.0.248.2/api/health: x509: certificate signed by unknown authority
Workaround:
Restart the registry agent pod manually in the vmware-system-registry namespace on the vSphere kubernetes cluster:
Run the
kubectl get pod -n vmware-system-registry
command to get registry agent pod.
Delete the pod output by running the
kubectl delete pod vmware-registry-controller-manager-xxxxxx -n vmware-system-registry
command.
Wait until the pod restarts.
Refresh the image registry on the cluster configuration UI, and the health status should show as running shortly.
Projects for newly-created namespaces on the Supervisor Cluster are not automatically created on the private container registry
Projects might not be automatically created on the private container registry for newly-created namespaces on a Supervisor Cluster. The status of the container registry still displays as healthy, but no projects are shown on the container registry of the cluster when a new namespace is created. You cannot push or pull images to the projects of the new namespaces on the container registry.
Workaround:
Run the kubectl get pod -n vmware-system-registry command to get the registry agent pod.
Delete the pod output by running the kubectl delete pod vmware-registry-controller-manager-xxxxxx -n vmware-system-registry command.
Wait until pod restarts.
Log in to the private container registry to verify that projects are created for namespaces on the cluster.
You might get this issue, when trying to use a deployment with 10 replica pods in a YAML. When you try to create with this YAML by using the private container registry, out of 10 replicas, at least 7 might pass and 3 might fail with the "ErrImgPull" issue.
Workaround: Use fewer replica sets, maximum 5.
The NSX Advanced Load Balancer Controller is not supported when vCenter Server is deployed with a custom port
You cannot register vCenter Server with the NSX Advanced Load Balancer Controller as no option exists for providing a custom vCenter Server port in NSX Advanced Load Balanced Controller UI while registering.
NSX Advanced Load Balancer Controller works only when vCenter Server is deployed with default ports 80 and 443.
When performing domain repointing on vCenter Server that already contains running Supervisor Clusters, the Supervisor Clusters will go in Configuring state
Domain repointing is not supported on vCenter Server that has Supervisor Clusters. When trying to perform domain repointing, Existing Supervisor Clusters will go in Configuring state and control plane VMs and Tanzu Kubernetes cluster VMs stop appearing in the inventory under the Hosts and Clusters view.
Workaround: None
Assigning tags and custom attributes does not work for VMs created using Kubernetes in a Supervisor Cluster
When you try to assign tags or custom attributes to a VM created in a Supervisor Cluster either through the vSphere UI client or using vAPIs, the operation fails with the message "An error occurred while attaching tags".
Workaround: None
Developers with permission to self-service namespaces cannot access cluster-scoped resources such as storage classes
When developers attempt to list cluster-scoped storage classes using the
kubectl
command
kubectl get storageclass -n test-ns or kubectl get storageclass
They encounter the following error.
Error from server (Forbidden): storageclasses.storage.k8s.io is forbidden: User "<sso:DEVUSER@DOMAIN>" cannot list resource "storageclasses" in API group "storage.k8s.io" at the cluster scope
Workaround: This is expected behavior as developers only have access to the storage classes assigned to the namespace template they have access to. This is pre-determined by the vSphere administrator.
Using the command below will list storage classes associated with the namespace.
kubectl get resourcequota <NAMESPACE>-storagequota -n <NAMESPACE>
Trying to create a namespace using a manifest with
kubectl apply -f
fails with the error
Error from server (Forbidden): namespaces is forbidden: User "sso:user@vsphere.local"
cannot list resource "namespaces"
in
API group ""
at the cluster scope
Workaround: Developers can use the
kubectl create -f
command to create a namespace instead.
You may encounter the error cannot "create resource pods" when trying to create a vSphere pod in a namespace created using the self-service namespace template.
Workaround: Wait for a few seconds after namespace creation to create vSphere pods in the namespace.
Developers are unable to add labels and annotations to namespaces created using the self-service namespace template
Trying to specify labels and annotations in the manifest used to create or modify a namespace with the command
kubectl apply -f
will fail with the error
Error from server (Forbidden): User "sso:nuser@vsphere.local"
cannot patch resource "namespaces"
in
API group ""
in
the namespace ""
Workaround: Add required labels and annotations to the namespace manifest and use
kubectl create -f
instead in order to add labels and annotations.
When you start a Supervisor Cluster upgrade, any task related to the namespace template, such as activation, deactivation or updates, stays in a queue until the upgrade completes.
Workaround: Wait for the upgrade operation to complete before running commands to manipulate the namespace template
Attempts to attach a persistent volume to a pod on a Tanzu Kubernetes cluster fail with the “CNS: Failed to attach disk because missing SCSI controller” error message
When you try to attach a persistent volume to a pod on a Tanzu Kubernetes cluster, your attempts fail, and the pod remains in a pending state. The error message indicates that the SCSI controller is missing even though the worker node VM has a PVSCSI controller configured.
This problem might occur when the worker node VM reaches its block volume limit of 60 volumes. However, Kubernetes ignores the vSphere volume limits and schedules block volume creation on that node.
Workaround: Add a worker node to the cluster. Delete the pod to schedule its deployment to the new worker node.
Deleting a stateful pod after an inactive vCenter Server session and attempts to later reuse or delete its volume in the Tanzu Kubernetes cluster might cause failures and unpredictable behavior
When you delete a stateful pod after a day or so of being inactive, its volume appears to be successfully detached from the node VM of the Tanzu Kubernetes cluster. However, when you try to create a new stateful pod with that volume or delete the volume, your attempts fail because the volume is still attached to the node VM in vCenter Server.
Workaround: Use the CNS API to detach the volume from the node VM to synchronize the state of the volume in the Tanzu Kubernetes cluster and vCenter Server. And also restart the CSI controller in Supervisor cluster to renew the inactive session.
Supervisor Cluster Upgrade stuck / not completing due to IP range exhaustion on Supervisor Cluster's primary workload network (if using vSphere Networking) / NCP cluster network pod CIDRs (if using NSX-T Container Plugin).
Supervisor Cluster Upgrade is stuck on Pending state, with message: "Namespaces cluster upgrade is in provision a new master step".
New Control Plane VMs deployed during cluster upgrade only receive one network interface. Control Plane VMs should have 2 network interfaces, one connected to the management network and the other to the workload network.
Certain system pods like coredns, which is supposed to be deployed to the new Control Plane VM, may be stuck on Pending state.
Workaround: Delete a small number of workloads (VMs, PodVMs, TKG clusters) to free up enough IPs for the upgrade process to complete. At least 3 IPs should be freed up.
Update the key value pairs or registry credentials for a Supervisor Service configuration
You might want to change the configuration registry key-value pairs of a Supervisor Service, because you have entered incorrect login credentials , or the registry password might have expired.
Workaround:
1. Create a new secret resource on the Supervisor Cluster.
kubectl -n vmware-system-supervisor-services create secret generic new-secret --from-literal=registryName= --from-literal=registryPasswd= --from-literal=registryUsername=
2. Update service reference for the Supervisor Service resource.
# kubectl edit supervisorservice svc-minio -n vmware-system-supervisor-services
3. Update the spec.config session:
spec:
config:
secretNamespace: vmware-system-supervisor-services
secretRef: new-secret
When vCenter Server is signed by custom CA certificate and the Supervisor Cluster is enabled, the Tanzu Kubernetes Grid Service UI plug-in and any Supervisor Service UI plug-ins deployed in the vSphere Client do not work. The UI plug-ins experience SSL authentication issues when trying to communicate with their respective backend servers in the Supervisor Cluster. The Tanzu Kubernetes Grid Service plug-in displays the following error:
The Tanzu Kubernetes Grid Service failed to authenticate with kubernetes; see the browser console for more technical details
Workaround: Add the trust root in a separate file (not in
vmca.pem
) in
/etc/ssl/certs
and regenerating the hashes by using
c_rehash
. You must perform this on all three control plane VMs.
Note that editing /etc/ssl/certs/vmca.pem is not advisable as the contents of this file will be overwritten by update-controller during every cluster synchronization.
Take into consideration that if one of the control plane VMs in the Supervisor Cluster is redeployed (for example after resizing the control plane VMs or due to repair operation) the certificate added manually to
/etc/ssl/certs
will be lost and the workaround will have to be reapplied to that VM.
After configuring a vSphere Cluster as a Supervisor Cluster, the cluster hangs in configuring state. You cannot create vSphere Pods and Tanzu Kubernetes clusters.
Workaround: Restart the wcp service by using the following command:
vmon-cli restart wcp
Checkout the documentation for more detailed instructions.
"kubectl get virtualmachineimages" does not return any results even when the associated Content Library is populated with VM images
If the Content Library being used by the VM Service is enabled with a security policy, VM Service will fail to read the images, and VMs cannot be deployed using images from this Content Library.
Workaround: Disassociate the Content Library with the security policy from the supervisor namespace. Remove the "Default OVF security policy" setting from the relevant Content Library, and then re-associate the Content Library with the namespace.
Developers will see incorrect POWERSTATE for VMs with vGPU and pass-through devices that are managed by VM Service when their host enters maintenance mode.
When a host enters maintenance mode, VMs with vGPU and pass-through devices that are managed by VM Service will be automatically powered off by DRS. However, the output of kubectl get vm displays the prior power state and VMs. This would results in showing POWERSTATE=poweredOn even when the VM is powered off on vCenter.
None.
On a Supervisor Cluster with NSX-T setup, ESXi hosts might continue to use self-signed Spherelet VIBs
If you configured or upgraded your Supervisor Cluster with the Kubernetes version available in vCenter Server 7.0 Update 3 and you have further upgraded the Kubernetes version of your Supervisor Cluster to a version available in vCenter Server 7.0 Update 3a, ESXi host(s) might continue to use self-signed Spherelet VIBs. This occurs because the self-signed Spherelet VIB version installed on vCenter Server 7.0 Update 3 and vCenter Server 7.0 Update 3a are identical. This issue does not apply to a Supervisor Cluster configured with the vSphere networking stack.
Workaround 1:
If the vSphere cluster is not based on vSphere Lifecycle Manager, perform the below steps to get the VMware certified Spherelet VIBs installed:
Put an ESXi host in maintenance mode and wait until it is marked as "not ready" by the Supervisor Cluster.
Move that host outside the cluster as a standalone host and wait for the spherelet and NSX-T VIBs to get uninstalled.
Move the same host back to the cluster, exit from maintenance mode and wait for new spherelet and NSX-T VIB to get configured again.
Repeat the above steps for each ESXi host inside the cluster.
If the Supervisor Cluster is enabled with vSphere Lifecycle Manager, deactivate the Supervisor Cluster and re-configure it again.
Workaround 2:
Once vCenter Server is upgraded to vCenter Server 7.0 Update 3a, deactivate the Supervisor Cluster and re-configure it. This applies to both for clusters enabled with vSphere Lifecycle Manager as well as non-vLCM/VUM based clusters.
Some vSphere Pods show a NodeAffinity status after host reboot or supervisor cluster upgrade
After you reboot the host or upgrade supervisor cluster, you might see a NodeAffinity status for some vSphere Pods. This behavior results from an issue in upstream Kubernetes in which the Kubernetes scheduler creates redundant pods with a NodeAffinity status in the cluster after kubelet restart. This issue does not affect cluster functionality. For information on the upstream Kubernetes issue, read https://github.com/kubernetes/kubernetes/issues/92067 .
Workaround: Delete the redundant pods with NodeAffinity status.
vDPp service operator and instance pods enabled on vSphere with Tanzu enter a Pending state after you upgrade your environment to 7.0 Update 3e or later
This problem occurs if the version of the partner service installed on the Supervisor Cluster does not support Kubernetes version 1.22. After you upgrade your vSphere with Tanzu environment to 1.22 complaint version 7.0 Update 3e or later, the service becomes incompatible. As a result, the operator and instance pods enter a Pending state.
Attempts to upgrade the service to a newer version fail.
Workaround: Upgrade the vDPp service to a version compatible with Kubernetes 1.22 before upgrading vSphere with Tanzu to 7.0 Update 3e.
On rare occasions, automatic upgrade of cluster a cluster from Kubernetes 1.19.1 to 1.20.8 may fail with the following error:
Task for upgrade not found. Please trigger upgrade again.
Workaround: Manually start the cluster upgrade.
Operations on Supervisor Namespace may fail if it has been reused within a short duration of time:
When a supervisor namespace has been deleted and recreated with the same name within a short duration of time operations may have failed due to an invalid entry in the cache.
This has been resolved in latest release.
By generating a Supervisor Support Bundle the VC support bundle has not been auto included.
Fixed in the latest version.
During Workload Management enablement it is possible to download a network checklist sheet. Localization is not applied to the sheet.
Fixed in latest release.
On a scaled Tanzu Kubernetes cluster setup observing OOM issue on WCP system pods capi-kubeadm-control-plane-controller-manager
Documented in KB article
https://kb.vmware.com/s/article/88914
See KB Article
Supervisor Cluster upgrade stuck at components upgrade step due to TKG or capw upgrade failure.
Documented in KB article
https://kb.vmware.com/s/article/88854
If a security policy CR is being created that contains rules A and B and then is updated the policy changing the rules to B and C. Rule A still takes effect.
Workaround: Delete security policy and re-create with updated rules
LoadBalancers and Tanzu Kubernetes clusters are not created when two SE Groups exist on NSX Advanced Load Balancer.
If a second SE Group is added to NSX Advanced Load Balancer with or without SEs or virtual services assigned to it, the creation of new Supervisor Clusters or Tanzu Kubernetes clusters fails, and existing Supervisor Clusters cannot be upgraded. The virtual service creation on NSX Advanced Load Balancer controller fails with the following error:
"get() returned more than one ServiceEngineGroup – it returned 2"
As a result, new load balancers are unusable and you cannot create new workload clusters successfully.
For more information, see the VMware Knowledge Base article 90386.
Workaround: Only the SEGroup "Default-Group" should be used for VirtualService creation, delete all other SE groups from NSX Advanced Load Balancer except the default-group and retry the operation.
If a security policy CR that contains rules A and B is created and is then updated, changing the rules to B and C, rule A still takes effect and is not removed.
Workaround: Create and apply a new policy that includes rules B and C and delete the old policy containing A and B.
The namespace management API may sometimes return an HTTP 500 error, failing to authorize a request
A request to Workload Management may intermittently fail. The API will return with a 500 status code, and no request will be processed. You will find a log message stating, "An unexpected error occurred during the authorization. This issue is intermittent but is more likely to happen when Workload Management is under load, such as when actively configuring one or more Supervisors.
Workaround: Retry the operation you were attempting when the error occurred.
During enablement, upgrade, or other node redeployment of a supervisor cluster it may remain in CONFIGURING or ERROR state and show the following error message:
"API request to VMware vCenter Server (vpxd) failed. Details 'ServerFaultCode: A general system error occurred: vix error codes = (1, 0)" may become stuck."
The relevant log’s can be found in: /var/log/vmware/wcp/wcpsvc.log
Delete the vSphere Agent Manager (EAM) corresponding with the control plane VM experiencing the issue to force redeployment of the node.
After deleting a PersistentVolumeClaim (PVC), the corresponding PersistentVolume (PV) may remain in a terminated state in Supervisor. Additionally, the vSphere Client may display multiple failed "deleteVolume" tasks.
1. Authenticate with the Supervisor:
kubectl vsphere login --server=IP-ADDRESS --vsphere-username USERNAME
2. Get the name of the persistent volume in terminating state:
kubectl get pv
3. Note down the volume handle from the persistent volume:
kubectl describe pv <pv-name>
4. Using the volume handle from previous step, delete the CnsVolumeOperationRequest Custom resource in the Supervisor:
kubectl delete cnsvolumeoperationrequest delete-<volume-handle>
Note: Before deleting a PV, ensure that it is not being used by any other resources in the cluster.
There is a combined 60 minute timeout for NSX Edge OVF deployment and NSX Edge VM registration. In slower networks or environments with slower storage, if the time elapsed for Edge deployment and registration exceeds this 60 minute timeout, the operation will fail.
Workaround: Clean up edges and restart the deployment.
NSX Edges are not updated if vCenter Server DNS, NTP, or Syslog settings are changed after cluster configuration
DNS, NTP, and Syslog settings are copied from vCenter Server to NSX Edge virtual machines during cluster configuration. If any of these vCenter Server settings are changed after configuration, the NSX Edges are not updated.
Workaround: Use the NSX Manager APIs to update the DNS, NTP, and Syslog settings of your NSX Edges.
NSX Edge Management Network Configuration only provides subnet and gateway configuration on select portgroups
The NSX Edge management network compatibility drop down list will show subnet and gateway information only if there are ESXi VMKnics configured on the host that are backed by a DVPG on the selected VDS. If you select a Distributed Portgroup without a VMKnic attached to it, you must provide a subnet and gateway for the network configuration.
Workaround: Use one of the following configurations:
Discreet Portgroup: This is where no VMKs currently reside. You must supply the appropriate subnet and gateway information for this portgroup.
Shared Management Portgroup: This is where the ESXi hosts' Management VMK resides. Subnet and gateway information will be pulled automatically.
When attempting to use VLAN 0 for overlay Tunnel Endpoints or uplink configuration, the operation fails with the message:
Argument 'uplink_network vlan' is not a valid VLAN ID for an uplink network. Please use a VLAN ID between 1-4094
Workaround: Manually enable VLAN 0 support using one of the following processes:
1. SSH into your deployed VC (root/vmware).
2. Open
/etc/vmware/wcp/nsxdsvc.yaml
. It will have content similar to:
logging:
level: debug
maxsizemb: 10
a. To enable VLAN0 support for NSX Cluster Overlay Networks, append the following lines to
/etc/vmware/wcp/nsxdsvc.yaml
and save the file.
experimental:
supportedvlan:
hostoverlay:
min: 0
max: 4094
edgeoverlay:
min: 1
max: 4094
edgeuplink:
min: 1
max: 4094
b. To enable VLAN0 support for NSX Edge Overlay Networks, append the following lines to
/etc/vmware/wcp/nsxdsvc.yaml
and save the file.
experimental:
supportedvlan:
hostoverlay:
min: 1
max: 4094
edgeoverlay:
min: 0
max: 4094
edgeuplink:
min: 1
max: 4094
c. To enable VLAN0 support for NSX Edge Uplink Networks, append the following lines to
/etc/vmware/wcp/nsxdsvc.yaml
and save the file.
experimental:
supportedvlan:
hostoverlay:
min: 1
max: 4094
edgeoverlay:
min: 1
max: 4094
edgeuplink:
min: 0
max: 4094
3. Restart the workload management service with
vmon-cli --restart wcp
.
vSphere with Tanzu and NSX-T cannot be enabled on a cluster where vSphere Lifecycle Manager Image is enabled
vSphere with Tanzu and NSX-T are not compatible with vSphere Lifecycle Manager Image. They are only compatible with vSphere Lifecycle Manage Baselines. When vSphere Lifecycle Manager Image is enabled on a cluster, you cannot enable vSphere with Tanzu or NSX-T on that cluster.
Workaround: Move hosts to a cluster where vSphere Lifecycle Manager Image is disabled. You must use a cluster with vSphere Lifecycle Manager Baselines. Once the hosts are moved, you can enable NSX-T and then vSphere with Tanzu on that new cluster.
When vSphere with Tanzu networking is configured with NSX-T, "ExternalTrafficPolicy: local" not supported
For Kubernetes service of type LoadBalancer, the "ExternalTrafficPolicy: local" configuration is not supported.
Workaround: None.
When vSphere with Tanzu networking is configured with NSX-T, the number of services of type LoadBalancer that a Tanzu Kuberetes cluster can support is limited by the NodePort range of the Supervisor Cluster
Each VirtualMachineService of type LoadBalancer is translated to one Kubernetes service of type LoadBalancer and one Kubernetes endpoint. The maximum number of Kubernetes services of type LoadBalancer that can be created in a Supervisor Cluster is 2767, this includes those created on the Supervisor Cluster itself and those created in Tanzu Kubernetes clusters.
Workaround: None.
Once you configure the Supervisor Cluster with the NSX Advanced Load Balancer, you cannot change the vCenter Server PNID.
Workaround: If you must change the PNID of vCenter Server, remove the NSX Advanced Load Balancer Controller and change for vCenter Server PNID, then redeploy and configure NSX Advanced Load Balancer Controller with new PNID of vCenter Server.
In vSphere Distributed Switch (vDS) environments, it is possible to configure Tanzu Kubernetes clusters with network CIDR ranges that overlap or conflict with those of the Supervisor Cluster, and vice versa, resulting in components not being able to communicate.
In vDS environments, there is no design-time network validation done when you configure the CIDR ranges for the Supervisor Cluster, or when you configure the CIDR ranges for Tanzu Kubernetes clusters. As a result, two problems can arise:
1) You create a Supervisor Cluster with CIDR ranges that conflict with the default CIDR ranges reserved for Tanzu Kubernetes clusters.
2) You create a Tanzu Kubernetes cluster with a custom CIDR range that overlaps with the CIDR range used for the Supervisor Clusters.
Workaround
:
For vDS environments, when you configure a Supervisor Cluster, do not use either of the default CIDR ranges used for Tanzu Kubernetes clusters, including 192.168.0.0/16, which is reserved for services, and 10.96.0.0/12, which is reserved for pods. See also "Configuration Parameters for Tanzu Kubernetes Clusters" in the vSphere with Tanzu documentation.
For vDS environments, when you create a Tanzu Kubernetes cluster, do not use the same CIDR range that is used for the Supervisor Cluster.
If user want to create guest cluster and VirtualMachines under a namespace successfully, user can add maximum 18 custom labels for the namespace, otherwise the virtual network for the guest cluster under the namespace cannot be created successfully.
Remove labels from namespace to total number of less or equal to 18. NCP retry mechanism will retry and create the namespace, but depending on the interval it might take up to 6 hours.
If a NSX Transport Zone has been created via policy API, the supervisor enablement may have failed. It is now supported to create NSX Transport Zone via policy API, while at the same time maintain backward compatibility with NSX Transport Zone created with MP API.
Fixed in latest release.
You cannot use NSX Advanced Load Balancer with a vCenter Server using an Embedded Linked Mode topology.
When you configure the NSX Advanced Load Balancer controller, you can configure it on multiple clouds. However, you do not have an option to select multiple clouds while enabling vSphere with Tanzu as it only supports the
Default-Cloud
option. As a result, you cannot use the NSX Advanced Load Balancer with a vCenter Server version using an Embedded Linked Mode topology.
Configure NSX Load Balancer for each vCenter Server.
<span style="color:#FF0000">NEW:</span> Attempts to run the Remove Disk operation on a vSAN Direct Datastore fail with the VimFault - Cannot complete the operation error
Generally, you can observe this error when one of the following scenarios occurs:
As a part of the
Remove Disk
operation, all persistent volumes placed on the vSAN Direct Datastore are relocated to another vSAN Direct Datastores on the same ESXi host. The relocation can fail if no space is available on the target vSAN Direct Datastores. To avoid this failure, make sure that the target vSAN Direct Datastores have sufficient storage space for running applications.
The
Remove Disk
operation can also fail when the target vSAN Direct Datastores on the ESXi host have sufficient storage. This might occur when the underlying persistent volume relocation operation, spawned by the
Remove Disk
parent operation, takes more than 30 minutes due to the size of the volume. In this case, you can observe that reconfiguration of the underlying vSphere Pod remains in progress in the
Tasks
view.
The in-progress status indicates that even though the
Remove Disk
operation times out and fails, the underlying persistent volume relocation done by reconfiguration of the vSphere Pod is not interrupted.
Workaround:
After the reconfiguration task for the vSphere Pod completes, run the
Remove Disk
operation again. The
Remove Disk
operation successfully proceed.
An expansion of a Supervisor cluster PVC in offline or online mode does not result in an expansion of a corresponding Tanzu Kubernetes cluster PVC
A pod that uses the Tanzu Kubernetes cluster PVC cannot use the expanded capacity of the Supervisor cluster PVC because the filesystem has not been resized.
Workaround: Resize the Tanzu Kubernetes cluster PVC to a size equal or greater than the size of the Supervisor cluster PVC.
Static provisioning in Kubernetes does not verify if the PV and backing volume sizes are equal. If you statically create a PVC in a Tanzu Kubernetes cluster, and the PVC size is less than the size of the underlying corresponding Supervisor cluster PVC, you might be able to use more space than the space you request in the PV. If the size of the PVC you statically create in the Tanzu Kubernetes cluster is greater than the size of the underlying Supervisor cluster PVC, you might notice
No space left on device
error even before you exhaust the requested size in the Tanzu Kubernetes cluster PV.
Workaround:
In the Tanzu Kubernetes cluster PV, change the
persistentVolumeReclaimPolicy
to
Retain
.
Note the
volumeHandle
of the Tanzu Kubernetes cluster PV and then delete the PVC and PV in the Tanzu Kubernetes cluster.
Re-create the Tanzu Kubernetes cluster PVC and PV statically using the volumeHandle and set the storage to the same size as the size of the corresponding Supervisor cluster PVC.
Attempts to create a PVC from a supervisor namespace or a TKG cluster fail if the external csi.vsphere.vmware.com provisioner loses its lease for leader election
When you try to create a PVC from a supervisor namespace or a TKG cluster using the
kubectl
command, your attempts might not succeed. The PVC remains in the Pending state. If you describe the PVC, the Events field displays the following information in a table layout:
Type –
Normal
Reason –
ExternalProvisioning
Age –
56s (x121 over 30m)
From –
persistentvolume-controller
Message –
waiting for a volume to be created, either by external provisioner "csi.vsphere.vmware.com" or manually created by system administrator
Workaround:
Verify that all containers in the
vsphere-csi-controller
pod inside the
vmware-system-csi
namespace are running.
kubectl describe pod
vsphere-csi-controller-pod-name
-n vmware-system-csi
Check the external provisioner logs by using the following command.
kubectl logs
vsphere-csi-controller-pod-name
-n vmware-system-csi -c csi-provisioner
The following entry indicates that the external-provisioner sidecar container lost its leader election:
I0817 14:02:59.582663 1 leaderelection.go:263] failed to renew lease vmware-system-csi/csi-vsphere-vmware-com: failed to tryAcquireOrRenew context deadline exceededF0817 14:02:59.685847 1 leader_election.go:169] stopped leading
Delete this instance of vsphere-csi-controller.
kubectl delete pod
vsphere-csi-controller-pod-name
-n vmware-system-csi
Kubernetes will create a new instance of the CSI controller and all sidecars will be reinitialized.
All PVC operations, such as create, attach, detach, or delete a volume, fail while CSI cannot connect to vCenter Server
In addition to operation failures, the Volume health information and the StoragePool CR cannot be updated in the Supervisor cluster. The CSI and Syncer logs display errors about not being able to connect to vCenter Server.
CSI connects to vCenter Server as a specific solution user. The password for this SSO user is rotated by wcpsvc once every 24 hours and the new password is transferred in to a Secret that the CSI driver reads to connect to vCenter Server. If the new password fails to be delivered to Secret, the stale password remains in the Supervisor cluster, and the CSI driver fails its operations.
This problem affects vSAN Data Persistence Platform and all CSI volume operations.
Workaround:
Typically the WCP Service delivers the updated password to CSI that runs in the Kubernetes cluster. Occasionally, the password delivery doesn't happen due to a problem, for example, a connectivity issue or an error in earlier part of the sync process. The CSI continues to use the old password and eventually locks the account due to too many authentication failures.
Ensure that the WCP cluster is in a healthy and running state. No errors should be reported for that cluster on Workload Management page. After problems causing the sync to fail are resolved, force a password refresh to unlock the locked account.
To force reset of the password:
Stop wcpsvc:
vmon-cli -k wcp
Edit the time of the last password rotation to a small value, for example 1, by changing the 3rd line in
/etc/vmware/wcp/.storageUser
to 1.
Start wcpsvc:
vmon-cli -i wcp
The wcpsvc resets the password, which unlocks the account and delivers the new password to the cluster.
When a Supervisor Cluster is upgraded, it can trigger a rolling update of all the Tanzu Kubernetes clusters to propagate any new configuration settings. During this process, a previously "Running" TKC Cluster might hang in the "Updating" phase. A "Running" Tanzu Kubernetes cluster only indicates the availability of the control plane and it is possible that the required control plane and worker nodes have not been successfully created. Such a Tanzu Kubernetes cluster might fail the health checks that are performed during the rolling update that initiates upon completion of the Supervisor Cluster upgrade. This results in the Tanzu Kubernetes cluster hanging in the "Updating" phase and can be confirmed by looking at the events on the
KubeadmControlPlane
resources associated with the Tanzu Kubernetes Cluster. The events emitted by the resource will be similar to the one below:
Warning ControlPlaneUnhealthy 2m15s (x1026 over 5h42m) kubeadm-control-plane-controller Waiting for control plane to pass control plane health check to continue reconciliation: machine's (gc-ns-1597045889305/tkg-cluster-3-control-plane-4bz9r) node (tkg-cluster-3-control-plane-4bz9r) was not checked
Workaround
:
None.
When a VI Admin deletes a storage class on from the vCenter Server namespace, access to that storage class is not removed for any Tanzu Kubernetes cluster that is already using it.
Workaround:
As VI Admin, after deleting a storage class from the vCenter Server namespace, create a new storage policy with the same name.
Re-add the existing storage policy or the one you just recreated to the supervisor namespace. TanzuKubernetesCluster instances using this storage class should now be fully-functional.
For each TanzuKubernetesCluster resource using the storage class you wish to delete, create a new TanzuKubernetesCluster instance using a different storage class and use Velero to migrate workloads into the new cluster.
Once no TanzuKubernetesCluster or PersistentVolume uses the storage class, it can be safely removed.
The embedded container registry SSL certificate is not copied to Tanzu Kubernetes cluster nodes
When the embedded container registry is enabled for a Supervisor Cluster, the Harbor SSL certificate is not included in any Tanzu Kubernetes cluster nodes created on that SC, and you cannot connect to the registry from those nodes.
Workaround: Copy and paste the SSL certificate from the Supervisor Cluster control plane to the Tanzu Kubernetes cluster worker nodes.
When multiple vCenter Server instances are configured in an Embedded Linked Mode setup, the UI allows the user to select a content library created on a different vCenter Server instance. Selecting such a library results in virtual machine images not being available for DevOps users to provision a Tanzu Kubernetes cluster. In this case, `kubectl get virtualmachineimages` does not return any results.
Workaround: When you associate a content library with the Supervisor Cluster for Tanzu Kubernetes cluster VM images, choose a library that is created in the same vCenter Server instance where the Supervisor Cluster resides. Alternatively, create a local content library which also supports air-gapped provisioning of Tanzu Kubernetes clusters.
You cannot provision new Tanzu Kubernetes clusters, or scale out existing clusters, because the Content Library subscriber cannot synchronize with the publisher.
When you set up a Subscribed Content Library for Tanzu Kubernetes cluster OVAs, an SSL certificate is generated, and you are prompted to manually trust the certificate by confirming the certificate thumbprint. If the SSL certificate is changed after the initial library setup, the new certificate must be trusted again by updating the thumbprint.
Edit the settings of the Subscribed Content Library. This will initiate a probe of the subscription URL even though no change is requested on the library. The probe will discover that the SSL certificate is not trusted and prompt you to trust it.
Tanzu Kubernetes Release version 1.16.8 is incompatible with vCenter Server 7.0 Update 1.
The Tanzu Kubernetes Release version 1.16.8 is incompatible with vCenter Server 7.0 Update 1. You must update Tanzu Kubernetes clusters to a later version before performing a vSphere Namespaces update to U1.
Before performing a vSphere Namespaces update to the vCenter Server 7.0 Update 1 release, update each Tanzu Kubernetes cluster running version 1.16.8 to a later version. Refer to the
List of Tanzu Kubernetes Releases
in the documentation for more information.
After upgrading the Workload Control Plane to vCenter Server 7.0 Update 1, new VM Class sizes are not available.
Description
: After upgrading to vSphere 7.0.1, and then performing a vSphere Namespaces update of the Supervisor Cluster, for Tanzu Kubernetes clusters, running the command "kubectl get virtualmachineclasses" does not list the
new VM class sizes
2x-large, 4x-large, 8x-large.
Workaround
: None. The new VM classes sizes can only be used with a new installation of the Workload Control Plane.
The Tanzu Kubernetes Release version 1.17.11 vmware.1-tkg.1 times out connecting to the cluster DNS server when using the Calico CNI.
The Tanzu Kubernetes Release version v1.17.11+vmware.1-tkg.1 has a Photon OS kernel issue that prevents the image from working as expected with the Calico CNI.
Workaround
: For Tanzu Kubernetes Release version 1.17.11, the image identified as "v1.17.11+vmware.1-tkg.2.ad3d374.516" fixes the issue with Calico. To run Kubernetes 1.17.11, use this version instead of "v1.17.11+vmware.1-tkg.1.15f1e18.489". Alternatively, use a different Tanzu Kubernetes Release, such as version 1.18.5 or 1.17.8 or 1.16.14.
When vSphere with Tanzu networking is configured with NSX-T Data Center, updating an "ExternalTrafficPolicy: Local" Service to "ExternalTrafficPolicy: Cluster" will render this Service's LB IP inaccessible on SV Masters
When a LoadBalancer type Kubernetes Service is initially created in workload clusters with
ExternalTrafficPolicy: Local
, and later updated to
ExternalTrafficPolicy: Cluster
, access to this Service's LoadBalancer IP on the Supervisor Cluster VMs will be dropped.
Workaround: Delete the Service and recreate it with
ExternalTrafficPolicy: Cluster
.
A
known issue
exists in the Kubernetes upstream project where occasionally kube-controller-manager goes into a loop resulting in high CPU usage which might effect functionality of Tanzu Kubernetes clusters. You might notice that the process, kube-controller-manager, is consuming a larger than expected amount of CPU and is outputting repeated logs indicating
failed for updating Node.Spec.PodCIDRs
.
Workaround: Delete the kube-controller-manager pod that sits inside the control plane node with such an issue. The pod will be recreated and the issue should not reappear.
Kubelet's configuration file is generated at the time
kubeadm init
is run and then replicated during cluster upgrades. At the time of 1.16,
kubeadm init
generates a config file that set
resolvConf
to
/etc/resolv.conf
that was then overwritten by a the command-line flag
--resolv-conf
pointing at
/run/systemd/resolve/resolv.conf
. During 1.17 and 1.18,
kubeadm
continues to configure Kubelet with the correct
--resolv-conf
. As of 1.19,
kubeadm
no longer configures the command line flag and instead relies on the Kubelet configuration file. Due to the replication process during cluster upgrades, a 1.19 cluster upgraded from 1.16 will include a config file where
resolvConf
points at
/etc/resolv.conf
instead of
/run/systemd/resolve/resolv.conf
.
Workaround: Before upgrading a Tanzu Kubernetes cluster to 1.19, reconfigure the Kubelet configuration file to point to the correct
resolv.conf
. Manually duplicate the
ConfigMap kubelet-config-1.18
to
kubelet-config-1.19
in the
kube-system
namespace then modify that new
ConfigMap's
data to point
resolvConf
at
/run/systemd/resolve/resolv.conf
.
When the Supervisor Cluster networking is configured with NSX-T, after updating a service from "ExternalTrafficPolicy: Local" to "ExternalTrafficPolicy: Cluster", requests made on the Supervisor Cluster control plane nodes to this service's load balancer IP fail
When you create a service on a Tanzu Kubernetes cluster with
ExternalTrafficPolicy: Local
and later updated the service to
ExternalTrafficPolicy: Cluster
, kube-proxy creates an IP table rule incorrectly on the Supervisor Cluster control plane nodes to block traffic destined to the service's LoadBalancer IP. For example, if this service has LoadBalancer IP 192.182.40.4, the following IP table rule is created on any one of the control plane nodes:
-A KUBE-SERVICES -d 192.182.40.4/32 -p tcp -m comment --comment "antrea-17-171/antrea-17-171-c1-2bfcfe5d9a0cdea4de6eb has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
As a result, access to that IP is dropped.
Workaround:
Delete the service and create it anew with
ExternalTrafficPolicy: Cluster
.
After you enable HTTP Proxy and/or Trust settings in the TkgServiceConfiguration specification, all pre-existing clusters without Proxy/Trust settings will inherit the global Proxy/Trust settings when they are updated.
You can edit the
TkgServiceConfiguration
specification to configure the TKG Service, including specifying the default CNI, HTTP Proxy, and Trust certificates. Any configuration changes you make to the
TkgServiceConfiguration
specification apply globally to any Tanzu Kuberentes cluster provisioned or updated by that service. You cannot opt-out of the global configuration using per-cluster settings.
For example, if you edit the
TkgServiceConfiguration
specification and enable an HTTP Proxy, all new clusters provisioned by that cluster inherit those proxy settings. In addition, all pre-existing clusters without a proxy server inherit the global proxy configuration when the cluster is modified or updated. In the case of HTTP/S proxy, which supports per-cluster configuration, you can update the cluster spec with a different proxy server, but you cannot remove the global proxy setting. If the HTTP Proxy is set globally, you must either use it or overwrite it with a different proxy server.
Workaround
: Understand that the
TkgServiceConfiguration
specification applies globally. If you don't want all clusters to use an HTTP Proxy, don't enable it at the global level. Do so at the cluster level.
In very large Supervisor Cluster deployments with many Tanzu Kubernetes Clusters and VMs, vmop-controller-manager pods might fail due to OutOfMemory resulting in the inability to lifecycle manage Tanzu Kubernetes Clusters
Within the Supervisor Cluster, the vmop-controller-manager pod is responsible for managing the lifecycle of the VMs that make up Tanzu Kubernetes Clusters. At very large numbers of such VMs (>850 VMs per Supervisor Cluster), the vmop-controller-manager pod can go into an OutOfMemory CrashLoopBackoff. When this occurs, lifecycle management of Tanzu Kubernetes Clusters are disrupted until the vmop-controller-manager pod resumes operations.
Reduce the total number Tanzu Kubernetes Cluster worker nodes managed in a Supervisor Cluster either by deleting clusters or scaling down clusters.
Upgrading from a pre-6.5 version of vSphere to vSphere 7 with Tanzu can throw an error indicating that the PhotonOS type is not supported.
After successfully upgrading a vSphere 6 environment to vSphere 7 with Tanzu, trying to deploy a Tanzu Kubernetes cluster results in an error message that PhotonOS is "not supported". Specifically:
failed to create or update VirtualMachine: admission webhook "default.validating.virtualmachine.vmoperator.vmware.com" denied the request: GuestOS not supported for osType vmwarePhoton64Guest on image v1.19.7---vmware.1-tkg.1.fc82c41'
If you have upgraded to vSphere 7 with Tanzu from a vSphere version that predates vSphere 6.5 (such as vSphere 6), you need to make sure that the default VM compatibility level is set to at least "ESXi 6.5 and later" for PhotonOS to show up as a supported guest operating system. To do this, select the vSphere cluster where Workload Management is enabled, right-click and choose Edit Default VM Compatibility. Select "ESXi 6.5 and later."
After provisioning a Tanzu Kubernetes cluster using a small size VM, and then deploying a workload to that cluster, the worker node enters a NotReady state and a continuous loop of attempting to respawn the node.
Description
: On a small or extra small worker node for a cluster, the /bin/opm process may consume an inordinate portion of the VM memory, leading to an out-of-memory error for the worker node.
Workaround
: Avoid using the small or extra small VM class for worker nodes, even for ephemeral development or test clusters. As a best practice, the minimum VM class for a worker node in any environment is medium. For more information, see
Default Virtual Machine Classes
in the documentation.
Synchronizing Tanzu Kubernetes releases from a Subscribed Content Library fails with the following HTTP request error message: "cannot authenticate SSL certificate for host wp-content.vmware.com."
Description
: When you configure a Subscribed Content Library for Tanzu Kubernetes cluster OVAs, an SSL certificate is generated for wp-content.vmware.com. You are prompted to manually trust the certificate by confirming the certificate thumbprint. If the SSL certificate is changed after the initial library setup, the new certificate must be trusted again by updating the thumbprint. The current SSL certificate for the Content Library expires month-end June 2021. Customers who subscribed to wp-content.vmware.com will see their Content Library synchronization failing, and need to update the thumbprint by performing the steps in the Workaround. For additional guidance, see the associated VMware Knowledge Base article at
https://kb.vmware.com/s/article/85268
.
Workaround
: Log in to the vCenter Server using the vSphere Client. Select the Subscribed Content Library and click
Edit Settings
. This action will initiate a probe of the subscription URL even though no change is requested on the library. The probe will discover that the SSL certificate is not trusted and prompt you to trust it. In the
Actions
drop-down menu that appears, select
Continue
and the fingerprint is updated. Now you can proceed with synchronizing the contents of the library.
Customers may see failures after applying an update to the cluster involving both a VMClass modification and a node scale up.
Please modify the TKC configuration to only modify one of the two fields and reapply the change.
Symptom: An unexpected rolling update of the cluster occurred after updating a label or a taint in the cluster spec.
Description: When using the TKGS v1alpah2 API, modifying
spec.topology.nodePools[*].labels
or spec.topology.nodePools[*].taints
of one node pool will trigger a rolling update of that node pool.
Workaround: None
Symptom: After a cluster update, the taints and labels manually added to a node pool are no longer present.
Description: Using the TKGS v1alpha2 API, during a cluster update the system does not retain
spec.topology.nodePools[*].taints
and
spec.topology.nodePools[*].labels
that were manually added and present before the update.
Workaround: After the update completes, manually add back the labels and taints fields.
Symptom: Tanzu Kubernetes cluster is stuck in a creating phase as one of the control plane VMs did not get an IP address.
Description:Tanzu Kubernetes cluster is stuck in a creating phase as one of the control plane VMs did not get an IP address.
Workaround: Use Tanzu Kubernetes release 1.20.7 or later to avoid this issue.
Symptom: During the creation of a Tanzu Kubernetes cluster, the process is stuck in an updating state with the reason "WaitingForNetworkAddress."
Description: The control plane VMs and worker nodes are powered on but are not have an IP assigned to them. This may be caused by vmware-tools running out of memory and not getting restarted.
Workaround: The specific issue is fixed in Tanzu Kubernetes releases v1.20.7 and later. In addition, you can fix the issue by increasing the VM memory. The best-effort-xsmall VM class provides only 2G of memory. Use a VM class with more memory to deploy Tanzu Kubernetes clusters.
Description: After deleting a vSphere namespace, the process is stuck in a "Removing" state for several hours without progressing.
Workaround: Use kubectl to check for remaining Kubernetes objects that haven't been deleted from the namespace yet. If there are remaining objects related to kubeadm-control-plane, restart the capi-kubeadm-control-plane-controller-manager pods. This action should requeue the objects to be deleted.
NOTE: This is an advanced operation. Contact VMware Support before performing this workaround.
Starting with 7.0.3 MP2 release TKr < 1.18.x will be incompatible, but TKC API still allows creation of TKCs with version less than v1.18.x, it does not block the creation of a TKC object when it is created with an incompatible TKr. These TKC will never get created.
Delete TKCs which are not in a running state and were created with incompatible TKr. And recreate with a compatible version.
A vDPP UI plugin does not get deployed in the vSphere Client and an error message indicates that the plugin download was attempted from a Supervisor cluster that no longer exists
The problem might occur when you deploy a vDPP UI plugin on a cluster, and then remove this cluster. However, the stale entries related to this vDPP UI plugin remain in the vCenter Extension Manager. If you later create a new cluster, your attempts to install the same version of the vDPP UI plugin on this cluster fail because the vSphere client uses the stale entries to download the plugin from the old non-existing cluster.
Workaround:
Navigate to the vCenter Extension Manager using
https://vc-ip/mob and
locate the plugin extension.
Remove all
ClientInfo
and
ServerInfo
entries that contain the name of the old cluster.
From the
ClientInfo
array, select the
ClientInfo
with the greatest version number and change its type from
vsphere-client-remote-backup
to
vsphere-client-remote
.
A vSphere pod in a Supervisor Cluster remains in pending state without the vm-uuid annotation
Occasionally, a pod might remain in pending state for a long time. This typically occurs when you use a vSAN Data Persistence platform (vDPP), and can be caused by an internal error or user actions.
When you use the
kubectl
command to query the pod instance, the
vmware-system-vm-uuid/vmware-system-vm-moid
annotation is missing in the metadata annotations.
Workaround: Use the
kubectl delete pod
pending_pod_name
command to delete the pod. If the pod is a part of a stateful set, the pod is recreated automatically.
An instance of a vDPP service fails to deploy when host local PVCs of its two replica pods are bound to the same cluster node
Occasionally, an instance of a vSAN Data Persistence Platform service, such as MinIO or Cloudian, might get into a state when its two replica pods have their host local PVCs allocated on the same cluster node. Normally, no two replicas of the same instance can have host local storage on the same cluster node. If this happens, one of the replica pods might remain in a pending state for an indefinite time causing the instance deployment failures.
The symptoms you observe during the failures include all of the following:
Instance health is yellow.
At least one pod of the instance remains in a pending state for over 10 minutes.
The pending pod and one of running replica pods of the same instance have their host local PVCs allocated on the same node of the cluster.
The failure scenarios that can lead to this problem are the following:
Some nodes in the cluster do not have sufficient storage resources for the instance.
The number of replicas is greater than the number of nodes in the cluster. This could be because one or more nodes are temporarily unavailable.
Workaround:
Make sure your cluster has enough storage resources, and the number of cluster nodes is greater than the instance replica count.
Delete the pending pod and all its host local PVCs.
The service operator should rebuild the volume data on the new node where the pod gets scheduled. This can take some time to complete, depending on the size of the volume and amount of valid data on it.
After an ESXi node exits maintenance in Ensure accessibility mode, the taint applied to the node during the maintenance might still remain on the node
This issue might occur when you use vSAN Data Persistence Platform (vDPP) to create instances of partner services. After the ESXi node exits maintenance in Ensure accessibility mode, you can still find the remaining taint
node.vmware.com/drain=planned-downtime:NoSchedule
on the node.
Typically, the issue occurs when these actions take place:
1. A partner service is enabled on the Supervisor Cluster and instances of the service are created.
2. An ESXi node is put into maintenance in Ensure accessibility mode.
3. The node successfully enters maintenance mode in Ensure accessibility.
4. The taint
node.vmware.com/drain=planned-downtime:NoSchedule
is applied on the node.
5. The node exits maintenance mode.
After the node exits maintenance mode, use the following command to ensure that no taint remains on the node:
kubectl describe node | grep Taints
Workaround:
If the taint
node.vmware.com/drain=planned-downtime:NoSchedule
is present on the host, manually remove the taint:
kubectl taint nodes
nodeName
key=value:Effect
-
Note: Make sure to use a hyphen at the end of the command.
Follow this example:
kubectl taint nodes wdc-10-123-123-1.eng.vmware.com node.vmware.com/drain=planned-downtime:NoSchedule-
After an APD recovery, persistent service pods that run on the vSAN Data Persistence platform might remain in pending state with AttachVolume errors
After ADP and recovery of ESXi hosts, a vDPP service instance pod might remain in pending state. If you run the
kubectl -n
instance_namespace
describe pod
name_of_pod_in_pending
command, you can see the
AttachVolume
errors.
If the pod remains in pending state for more than 15 minutes, it is unlikely to come out of this state.
Workaround: Delete the pending pod using the following command:
kubectl -n
instance_namespace
delete pod
name_of_pod_in_pending
. The pod is recreated and moves to running state.
<span style="color:#FF0000">NEW:</span> Attempts to deploy a vSphere Pod and a VM Service VM that have the same name cause errors and unpredictable behavior
You might observe failures and errors or other problematic behavior. You can experience these problems when you have a vSphere Pod running in a namespace and then try to deploy a VM Service virtual machine with the same name in the same namespace, or visa versa.
Workaround: Do not use the same names for the vSphere Pods and VMs in the same namespace.
When attempting to use a VM image that is not compatible with VM Service, the following message is encountered at VM creation
Error from server (GuestOS not supported for osType on image or VMImage is not compatible with v1alpha1 or is not a TKG Image): error when creating : admission webhook "default.validating.virtualmachine.vmoperator.vmware.com" denied the request: GuestOS not supported for osType on image or VMImage is not compatible with v1alpha1 or is not a TKG Image
Workaround: Only use VM images from the VMware Marketplace that have been validated to work with VM Service
If the name of an OVF image in a Content Library is not DNS-subdomain name compliant, VM Service will not create a VirtualMachineImage from the OVF image. As a result,
kubectl get vmimage
will not list the images in a namespace and developers will not be able to consume this image.
Workaround: Use DNS subdomain name compliant names for OVF images.
OVF images with the same name in different Content Libraries do not show up as different VirtualMachineImages
Workaround: Use unique names for OVF images across Content Libraries that are configured with VM Service.
Virtual Machines created by VM Service do not allow access to the remote console. As a result, administrators cannot log into the VM using the
Launch Web Console
and
Launch Remote Console
options in the vSphere web client.
Workaround: Administrators can access the VM console by logging into the ESXi host on which the VM is located.
VMs with vGPU and pass-through devices that are managed by VM Service are powered off when the ESXi host enters maintenance mode
VM Service allows DevOps engineers to create VMs that are attached to vGPU or pass-through devices. Normally, when an ESXi host enters maintenance mode, DRS generates a recommendation for such VMs that are running on the host. A vSphere administrator can then accept the recommendation to trigger the vMotion.
However, VMs with vGPU and pass-through devices that are managed by VM Service are automatically powered off. This may temporarily affect workloads running in these VMs. The VMs are automatically powered on after the host exists the maintenance mode.
Workaround: None.