Troubleshooting
Installation Permissions
No matter what deployment method you use, the following create permissions are required to install Speedscale.
- Cluster-wide resources:
CustomResourceDefinitions
ClusterRole
ClusterRoleBinding
MutatingWebhookConfiguration
ValidatingWebhookConfiguration
- Namespaced resources
ConfigMap
Deployment
Job
Role
Service
Secret
You can verify you have permissions and other prerequisites by running speedctl check operator --pre
.
This will not modify anything within your cluster.
Operation Permissions
Once Speedscale is installed in your cluster, the speedscale-operator
cluster role will have permissions to create, list, watch, and modify workload manifests.
Speedscale will use these permissions to add a sidecar container to the workload.
Workloads include DaemonSet
, Deployment
, Job
, ReplicaSet
, and StatefulSet
.
Additionally, the speedscale-operator
role can create, modify, and watch configuration such as Istio's Sidecar
.
For a full list of permissions that Speedscale is using, you may use one of the following methods:
- Review the latest version of the Helm chart
- Run
kubectl get -n speedscale clusterrole/speedscale-operator -o yaml
to see the installed manifest.
Using Minikube
If you get webhook errors when running in minikube, it could be related to the network configuration. You need to add these 2 flags to your start command to ensure the network is properly configured:
minikube start \
--cni=true --container-runtime=containerd \
--ALL_YOUR_OTHER_FLAGS_HERE
Signature expired
If you see errors relating to an invalid signature involving timestamps (example below), this is because the VM your minikube instance is running on has fallen out of sync with the actual time. This is a known problem with Hyperkit.
SignatureDoesNotMatch: Signature expired: 20220727T233601Z is now earlier than 20220727T234712Z (20220728T000212Z - 15 min.)
The time needs to be resynced on the VM and can be done via
ssh -i ~/.minikube/machines/minikube/id_rsa docker@$(minikube ip) "docker run --rm --privileged --pid=host alpine nsenter -t 1 -m -u -n -i date -u $(date -u +%m%d%H%M%Y)"
Using microk8s
If you get webhook errors when running in microk8s, it could be related to the network configuration. You need to enable the dns add-on to ensure the network is properly configured:
microk8s enable dns
Seeing Webhook Errors?
Manually deleting the speedscale
namespace will cause your cluster to stop accepting deployments due to a dangling mutating webhook. The error may look something like this:
Internal error occurred: failed calling webhook "operator.speedscale.com":
Post "https://speedscale-operator.speedscale.svc:443/mutate?timeout=30s":dial tcp xx.xx.xx.xx:443: connect: connection refused
If you experience this problem, you can fix your cluster by deleting the webhooks manually:
kubectl delete mutatingwebhookconfigurations.admissionregistration.k8s.io speedscale-operator speedscale-operator-replay
kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io speedscale-operator-replay
After the webhook has been deleted, re-run the full operator delete command to make sure that service roles and other items are properly cleaned up.
Unable to edit TrafficReplays?
Manually deleting the speedscale-operator
deployment will cause the validating webhook for TrafficReplays
to fail.
This will prevent modifications such as removing any finalizers on the TrafficReplay
manually.
If you experience this problem, you can fix your cluster by deleting the webhook manually:
kubectl delete validatingwebhookconfigurations.admissionregistration.k8s.io speedscale-operator
After the webhook has been deleted, re-run the full operator delete command to make sure that service roles and other items are properly cleaned up.
Istio Errors
In Istio with dual proxy capture mode, the Operator creates a Sidecar
resource that routes traffic through the Istio mesh into the Speedscale sidecar. In the case of invalid settings, you may see an error along the lines of
{"L":"ERROR","T":"2022-08-05T15:36:45.149Z","M":"failed to provision envoy sidecar config, provisioning failed","reqId":"2f061731-46f0-4d33-9f37-b99c16a0dec3","op":"UPDATE","kind":"Deployment","apiVersion":"apps/v1","name":"inventory-availability","namespace":"perf1-inventory-availability","error":"resource already exists"}
in the Operator logs. This usually happens if the port specified is invalid (not an integer). This can also happen if the Operator is not given permissions to create Istio Sidecar resources. This should be handled during installation but if you encounter this, try upgrading the operator.
Leftover Certificates
"error":"could not verify cert: crypto/rsa: verification error"
This error usually happens when there's a fresh install after an incomplete uninstall. There are certs in the cluster that have stuck around when they shouldn't. Search for speedscale related certs and delete. The following commands may be helpful:
kubectl delete mutatingwebhookconfigurations speedscale-operator
kubectl delete mutatingwebhookconfigurations speedscale-operator-replay
kubectl delete validatingwebhookconfiguration speedscale-operator-replay
kubectl delete ns speedscale
Self Check Errors
The Speedscale operator performs a set of self-checks during initialization. One of the common self-check errors is related to certificate verification.
"M":"self-check failed, exiting"
Self-check errors mostly occur due to corrupted configurations, version mismatches, permission issues, or unsuccessful previous deletions/upgrades. These issues can typically be resolved by performing a complete reinstallation.
Cert Verficitaion Error:
"M":"self-check failed, exiting","error":"could not verify cert: crypto/rsa: verification error"
To fix this error, ensure that:
Certificate Assignment: After rotating certificates, verify that the new certificates are correctly assigned to the Speedscale operator. Helm Chart Update: Make sure the Helm chart is updated correctly to reflect the new certificate information. Reinstallation: If issues persist, perform a complete reinstallation of the Speedscale operator to ensure all configurations are correctly applied.