For the past couple weeks our issue has gotten worse with massive CPU spikes every 20-30 minutes. It does not correlate to any usage metrics or outside bandwidth. It also seems to cause our Apps from the Marketplace to become constantly disabled.
Server Setup Information
- Version of Rocket.Chat Server: 3.0.9
- Operating System: Container-Optimized OS
- Kubernetes: GKE master version 1.15.9-gke.24
- Deployment Method: Official Helm chart, Official Docker image, Gitlab CI/CD deployment
- Number of Running Instances: 4 rocketchat instances, 1 large one with ~1000 concurrent users each day
- DB Replicaset Oplog: Enabled. We use MongoDB Atlas.
- NodeJS Version: v12.14.0
- MongoDB Version: 4.2.6
- Proxy: Cloudflare, nginx ingress
- Firewalls involved: Cloudflare firewall
Any additional Information
For our Kubernetes helm chart we have:
requests_memory = "512Mi" requests_cpu = "300m" limits_memory = "2048Mi" limits_cpu = "1500m"
We also have horizonal pod autoscaling enabled, which expands to the max of 15 pods every time there is a spike. The cause of the spikes I believe is due to the database migration that is ran each time a new pod is created, which eats up much of the CPU, and causes a feedback loop of over-usage causing the spikes to increase even more.
Another possible cause is the Apps, which take up much of the CPU when they are uninstalled or reinstalled, causing the system to crash. It might be when new pods are created, apps somehow eat up CPU causing a spike.
Is there a way to turn off this migration? Or a Helm configuration that will allow us to have the right combination of cpu requests and HPA percentage cliff? Any help to figure out a cause to this would be very appreciated. Let me know if more information is needed.