Hi, we’re trying to support 4k active users with RocketChat, but we are unable to go above 1k for now.
We are using RocketChat v3.6.3
10 instances (2CPU & 2GB RAM each) on AWS Fargate
3 nodes MongoDB v4.2 cluster (8vCPU & 32GB RAM & 16000 max connections) on Atlas, we use retryWrites=true&w=majority&poolSize=75
in the connection string.
We are using selenium with headless chrome on the cloud to perform the load test.
All users are connected to the same public channel, and wait a random amount of time before sending a text message.
We tried with :
+10 min : const time = Math.floor(Math.random() * 10 * 60) * 1000;
and + 30 min : const time = Math.floor(Math.random() * 30 * 60) * 1000;
In our last test we tried with 2370 users, and the chat was unusable, I could not send messages (they stay grey and no REST request sendMessage
is sent), if I reload the page I can access the channel but the messages loader stays forever.
The problem is that our monitoring does not show any big CPU load, the app instances are at ~50% CPU max and the DB is at ~40% CPU, so we’re at lost here.
We first discovered that having the setting Unread_Count
set to all_messages
is a big no for large channels, it was generating a lot of oplog updates on the subscription collection and was slowing the app. Changing it helped a little.
We also have a lot of this in our instance logs :
Mongodb Exception in setInterval callback: SwitchedToQuery TIMEOUT QUERY OPERATION
We would appreciate any additional hints from the experts in this forum.
Server Setup Information
- Version of Rocket.Chat Server: v3.6.3
- Operating System: Amazon Linux
- Deployment Method: Containers on AWS Fargate
- Number of Running Instances: 10
- DB Replicaset Oplog: enabled
- NodeJS Version: 12.16.1
- MongoDB Version: 4.2
- Proxy: AWS ALB
- Firewalls involved: none