Support 4k active users with RocketChat

Description

Hi, we’re trying to support 4k active users with RocketChat, but we are unable to go above 1k for now.

We are using RocketChat v3.6.3
10 instances (2CPU & 2GB RAM each) on AWS Fargate
3 nodes MongoDB v4.2 cluster (8vCPU & 32GB RAM & 16000 max connections) on Atlas, we use retryWrites=true&w=majority&poolSize=75 in the connection string.

We are using selenium with headless chrome on the cloud to perform the load test.
All users are connected to the same public channel, and wait a random amount of time before sending a text message.
We tried with :
+10 min : const time = Math.floor(Math.random() * 10 * 60) * 1000;
and + 30 min : const time = Math.floor(Math.random() * 30 * 60) * 1000;

In our last test we tried with 2370 users, and the chat was unusable, I could not send messages (they stay grey and no REST request sendMessage is sent), if I reload the page I can access the channel but the messages loader stays forever.

The problem is that our monitoring does not show any big CPU load, the app instances are at ~50% CPU max and the DB is at ~40% CPU, so we’re at lost here.

We first discovered that having the setting Unread_Count set to all_messages is a big no for large channels, it was generating a lot of oplog updates on the subscription collection and was slowing the app. Changing it helped a little.

We also have a lot of this in our instance logs :
Mongodb Exception in setInterval callback: SwitchedToQuery TIMEOUT QUERY OPERATION

We would appreciate any additional hints from the experts in this forum.

Server Setup Information

  • Version of Rocket.Chat Server: v3.6.3
  • Operating System: Amazon Linux
  • Deployment Method: Containers on AWS Fargate
  • Number of Running Instances: 10
  • DB Replicaset Oplog: enabled
  • NodeJS Version: 12.16.1
  • MongoDB Version: 4.2
  • Proxy: AWS ALB
  • Firewalls involved: none

Thanks

did you have a solution?

Same here we are experiencing the same.

The servers look fine.

However, the browser collapses. We are thinking if the number of messages that it receives could be an issue here. Either the volume or the rate.

Have the same problem.
Did you solve it?

My cluster:
2.5k active users
25 containers on swarm with reverse proxy
3 mongo servers in replicaSet