During the working day, I had to restart the RocketChat server. After the reboot, there were problems: users try to connect and they don’t get it, just a gray screen and running points.
I looked at the server activity and saw such a picture there
Observing the activity of the processor, I noticed that consistently on one core, the load is 100% on the other less, then the next core is 100%, on the previous one it falls, which is actually seen in the screenshot.
In the process of searching for the cause, I established that such activity arises when users endeavor to connect to the server en masse, that is, the server does not cope. However, those who have already connected can work in the chat, but if you open a new contact, the messages are loaded with some delay.
Previously, this problem did not arise.
What can be done with this?
Rocket Chat server runs on Ubuntu 18.04. (Hyper-V virtual machine. VM is allocated 8 cores, 16 GB of RAM (dynamic))
Installed through the snap. Current version 0.72.1
Total number of users 478. Active during the working day 350-400.
Methods of testing user connections.
Users connect from different addresses. Approximately half of users connect from 2 large offices with static addresses. Server for NAT, which allows you to “play” the rules. On the gateway, created rules for static addresses:
- I allowed connection to one office - they connected, everything is OK
- I allowed connection to the second office - they connected, everything is OK
- I give permission to connect to the server from arbitrary addresses - the activity of one of the cores soars up to 100% and further as described above.