Sudden 100% CPU usage causing non responsive chat till service restart


#1

Since a few days Rocket Chat is using 100% CPU from time to time. It stays using 100% till I restart the service. During that 100% CPU usage the entire chat becomes unresponsive.

First I thought it was related to the API usage for setting / uploading avatars to RC however I disabled this flow and the issue remains. The only thing that did change over the past days is the upgrade from 0.71 to 0.72, Yesterday I upgraded to 0.72.1 however the problem remains.

Going trough the changelog of 0.72 I don’t see anything that might cause this 100% CPU usage in our case. It also doesn’t happen on the same hour of day.

Below two screenshots of the CPU spike that stays on till I restart the process.

Screenshot1:

Screenshot2:

Logs
I don’t see anything that jumps out of the logs that wasn’t there earlier. I do log exceptions swell now to a RC channel but the errors don’t align time wise with the outages.

Exceptions:

Exception in added observe/observeChanges callback:
TypeError: Cannot destructure property `username` of 'undefined' or 'null'.

BulkWriteError: E11000 duplicate key error collection: 
rocketchat.rocketchat_message index: _id_ dup key: { : "G9oJAbj6iYRqHxLjo" }

Log:

Dec  3 04:01:01 tri-rc01 systemd: Removed slice User Slice of root.
Dec  3 04:01:01 tri-rc01 systemd: Stopping User Slice of root.
Dec  3 04:01:01 tri-rc01 systemd: Created slice User Slice of root.
Dec  3 04:01:01 tri-rc01 systemd: Starting User Slice of root.
Dec  3 04:01:01 tri-rc01 systemd: Started Session 3238 of user root.
Dec  3 04:01:01 tri-rc01 systemd: Starting Session 3238 of user root.
Dec  3 04:01:01 tri-rc01 systemd: Removed slice User Slice of root.
Dec  3 04:01:01 tri-rc01 systemd: Stopping User Slice of root.
Dec  3 04:09:20 tri-rc01 rocketchat: The endpoint "permissions" is deprecated and will be removed after version v0.69
Dec  3 04:15:08 tri-rc01 rocketchat: TypeError: Cannot read property 'getRedirectURL' of undefined
Dec  3 04:15:08 tri-rc01 rocketchat: at FileUploadClass.copy (/opt/Rocket.Chat/programs/server/packages/rocketchat_file-upload.js:1008:30)
Dec  3 04:15:08 tri-rc01 rocketchat: at Object.copy (/opt/Rocket.Chat/programs/server/packages/rocketchat_file-upload.js:627:13)
Dec  3 04:15:08 tri-rc01 rocketchat: at copyFile (/opt/Rocket.Chat/programs/server/packages/rocketchat_user-data-download.js:272:20)
Dec  3 04:15:08 tri-rc01 rocketchat: at exportOperation.fileList.forEach.attachmentData (/opt/Rocket.Chat/programs/server/packages/rocketchat_user-data-download.js:536:9)
Dec  3 04:15:08 tri-rc01 rocketchat: at Array.forEach (<anonymous>:null:null)
Dec  3 04:15:08 tri-rc01 rocketchat: at continueExportOperation (/opt/Rocket.Chat/programs/server/packages/rocketchat_user-data-download.js:535:32)
Dec  3 04:15:08 tri-rc01 rocketchat: at cursor.forEach.exportOperation (/opt/Rocket.Chat/programs/server/packages/rocketchat_user-data-download.js:577:5)
Dec  3 04:15:08 tri-rc01 rocketchat: at SynchronousCursor.forEach (/opt/Rocket.Chat/programs/server/packages/mongo.js:1120:16)
Dec  3 04:15:08 tri-rc01 rocketchat: at Cursor.(anonymous function) [as forEach] (/opt/Rocket.Chat/programs/server/packages/mongo.js:918:44)
Dec  3 04:15:08 tri-rc01 rocketchat: at Object.processDataDownloads [as job] (/opt/Rocket.Chat/programs/server/packages/rocketchat_user-data-download.js:572:10)
Dec  3 04:15:08 tri-rc01 rocketchat: at packages/littledata_synced-cron.js:251:26
Dec  3 04:15:08 tri-rc01 rocketchat: at scheduleTimeout (packages/littledata_synced-cron.js:304:9)
Dec  3 04:15:08 tri-rc01 rocketchat: at packages/littledata_synced-cron.js:356:42
Dec  3 04:15:08 tri-rc01 rocketchat: at Meteor.EnvironmentVariable.EVp.withValue     (packages/meteor.js:1304:12)
Dec  3 04:15:08 tri-rc01 rocketchat: at packages/meteor.js:620:25
Dec  3 04:15:08 tri-rc01 rocketchat: at runWithEnvironment (packages/meteor.js:1356:24)
Dec  3 04:25:10 tri-rc01 rocketchat: The endpoint "permissions" is deprecated and will be removed after     version v0.69

Setup and server information:
type: VPS, only running RC and MongoDB
CPU: 4 core
RAM: 8GB
Number of users online: 50-300
Operating System: CentOS Linux release 7.6.1810
Version of Rocket.Chat Server: 0.72.1
Deployment Method: tar
Number of Running Instances: 1
Oplog: disabled
NodeJS Version: 8.12.0
MongoDB Version: v3.2.21
Node Version: v8.11.4


#2

Hello!

It’s interesting that we can see a spike in I/O and network activity at the same time CPU usage increases. Do you have a top output from that time so we can see the processes involved during that time? If not, it might be a good thing to have next time a CPU spike happens.

Additionally, do you have any recurring, scheduled job that might impact server performance at the time of the spikes?


#3

Today we have set up nmon so we can log some more data. “Luckily” we didn’t have to wait long and I was able to make two screenshots after each other of nmon in interactive mode just now when it happened.

Next time I will make a screenshot of top and have additional information from nmon if everything goes well. But maybe the screenshots above already help a bit.


#4

I just needed to restart the service again. Below are the screenshots requested. I also asked help from one or our members and he thought this might be related:


Two screenshots of top when it just happened again:

Screenshot 1:

Screenshot 2:

NMON screenshot:

There are no recurring, scheduled jobs around the time of the spikes


#5

Can you look at your mongo logs and see if there are logs about long queries around that time? It’s possible a query or set of queries is some how blocking and causing Rocket.Chat side to spike.

If you don’t have mongo piping to syslog mongo logs should be at /var/log/mongodb.log (name might be slightly different. Don’t have PC at hand to check)

Most of the time queries longer then 200ms are logged


#6

Queries that are logged:

This one I see occur the most:
2018-12-07T21:16:01.928+0000 I COMMAND [conn72] command rocketchat.rocketchat_message command: count { count: “rocketchat_message”, query: { _hidden: { $ne: true }, rid: “cwJ7WBJKbfSLZY89C”, ts: { $gt: new Date(1521725899059), $lt: new Date(1544202188085) }, t: { $nin: [ “uj”, “ul”, “ru”, “au”, “user-muted”, “user-unmuted” ] } } } planSummary: IXSCAN { rid: 1, ts: 1 } keyUpdates:0 writeConflicts:0 numYields:374 reslen:62 locks:{ Global: { acquireCount: { r: 750 } }, Database: { acquireCount: { r: 375 } }, Collection: { acquireCount: { r: 375 } } } protocol:op_query 196ms

2018-12-07T17:41:58.790+0000 I WRITE    [conn67] warning: log line attempted (25kB) over max size (10kB), printing beginning and end ... update rocketchat.rocketchat_message query: { _id: { $in: [ "j8BKsAmPipjMkNxZy", "ziuSRhZ3bGTrLbywi", "Ee3ovi5wTbGvLSYBN", "L37qQqGB9F6YRsQKY", "hzp7PTtxbcD2HQauS", "hLRwQ9EEjCd6GWr4F", "WRZpQ4ysFFJj33CDK", "K4L88Ahqrmx3HFeMy", "mff6MdnPQ7Wo56Qdq", "zwvFszydiQ3sPnNKq", "nMkKA7ranNhaF5RJX", "8wh43XddT3iPcheRm", "Z5cPeMHuuhFopjE47", "r6qZEiGZR74dMrrsn", "KRumMq9T7sPFPhdi3", "gPQXEMpF6QJjyGFig", "WFit7g9gdvyW2GTx5", "wHv8vDWp7Tn7YmLgS", "EvHhyx75985j6Wy2iT", "L4tyLuhGPOLgyvaC8q", "egz3OwPlrDM7mhPdX4", "4TNzG7qpSay6AN3vz", "mWoSew

I attached the log since 2018-11-30 here:
https://pastebin.com/97vRRhxu

ps for others looking for the log my path was /var/log/mongodb/mongod.log


#7

What time did cpu spike in relation to logs?

Do you have any message retention settings enabled?


#8

I was waiting for a new downtime and it just happend again. In the same minute as this occurred in the monolog file:

2018-12-08T10:39:36.529+0000 I COMMAND  [conn74] command rocketchat.rocketchat_message command: count { count: "rocketchat_message", query: { _hidden: { $ne: true }, rid: "cwJ7WBJKbfSLZY89C", ts: { $gt: new Date(1533390937636), $lt: new Date(1544211853354) }, t: { $nin: [ "uj", "ul", "ru", "au", "user-muted", "user-unmuted" ] } } } planSummary: IXSCAN { rid: 1, ts: 1 } keyUpdates:0 writeConflicts:0 numYields:356 reslen:62 locks:{ Global: { acquireCount: { r: 714 } }, Database: { acquireCount: { r: 357 } }, Collection: { acquireCount: { r: 357 } } } protocol:op_que

Retention policy is disable completely.

Edit: I just checked the room ID and this is our biggest channel by far in terms of activity. When I execute the logged query manually the result for this room is 45294.


#9

Was there a flurry of queries like this prior? Or any queries in general? During this time is mongo responsive? As in can you hop in and query?

This is interesting that it’s your biggest channel… how many people are in that biggest channel?


#10

Yes, I just went back to examine older log files but that query isn’t new and is occurring for at least 6 months now as it turns out. I tried running the query multiple times right after each other manually myself without any issues trough Robo3T.

There are 778 users in total but definitely not all are active. When someone registers on our site he/she automatically gets access to this channel.

I now have a script in place that monitors the load of node and restarts rocket chat if the load is above a certain threshold for a certain of minutes but this is merely a workaround off-course.