Sudden 100% CPU usage causing non responsive chat till service restart

More information if its helpful.

I have changed the storage engine back to mmapv1 as i saw some posts saying it is preferred.

Still having the same issues. Even with just a single user logged in and slackbridge disabled, it still goes to 100% and eventually fails and the containers restart.

I think there must be bad data stored in the users table thats triggering this issue. Is there a cleanup routine or a data validation i can run to clear bad data ? It seems only that table thats causing issues as far as i can tell

Wow that’s some serious wait time on a query. Can you grab the user object it’s delaying on? I know might be a pain to redact. Is it a random Id each time or is it random?

Well, i think i have found the issue so here’s something we can script to fix if someone else has it.

In each user document, in the service column, there was a 28MB (text version) for a single user. The login sessions (all without dates) were huge. I deleted these for 2 users (and am cleaning up others) and its sooo much faster now.

I presume theres a cleanup routine running to delete old sessions but without dates they will be ignored.

I am not a mongo person but will look at writing a script to delete all sessions where there is no date.

I am also assuming that this has come from an old version and there probably should be a published routine or something to clean some of this up.

I think the logs I had were just pointing at the table causing the issues and locks, not the actual row so the logs weren’t entirely helpful.

Gotta say that I now know far more about mongo db than i thought i needed originally :slight_smile:

1 Like

Any thoughts about this issue?

I have my instance running with 30 users and is so annoying cause i have to restart the server manually.

Hi,
I also have this problem. Anybody found a solution? I have to restart the server manually every couple of days.

Do all of you have REST API or BOT type of traffic (running periodically) other than the regular human users?

so were they under loginTokens or resumeTokens?

I’m guessing if having significant REST API traffic might lead to this. But not sure why they would be getting inserted with out an expire date. What version are you running?

They were login tokens. We don’t do much restapi traffic but do have some. We tend to run almost the latest when we remember to udpate or are looking for a bug fix. We have been using the same system for a long time however so it may have been old data.
All the current entries have a when attribute and haven’t had an issue since.

Hello!
I’m having the same issue.
CPU goes to 90 ~100% and Rocket Chat receive timeout on logs.
After some time, rocket chat is back online.(10 minutes)

t3.micro instance on AWS.
Ubuntu 18.04 - Snap Installation.
Brand New Installation (5 days)

CPU Usage on peak.
image

Rocket Chat Log

Summary

I20190622-06:53:57.404(0) Exception in setInterval callback: { MongoNetworkError: connection 6 to localhost:27017 timed out at Socket. (/snap/rocketchat-server/1377/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:259:7) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at Socket.emit (events.js:208:7) at Socket._onTimeout (net.js:410:8) at ontimeout (timers.js:498:11) at tryOnTimeout (timers.js:323:5) at Timer.listOnTimeout (timers.js:290:5) name: ‘MongoNetworkError’, errorLabels: [ ‘TransientTransactionError’ ], [Symbol(mongoErrorContextSymbol)]: {} }

I20190622-06:56:44.668(0) Exception in setTimeout callback: { MongoNetworkError: connection 3 to localhost:27017 timed out at Socket. (/snap/rocketchat-server/1377/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:259:7) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at Socket.emit (events.js:208:7) at Socket._onTimeout (net.js:410:8) at ontimeout (timers.js:498:11) at tryOnTimeout (timers.js:323:5) at Timer.listOnTimeout (timers.js:290:5) => awaited here: at Function.Promise.await (/snap/rocketchat-server/1377/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/promise_server.js:56:12) at Promise.asyncApply (packages/mongo/mongo_driver.js:1042:14) at /snap/rocketchat-server/1377/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/fiber_pool.js:43:40 => awaited here: at Promise.await (/snap/rocketchat-server/1377/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/promise_server.js:60:12) at SynchronousCursor._nextObject (packages/mongo/mongo_driver.js:1091:12) at SynchronousCursor.forEach (packages/mongo/mongo_driver.js:1105:22) at SynchronousCursor.map (packages/mongo/mongo_driver.js:1115:10) at SynchronousCursor.fetch (packages/mongo/mongo_driver.js:1139:17) at Cursor.(anonymous function) [as fetch] (packages/mongo/mongo_driver.js:887:44) at PeerPinger.pingAllPeers (app/federation/server/PeerPinger.js:30:113) at Meteor.EnvironmentVariable.EVp.withValue (packages/meteor.js:1304:12) at packages/meteor.js:620:25 at runWithEnvironment (packages/meteor.js:1356:24) name: ‘MongoNetworkError’, errorLabels: [ ‘TransientTransactionError’ ], [Symbol(mongoErrorContextSymbol)]: {} }

I20190622-06:56:51.902(0) Exception in setInterval callback: { MongoNetworkError: connection 5 to localhost:27017 timed out at Socket. (/snap/rocketchat-server/1377/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:259:7) at Object.onceWrapper (events.js:313:30) at emitNone (events.js:106:13) at Socket.emit (events.js:208:7) at Socket._onTimeout (net.js:410:8) at ontimeout (timers.js:498:11) at tryOnTimeout (timers.js:323:5) at Timer.listOnTimeout (timers.js:290:5) name: ‘MongoNetworkError’, errorLabels: [ ‘TransientTransactionError’ ], [Symbol(mongoErrorContextSymbol)]: {} }

Mongo DB Log:

Summary

– Logs begin at Wed 2019-06-19 17:58:37 UTC. –
Jun 22 06:58:38 ip-172-31-44-206 rocketchat-server.rocketchat-mongo[859]: 2019-06-22T06:58:38.060+0000 I WRITE[conn23] update parties.instances query: { _id: “9PocQ7EM3e5mkwKNX” } planSummary: IDHACK update: { $currentDate: { _updatedAt: true } } keysExamined:0 docsExamined:0 nMatched:0 nModified:0 numYields:2 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 3 } } } 160ms

Jun 22 06:58:38 ip-172-31-44-206 rocketchat-server.rocketchat-mongo[859]: 2019-06-22T06:58:38.120+0000 I COMMAND [conn28] command parties.rocketchat_cron_history command: insert { insert: “rocketchat_cron_history”, documents: [ { intendedAt: new Date(1561185900000), name: “Generate download files for user data”, startedAt: new Date(1561186054551), _id: “RyHRiaaLoiGpu5mke” } ], ordered: true } ninserted:1 keysInserted:3 numYields:0 reslen:104 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_query 286ms

Jun 22 06:58:38 ip-172-31-44-206 rocketchat-server.rocketchat-mongo[859]: 2019-06-
22T06:58:38.121+0000 I COMMAND [conn23] command parties.$cmd command: update { update: “instances”, updates: [ { q: { _id: “9PocQ7EM3e5mkwKNX” }, u: { $currentDate: { _updatedAt: true } }, upsert: false, multi: false } ], ordered: true } numYields:0 reslen:119 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 3 } } } protocol:op_query 308ms

Jun 22 06:58:38 ip-172-31-44-206 rocketchat-server.rocketchat-mongo[859]: 2019-06-
22T06:58:38.137+0000 I COMMAND [conn21] command parties.users command: find { find: “users”, filter: { $or: [ { services.resume.loginTokens.when: { $lt: new Date(1553410026728) } }, { services.resume.loginTokens.when: { $lt: 1553410026728.0 } } ] }, returnKey: false, showRecordId: false } planSummary: IXSCAN { services.resume.loginTokens.when: 1 } keysExamined:1 docsExamined:0 cursorExhausted:1 numYields:1 nreturned:0 reslen:101 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 371ms

Jun 22 06:58:38 ip-172-31-44-206 rocketchat-server.rocketchat-mongo[859]: 2019-06-22T06:58:38.260+0000 I COMMAND [conn23] command parties._raix_push_notifications command: find { find: “_raix_push_notifications”, filter: { $and: [ { sent: false }, { sending: { $lt: 1561186611788.0 } }, { $or: [ { delayUntil: { $exists: false } }, { delayUntil: { $lte: new Date(1561186611792) } } ] } ] }, sort: { createdAt: 1 }, limit: 10, returnKey: false, showRecordId: false } planSummary: IXSCAN { createdAt: 1 } keysExamined:0 docsExamined:0 cursorExhausted:1 numYields:1 nreturned:0 reslen:120 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 138ms

Jun 22 06:58:38 ip-172-31-44-206 rocketchat-server.rocketchat-mongo[859]: 2019-06-22T06:58:38.277+0000 I COMMAND [conn28] command parties.rocketchat_sessions command: find { find: “rocketchat_sessions”, filter: { instanceId: “9PocQ7EM3e5mkwKNX”, sessionId: “CjZs7JiSGAq7fFaDz”, closedAt: { $exists: 0 } }, limit: 1, returnKey: false, showRecordId: false } planSummary: IXSCAN { instanceId: 1, sessionId: 1 } keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:573 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 120ms

Jun 22 06:58:38 ip-172-31-44-206 rocketchat-server.rocketchat-mongo[859]: 2019-06-22T06:58:38.304+0000 I COMMAND [conn27] command parties.rocketchat_cron_history command: insert { insert: “rocketchat_cron_history”, documents: [ { intendedAt: new Date(1561185900000), name: “Generate download files for user data”, startedAt: new Date(1561186043404), _id: “4JhqzTBPYcTSuhrCG” } ], ordered: true } ninserted:0 keysInserted:1 writeConflicts:1 exception: E11000 duplicate key error collection: parties.rocketchat_cron_history index: intendedAt_1_name_1 dup key: { : new Date(1561185900000), : “Generate download files for user data” } code:11000 numYields:0 reslen:343 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 2 } } } protocol:op_query 478ms

Jun 22 06:58:38 ip-172-31-44-206 rocketchat-server.rocketchat-mongo[859]: 2019-06-22T06:58:38.307+0000 I COMMAND [conn29] command parties.rocketchat_cron_history command: insert { insert: “rocketchat_cron_history”, documents: [ { intendedAt: new Date(1561185900000), name: “Generate download files for user data”, startedAt: new Date(1561186063299), _id: “YXHhApGrXYo2QDYgs” } ], ordered: true } ninserted:0 keysInserted:1 writeConflicts:1 exception: E11000 duplicate key error collection: parties.rocketchat_cron_history index: intendedAt_1_name_1 dup key: { : new Date(1561185900000), : “Generate download files for user data” } code:11000 numYields:0 reslen:343 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 2 } } } protocol:op_query 481ms

Jun 22 18:04:15 ip-172-31-44-206 rocketchat-server.rocketchat-mongo[859]: 2019-06-22T18:04:15.637+0000 I COMMAND [conn25] command local.oplog.rs command: getMore { getMore: 14299273104, collection: “oplog.rs”, batchSize: 1000 } originatingCommand: { find: “oplog.rs”, filter: { ns: /^(?:parties.|admin.$cmd)/, $or: [ { op: { $in: [ “i”, “u”, “d” ] } }, { op: “c”, o.drop: { $exists: true } }, { op: “c”, o.dropDatabase: 1 }, { op: “c”, o.applyOps: { $exists: true } } ], ts: { $gt: Timestamp 1561185986000|1 } }, returnKey: false, showRecordId: false, tailable: true, oplogReplay: true, awaitData: true } planSummary: COLLSCAN cursorid:14299273104 keysExamined:0 docsExamined:0 numYields:0 nreturned:0 reslen:101 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, oplog: { acquireCount: { r: 2 } } } protocol:op_query 1188ms

Jun 22 18:04:20 ip-172-31-44-206 rocketchat-server.rocketchat-mongo[859]: 2019-06-22T18:04:20.428+0000 I COMMAND [conn27] command parties.$cmd command: update { update: “instances”, updates: [ { q: { _id: “9PocQ7EM3e5mkwKNX” }, u: { $currentDate: { _updatedAt: true } }, upsert: false, multi: false } ], ordered: true } numYields:0 reslen:119 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 2 } }, Metadata: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } protocol:op_query 111ms

Any idea what happened around that time? What about memory usage? How many connected users around that time?

Hi,

until a few days ago I had a similar issue that one process got stuck with 100% CPU load.
I could see that all the sockets from the process where in the CLOSE_WAITING state.

I deactivated the “Federation” support and since then it did not happen again.

Version of Rocket.Chat Server: 1.2.0
Operating System: CentOS 7
Deployment Method: TAR
Number of Running Instances: 3
DB Replicaset Oplog: yes
NodeJS Version: 6.11.4
MongoDB Version: 4.0
Proxy: NGINX
Firewalls involved: Yes

Best regards
Timmi