Hey guys, I think I have some good news!
This problem started happening more frequently lately (daily for the past few days) and it looks like I’ve got a solution.
Last night I added the environment variable USE_NATIVE_OPLOG=true. This disables the newer oplog tailing code added in RC 3.7.0 that is using change streams and instead reverts back to Meteor’s oplog tailing (https://github.com/RocketChat/Rocket.Chat/pull/18892).
I checked the RC server logs this morning and saw that Atlas had what looks like 2 Mongo primary elections in the space of 7 hours overnight and for the first time, my production RC instance survived them both and there is no trace of this error anymore in the UI. I can post messages and they appear instantly (instead of being greyed out), the recipient sees them right away if they are online, user presence is working and email push notifications are being delivered.
Here is the redacted environment variable entry for my Rocket.Chat service file in case that helps anyone:
Environment=USE_NATIVE_OPLOG=true MONGO_URL=mongodb+srv://USERNAME:PASSWORD@CLUSTER_NAME.mongodb.net/DB_NAME?retryWrites=true&readPreference=primaryPreferred&w=majority MONGO_OPLOG_URL=mongodb+srv://USERNAME:PASSWORD@CLUSTER_NAME.mongodb.net/local ROOT_URL=https://mydomain.com/ PORT=3000
Note that the “mongodb+srv
” URI syntax no longer requires replicaSet
or authSource
to be set on MONGO_URL or MONGO_OPLOG_URL.
====================================
Here is some previous analysis that I did which may be useful…
Some different scenarios I tested to help isolate the issue:
- Custom RC 3.7.1 codebase on my laptop (OS X) + local MongoDB replica set without USE_NATIVE_OPLOG=true -> NO problem
- Custom RC 3.7.1 codebase on my laptop (OS X) + Atlas MongoDB replica set without USE_NATIVE_OPLOG=true -> Problem reproduced
- Custom RC 3.7.1 codebase on DigitalOcean droplet (Ubuntu + Caddy 2) + Atlas MongoDB replica set without USE_NATIVE_OPLOG=true -> Problem reproduced
- Custom RC 3.7.1 codebase on DigitalOcean droplet (Ubuntu + Caddy 2) + Atlas MongoDB replica set with USE_NATIVE_OPLOG=true -> NO Problem
Since I was able to reproduce the problem on both my laptop and my production environment, that meant it was probably not related to my production environment.
Since I was not able to reproduce it locally when using a local MongoDB replica set, that meant it was probably something specific that’s happening on Atlas - which makes sense since Atlas frequently does primary elections, which I never do on my local Mongo replica set.
====================================
Also, I’m now seeing some new errors in my production RC logs apparently related to Atlas Mongo primary election (but they don’t seem to cause any problems or if they do, RC can now recover from them successfully):
ERROR 1: MongoError: not master
Dec 17 11:31:07 **************** rocketchat[25979]: Exception in setInterval callback: MongoError: not master
Dec 17 11:31:07 **************** rocketchat[25979]: at Connection. (/opt/Rocket.Chat/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/pool.js:451:61)
Dec 17 11:31:07 **************** rocketchat[25979]: at Connection.emit (events.js:210:5)
Dec 17 11:31:07 **************** rocketchat[25979]: at Connection.EventEmitter.emit (domain.js:475:20)
Dec 17 11:31:07 **************** rocketchat[25979]: at processMessage (/opt/Rocket.Chat/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/connection.js:452:10)
Dec 17 11:31:07 **************** rocketchat[25979]: at TLSSocket. (/opt/Rocket.Chat/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/connection.js:621:15)
Dec 17 11:31:07 **************** rocketchat[25979]: at TLSSocket.emit (events.js:210:5)
Dec 17 11:31:07 **************** rocketchat[25979]: at TLSSocket.EventEmitter.emit (domain.js:475:20)
Dec 17 11:31:07 **************** rocketchat[25979]: at addChunk (_stream_readable.js:309:12)
Dec 17 11:31:07 **************** rocketchat[25979]: at readableAddChunk (stream_readable.js:290:11)
Dec 17 11:31:07 **************** rocketchat[25979]: at TLSSocket.Readable.push (stream_readable.js:224:10)
Dec 17 11:31:07 **************** rocketchat[25979]: at TLSWrap.onStreamRead (internal/stream_base_commons.js:182:23) {
Dec 17 11:31:07 **************** rocketchat[25979]: operationTime: Timestamp { bsontype: ‘Timestamp’, low: 2, high: 1608204667 },
Dec 17 11:31:07 **************** rocketchat[25979]: ok: 0,
Dec 17 11:31:07 **************** rocketchat[25979]: code: 10107,
Dec 17 11:31:07 **************** rocketchat[25979]: codeName: ‘NotWritablePrimary’,
Dec 17 11:31:07 **************** rocketchat[25979]: ‘$clusterTime’: {
Dec 17 11:31:07 **************** rocketchat[25979]: clusterTime: Timestamp { bsontype: ‘Timestamp’, low: 3, high: 1608204667 },
Dec 17 11:31:07 **************** rocketchat[25979]: signature: { hash: [Binary], keyId: [Long] }
Dec 17 11:31:07 **************** rocketchat[25979]: },
Dec 17 11:31:07 **************** rocketchat[25979]: name: ‘MongoError’,
Dec 17 11:31:07 **************** rocketchat[25979]: [Symbol(errorLabels)]: Set { ‘RetryableWriteError’ }
Dec 17 11:31:07 **************** rocketchat[25979]: }
ERROR 2: UnHandledPromiseRejection - MongoError: not master
Dec 17 16:05:20 **************** rocketchat[25979]: === UnHandledPromiseRejection ===
Dec 17 16:05:20 **************** rocketchat[25979]: MongoError: not master
Dec 17 16:05:20 **************** rocketchat[25979]: at Connection. (/opt/Rocket.Chat/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/pool.js:451:61)
Dec 17 16:05:20 **************** rocketchat[25979]: at Connection.emit (events.js:210:5)
Dec 17 16:05:20 **************** rocketchat[25979]: at Connection.EventEmitter.emit (domain.js:475:20)
Dec 17 16:05:20 **************** rocketchat[25979]: at processMessage (/opt/Rocket.Chat/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/connection.js:452:10)
Dec 17 16:05:20 **************** rocketchat[25979]: at TLSSocket. (/opt/Rocket.Chat/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/connection.js:621:15)
Dec 17 16:05:20 **************** rocketchat[25979]: at TLSSocket.emit (events.js:210:5)
Dec 17 16:05:20 **************** rocketchat[25979]: at TLSSocket.EventEmitter.emit (domain.js:475:20)
Dec 17 16:05:20 **************** rocketchat[25979]: at addChunk (_stream_readable.js:309:12)
Dec 17 16:05:20 **************** rocketchat[25979]: at readableAddChunk (stream_readable.js:290:11)
Dec 17 16:05:20 **************** rocketchat[25979]: at TLSSocket.Readable.push (stream_readable.js:224:10)
Dec 17 16:05:20 **************** rocketchat[25979]: at TLSWrap.onStreamRead (internal/stream_base_commons.js:182:23)
Dec 17 16:05:20 **************** rocketchat[25979]: => awaited here:
Dec 17 16:05:20 **************** rocketchat[25979]: at Function.Promise.await (/opt/Rocket.Chat/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/promise_server.js:56:12)
Dec 17 16:05:20 **************** rocketchat[25979]: at app/models/server/raw/NotificationQueue.ts:60:15
Dec 17 16:05:20 **************** rocketchat[25979]: at /opt/Rocket.Chat/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/fiber_pool.js:43:40
Dec 17 16:05:20 **************** rocketchat[25979]: => awaited here:
Dec 17 16:05:20 **************** rocketchat[25979]: at Function.Promise.await (/opt/Rocket.Chat/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/promise_server.js:56:12)
Dec 17 16:05:20 **************** rocketchat[25979]: at app/notification-queue/server/NotificationQueue.ts:48:21
Dec 17 16:05:20 **************** rocketchat[25979]: at /opt/Rocket.Chat/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/fiber_pool.js:43:40 {
Dec 17 16:05:20 **************** rocketchat[25979]: operationTime: Timestamp { bsontype: ‘Timestamp’, low: 6, high: 1608221119 },
Dec 17 16:05:20 **************** rocketchat[25979]: ok: 0,
Dec 17 16:05:20 **************** rocketchat[25979]: code: 10107,
Dec 17 16:05:20 **************** rocketchat[25979]: codeName: ‘NotWritablePrimary’,
Dec 17 16:05:20 **************** rocketchat[25979]: ‘$clusterTime’: {
Dec 17 16:05:20 **************** rocketchat[25979]: clusterTime: Timestamp { bsontype: ‘Timestamp’, low: 8, high: 1608221119 },
Dec 17 16:05:20 **************** rocketchat[25979]: signature: { hash: [Binary], keyId: [Long] }
Dec 17 16:05:20 **************** rocketchat[25979]: },
Dec 17 16:05:20 **************** rocketchat[25979]: name: ‘MongoError’,
Dec 17 16:05:20 **************** rocketchat[25979]: [Symbol(errorLabels)]: Set { ‘RetryableWriteError’ }
Dec 17 16:05:20 **************** rocketchat[25979]: }
Dec 17 16:05:20 **************** rocketchat[25979]: ---------------------------------
Dec 17 16:05:20 **************** rocketchat[25979]: Errors like this can cause oplog processing errors.
Dec 17 16:05:20 **************** rocketchat[25979]: Setting EXIT_UNHANDLEDPROMISEREJECTION will cause the process to exit allowing your service to automatically restart the process
Dec 17 16:05:20 **************** rocketchat[25979]: Future node.js versions will automatically exit the process
Dec 17 16:05:20 **************** rocketchat[25979]: =================================