Mongo container restarting continuously; oplog size error

Description

Our chat server hosted locally has crashed. The mongo service is failing to start up.

root@studiochat:~# docker-compose up -d
Recreating root_mongo_1 … done
Recreating root_rocketchat_1 … done
Recreating root_mongo-init-replica_1 … done
Recreating root_hubot_1 … done

The root_mongo_1 container keeps restarting

Following is the error reported in log ( full log attached ) -

{"log":"2021-07-08T03:19:28.379+0000 I NETWORK [LogicalSessionCacheReap] Successfully connected to localhost:27017 (1 connections now open to localhost:27017 with a 5 second timeout)\n","stream":"stdout","time":"2021-07-08T03:19:28.38007768Z"}

{"log":"2021-07-08T03:19:28.380+0000 W NETWORK [LogicalSessionCacheReap] Unable to reach primary for set rs0\n","stream":"stdout","time":"2021-07-08T03:19:28.380517864Z"}

{"log":"2021-07-08T03:19:28.381+0000 I REPL [replexec-0] transition to RECOVERING from STARTUP2\n","stream":"stdout","time":"2021-07-08T03:19:28.381474128Z"}

{"log":"2021-07-08T03:19:28.381+0000 I REPL [replexec-0] Starting replication fetcher thread\n","stream":"stdout","time":"2021-07-08T03:19:28.381501349Z"}

{"log":"2021-07-08T03:19:28.381+0000 I REPL [replexec-0] Starting replication applier thread\n","stream":"stdout","time":"2021-07-08T03:19:28.38180951Z"}

{"log":"2021-07-08T03:19:28.381+0000 I REPL [replexec-0] Starting replication reporter thread\n","stream":"stdout","time":"2021-07-08T03:19:28.381843154Z"}

{"log":"2021-07-08T03:19:28.381+0000 I REPL [rsSync-0] Starting oplog application\n","stream":"stdout","time":"2021-07-08T03:19:28.381932863Z"}

{"log":"2021-07-08T03:19:28.382+0000 I REPL [rsSync-0] transition to SECONDARY from RECOVERING\n","stream":"stdout","time":"2021-07-08T03:19:28.382933469Z"}

{"log":"2021-07-08T03:19:28.383+0000 I REPL [rsSync-0] conducting a dry run election to see if we could be elected. current term: 782\n","stream":"stdout","time":"2021-07-08T03:19:28.383206051Z"}

{"log":"2021-07-08T03:19:28.383+0000 I REPL [replexec-0] dry election run succeeded, running for election in term 783\n","stream":"stdout","time":"2021-07-08T03:19:28.383382223Z"}

{"log":"2021-07-08T03:19:28.393+0000 I REPL [replexec-0] election succeeded, assuming primary role in term 783\n","stream":"stdout","time":"2021-07-08T03:19:28.393883858Z"}

{"log":"2021-07-08T03:19:28.394+0000 I REPL [replexec-0] transition to PRIMARY from SECONDARY\n","stream":"stdout","time":"2021-07-08T03:19:28.394129048Z"}

{"log":"2021-07-08T03:19:28.394+0000 I REPL [replexec-0] Resetting sync source to empty, which was :27017\n","stream":"stdout","time":"2021-07-08T03:19:28.394291254Z"}

{"log":"2021-07-08T03:19:28.394+0000 I REPL [replexec-0] Entering primary catch-up mode.\n","stream":"stdout","time":"2021-07-08T03:19:28.394592251Z"}

{"log":"2021-07-08T03:19:28.394+0000 I REPL [replexec-0] Exited primary catch-up mode.\n","stream":"stdout","time":"2021-07-08T03:19:28.394624864Z"}

{"log":"2021-07-08T03:19:28.394+0000 I REPL [replexec-0] Stopping replication producer\n","stream":"stdout","time":"2021-07-08T03:19:28.394703921Z"}

{"log":"2021-07-08T03:19:28.394+0000 I **REPL [ReplBatcher] Oplog buffer has been drained in term 783\n","stream":"stdout","time":"2021-07-08T03:19:28.395103991Z"}**

**{"log":"2021-07-08T03:19:28.395+0000 E - [rsSync-0] Assertion: Location17322: write to oplog failed: DocTooLargeForCapped: document doesn't fit in capped collection. size: 124 storageSize:134217728** src/mongo/db/repl/oplog.cpp 142\n","stream":"stdout","time":"2021-07-08T03:19:28.396029271Z"}

{"log":"2021-07-08T03:19:28.396+0000 F - [rsSync-0] terminate() called. An exception is active; attempting to gather more information\n","stream":"stdout","time":"2021-07-08T03:19:28.396346626Z"}

{"log":"2021-07-08T03:19:28.432+0000 F - [rsSync-0] **DBException::toString(): Location17322: write to oplog failed: DocTooLargeForCapped: document doesn't fit in capped collection. size: 124 storageSize:134217728** \n","stream":"stdout","time":"2021-07-08T03:19:28.432546701Z"}

{"log":"Actual exception type: mongo::error_details::throwExceptionForStatus(mongo::Status const\u0026)::NonspecificAssertionException\n","stream":"stdout","time":"2021-07-08T03:19:28.432628803Z"}

Changing the oplog size from the default 128 to 512 in the docker-compose has no effect. We are not able to get to the mongo shell through docker exec command so unable to change the oplog size at all.

Any advice on how to resolve this issue?

Server Setup Information

  • Version of Rocket.Chat Server: 3.9.0
  • Operating System: Linux
  • Deployment Method: docker
  • Number of Running Instances: 1
  • DB Replicaset Oplog:
  • NodeJS Version:
  • MongoDB Version: 4.0.22
  • Proxy: nginx
  • Firewalls involved:

Additional Info -
The docker-config is as follows -

version: '2'

services:
  rocketchat:
    image: rocketchat/rocket.chat:latest
    command: >
      bash -c
        "for i in `seq 1 30`; do
          node main.js &&
          s=$$? && break || s=$$?;
          echo \"Tried $$i times. Waiting 5 secs...\";
          sleep 5;
        done; (exit $$s)"
    restart: unless-stopped
    volumes:
      - /root/uploads:/app/uploads
    environment:
      - PORT=3000
      - ROOT_URL=http://192.168.2.201:3000
      - MONGO_URL=mongodb://mongo:27017/rocketchat
      - MONGO_OPLOG_URL=mongodb://mongo:27017/local
      - MAIL_URL=smtp://smtp.email
#       - USE_NATIVE_OPLOG=true
#       - HTTP_PROXY=http://proxy.domain.com
#       - HTTPS_PROXY=http://proxy.domain.com
    depends_on:
      - mongo
    ports:
      - 3000:3000
    labels:
      - "traefik.backend=rocketchat"
      - "traefik.frontend.rule=Host: your.domain.tld"

  mongo:
    image: mongo:4.0
    restart: unless-stopped
    volumes:
     - ./data/db:/data/db
     #- ./data/dump:/dump
    command: mongod --smallfiles --oplogSize 128 --replSet rs0 --storageEngine=mmapv1
    labels:
      - "traefik.enable=false"
#     shm_size: '2gb'

  # this container's job is just run the command to initialize the replica set.
  # it will run the command and remove himself (it will not stay running)
  mongo-init-replica:
    image: mongo:4.0
    command: >
      bash -c
        "for i in `seq 1 30`; do
          mongo mongo/rocketchat --eval \"
            rs.initiate({
              _id: 'rs0',
              members: [ { _id: 0, host: 'localhost:27017' } ]})\" &&
          s=$$? && break || s=$$?;
          echo \"Tried $$i times. Waiting 5 secs...\";
          sleep 5;
        done; (exit $$s)"
    depends_on:
      - mongo

  # hubot, the popular chatbot (add the bot user first and change the password before starting this image)
  hubot:
    image: rocketchat/hubot-rocketchat:latest
    restart: unless-stopped
    environment:
      - ROCKETCHAT_URL=rocketchat:3000
      - ROCKETCHAT_ROOM=GENERAL
      - ROCKETCHAT_USER=bot
      - ROCKETCHAT_PASSWORD=botpassword
      - BOT_NAME=bot
  # you can add more scripts as you'd like here, they need to be installable by npm
      - EXTERNAL_SCRIPTS=hubot-help,hubot-seen,hubot-links,hubot-diagnostics
    depends_on:
      - rocketchat
    labels:
      - "traefik.enable=false"
    volumes:
      - ./scripts:/home/hubot/scripts
  # this is used to expose the hubot port for notifications on the host on port 3001, e.g. for hubot-jenkins-notifier
    ports:
      - 3001:8080

Have a read of this

So first thing is you have Rocket 3.9.x but your compose file shows latest and we are on 3.16.1 right now so you are missing some history here. (A good reason to never use latest in production so you don’t get unexpected upgrades)

Please go back a few steps and tell us what happened prior to this. Did you have any changes or modifications?

This may help you in providing the right information to help diagnose your issue. Some hardware information and OS might be helpful.

Hi John,

The server version is actually 3.10.4, sorry for the wrong info earlier. There are no modifications to the server settings since last couple of days. Prior to that we were setting up the file upload features but then decided to turn it off when we realized that currently there is no permission control for file uploads. I don’t think there’s anything else to suspect

Other details as follows -
os: Linux x64 , release: 4.15.0-142-generic
totalmem: 8364220416,
freemem: 4914024448,
cpus: 4 x Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz

nodeVersion: v12.18.4,
instanceCount: 1,
oplogEnabled: Yes,
mongoVersion: 4.0.22,
mongoStorageEngine: mmapv1

I am going to ask this, just in case. You do have a backup don’t you??

OK - 3.10.x should have been upgraded a while back :frowning:

Hmmmmmm… coincidence? What were you doing with the file uploads??

Did this occur straight after?

I’ll try and get some brains on this.

Another question is, how are the files being stored? Are they on the file system, the GridFS, or S3? If they are on GridFS, then that is the database and potentially changing those settings caused something to trip up?

1 Like

Yup a few things:

  1. GridFS this is a good point that it will cause your database size to grow a lot.
  2. Double check the disk this docker-compose file is on. Make sure you have room still. How big is that data/db folder?
  3. How much ram does it have available on the server?
  4. Finally… if you start just mongo and not rocket.chat does it still crash or is it only crashing after rocket.chat starts talking to it?

If its only crashing when Rocket.Chat talks to it… i’d leave Rocket.Chat off. Do mongodump somewhere. Verify backup. Then stop mongo delete or better yet rename that ./data/db folder and then start mongo again. Then mongorestore

can do docker-compose stop and then docker-compose up -d mongo to start just it by its self to see how it behaves.