High CPU making chat unusable

Description

My instance of RC is becoming unusable since it consumes so much CPU. Mongo consumes about 400% CPU on a large server with only a few users online. This started a few weeks ago and is gradually getting worse, currently anonymous login isn’t even working since it can’t load the username suggestions.

I considered it may be due to an attack but I’ve enabled DDOS protection from Cloudflare and the result is the same. There are however no problems on my dev server, the problem start on production when there are users.

It would be much appreciated if anyone could help sort this out. I have previously restored an old backup of the server since I initially thought it was due to updating to 3.5.0, but a couple of days after restoring the backup the problems arose again.

Server Setup Information

  • Version of Rocket.Chat Server: 3.0.12 and 3.5.0
  • Operating System: Ubuntu
  • Deployment Method: Docker-compose
  • Number of Running Instances: 1
  • MongoDB Version: 3.6
  • Proxy: Nginx

@cchbr Many are running Rocket.Chat on a Raspberry Pi 4 and handling hundreds of users with little reported problem.

What “server” are you using? Please provide some details:

  1. is it a physical box you own? VM? VPS?
  2. what is the size of memory, how many cores, disk space?
  3. if it is in the cloud, which one?
  4. which version of operating system is it running on
  5. what else, other than Rocket.Chat are you running on the “server”
  1. It’s a VPS.
  2. 6 vCPUs,16 GB Memory, 320 GB SSD.
  3. It’s a droplet at Digital Ocean.
  4. Ubuntu 18.04.3 LTS
  5. I’m also running a Wildfly server on it but it’s barely used by any users and doesn’t consume a lot of resources.

I thought of starting a completely new instance, in case something has become corrupted in my current one. But I suppose migrating all the data will be troublesome since dumping the whole DB will probably result in the same issues on the new server.

Yes. @cchbr The large v-memory vps PLUS memory-hungry Wildfly PLUS Mongo are what is likely to be causing your problem.

Please read about how mongo allocate memory here https://docs.mongodb.com/v4.2/core/wiredtiger/

Your best bet is to isolate mongoDB on its own VPS instance.

1 Like

Run mongodb in a docker container or use cgroups to limit its memory usage.

Thanks so much for your responses.

It’s deployed with compose-compose and I have now tried to limit memory usage with mem_limit: 2048m, I also tried using the cpu_shares limit on the mongo container. But mongod still consumes 200-400% CPU.

I’ll continue to monitor and see if there are any improvements but I’ll try to isolate mongo on it’s own instance as well. Are there any specific settings that has to be set to run mongo on a separate server, or will it be enough to run mongo and mongo-init-replica on it’s own server in docker-compose, and point the current rocketchat service to that server?

One thing you can also do is take a look at the logs generated by mongo. It’s likely some queries are slow.

Also are you using gridfs and uploading a lot of files? If so this can contribute to mongo stress a lot as those files are stored inside of mongo

Yes I’m using GridsFS, there are about 80GB of files uploaded and approximately 1000 files are uploaded per day, mainly images.

I have setup an isolated server for mongo but uploads aren’t working with GridFS now, it keeps looking for the images on the app server. I’ve changed to FileSystem which uploads new files, but the progress meter gets stuck on 0% despite the file being uploaded. And none of the already uploaded files are available since it still uses the app server url for those as well.

It’s also very unpredictable right now, a lot of the times it won’t even load a channel, it just gets stuck at loading.

Here’s the docker-compose for the db server:

version: '2'

services:
  mongo:
    image: mongo:3.6
    restart: unless-stopped
    volumes:
     - /mnt/volume_lon1_01/data/runtime/db:/data/db
     - /mnt/volume_lon1_01/data/dump:/dump
    command: mongod --smallfiles --oplogSize 128 --replSet rs0 --storageEngine=mmapv1
    labels:
      - "traefik.enable=true"
    ports:
     - 27017:27017

  # this container's job is just run the command to initialize the replica set.
  # it will run the command and remove himself (it will not stay running)
  mongo-init-replica:
    image: mongo:3.6
    command: 'bash -c "for i in `seq 1 30`; do mongo mongo/rocketchat --eval \"rs.initiate({ _id: ''rs0'', members: [ { _id: 0, host: ''localhost:27017'' } ]})\" && s=$$? && break || s=$$?$
    depends_on:
      - mongo

And here’s the docker-compose for the app server:

version: '2'

services:
  rocketchat:
    image: rocketchat/rocket.chat:3.0.12
    logging:
        driver: "json-file"
        options:
            max-file: "10"
            max-size: "50m"
    command: bash -c 'for i in `seq 1 30`; do node main.js && s=$$? && break || s=$$?; echo "Tried $$i times. Waiting 5 secs..."; sleep 5; done; (exit $$s)'
    restart: unless-stopped
    volumes:
      - ./uploads:/app/uploads
    environment:
      - PORT=3000
      - ROOT_URL=https://chat.domain.com
      - MONGO_URL=mongodb://64.217.43.48:27017/rocketchat
      - MONGO_OPLOG_URL=mongodb://64.217.43.48:27017/local
      - MAIL_URL=smtp://smtp.email
#       - HTTP_PROXY=http://proxy.domain.com
#       - HTTPS_PROXY=http://proxy.domain.com
    ports:
      - 3000:3000
    labels:
      - "traefik.backend=rocketchat"
      - "traefik.frontend.rule=Host: your.domain.tld"

Usually if it’s getting stuck uploading it can’t write to the folder.

I’d recommend moving files to an object store. Minio is a nice easy self hosted one.

There is a community built tool that might be of use here: https://github.com/arminfelder/gridfsmigrate

Also depending on how many users connect… might be time to scale horizontally and add another rocket.chat instance

New issues are just popping up each day with this software. When I activate FileSystem uploads it doesn’t store them in the correct folder, but in a folder such as /var/lib/docker/overlay2/68a9f50e946d6f986b0cf19662a036eb1b1be74f196aa99fbb3aaafc25c96c6c/merged/app/bundle/programs/server/uploads. And as soon as I restart the container it breaks and all images return 404 even the avatars, probably because the container id changes on restart. Even if I run all containers as root and set the path for uploads to /uploads it won’t save anything to that folder, only to the folder relative to docker.

Usually about 70-100 users are connecting simultaneously, so you’d think that 6 vCPUs and 16 GB memory would be enough, but more users are leaving each day since it’s so unstable.