Scaling challenges


#1

Server Setup Information:

The version of Rocket.Chat Server: 0.63.3
Operating System: Ubuntu 16.04
Deployment Method: PM2
Number of Running Instances: 14
DB Replicaset Oplog: Yes, 5 member one of those is an arbiter.
MongoDB Version: 3.2.X

RocketChat statistics:
Total Users 7116
Active Users 6174
Inactive Users 942
Online Users 342
Away Users 344
Offline Users 6430
Total Rooms 49149
Total Channels 44
Total Private Groups 1917
Total Direct Message Rooms 47188
Total Livechat Rooms 47188
Total Messages 2399787
Total Messages in Channels 5345
Total Messages in Private Groups 416826
Total Messages in Direct Messages 1975032

Total users 7000+ and moderate chat/file upload activities by 300-900 users brings rocketchat installation to the unusable state and many 502 are thrown out. From infrastructure, this is how the request flows

Users <—> AWS ALB(SSL Termination) <–> RC Nodes

Here’s the 5XX report from ALB

In logs we get,
[ Node processes ]
Warning: connect.session() MemoryStore is not
designed for a production environment, as it will leak
memory, and will not scale past a single process.
Wed, 17 Oct 2018 02:35:13 GMT connect deprecated multipart: use parser (multiparty, busboy, formidable) npm module instead at npm/node_modules/connect
/lib/middleware/bodyParser.js:56:20
Wed, 17 Oct 2018 02:35:13 GMT connect deprecated limit: Restrict request size at location of read at npm/node_modules/connect/lib/middleware/multipart
.js:86:15
You have triggered an unhandledRejection, you may have forgotten to catch a Promise rejection:
MongoError: ns not found
at Function.MongoError.create (/home/deploy/channels/releases/20180719040156/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules
/mongodb-core/lib/error.js:31:11)
at /home/deploy/channels/releases/20180719040156/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/pool.js:497:72
at authenticateStragglers (/home/deploy/channels/releases/20180719040156/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/pool.js:443:16)
at Connection.messageHandler (/home/deploy/channels/releases/20180719040156/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/pool.js:477:5)
at Socket. (/home/deploy/channels/releases/20180719040156/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:333:22)
at emitOne (events.js:116:13)
at Socket.emit (events.js:211:7)
at addChunk (_stream_readable.js:263:12)
at readableAddChunk (_stream_readable.js:250:11)
at Socket.Readable.push (_stream_readable.js:208:10)
at TCP.onread (net.js:597:20)

Error: socket hang up
at createHangUpError (_http_client.js:331:15)
at Socket.socketOnEnd (_http_client.js:423:23)
at emitNone (events.js:111:20)
at Socket.emit (events.js:208:7)
at endReadableNT (_stream_readable.js:1064:12)
at _combinedTickCallback (internal/process/next_tick.js:138:11)
at process._tickDomainCallback (internal/process/next_tick.js:218:9)

Error: socket hang up
at createHangUpError (_http_client.js:331:15)
at Socket.socketCloseListener (_http_client.js:363:23)
at emitOne (events.js:121:20)
at Socket.emit (events.js:211:7)
at TCP._handle.close [as _onclose] (net.js:557:12)


[Nginx]
2018-10-17T12:58:26.239590Z XXX-production-app10 nginx - - - 2018/10/17 12:58:26 [error] 15374#15374: *34349 recv() failed (104: Connection reset by peer) while proxying upgraded connection, client: 172.17.37.220, server: XXXXXX, request: “GET /sockjs/008/vcqc9hke/websocket HTTP/1.1”, upstream: “x “,
xxxx
<14>1 2018-10-17T12:46:26.302082Z xxxx-production-app5 nginx - - - 2018/10/17 12:46:26 [error] 26789#26789: *1447807 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 172.17.108.56, server: xxxxxxxxx ; /AmazonS3:Uploads/e9D2Z4NzGWdJavrfG?token=d97AB9993b&progress=0.6322451184687814 HTTP/1.1”, upstream: “AmazonS3:Uploads/e9D2Z4NzGWdJavrfG?token=d97AB9993b&progress=0.6322451184687814”, host: “xxxxx”, referrer: “

Hope some RocketChat expert could help?


#2

What does the load to your mongo server look like? How big are those nodes? That’s a lot of mongo errors like its struggling with the connection there.


#3

Thanks Aaron.

Except for arbiter of the replicaset, all other 4 instances are m5 large i.e. 2 CPU cores, 8 GB RAM. Here’s a newrelic system resource consumption graph


#4

With some investigative analysis on logs and turning on “error” channels on rocketchat administration, we got following many times around the time we get 5XX errors.

Error: socket hang up
at createHangUpError (_http_client.js:331:15)
at Socket.socketCloseListener (_http_client.js:363:23)
at emitOne (events.js:121:20)
at Socket.emit (events.js:211:7)
at TCP._handle.close [as _onclose] (net.js:557:12)

There were some of these as well.

Error: getaddrinfo EMFILE localhost:3000
at Object._errnoException (util.js:992:11)
at errnoException (dns.js:55:15)
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:92:26)

It looks like following is the scenario - nodejs process would go exception condition and exit. Supervisor brings it up again. During that time, servers are unresponsive hence 5XX errors. We also found just by switching from PM2 supervisor to runit has reduced 5XX errors visibly but those are still happening.

I looked up EMFILE error code and looks like it has to do something with socket connection, we increased the connection count from 128 to 1024. It is still emitting EMFILE errors.


#5

your session stores should be moved to either mongo or to redis and not kept in the memory one.


#6

Thanks @sandys

@aaron.ogle any thought on that suggestion?


#7

We upgraded the system to 67.0 but it did not solve our issues sadly.

We have hundreds of these popping up in debug channel we set up through the administrative panel

Error: getaddrinfo EMFILE localhost:3000
at Object._errnoException (util.js:992:11)
at errnoException (dns.js:55:15)
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:92:26)

There are also new kinds of error message we receive now.

Exception in handling DDP reset:

TypeError: Cannot read property 'call' of undefined
at _authorizeConnection (/home/deploy/channels/releases/20181029033207/bundle/programs/server/app/app.js:13090:32)
at authorizeConnection 

Oplog error

Error in oplog callback
TypeError: Cannot read property 'u' of undefined

Any help is much appreciated.


#8

Zooming onto “oplog callback errors” - this happens and as soon as this is logged, nodejs process exits. Supervisor brings it up - in the meantime, the users connected to this server receives 5XX errors.

   Error in oplog callback TypeError: Cannot read property 'u' of undefined
    at ModelsBaseDb.RocketChat.models.Subscriptions.on 
 (/home/deploy/channels/releases/20181029033207/bundle/programs/server/app/app.js:12817:58)
  at emitOne (events.js:116:13)
at ModelsBaseDb.emit (events.js:211:7)
at ModelsBaseDb.processOplogRecord (/home/deploy/channels/releases/20181029033207/bundle/programs/server/packages/rocketchat_lib.js:11699:12) at /home/deploy/channels/releases/20181029033207/bundle/programs/server/packages/mongo.js:1494:7
at runWithEnvironment (packages/meteor.js:1238:24)
at Object.callback (packages/meteor.js:1251:14)
at /home/deploy/channels/releases/20181029033207/bundle/programs/server/packages/ddp-server.js:2130:36
at Array.forEach (<anonymous>:null:null)
at Function._.each._.forEach (packages/underscore.js:139:11)
at DDPServer._Crossbar.fire (/home/deploy/channels/releases/20181029033207/bundle/programs/server/packages/ddp-server.js:2128:7)
at /home/deploy/channels/releases/20181029033207/bundle/programs/server/packages/mongo.js:1719:26
at Meteor.EnvironmentVariable.EVp.withValue (packages/meteor.js:1186:12)
at packages/meteor.js:502:25
at runWithEnvironment (packages/meteor.js:1238:24)

#9

Some of these socket issues look like an issue with the inter-instance event matrix having issues.

With your different instances are you setting PORT and or INSTANCE_IP environment variables?

When starting it takes and inserts this into the instances collection in your database. Then other instances as they come up look for other instances in this collection when it sees one it tries to communicate with it.

Events like userX is typing… and many other ephemeral events are propagated this way.


#10

All the instances have PORT and INSTANCE_IP set actually.

As a last resort, we turned all instances traffic for mongo to primary and most of the above issues disappeared. This is a quick relief only in my mind as it means we are limited to the vertical scalability of one mongo instance and other replica-members are there just for DR purposes.

So, replication lag is at fault as per our observation but not sure how you guys are managing open.rocket.chat ?

In our case, 7000+ users use it daily and with up to 1000 users online. The difference I speculate between usage of open.rocket.chat users and ours is - in our case, our users use it consistently and continuously with high participation. eg. I see open.rocket.chat has 200,000+ users in #general but there is hardly many conversations going where as in our case, there is much more activity within large channels. Again our message size tends to be paragraph size while in open.rocket.chat, it mostly sentences.

Do you think because of our use case, replication is inherently slow and rocketchat is not designed for lagging replicamembers as it is not agnostic to eventual consistency. We see our replication lag sometimes up to 2 secs.

It would be great to understand / document how replica set is being set up and how monitoring of those being done?

I know 0.67 there is monitoring that can be integrated into prometheous. It would be great to have documentation on that setup as well. Can you share @aaron.ogle?