Socket hangs after scaling down of EC2 autoscaling instances

Description:

We have the Rocket Chat installed in AWS EC2 instances and have autoscaling enabled. When the autoscaling scales down and terminates the instance, we receive this error in the logs of other EC2 instances (app2).

17:55:32.874 - app2 - - - Exception in callback of async function: { Error: socket hang up
17:55:32.874 - app2 - - - at createHangUpError (_http_client.js:331:15)
17:55:32.874 - app2 - - - at Socket.socketCloseListener (_http_client.js:363:23)
17:55:32.874 - app2 - - - at emitOne (events.js:121:20)
17:55:32.874 - app2 - - - at Socket.emit (events.js:211:7)
17:55:32.874 - app2 - - - at TCP._handle.close [as _onclose] (net.js:557:12) code: 'ECONNRESET' }
17:55:32.874 - app2 - - - Exception in callback of async function: { Error: connect ECONNREFUSED xxx.xxx.xxx.51:3000
17:55:32.874 - app2 - - - at Object._errnoException (util.js:992:11)
17:55:32.874 - app2 - - - at _exceptionWithHostPort (util.js:1014:20)
17:55:32.875 - app2 - - - at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1186:14)
17:55:32.875 - app2 - - - code: 'ECONNREFUSED',
17:55:32.875 - app2 - - - errno: 'ECONNREFUSED',
17:55:32.875 - app2 - - - syscall: 'connect',
17:55:33.015 - app2 - - - address: 'xxx.xxx.xxx.51',
17:55:33.015 - app2 - - - port: 3000 }

Here’s a snapshot of the data received from rocket chat prometheus server.

In the above graph, the app829 has been already terminated, however the ddp session count from prometheus server can still be noticed for more few minutes.

After the log is thrown by the instance (app2), the rocket chat application restarts which can be noticed from the above graph. All the users have been disconnected from app2.

The error occurs in 1 out of 5 scaling down and is likely to occur with higher number of application count and with heavy traffic(100% CPU utilization).

My assumption for this issue are:

  1. The terminated instances are not removed from the database records quicker or instantly. The records are used for maintaining DDP connections between the app instances.
  2. The exceptions are not handled during the socket timeouts.

Steps to reproduce:

  1. Launch multiple instances of Rocket Chat applications. I hope it does not require EC2 instances to reproduce, however more number of instances with high CPU utilization has higher chance to reproduce.
  2. Prometheus server enabled and logged (Optional)
  3. Scale down the number of application.

Expected behavior:

Handle the socket timeouts with disconnected rocket chat instance.

Actual behavior:

Neighbouring instances receives error and application restarts.

Server Setup Information:

  • Version of Rocket.Chat Server: 1.3.3 (Occured in 0.71.2 before as well)
  • Operating System: Ubuntu 16.04.4 LTS
  • Deployment Method: meteor build
  • Number of Running Instances: 2 to 5
  • DB Replicaset Oplog:
  • NodeJS Version: v8.11.2 (Built on v8.11.3)
  • MongoDB Version: (3.4.1)

Client Setup Information

  • Desktop App or Browser Version:
  • Operating System:

Additional context

Relevant logs: