Socket hangs after scaling down of EC2 autoscaling instances

srekcah.rai · March 18, 2020, 10:08am

Description:

We have the Rocket Chat installed in AWS EC2 instances and have autoscaling enabled. When the autoscaling scales down and terminates the instance, we receive this error in the logs of other EC2 instances (app2).

17:55:32.874 - app2 - - - Exception in callback of async function: { Error: socket hang up
17:55:32.874 - app2 - - - at createHangUpError (_http_client.js:331:15)
17:55:32.874 - app2 - - - at Socket.socketCloseListener (_http_client.js:363:23)
17:55:32.874 - app2 - - - at emitOne (events.js:121:20)
17:55:32.874 - app2 - - - at Socket.emit (events.js:211:7)
17:55:32.874 - app2 - - - at TCP._handle.close [as _onclose] (net.js:557:12) code: 'ECONNRESET' }
17:55:32.874 - app2 - - - Exception in callback of async function: { Error: connect ECONNREFUSED xxx.xxx.xxx.51:3000
17:55:32.874 - app2 - - - at Object._errnoException (util.js:992:11)
17:55:32.874 - app2 - - - at _exceptionWithHostPort (util.js:1014:20)
17:55:32.875 - app2 - - - at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1186:14)
17:55:32.875 - app2 - - - code: 'ECONNREFUSED',
17:55:32.875 - app2 - - - errno: 'ECONNREFUSED',
17:55:32.875 - app2 - - - syscall: 'connect',
17:55:33.015 - app2 - - - address: 'xxx.xxx.xxx.51',
17:55:33.015 - app2 - - - port: 3000 }

Here’s a snapshot of the data received from rocket chat prometheus server.

In the above graph, the app829 has been already terminated, however the ddp session count from prometheus server can still be noticed for more few minutes.

After the log is thrown by the instance (app2), the rocket chat application restarts which can be noticed from the above graph. All the users have been disconnected from app2.

The error occurs in 1 out of 5 scaling down and is likely to occur with higher number of application count and with heavy traffic(100% CPU utilization).

My assumption for this issue are:

The terminated instances are not removed from the database records quicker or instantly. The records are used for maintaining DDP connections between the app instances.
The exceptions are not handled during the socket timeouts.

Steps to reproduce:

Launch multiple instances of Rocket Chat applications. I hope it does not require EC2 instances to reproduce, however more number of instances with high CPU utilization has higher chance to reproduce.
Prometheus server enabled and logged (Optional)
Scale down the number of application.

Expected behavior:

Handle the socket timeouts with disconnected rocket chat instance.

Actual behavior:

Neighbouring instances receives error and application restarts.

Server Setup Information:

Version of Rocket.Chat Server: 1.3.3 (Occured in 0.71.2 before as well)
Operating System: Ubuntu 16.04.4 LTS
Deployment Method: meteor build
Number of Running Instances: 2 to 5
DB Replicaset Oplog:
NodeJS Version: v8.11.2 (Built on v8.11.3)
MongoDB Version: (3.4.1)

Client Setup Information

Desktop App or Browser Version:
Operating System:

Additional context

Relevant logs:

principemestizo · August 21, 2021, 1:33am

Do you already solve this?

Topic		Replies	Views
Error outgoing WebHook Integration Community Support rocketchat-apps	0	376	February 28, 2023
File upload stuck or doesn't work Community Support	1	1472	September 24, 2019
Error: ESOCKETTIMEDOUT at ClientRequest.<anonymous> Community Support	3	1025	November 3, 2021
End to End encryption error Community Support	0	609	December 11, 2020
Failed to find any information about the oAuth Client Community Support android	5	1519	September 18, 2022

Join our Community Open Call tomorrow where we'll share more details regarding recent changes and answer questions about updating your workspace to the latest Rocket.Chat