Sudden 100% CPU usage causing non responsive chat till service restart

#22

More information if its helpful.

I have changed the storage engine back to mmapv1 as i saw some posts saying it is preferred.

Still having the same issues. Even with just a single user logged in and slackbridge disabled, it still goes to 100% and eventually fails and the containers restart.

#23

I think there must be bad data stored in the users table thats triggering this issue. Is there a cleanup routine or a data validation i can run to clear bad data ? It seems only that table thats causing issues as far as i can tell

#24

Wow that’s some serious wait time on a query. Can you grab the user object it’s delaying on? I know might be a pain to redact. Is it a random Id each time or is it random?

#25

Well, i think i have found the issue so here’s something we can script to fix if someone else has it.

In each user document, in the service column, there was a 28MB (text version) for a single user. The login sessions (all without dates) were huge. I deleted these for 2 users (and am cleaning up others) and its sooo much faster now.

I presume theres a cleanup routine running to delete old sessions but without dates they will be ignored.

I am not a mongo person but will look at writing a script to delete all sessions where there is no date.

I am also assuming that this has come from an old version and there probably should be a published routine or something to clean some of this up.

I think the logs I had were just pointing at the table causing the issues and locks, not the actual row so the logs weren’t entirely helpful.

Gotta say that I now know far more about mongo db than i thought i needed originally :slight_smile:

1 Like
#26

Any thoughts about this issue?

I have my instance running with 30 users and is so annoying cause i have to restart the server manually.

#27

Hi,
I also have this problem. Anybody found a solution? I have to restart the server manually every couple of days.

#28

Do all of you have REST API or BOT type of traffic (running periodically) other than the regular human users?

#29

so were they under loginTokens or resumeTokens?

I’m guessing if having significant REST API traffic might lead to this. But not sure why they would be getting inserted with out an expire date. What version are you running?

#30

They were login tokens. We don’t do much restapi traffic but do have some. We tend to run almost the latest when we remember to udpate or are looking for a bug fix. We have been using the same system for a long time however so it may have been old data.
All the current entries have a when attribute and haven’t had an issue since.