Skip to main content

Occasional connectivity issues with main Matrix Server

Ben

When did it occur?
Dec 1st – Dec 18th
What is the problem?
Our main Matrix server peaks with higher load than usual and is not reachable for several minutes at a time. Restarting the service and sometimes the entire machine is necessary, which must be done manually. Often times it recovers again after a short period, but it keeps having happening. Which results in messages not being transmitted and the Acter App having a degraded user experience.

The root cause seems to be a known problem with our hosting provider and the server we are using. We are running the matrix server on a VPS, a virtual server, so to speak, and that has certain limitations. Among them is the number of open files and processes running at the same time. Even though we have a rather beefy setup for the server itself (average load barely hits the 0.7 usually) those limits cause processes to fail all the time. This got even worse when we tried switching to background workers, as this causes more processes to run at the same time.
What has being done about it?
We have monitored the situation for a while and seen this happen repeatedly but never as contentiously as at the beginning of December. After a few unsuccessful attempts we have finally been able to switch the Matrix server to a new hosting provider, which doesn't have that same limitation. We are also increasing our monitoring to detect such failures quicker in the future.