Connectivity issues with main Matrix Server
— Ben
When did it occur?
Since Sunday, Dec 1st 14:00 GMT – still on-going
What is the problem?
Our main Matrix server peaks with higher load than usual and is not reachable for several minutes at a time. Restarting the service and sometimes the entire machine is necessary, which must be done manually. Often times it recovers again after a short period, but it keeps having happening. Which results in messages not being transmitted and the Acter App having a degraded user experience.
What's being done about it?
The root cause seems to be a known problem with our hosting provider and the server we are using. We are running the matrix server on a VPS, a virtual server, so to speak, and that has certain limitations. Among them is the number of open files and processes running at the same time. Even though we have a rather beefy setup for the server itself (average load barely hits the 0.7 usually) those limits cause processes to fail all the time. This got even worse when we tried switching to background workers, as this causes more processes to run at the same time.
We have monitored the situation for a while and seen this happen repeatedly but never as contentiously as now. The plan was to switch the hosting provider with another one - that is also using green energy sources - to combat the problem. This process has been speed up and we hope that the move will fix this problem for good.
Unfortunately, until that move has happened, there is little we can do, other than switching off all auxiliary processes - which we have done - to alleviate the strain on the system itself.