When did it occur?
Since Sunday, Dec 1st 14:00 GMT ā still on-going
What is the problem?
Our main Matrix server peaks with higher load than usual and is not reachable for several minutes at a time. Restarting the service and sometimes the entire machine is necessary, which must be done manually. Often times it recovers again after a short period, but it keeps having happening. Which results in messages not being transmitted and the Acter App having a degraded user experience.
What's being done about it?
The root cause seems to be a known problem with our hosting provider and the server we are using. We are running the matrix server on a VPS, a virtual server, so to speak, and that has certain limitations. Among them is the number of open files and processes running at the same time. Even though we have a rather beefy setup for the server itself (average load barely hits the 0.7 usually) those limits cause processes to fail all the time. This got even worse when we tried switching to background workers, as this causes more processes to run at the same time.
We have monitored the situation for a while and seen this happen repeatedly but never as contentiously as now. The plan was to switch the hosting provider with another one - that is also using green energy sources - to combat the problem. This process has been speed up and we hope that the move will fix this problem for good.
Unfortunately, until that move has happened, there is little we can do, other than switching off all auxiliary processes - which we have done - to alleviate the strain on the system itself.
System Status
Acter App & Server
Push Notifications
DNS Discovery
š¤ Warrant Canary
As of September 30th 2024, Acter warrants that:
- Acter has never turned over any encryption or authentication keys or our customersā encryption or authentication keys to anyone.
- Acter has never installed any law enforcement software or equipment anywhere on our network.
- Acter has never provided any law enforcement organization a feed of our customersā content transiting our network.
- Acter has never modified customer content at the request of law enforcement or another third party.
- Acter has never weakened, compromised, or subverted any of its encryption at the request of law enforcement or another third party.
History
Record of infrastructure incidents
When & What was effected
Over several weeks Email sending was interrupted in the Acter App for all Matrix accounts on acter.global. In particular verifying emails addresses to the app or using them for password recovery didn't work: emails just didn't find their inboxes.
What happened
We first though this was a bug in the App and thus searched there for a quite a while, before we realized the cause was in the running infrastructure. In an attempt to break free from unnecessary servers we have recently consolidated the services we used to send emails through. We had already disabled mailgun and then also closed our sendgrid account in order to switch (temporarily) to Amazon SES, which worked fine in testing. What we didn't realise was that it was running in a sandbox-mode only allowing us to sent to acter.global email-addresses. Thus, our tests worked, but emails to external providers were not sent.
What was compromised/lost
Nothing was compromised or lost. Just emails were not delivered.
Measures taken to prevent this in the future
We have finished the switch and are now sending emails through sweego, a French company with Data Centers in France. Email sending has been restored successfully.
Additionally, we have been planning to add a more extensive monitoring system to our infrastructure and will for sure add the ability to get notified when the email sending isn't working anymore as expected.
When & What was effected?
Sept 28-30th, Push Notifications to iOS devices (iPhone & iPad).
What happened
We missed that push notifications certificates can and do expire. Unfortunately the expiration date fell on Saturday so we were only informed about this in report on Monday, and immediately issued new certificates and restarted the service.
What was compromised/lost?
Nothing was compromised or lost.
Measures taken to prevent this in the future?
Documentation has been added to the internal processes clarifying the procedure and steps to follow to rectify this in the future. Further an overhaul and collection of all certificates that can expire on us is put on the todo-list.
When & What was effected?
3pm-6pm GMT the server matrix.acter.global and with that the Acter App was unavailabe.
What happened
Due to a problem with the Apple Push Notifications we had to update the server infrastructure. During which we noticed that some database upgrades were due to be made. Running them on the staging instance everything worked just fine and came back up quickly. So we decided to also run them on the main server underestimating how much larger that server upgrade would be, took the server into maintenance mode and started the update. After about an hour in the upgrade (without much visible progress) we decided to cancel it and delay it for some better time. Unfortunately, restoring to the previous state took almost another hour in itself and thus we experienced a prolonged down time of the main server until everything was restored.
What was compromised/lost?
Nothing was compromised or lost. All data could be restored without problems.
Measures taken to prevent this in the future?
For the time being we will only be doing database upgrades for pre-scheduled time frames with low activity and not do some "on the go" anymore.