Affected
Major outage from 1:40 PM to 3:01 PM, Operational from 3:01 PM to 3:11 PM, Major outage from 3:11 PM to 1:23 AM
- UpdateUpdate
Some final details about this incident:
We found an insidious "lock" that was induced by arrival of new data being analyzed by an alert while a user was scrolling through a time graph of the same target getting data. In an isolated (low chance-based) case, this would cause a deadlock, which would then back up the data analysis server (of which we run many in parallel). That deadlock could (but didn't always) take several minutes to release, and while that was happening, continued actions on that server would back up. Depending on how busy the server was, it could be unusable for many minutes, which would affect access to any sessions being processed by that server. Whoa, complicated!
This bug happened on Friday (May 3rd) morning twice, and has not happened since (once we found a way to recognize it, we've been keeping a close eye on it and remediating before it created an issue).
We rolled out a fix for that bug yesterday, and feel confident that that particular bug is squashed.
Thanks for your understanding and patience!
- ResolvedResolved
Our team has successfully identified the root cause of the issue. While a comprehensive fix is still in progress, we have implemented effective remediation steps that have stabilized the system. At this time, the issue should no longer affect your user experience. We are diligently working on a permanent solution to ensure this issue does not recur.
- IdentifiedIdentified
Looks like there's another issue that we haven't yet identified because the problem came back up on a different server. We're remediating / investigating.
- ResolvedResolved
On of the back end "computation" servers was overloaded in an unexpected path. This caused other servers to mis-report their statuses, too, and the problem looked more widespread than it really was. Because of this, we restarted too much.
Everything is fully operational - all reports, views, summaries and quality monitor views should be accurate and up to date. No data was lost. We are investigating ways to reduce the likelihood of a repeat event.
Thanks for your patience and understanding!
- IdentifiedIdentified
Found an issue with one of the back end data collection servers. Restarting it. Most targets are working, and more coming back online.
- InvestigatingInvestigating
There's an issue with viewing summaries and agent target lists. We're investigating.