Performance Issue

Updates

Update
May 07, 2024 at 4:22 PMUTC
Update
May 07, 2024 at 4:22 PMUTC
Some final details about this incident:
We found an insidious "lock" that was induced by arrival of new data being analyzed by an alert while a user was scrolling through a time graph of the same target getting data. In an isolated (low chance-based) case, this would cause a deadlock, which would then back up the data analysis server (of which we run many in parallel). That deadlock could (but didn't always) take several minutes to release, and while that was happening, continued actions on that server would back up. Depending on how busy the server was, it could be unusable for many minutes, which would affect access to any sessions being processed by that server. Whoa, complicated!
This bug happened on Friday (May 3rd) morning twice, and has not happened since (once we found a way to recognize it, we've been keeping a close eye on it and remediating before it created an issue).
We rolled out a fix for that bug yesterday, and feel confident that that particular bug is squashed.
Thanks for your understanding and patience!
Resolved
May 04, 2024 at 1:23 AMUTC
Resolved
May 04, 2024 at 1:23 AMUTC
Our team has successfully identified the root cause of the issue. While a comprehensive fix is still in progress, we have implemented effective remediation steps that have stabilized the system. At this time, the issue should no longer affect your user experience. We are diligently working on a permanent solution to ensure this issue does not recur.
Identified
May 03, 2024 at 3:11 PMUTC
Identified
May 03, 2024 at 3:11 PMUTC
Looks like there's another issue that we haven't yet identified because the problem came back up on a different server. We're remediating / investigating.
Resolved
May 03, 2024 at 3:01 PMUTC
Resolved
May 03, 2024 at 3:01 PMUTC
On of the back end "computation" servers was overloaded in an unexpected path. This caused other servers to mis-report their statuses, too, and the problem looked more widespread than it really was. Because of this, we restarted too much.
Everything is fully operational - all reports, views, summaries and quality monitor views should be accurate and up to date. No data was lost. We are investigating ways to reduce the likelihood of a repeat event.
Thanks for your patience and understanding!
Identified
May 03, 2024 at 2:25 PMUTC
Identified
May 03, 2024 at 2:25 PMUTC
Found an issue with one of the back end data collection servers. Restarting it. Most targets are working, and more coming back online.
Investigating
May 03, 2024 at 1:40 PMUTC
Investigating
May 03, 2024 at 1:40 PMUTC
There's an issue with viewing summaries and agent target lists. We're investigating.

PingPlotter Cloud - Performance Issue – Incident details

All systems operational