Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-03-29 12:26 for 2026-03-22 07:00 to 2026-03-29 07:00 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails
Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 0
Impacted services8Mapped from Slack and Pingdom evidence
AWS alarms in ALARM5Still alarming at window end
Latest observed signal2026-03-29 05:42Most recent cross-source activity
Executive summary

What needs attention

No dominant issue stood out in this window.

Pingdom customer impact

External signal
No criticalActive: 0Total seen: 0

No strong signal in this lane.

No active issue listed in this category.

Slack impacted services

Application signal
No criticalActive: 0Total seen: 0

No strong signal in this lane.

No active issue listed in this category.

AWS alarms

Infrastructure signal
No criticalActive: 0Total seen: 0

No strong signal in this lane.

No active issue listed in this category.

What to do next

  1. NowReview the evidence categories below

    No high-priority action was pre-ranked in this window.

    Dashboard overview
Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom CheckStatusEventsDowntimeLast SeenLikely ServicesCorrelated Evidence

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
web-80Critical62026-03-27 13:19Recent (72h)TraefikServiceHighErrorRate (1)TraefikServiceHighLatency (5)General investigation
era un query la catalog care mergea greu
uni-api-svc-4000Critical12026-03-25 10:58Seen this weekTraefikServiceHighErrorRate (1)General investigation
Andrei Alexandru pare ca e o buba cu foreign keys pe tuiasi | Service method DisciplineServiceImpl.patchDisciplineByIdAndAcademicPlan failed after 12ms: could not execute statement [(conn=4451547) Cann… | ok, urmaresc si rezolv, am mai gasit ieri.o astfel de problema, o urmaresc si pe asta
grafanaWarning432026-03-29 04:44Seen todayKubeJobFailed (32)NodeSystemSaturation (6)NodeCPUHighUsage (3)KubeNodeEviction (2)General investigation
production.ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 200704 bytes) {"exception":"[object] (Symfony\\Compon… | Ionut Ciolan a dat out of memory. E setat cumva in php.ini max memory la 128mb? Ai cum sa o cresti sa nu mai crape? | mai bine ar fi sa faci un pic refactor la cod. Pare ca inc…
download-album-29576545Warning122026-03-29 05:42Seen todayKubeJobFailed (12)None
ai-api-svc-3900Warning112026-03-27 17:21Recent (72h)TraefikServiceHighLatency (11)None
docgen2-apiWarning52026-03-27 13:05Recent (72h)KubeHpaMaxedOut (5)None
subscriptions-api-svc-3400Warning42026-03-27 08:35Recent (72h)TraefikServiceHighLatency (4)None
rooms-api-svc-3700Warning12026-03-27 08:35Recent (72h)TraefikServiceHighLatency (1)None
Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
TraefikServiceHighErrorRateCritical22026-03-27 08:23Recent (72h)1uni-api-svc-4000 (1)web-80 (1)General investigation
Andrei Alexandru pare ca e o buba cu foreign keys pe tuiasi | Service method DisciplineServiceImpl.patchDisciplineByIdAndAcademicPlan failed after 12ms: could not execute statement [(conn=4451547) Cann… | ok, urmaresc si rezolv, am mai gasit ieri.o astfel de problema, o urmaresc si pe asta
KubeJobFailedWarning442026-03-29 05:42Seen today1grafana (32)download-album-29576545 (12)General investigation
production.ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 200704 bytes) {"exception":"[object] (Symfony\\Compon… | Ionut Ciolan a dat out of memory. E setat cumva in php.ini max memory la 128mb? Ai cum sa o cresti sa nu mai crape? | mai bine ar fi sa faci un pic refactor la cod. Pare ca inc…
TraefikServiceHighLatencyWarning162026-03-27 17:21Recent (72h)1ai-api-svc-3900 (11)web-80 (5)subscriptions-api-svc-3400 (4)rooms-api-svc-3700 (1)General investigation
era un query la catalog care mergea greu
NodeSystemSaturationWarning62026-03-27 11:29Recent (72h)0grafana (6)None
KubeHpaMaxedOutWarning52026-03-27 13:05Recent (72h)0docgen2-api (5)None
NodeCPUHighUsageWarning32026-03-27 11:29Recent (72h)0grafana (3)None
KubeNodeEvictionWarning22026-03-26 08:38Seen this week0grafana (2)None

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-mysql-catalog2-memory-low432221422026-03-24 02:382026-03-25 10:11ALARMStill alarming
adservio-rds-mysql-catalog-swap-high21112026-03-27 08:262026-03-27 12:14ALARMStill alarming
adservio-rds-mysql-catalog-memory-low21112026-03-27 08:232026-03-27 08:41ALARMStill alarming
adservio-rds-mysql-catalog2-swap-high11002026-03-27 13:432026-03-27 13:43ALARMStill alarming
adservio-rds-mysql-catalog-disk-queue-high21112026-03-27 08:312026-03-27 08:41OKLatest OK

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-03-26 11:19TraefikServiceHighLatencyWarningweb-80General investigation
era un query la catalog care mergea greu
2026-03-25 10:58TraefikServiceHighErrorRateCriticaluni-api-svc-4000General investigation
Andrei Alexandru pare ca e o buba cu foreign keys pe tuiasi | Service method DisciplineServiceImpl.patchDisciplineByIdAndAcademicPlan failed after 12ms: could not execute statement [(conn=4451547) Cann… | ok, urmaresc si rezolv, am mai gasit ieri.o astfel de problema, o urmaresc si pe asta
2026-03-24 10:51KubeJobFailedWarninggrafanaGeneral investigation
production.ERROR: Allowed memory size of 134217728 bytes exhausted (tried to allocate 200704 bytes) {"exception":"[object] (Symfony\\Compon… | Ionut Ciolan a dat out of memory. E setat cumva in php.ini max memory la 128mb? Ai cum sa o cresti sa nu mai crape? | mai bine ar fi sa faci un pic refactor la cod. Pare ca inc…