Ops Briefing Surface

Production Reliability Dashboard

Generated 2026-03-17 16:22 for 2026-03-15 from Pingdom checks, Slack #_alerts_prod, and AWS SNS alerts.

All sources Pingdom customer checks Slack alert families AWS alarm emails
Email-confirmed customer incidents0Pingdom down/slow events confirmed by inbox alertsObserved in Pingdom: 0
Impacted services51Mapped from Slack and Pingdom evidence
AWS alarms in ALARM5Still alarming at window end
Latest observed signal2026-03-15 17:31Most recent cross-source activity
Executive summary

What needs attention

No dominant issue stood out in this window.

Pingdom customer impact

External signal
No criticalActive: 0Total seen: 0

No strong signal in this lane.

No active issue listed in this category.

Slack impacted services

Application signal
No criticalActive: 0Total seen: 0

No strong signal in this lane.

No active issue listed in this category.

AWS alarms

Infrastructure signal
No criticalActive: 0Total seen: 0

No strong signal in this lane.

No active issue listed in this category.

What to do next

  1. NowReview the evidence categories below

    No high-priority action was pre-ranked in this window.

    Dashboard overview
Global Evidence Explorer

Global Evidence Explorer

Report-wide charts and tables stay here, separate from the active investigation scope.

Application + Infrastructure Alerts by Day

Pingdom latency + downtime by source

Customer View

Pingdom Checks

Pingdom CheckStatusEventsDowntimeLast SeenLikely ServicesCorrelated Evidence

Pingdom rows show externally visible signal first. The correlated evidence column helps tie the failing check back to services, Slack alert families, or AWS alarms when those links exist.

Application View

Slack Impacted Service / Resource View

This view attributes alerts to the workload or resource named in the alert text. Grafana, Loki, and Tempo are treated as observability components and are excluded when a more specific impacted target is also present.

Impacted Service / ResourceHighest SeverityCountLast SeenStatusTop Alert TypesDiscussion SignalLatest Thread Note
grafana1732026-03-13 17:01Seen this weekKubeDeploymentReplicasMismatch (54)KubePodCrashLooping (52)KubeCPUOvercommit (11)KubeNodeEviction (11)KubeHpaMaxedOut (10)Observability storageRelease / migration issueGeneral investigation
e un config gresit pus, se repara acum
uni-api-svc-4000552026-03-11 10:04Seen this weekTraefikServiceHighLatency (29)TraefikServiceHighErrorRate (26)App bug / schema mismatch
`d.nrCrt` — missing from `disciplina` | `d1_0.an — missing from disciplina` | | Andrei Alexandru pare ca tot lipsesc niste coloane
accommodations-api-svc-4100342026-03-11 16:31Seen this weekTraefikServiceHighLatency (27)TraefikServiceHighErrorRate (7)None
subscriptions-api322026-03-12 18:24Likely noise / resolvedKubeHpaMaxedOut (27)CPUThrottlingHigh (3)KubeDeploymentRolloutStuck (1)KubeContainerWaiting (1)Scaling configAlert tuning / noise
am scos limitele de la toate cpu, nu ar trebui sa apara - am rezolvat la mana ma uit imediat
ws2-api-svc-4300322026-02-24 06:49No recent signalTraefikServiceHighLatency (32)None
app-gateway-ingress-production-we292026-02-20 20:49No recent signal5xx Response Rate Alert (19)Increased Latency Alert (7)Anomaly Detected: Unusual Request Ratio (3)DB / maintenanceAttack / traffic anomalyDependency failureGeneral investigation
atac
admission-end-session-29535860272026-03-02 23:07No recent signalKubeJobFailed (27)None
admission-end-session-29537300272026-03-03 23:07No recent signalKubeJobFailed (27)None
ai-api-svc-3900262026-03-13 16:39Seen this weekTraefikServiceHighLatency (14)TraefikServiceHighErrorRate (12)Release / migration issue
nu erau rulate migrarile pe ai
admission-end-session-29538740232026-03-04 06:47No recent signalKubeJobFailed (23)None
admission-end-session-29554580222026-03-15 14:57Recent (72h)KubeJobFailed (22)Observability storage
Call to a member function format() on bool in FormItem.php | Marian e ceva busit pe recurenta | Marian will investigate
service-av222026-02-20 08:26Likely noise / resolvedCPUThrottlingHigh (21)KubePodCrashLooping (1)Alert tuning / noise
am dat increase la limita pe av, nu ar trebui sa mai apara
unclassified222026-02-26 03:27Likely noise / resolvedWatchdog (9)TargetDown (4)KubeClientErrors (3)KubeletTooManyPods (2)KubeDeploymentReplicasMismatch (1)Alert tuning / noise
Astea is pe envul nou
web-80182026-03-12 18:12Seen this weekTraefikServiceHighLatency (15)TraefikServiceHighErrorRate (3)Release / migration issue
am facut eu ceva gresit la deploy, merge bine
admission-end-session-29540180162026-03-04 06:47No recent signalKubeJobFailed (16)None
admission-end-session-29547380162026-03-09 06:46No recent signalKubeJobFailed (16)None
admission-end-session-29556020162026-03-15 14:57Recent (72h)KubeJobFailed (16)None
publish-results-29531525152026-02-25 14:45No recent signalKubeJobFailed (15)General investigation
eroare de la alea vechi, le-a reparat Marian
attendance-register-missed-attendance-29531530132026-02-25 14:45No recent signalKubeJobFailed (13)General investigation
eroare de la alea vechi, le-a reparat Marian
admission-end-session-29531540122026-02-25 14:45No recent signalKubeJobFailed (12)General investigation
eroare de la alea vechi, le-a reparat Marian
loki122026-03-13 17:01Seen this weekKubePersistentVolumeFillingUp (9)KubePodCrashLooping (1)KubeStatefulSetReplicasMismatch (1)TargetDown (1)Release / migration issueGeneral investigation
e un config gresit pus, se repara acum
admission-end-session-29551700112026-03-11 10:52Seen this weekKubeJobFailed (11)None
notifications-event-manager112026-03-12 18:19Seen this weekKubeHpaMaxedOut (5)KubePodCrashLooping (3)KubeDeploymentReplicasMismatch (2)KubePodNotReady (1)Batch code bug
The arguments array must contain 2 items, 1 given in Notificari.php
admission-end-session-29557460102026-03-15 14:57Recent (72h)KubeJobFailed (10)None
attendance-register-missed-attendance-29518570102026-02-16 06:32No recent signalKubeJobFailed (10)DB / maintenance
joburi picate de la db maintenance
core-grafana-80102026-03-13 17:02Likely noise / resolvedTraefikServiceHighLatency (9)TraefikServiceHighErrorRate (1)Alert tuning / noise
todo silence, nu ar trebui sa fie critical pentru grafana
docgen2-api-svc-3600102026-03-10 15:03No recent signalTraefikServiceHighLatency (6)TraefikServiceHighErrorRate (4)Release / migration issue
| "uri": "/docgen2/uni/disciplines/download?disciplineId=18461&academicPlanId=16905&cohortId=0&lang=ro", | "status": 500, | Aici e fix-ul facut de Stefan, Cred ca inca nu avem label-uri, dar poate nu e necesar pe acest fix :slightly_smiling_face:
admission-end-session-2954162092026-03-04 06:47No recent signalKubeJobFailed (9)None
admission-end-session-2954450092026-03-06 06:46No recent signalKubeJobFailed (9)None
admission-end-session-2954882092026-03-09 06:46No recent signalKubeJobFailed (9)None
docgen2-api92026-03-13 13:08Seen this weekKubeHpaMaxedOut (6)KubePodNotReady (1)KubeContainerWaiting (1)CPUThrottlingHigh (1)None
download-album-2951857582026-02-16 06:32No recent signalKubeJobFailed (8)DB / maintenance
joburi picate de la db maintenance
download-album-2952432582026-02-20 07:03No recent signalKubeJobFailed (8)None
publish-results-2952432582026-02-20 07:03No recent signalKubeJobFailed (8)None
billing-api-svc-310072026-03-12 18:12Seen this weekTraefikServiceHighLatency (7)None
download-album-2953252562026-02-25 14:45No recent signalKubeJobFailed (6)General investigation
eroare de la alea vechi, le-a reparat Marian
download-album-2955286562026-03-11 10:52Seen this weekKubeJobFailed (6)None
library-api-svc-320062026-03-05 08:10No recent signalTraefikServiceHighErrorRate (4)TraefikServiceHighLatency (2)None
admission-end-session-2953442052026-02-26 19:01No recent signalKubeJobFailed (5)None
admission-end-session-2953298042026-02-25 14:45No recent signalKubeJobFailed (4)General investigation
eroare de la alea vechi, le-a reparat Marian
admission-end-session-2955890042026-03-15 14:57Recent (72h)KubeJobFailed (4)None
websocket-380042026-02-22 22:17No recent signalTraefikServiceHighLatency (4)None
admission-end-session-2955314032026-03-11 10:52Seen this weekKubeJobFailed (3)None
etcd32026-02-25 10:14Likely noise / resolvedTargetDown (1)etcdMembersDown (1)etcdInsufficientMembers (1)Alert tuning / noise
e ok, alerta asta nu ar trebui sa fie pusa pe tuiasi
subscriptions-api-svc-340032026-03-12 18:12Seen this weekTraefikServiceHighLatency (3)None
admission-end-session-2954306022026-03-04 06:47No recent signalKubeJobFailed (2)None
admission-end-session-2954594022026-03-06 06:46No recent signalKubeJobFailed (2)None
admission-end-session-2955026022026-03-09 06:46No recent signalKubeJobFailed (2)None
publish-results-2952720522026-02-22 16:12No recent signalKubeJobFailed (2)None
uni-api22026-02-17 18:25No recent signalKubeDeploymentRolloutStuck (1)KubePodNotReady (1)None
tempo12026-02-16 21:29No recent signalKubePodCrashLooping (1)None
Evidence

Slack Alert Families

AlertSeverityCountLast SeenStatusThreadsTop Impacted ServicesDiscussion SignalLatest Thread Note
KubeJobFailedWarning1402026-03-15 14:57Recent (72h)3admission-end-session-29537300 (27)admission-end-session-29535860 (27)admission-end-session-29538740 (23)admission-end-session-29554580 (22)admission-end-session-29540180 (16)DB / maintenanceGeneral investigationObservability storage
Call to a member function format() on bool in FormItem.php | Marian e ceva busit pe recurenta | Marian will investigate
TraefikServiceHighLatencyWarning1402026-03-13 16:39Seen this week0ws2-api-svc-4300 (32)uni-api-svc-4000 (29)accommodations-api-svc-4100 (27)web-80 (15)ai-api-svc-3900 (14)None
KubePodCrashLoopingWarning582026-03-13 17:01Seen this week0grafana (52)notifications-event-manager (3)unclassified (1)tempo (1)service-av (1)None
KubeDeploymentReplicasMismatchWarning572026-03-12 17:53Seen this week1grafana (54)notifications-event-manager (2)unclassified (1)Batch code bug
The arguments array must contain 2 items, 1 given in Notificari.php
TraefikServiceHighErrorRateCritical572026-03-13 17:02Likely noise / resolved6uni-api-svc-4000 (26)ai-api-svc-3900 (12)accommodations-api-svc-4100 (7)library-api-svc-3200 (4)docgen2-api-svc-3600 (4)Release / migration issueApp bug / schema mismatchAlert tuning / noise
todo silence, nu ar trebui sa fie critical pentru grafana
KubeHpaMaxedOutWarning482026-03-13 13:08Seen this week1subscriptions-api (27)grafana (10)docgen2-api (6)notifications-event-manager (5)unclassified (1)Scaling config
Răzvan Ionică ai cum sa te uiti tu peste asta? E acelasi config la keda pe azure nush de ce apare alerta | ai gasit cauza? | daca e asa da. asta cred ca e de la criza financiara de la craciun in care am scazut la minim tot.
CPUThrottlingHighWarning282026-03-12 09:00Likely noise / resolved2service-av (21)subscriptions-api (3)grafana (3)docgen2-api (1)Alert tuning / noise
am scos limitele de la toate cpu, nu ar trebui sa apara - am rezolvat la mana ma uit imediat
5xx Response Rate AlertWarning192026-02-20 20:46No recent signal9app-gateway-ingress-production-we (19)DB / maintenanceAttack / traffic anomalyDependency failureGeneral investigation
atac
KubeNodeEvictionWarning122026-03-10 12:27No recent signal0grafana (11)unclassified (1)None
KubeCPUOvercommitWarning112026-02-25 08:55No recent signal0grafana (11)None
WatchdogWarning112026-02-24 09:44Likely noise / resolved1unclassified (9)grafana (2)Alert tuning / noise
i-am dat silence, e o alerta de debug
KubePersistentVolumeFillingUpWarning92026-03-13 14:08Seen this week1grafana (9)loki (9)Release / migration issue
dau deploy acum sa reduc logurile | oricum e facut sa stearga logurile vechi daca se apropie de 100gb |
Increased Latency AlertWarning72026-02-19 03:18No recent signal4app-gateway-ingress-production-we (7)DB / maintenanceAttack / traffic anomaly
atac
NodeHighNumberConntrackEntriesUsedWarning72026-03-13 09:11Seen this week0grafana (7)None
TargetDownWarning72026-03-13 16:56Likely noise / resolved2unclassified (4)grafana (2)etcd (1)loki (1)Alert tuning / noiseGeneral investigation
e un config gresit pus, se repara acum
Anomaly Detected: Unusual Request RatioWarning32026-02-20 20:49No recent signal0app-gateway-ingress-production-we (3)None
KubeClientErrorsWarning32026-02-26 03:27No recent signal0unclassified (3)None
KubeJobNotCompletedWarning22026-03-07 08:05No recent signal0grafana (2)None
KubeletTooManyPodsWarning22026-02-15 18:28No recent signal0unclassified (2)None
KubeAggregatedAPIDownWarning12026-02-15 16:24No recent signal0grafana (1)None
KubeContainerWaitingWarning12026-02-17 18:25No recent signal0docgen2-api (1)subscriptions-api (1)None
KubeDeploymentRolloutStuckWarning12026-02-17 18:25No recent signal0subscriptions-api (1)uni-api (1)None
KubePodNotReadyWarning12026-02-17 18:25No recent signal0docgen2-api (1)notifications-event-manager (1)uni-api (1)None
KubeStatefulSetReplicasMismatchWarning12026-03-13 17:00Seen this week0grafana (1)loki (1)None
NodeCPUHighUsageWarning12026-03-12 18:18Seen this week0grafana (1)None
NodeMemoryHighUtilizationWarning12026-03-10 12:24No recent signal1grafana (1)Observability storage
e de la loki, a consumat prea multa memorie
NodeSystemSaturationWarning12026-03-12 18:17Seen this week0grafana (1)None
etcdInsufficientMembersCritical12026-02-25 10:12Likely noise / resolved1etcd (1)Alert tuning / noise
e ok, alerta asta nu ar trebui sa fie pusa pe tuiasi
etcdMembersDownWarning12026-02-25 10:14No recent signal0etcd (1)None

Status is heuristic. Slack rarely posts explicit resolutions, so “Seen today” or “Recent” means the alert family still appeared in production recently, not that it is definitely unresolved.

AWS Email Alarm Families

AWS AlarmEmailsALARMOKState FlipsFirst SeenLast SeenLatest StateStatus
adservio-rds-mysql-master-memory-low562531432026-03-04 15:192026-03-15 17:31OKFlapping, latest OK
adservio-root-account-usage52342026-02-25 11:532026-02-26 11:30OKLatest OK
adservio-rds-mysql-catalog-memory-low11002026-03-10 22:152026-03-10 22:15ALARMStill alarming
adservio-rds-mysql-catalog-swap-high11002026-03-11 11:382026-03-11 11:38ALARMStill alarming
adservio-rds-postgres-billing-cpu-high11002026-03-01 09:122026-03-01 09:12ALARMStill alarming

“Flapping, latest OK” means the most recent email was an OK, but the alarm toggled repeatedly and is still a reliability concern.

Global Discussion-Derived Signal

Thread DateAlertSeverityServicesSignalKey Notes
2026-03-13 17:02TraefikServiceHighErrorRateCriticalcore-grafana-80Alert tuning / noise
todo silence, nu ar trebui sa fie critical pentru grafana
2026-03-13 16:56TargetDownWarninggrafana, lokiGeneral investigation
e un config gresit pus, se repara acum
2026-03-12 13:48KubeDeploymentReplicasMismatchWarningnotifications-event-managerBatch code bug
The arguments array must contain 2 items, 1 given in Notificari.php
2026-03-12 10:52KubeJobFailedWarningadmission-end-session-29554580Observability storage
Call to a member function format() on bool in FormItem.php | Marian e ceva busit pe recurenta | Marian will investigate
2026-03-12 08:25CPUThrottlingHighWarningsubscriptions-apiAlert tuning / noise
am scos limitele de la toate cpu, nu ar trebui sa apara - am rezolvat la mana ma uit imediat
2026-03-11 08:01TraefikServiceHighErrorRateCriticaluni-api-svc-4000App bug / schema mismatch
`d.nrCrt` — missing from `disciplina` | `d1_0.an — missing from disciplina` | | Andrei Alexandru pare ca tot lipsesc niste coloane
2026-03-10 12:24NodeMemoryHighUtilizationWarninggrafanaObservability storage
e de la loki, a consumat prea multa memorie
2026-03-10 10:49TraefikServiceHighErrorRateCriticaldocgen2-api-svc-3600Release / migration issue
| "uri": "/docgen2/uni/disciplines/download?disciplineId=18461&academicPlanId=16905&cohortId=0&lang=ro", | "status": 500, | Aici e fix-ul facut de Stefan, Cred ca inca nu avem label-uri, dar poate nu e necesar pe acest fix :slightly_smiling_face:
2026-03-10 09:12TraefikServiceHighErrorRateCriticaluni-api-svc-4000App bug / schema mismatch
| Got error 'missing ) at offset 911' from regexp | Si pe acest punct avem fix in main. Am facut un escape suplimentar pe caracterele care intra in acea expresie regex
2026-03-06 14:16KubePersistentVolumeFillingUpWarninggrafana, lokiRelease / migration issue
dau deploy acum sa reduc logurile | oricum e facut sa stearga logurile vechi daca se apropie de 100gb |
2026-02-25 16:23TraefikServiceHighErrorRateCriticalai-api-svc-3900Release / migration issue
nu erau rulate migrarile pe ai
2026-02-25 10:55KubeJobFailedWarningadmission-end-session-29531540, admission-end-session-29532980, attendance-register-missed-attendance-29531530, download-album-29532525General investigation
eroare de la alea vechi, le-a reparat Marian
2026-02-25 10:12etcdInsufficientMembersCriticaletcdAlert tuning / noise
e ok, alerta asta nu ar trebui sa fie pusa pe tuiasi
2026-02-23 14:28TraefikServiceHighErrorRateCriticalweb-80Release / migration issue
am facut eu ceva gresit la deploy, merge bine
2026-02-23 07:16KubeHpaMaxedOutWarningsubscriptions-apiScaling config
Răzvan Ionică ai cum sa te uiti tu peste asta? E acelasi config la keda pe azure nush de ce apare alerta | ai gasit cauza? | daca e asa da. asta cred ca e de la criza financiara de la craciun in care am scazut la minim tot.
2026-02-20 08:26CPUThrottlingHighWarningservice-avAlert tuning / noise
am dat increase la limita pe av, nu ar trebui sa mai apara
2026-02-17 17:22TargetDownWarningunclassifiedAlert tuning / noise
Astea is pe envul nou
2026-02-16 06:32KubeJobFailedWarningattendance-register-missed-attendance-29518570, download-album-29518575DB / maintenance
joburi picate de la db maintenance
2026-02-16 06:11WatchdogWarningunclassifiedAlert tuning / noise
i-am dat silence, e o alerta de debug
2026-02-16 03:265xx Response Rate AlertWarningapp-gateway-ingress-production-weAttack / traffic anomaly
atac
2026-02-16 03:23Increased Latency AlertWarningapp-gateway-ingress-production-weAttack / traffic anomaly
atac
2026-02-16 03:215xx Response Rate AlertWarningapp-gateway-ingress-production-weAttack / traffic anomaly
atac
2026-02-15 18:015xx Response Rate AlertWarningapp-gateway-ingress-production-weDependency failure
aici a picat nodul de redis si de rabbitmq
2026-02-15 03:365xx Response Rate AlertWarningapp-gateway-ingress-production-weGeneral investigation
aici is 13 requesturi
2026-02-15 03:33Increased Latency AlertWarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 03:115xx Response Rate AlertWarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 03:08Increased Latency AlertWarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 02:315xx Response Rate AlertWarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 02:265xx Response Rate AlertWarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 02:215xx Response Rate AlertWarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 02:13Increased Latency AlertWarningapp-gateway-ingress-production-weDB / maintenance
db restart
2026-02-15 02:115xx Response Rate AlertWarningapp-gateway-ingress-production-weDB / maintenance
db restart