AI Log Analytics & Predictive Anomaly Detection for Interoperability

Problem

Operational issues (bad data, unusual traffic spikes, rising error rates) are discovered too late, after queues back up or downstream systems fail.
Current Event Log, Message Viewer, and Visual Trace are excellent for investigation, but they’re reactive and manual. Teams need proactive signals and clear root‑cause hints.

Proposal

Add a new “Log Analytics” section in the Management Portal that uses IRIS SQL + IntegratedML/Embedded Python to learn baselines from historical logs and predict anomalies in near real time.
It surfaces patterns of probable bad data, unusual traffic, adapter errors, latency changes, and queue growth—per Service/Process/Operation—then suggests likely causes and next actions.

Data sources (IRIS/Interoperability specific)

Ens.EventLog (errors/warnings/info), Ens.MessageHeader (message metadata), EnsLib.HL7.SearchTable (field indexes), EnsLib.FHIR.* resource validation results, Adapter logs (EnsLib.* TCP/HTTP/File/SFTP/JDBC), system metrics via %SYS.Monitor and ^%SS, production metrics (queue depth, retry counts, message sizes, latency).
Optional: Message Bank for centralized multi‑namespace aggregation.
Journal timestamps for high‑precision time alignment.

Data model (new Observability schema)

OBS.Event (ts, Production, Namespace, Host, Service, Process, Operation, AdapterType, Direction, Status, ErrorCode, ErrorText, RetryCount, LatencyMs, MsgSize, RemoteHost, Protocol, MessageType, HL7Event, FHIRResource, HTTPStatus, TLSFlags).
OBS.FieldQuality (per message type): RequiredMissingCnt, UnknownCodeCnt, OutOfRangeCnt, DuplicatesCnt, CoercionFailuresCnt, SegmentCountStats.
OBS.Aggregates (1min/5min/1hr windows): volume, error_rate, p95_latency, queue_depth, backoff_rate.
Views that join OBS.* back to Ens.* classes for deep-linking to Visual Trace/Message Viewer.

ML/Analytics approach

Baselines per interface: learn normal ranges by time‑of‑day/day‑of‑week and seasonality.
Unsupervised anomaly detection (Isolation Forest or robust Z‑score) for:
- Volume spikes/drops
- Error/timeout surges
- Latency shifts
- Queue depth acceleration
Predictive forecasts (next 30–60 minutes):
- Expected inbound volume and error rate
- Probability of SLA breach given current trends
Data quality scoring for HL7/FHIR/JSON:
- Required fields missing (e.g., PID‑3, MSH version)
- Unknown value sets/code systems
- Structural deviations (segment/resource cardinality)
Implementation options:
- IntegratedML models: CREATE MODEL, TRAIN MODEL, PREDICT via SQL over OBS.Aggregates.
- Embedded Python (scikit‑learn/prophet) for advanced time‑series; exposed as IRIS methods or SQL table functions.
- Feature store built with computed columns and scheduled tasks (Ens.Job) to maintain aggregates.

Portal UX

New menu: Interoperability > Log Analytics.
Dashboards:
- Traffic & Error Heatmap by Service/Operation
- Anomaly Timeline with severity score and confidence
- Queue Risk Forecast (when/where backlog will exceed threshold)
- Data Quality Panel (top offending fields, segments, value sets)
Drill‑downs:
- Click an anomaly to open Visual Trace filtered to the affected window and components.
- “Explain” side panel: contributing features (e.g., remote host, message type), top correlated signals, and suggested actions.
Alerts:
- Rules like “if AnomalyScore > 0.8 and queue_depth_slope > X, alert via ENS.Alert or email/SNMP/Webhook.”
- Maintenance: certificate expiry predictors (based on TLSFlags/errors), disk space risk (journal size/velocity), connection flaps.

Suggested actions (generated)

Throttle or burst buffer recommendations per adapter.
Routing rule hints (e.g., quarantine messages with unknown PV1 values).
Index suggestions for heavy SQL in Operations.
Configuration checks: TCP keepalive, HTTP timeouts, SSL/TLS ciphers, retry/backoff tuning.

Technical details

OBS.* classes stored in a dedicated schema; populated by:
- A lightweight Business Service/Process that subscribes to Ens.EventLog and message metadata streams.
- Scheduled Task that compacts and aggregates into OBS.Aggregates.
Models:
- Train nightly; update baselines continuously using sliding windows.
- Governance: versioned models stored in a class (OBS.ModelRegistry), with metrics (AUC, precision/recall for labeled events when available).
Performance:
- Use shard-friendly tables (IDKEY on ts + component) and bitmap indexes for event type filters.
- Efficient queries via window functions (AVG, STDDEV, PERCENTILE_DISC) and materialized views for hot panels.

Security and privacy

Read‑only views by default; no PHI content stored beyond necessary field statistics unless enabled.
Role‑aware: Ops role can view/acknowledge; Admin can change thresholds and retraining cadence.
Redaction options for ErrorText; configurable field sampling.
Works on‑prem or with customer‑selected LLM/ML endpoints; no production payloads sent externally unless approved.

MVP acceptance criteria

OBS.Event and OBS.Aggregates populated from Ens.EventLog and MessageHeader across at least one production.
Anomaly Timeline showing predicted spikes/drops with confidence and deep links to Visual Trace.
Data Quality Panel for HL7 messages using EnsLib.HL7.SearchTable indices (missing required, unknown codes).
IntegratedML model that forecasts next‑hour error rate or volume per Service and exposes PREDICT() in SQL.
Alerting to ENS.Alert with severity and suggested next step.
Dashboard performance: <2s for last 24h view; <5s for 7‑day view.

Example detections

“Inbound HL7 ADT volume +180% vs baseline; unknown PID‑3 code system rising; probable bad feed from Host X.”
“Operation OP_FHIR posting 429s increasing; forecast SLA breach in 25 minutes; suggest retry/backoff adjustments.”
“Queue depth for BS_FileIn accelerating; expected to exceed 5,000 in 40 minutes; recommend temporary throttle and quarantine rule.”

Success metrics

50% faster time‑to‑detect issues compared to manual monitoring.
30% reduction in after‑hours incidents caused by unnoticed data/traffic anomalies.
Measurable decrease in message reprocessing due to early quarantine of bad data.

RELATED IDEAS