Skip to Main Content
InterSystems Ideas
We love hearing from our users. Tell us what you want to see next and upvote ideas from the community.
* Bugs and troubleshooting should as usual go through InterSystems support.
Status Needs review
Created by Tirthankar Bachhar
Created on Nov 24, 2025

AI Log Analytics & Predictive Anomaly Detection for Interoperability

Problem

  • Operational issues (bad data, unusual traffic spikes, rising error rates) are discovered too late, after queues back up or downstream systems fail.

  • Current Event Log, Message Viewer, and Visual Trace are excellent for investigation, but they’re reactive and manual. Teams need proactive signals and clear root‑cause hints.

Proposal

  • Add a new “Log Analytics” section in the Management Portal that uses IRIS SQL + IntegratedML/Embedded Python to learn baselines from historical logs and predict anomalies in near real time.

  • It surfaces patterns of probable bad data, unusual traffic, adapter errors, latency changes, and queue growth—per Service/Process/Operation—then suggests likely causes and next actions.

Data sources (IRIS/Interoperability specific)

  • Ens.EventLog (errors/warnings/info), Ens.MessageHeader (message metadata), EnsLib.HL7.SearchTable (field indexes), EnsLib.FHIR.* resource validation results, Adapter logs (EnsLib.* TCP/HTTP/File/SFTP/JDBC), system metrics via %SYS.Monitor and ^%SS, production metrics (queue depth, retry counts, message sizes, latency).

  • Optional: Message Bank for centralized multi‑namespace aggregation.

  • Journal timestamps for high‑precision time alignment.

Data model (new Observability schema)

  • OBS.Event (ts, Production, Namespace, Host, Service, Process, Operation, AdapterType, Direction, Status, ErrorCode, ErrorText, RetryCount, LatencyMs, MsgSize, RemoteHost, Protocol, MessageType, HL7Event, FHIRResource, HTTPStatus, TLSFlags).

  • OBS.FieldQuality (per message type): RequiredMissingCnt, UnknownCodeCnt, OutOfRangeCnt, DuplicatesCnt, CoercionFailuresCnt, SegmentCountStats.

  • OBS.Aggregates (1min/5min/1hr windows): volume, error_rate, p95_latency, queue_depth, backoff_rate.

  • Views that join OBS.* back to Ens.* classes for deep-linking to Visual Trace/Message Viewer.

ML/Analytics approach

  • Baselines per interface: learn normal ranges by time‑of‑day/day‑of‑week and seasonality.

  • Unsupervised anomaly detection (Isolation Forest or robust Z‑score) for:

    • Volume spikes/drops

    • Error/timeout surges

    • Latency shifts

    • Queue depth acceleration

  • Predictive forecasts (next 30–60 minutes):

    • Expected inbound volume and error rate

    • Probability of SLA breach given current trends

  • Data quality scoring for HL7/FHIR/JSON:

    • Required fields missing (e.g., PID‑3, MSH version)

    • Unknown value sets/code systems

    • Structural deviations (segment/resource cardinality)

  • Implementation options:

    • IntegratedML models: CREATE MODEL, TRAIN MODEL, PREDICT via SQL over OBS.Aggregates.

    • Embedded Python (scikit‑learn/prophet) for advanced time‑series; exposed as IRIS methods or SQL table functions.

    • Feature store built with computed columns and scheduled tasks (Ens.Job) to maintain aggregates.

Portal UX

  • New menu: Interoperability > Log Analytics.

  • Dashboards:

    • Traffic & Error Heatmap by Service/Operation

    • Anomaly Timeline with severity score and confidence

    • Queue Risk Forecast (when/where backlog will exceed threshold)

    • Data Quality Panel (top offending fields, segments, value sets)

  • Drill‑downs:

    • Click an anomaly to open Visual Trace filtered to the affected window and components.

    • “Explain” side panel: contributing features (e.g., remote host, message type), top correlated signals, and suggested actions.

  • Alerts:

    • Rules like “if AnomalyScore > 0.8 and queue_depth_slope > X, alert via ENS.Alert or email/SNMP/Webhook.”

    • Maintenance: certificate expiry predictors (based on TLSFlags/errors), disk space risk (journal size/velocity), connection flaps.

Suggested actions (generated)

  • Throttle or burst buffer recommendations per adapter.

  • Routing rule hints (e.g., quarantine messages with unknown PV1 values).

  • Index suggestions for heavy SQL in Operations.

  • Configuration checks: TCP keepalive, HTTP timeouts, SSL/TLS ciphers, retry/backoff tuning.

Technical details

  • OBS.* classes stored in a dedicated schema; populated by:

    • A lightweight Business Service/Process that subscribes to Ens.EventLog and message metadata streams.

    • Scheduled Task that compacts and aggregates into OBS.Aggregates.

  • Models:

    • Train nightly; update baselines continuously using sliding windows.

    • Governance: versioned models stored in a class (OBS.ModelRegistry), with metrics (AUC, precision/recall for labeled events when available).

  • Performance:

    • Use shard-friendly tables (IDKEY on ts + component) and bitmap indexes for event type filters.

    • Efficient queries via window functions (AVG, STDDEV, PERCENTILE_DISC) and materialized views for hot panels.

Security and privacy

  • Read‑only views by default; no PHI content stored beyond necessary field statistics unless enabled.

  • Role‑aware: Ops role can view/acknowledge; Admin can change thresholds and retraining cadence.

  • Redaction options for ErrorText; configurable field sampling.

  • Works on‑prem or with customer‑selected LLM/ML endpoints; no production payloads sent externally unless approved.

MVP acceptance criteria

  • OBS.Event and OBS.Aggregates populated from Ens.EventLog and MessageHeader across at least one production.

  • Anomaly Timeline showing predicted spikes/drops with confidence and deep links to Visual Trace.

  • Data Quality Panel for HL7 messages using EnsLib.HL7.SearchTable indices (missing required, unknown codes).

  • IntegratedML model that forecasts next‑hour error rate or volume per Service and exposes PREDICT() in SQL.

  • Alerting to ENS.Alert with severity and suggested next step.

  • Dashboard performance: <2s for last 24h view; <5s for 7‑day view.

Example detections

  • “Inbound HL7 ADT volume +180% vs baseline; unknown PID‑3 code system rising; probable bad feed from Host X.”

  • “Operation OP_FHIR posting 429s increasing; forecast SLA breach in 25 minutes; suggest retry/backoff adjustments.”

  • “Queue depth for BS_FileIn accelerating; expected to exceed 5,000 in 40 minutes; recommend temporary throttle and quarantine rule.”

Success metrics

  • 50% faster time‑to‑detect issues compared to manual monitoring.

  • 30% reduction in after‑hours incidents caused by unnoticed data/traffic anomalies.

  • Measurable decrease in message reprocessing due to early quarantine of bad data.

  • ADMIN RESPONSE
    Nov 24, 2025

    Thank you for submitting the idea. The status has been changed to "Needs review".

    Stay tuned!