Remediate¶

This phase demonstrates autonomous incident response: a model-related change introduces drift, Dynatrace detects impact, and Event-Driven Ansible restores a healthy state without manual intervention.

Objectives¶

Enable a feature flag that changes the AI embedding model.
Observe impact without taking down the application.
Validate Dynatrace detection of model drift and resulting problem state.
Trigger EDA-driven remediation that restores known-good behavior.

Step 1: Introduce Feature Flag Change¶

AI workloads introduce resiliency challenges beyond traditional uptime and error-rate monitoring. A service can stay online and still deliver poor outcomes because of AI-specific failure modes, including hallucinations, prompt-sensitivity, retrieval failures, and drift.

What Is Drift and Why It Matters¶

In this lab, focus on drift: a gradual or sudden change that makes model outputs less reliable for the same business task.

Common causes include:

Changes in user prompts, language patterns, or data distributions over time
Changes in model versions or model parameters
Changes in retrieval pipelines, ranking behavior, or document content
Embedding model changes that are not synchronized with indexed vectors

Business impact of drift can be significant even when infrastructure is healthy:

Lower answer relevance and trust, leading to poor user experience
More escalation and rework for support and operations teams
Increased token spend and latency from repeated or corrective prompts
Compliance and brand risk when responses become inaccurate or inconsistent

How a Vector Database and Embedding Model Work Together

A vector database stores numeric vector representations of text (embeddings) so similar meaning can be retrieved quickly.

Typical flow:

Source documents are chunked into smaller passages
An embedding model converts each chunk into a high-dimensional vector
Those vectors are indexed in the vector database
At query time, the user prompt is embedded with the same embedding model
The database returns nearest-neighbor passages for RAG context

AI Drift Conceptual Image

AI drift conceptual image: generated by AI, perhaps with a little drift and a few hallucinations.

Why Embedding Changes Can Cause Poor LLM Performance

Embedding spaces are model-specific. If stored document vectors were created with one embedding model, but live queries are embedded with a different model, similarity search quality can degrade sharply. The query and document vectors are no longer in the same semantic space, so retrieval returns weak or irrelevant context.

When retrieval quality drops, the LLM receives poorer grounding context, which can lead to:

Generic or off-target answers
Higher hallucination risk
Inconsistent quality across repeated prompts

OpenFeature

OpenFeature is an open standard and CNCF sandbox project that provides a vendor-neutral API and SDK model for feature flagging, so application code can evaluate flags consistently without being tightly coupled to a single flag provider. Feature flags are now a core operational control for modern systems, enabling safer releases, gradual rollouts, instant rollback, and controlled experimentation. The value OpenFeature provides is portability, reduced lock-in, and better engineering consistency: teams can standardize flag behavior across languages and services while still swapping providers, improving governance, and reducing risk when introducing or remediating production changes.

In this step, you will introduce a controlled embedding-model change, via an OpenFeature feature flag, to create observable drift conditions.

As the workshop instructor, in the AAP UI locate the destination-automation-automate-app-ai-embedding job template
Launch the job template and proceed to the Extra Variables prompt
For the new embedding model value, select the same model the application is currently using
Execute the job and confirm the application remains available

Automate App AI Embedding

Step 2: Observe Drift Behavior¶

Use this step to generate enough live traffic for Dynatrace Intelligence anomaly detection while observing response quality shifts in real time.

In the AI Travel Advisor app, run the same travel advice prompts you used earlier in the lab
For each response, submit feedback:
- Thumbs up for relevant and useful responses
- Thumbs down for generic, incorrect, or low-quality responses
Continue sending prompts for several minutes to create sustained telemetry
Try multiple destinations from the included destination list while using RAG
Switch between RAG and Direct LLM to compare behavior across both approaches

As you test, reflect on what has changed:

Are RAG answers less grounded or less consistent than before?
Do some destinations fail more often than others?
Does Direct LLM work as expected, as it did before?
Do repeated prompts produce unstable quality?

Generic, incorrect, and low-quality responses

Boston Not Found Boston Red Hat Office

Return to Normal¶

Keep submitting prompts and feedback. After some time, you should observe that RAG responses begin returning to expected quality.

Return to Normal

A few minutes ago the experience was clearly degraded: responses were weak, feedback trended negative, and users were getting frustrated. Now, without manual intervention, responses seem stable and useful again.

Reflect on the following before moving to the next step:

What evidence tells you quality actually recovered (and this is not random luck)?
If no one manually fixed the app in the UI, what mechanisms or automations might have intervened?
Which signals would you expect to see in telemetry during the failure period vs the recovery period?
In a real production incident, how would you explain this "it was broken, now it's fine" moment to stakeholders?

Step 3: Automated Recovery¶

When Dynatrace observability is connected to Red Hat EDA/AAP, detected problems can trigger trusted automated responses without waiting for manual intervention. This is auto-remediation: detect, decide, and act.

How Event Driven Ansible Remediates¶

Event Driven Ansible (EDA) is a real-time event processing and automation engine that extends Ansible Automation Platform with the ability to listen for events from multiple sources and automatically trigger predefined workflows in response. Rather than waiting for scheduled jobs or manual operator intervention, EDA adds an orchestration layer that immediately evaluates incoming events against rulebook conditions and executes remediation actions when matches occur. This enables autonomous incident response where problems detected by observability platforms like Dynatrace can be automatically remediated without human delay. In this workshop, EDA receives AI degradation events from Dynatrace, matches them against rules, and automatically executes AAP job templates that restore the application to a healthy state.

EDA Rule Audit

In the AAP web interface, navigate to Automation Decisions -> Rule Audit
Find the recent event showing that EDA received an incoming event and matched it to a rule in the rulebook

AAP Remediation Rule Audit

What this means:

Dynatrace detected a problem condition
Dynatrace sends problem event data to Red Hat Event Driven Ansible
EDA evaluated the event against rulebook logic
EDA/AAP executes automation for matching rule

EDA Event Action

Open the event details from the Rule Audit record
Confirm the event source is Dynatrace and review the payload metadata describing the AI Travel Advisor problem
Review the triggered action tied to that event

AAP Remediation Event Detail

You should see that a remediation job template was executed. The specific job is selected because the incoming event payload matched a rulebook condition that points to the desired remediation action.

EDA Rulebook

Rules are composed of conditions evaluated against the payload to determine the actions that EDA should take as a result. The rule in this workshop is a very simple example of matching conditions in the JSON payload sent by Dynatrace.


rules:
- name: "Trigger embedding model remediation on Dynatrace problem event"
  condition: |
    event.payload is defined and
    event.payload.eventData['event.category'] == 'ERROR' and
    event.payload.eventData['event.name'] == 'AI Travel Advisor: Embedding Model Drift Detected'
  action:
    run_job_template:
      name: "destination-automation-remediate-app-ai-embedding"
      organization: "destination-automation"
      extra_vars:
        app_ai_embedding_model_requested: "nomic-embed-text"

Open the related job execution output in AAP.
Verify the job changed the OpenFeature embedding-model flag back to the correct original model used to build the Weaviate collection.

AAP Job Execution

This is the operational value of Red Hat EDA/AAP: secure, policy-driven automation that remediates issues across hybrid cloud environments using events from Dynatrace and many other sources.

Step 4: Automated Anomaly Detection¶

In this step, explore how Dynatrace Intelligence automatically detects AI service problems from infrastructure to response quality.

How Dynatrace Detects Anomalies¶

Dynatrace Intelligence continuously analyzes telemetry signals to automatically detect deviations from normal behavior. The detection approach combines multiple techniques:

Baseline Deviation

Dynatrace establishes dynamic baselines for key metrics by learning normal behavior patterns over time. When observed values deviate significantly from these baselines—whether up or down—an anomaly is flagged. For AI workloads, this includes application response time, error rate, and AI-specific metrics like vector database query performance and retrieval result quality.

Forecasting

Beyond static baselines, Dynatrace uses predictive forecasting to anticipate expected metric ranges. This accounts for time-of-day, day-of-week, and seasonal patterns in traffic and performance. Forecasting enables early detection of emerging issues before they become critical failures.

Alerting on Signal Ingest

Problems are detected at signal ingest time—as soon as anomalous telemetry arrives—not after batch processing or scheduled jobs. This immediate detection enables rapid remediation workflows and reduces the window of poor end-user experience.

AI-Specific Metrics

For AI observability, Dynatrace tracks specialized metrics including:

Vector retrieval distance: How far the retrieved document vectors are from the query vector. Larger distances indicate weaker semantic similarity and lower-quality RAG grounding
Query result count: How many relevant documents were found. Zero or very few results indicate retrieval failure
LLM request/response metrics: Token counts, latency, model name, and span-level instrumentation of the generative AI pipeline
User feedback signals: Thumbs-up/down and explicit ratings that correlate with LLM response quality

Together, these signals allow Dynatrace to detect not just infrastructure problems, but AI-specific quality degradation that impacts end users.

From Automatic Detection to Event Orchestration¶

Open the Dynatrace environment and navigate to the AI Observability app
In the Explorer view, locate the ai-travel-advisor service
Click on the service to open the detail view
Navigate to the Problems tab to see automatically detected issues

You should see a problem card titled AI Travel Advisor: Embedding Model Drift Detected. This problem was automatically created when Dynatrace Intelligence detected the anomaly during your traffic generation in Step 2. Click on the problem card to explore its details. You may see metrics related to Weaviate vector distance and retrieval scores.

Dynatrace Problem Detected

Understanding Vector Metrics: Distance and Quality

Vector distance measures how far apart two embeddings are in the high-dimensional semantic space. Common distance metrics include:

Cosine distance: Ranges from 0.0 (identical direction) to 1.0 (opposite direction). In embeddings, lower is better, values near 0.0 indicate strong semantic similarity
Euclidean distance: Geometric distance in the vector space. Lower values indicate closer, more similar vectors

When your query vector is far from stored document vectors:

The vector database returns matches with poorer semantic relevance
The retrieved context is weaker or off-topic for the user's question
The LLM receives less useful grounding, leading to generic or hallucinatory responses

How Distance Signals Quality Issues

In the normal case (using the correct embedding model):

Query vectors and document vectors live in the same semantic space
Vector distances are small and consistent
Retrieved documents are highly relevant for RAG

In the degraded case (embedding model mismatch):

Query and document vectors are in different semantic spaces
Vector distances are large (the query is "far" from all stored documents)
Retrieved documents may be irrelevant or missing entirely
LLM receives poor context and produces poor answers

Dynatrace correlates high vector distances with low user satisfaction signals to create a clear problem statement: "Embedding Model Drift Detected."

Explore the Problem-Triggered Workflow

In the Dynatrace problem detail view, look for a Workflow section or Notifications section
Open the workflow execution that was triggered by this problem detection
Examine the workflow details to see:
The event payload sent by Dynatrace
How the problem data flows to Red Hat Event Driven Ansible
The orchestration actions executed as a result

Dynatrace Workflow Send to EDA

The workflow represents the out-of-the-box integration between Dynatrace and Red Hat EDA/AAP. When a problem is detected, Dynatrace automatically sends event data—including the problem name, affected service, metrics, and context—directly to EDA. EDA's rulebook engine matches this event payload against rules and triggers appropriate remediation actions, all without manual operator involvement.

The Value of Integrated Observability and Automation¶

This end-to-end flow demonstrates why modern observability and automation platforms must work together:

Comprehensive AI Monitoring

Dynatrace provides visibility into the complete AI pipeline:

Infrastructure and transactional health (request latency, error rates, resource utilization)
Application performance (code-level instrumentation and spans)
AI workload quality (embeddings, retrieval metrics, model performance, user satisfaction)

Automatic Problem Detection

Dynatrace Intelligence automatically detects anomalies without requiring manual threshold tuning. This is critical for AI workloads where "normal" is harder to define and often changes based on data distributions and user behavior.

Immediate Remediation

By connecting Dynatrace to Red Hat EDA/AAP, detected problems trigger automated remediation workflows instantly. Infrastructure or AI issues can be resolved in seconds, not hours. This reduces Mean Time To Recovery (MTTR) and minimizes user impact.

Enterprise Notifications

Dynatrace's out-of-the-box integrations with ServiceNow, JIRA, Slack, PagerDuty, and many others, ensure that teams across operations, development, and business stakeholders are informed of AI service quality issues and remediation actions, enabling coordinated incident response.

Reflection: Your Organization's AI Readiness

Pause here and reflect on your own environment. If your organization is currently introducing AI capabilities, how confident are you that you can detect quality degradations quickly when no infrastructure outage occurs? Many teams can monitor uptime and errors, but struggle to detect semantic failures such as retrieval drift, hallucination trends, or declining user trust before business impact becomes visible.

Consider the operational challenge from end to end: signal collection, anomaly detection, triage, ownership, and remediation execution. How long would it take your teams today to recognize an AI quality issue, identify the likely root cause, and apply a safe correction across platforms? What customer, revenue, compliance, or brand risk accumulates during that detection-to-resolution window?

Use this lab outcome as a benchmark for future-state capabilities. The goal is not only to build AI features, but to build AI resilience: observable quality signals, intelligent detection, event-driven remediation, and clear enterprise communication that together reduce mean time to detect and mean time to recover.

Validation¶

[ ] Embedding model feature flag was changed via AAP job template without causing an application outage
[ ] RAG response quality degraded while Direct LLM remained unaffected, confirming drift is retrieval-specific
[ ] Dynatrace Intelligence automatically detected and created the AI Travel Advisor: Embedding Model Drift Detected problem
[ ] Dynatrace workflow executed and sent problem event data to Red Hat Event Driven Ansible
[ ] EDA Rule Audit confirmed the event was received, matched rulebook conditions, and triggered the remediation job template
[ ] Service recovered automatically—RAG response quality returned to expected baseline without manual intervention

Continue to Summarize.