Observe and Automate¶

In this phase, you drive the application with prompts, observe behavior in Dynatrace AI observability views, and automate controlled runtime changes through deployment workflows.

Objectives¶

Exercise the travel advisor with realistic prompts
Observe workload behavior and AI response characteristics
Automate changes to model selection, temperature, and RAG instructions
Compare response quality and runtime signals after each change

Step 1: Generate Baseline AI Traffic¶

Use this step to establish a clean baseline before any tuning or remediation activities.

What You Are Learning

How an AI workload behaves under repeated prompts
How answer quality can change by prompting strategy
How user feedback helps quantify response quality
Why retrieval context (RAG) can improve consistency and trust

Pick Your Locations

Choose office destinations closest to your geography (continent).

Easy default options: Boston and Raleigh
Additional options are available in the repository destinations directory: app/destinations

Run Baseline Prompts with Direct LLM¶

Open the AI Travel Advisor application
Set the approach to Direct LLM
Ask for travel advice for one selected location (for example, Boston)
Repeat the same request 3-5 times with minor wording changes
Repeat steps 3-4 for a second location (for example, Raleigh)
For each response, use the app feedback buttons:
- Thumbs up when the answer is relevant, specific, and useful
- Thumbs down when the answer is generic, inconsistent, or off-topic

Suggested prompt pattern (maximum 50 characters):

I am traveling to <location>. Suggest a 2-day plan.

Repeat Prompts with RAG¶

Switch the approach to RAG
Ask for travel advice for the same location set used in Direct LLM
Repeat each request 3-5 times
Again provide thumbs up/down feedback for every response

Use the same or very similar prompts so your comparison remains fair.

Record Baseline Observations¶

Capture quick notes for each approach:

Response relevance to requested destination
Specificity (concrete suggestions vs generic text)
Consistency across repeated prompts
Perceived latency and response completeness
Thumbs up/down ratio

Concept: Direct LLM vs RAG¶

Simple comparison:

Direct LLM: You ask a smart friend a question, and they answer from memory
RAG: You ask the same smart friend, but first they look at a notebook with trusted facts, then answer

Detailed comparison (how it works):

Direct LLM: Prompt input is sent directly to the model, and the model generates output from its parametric memory and reasoning patterns. Quality depends on model capability, prompt quality, and prior training data representation.
RAG: Retrieval-Augmented Generation first queries a knowledge source (often vector search over indexed documents), injects retrieved context into the prompt, and then generates an answer conditioned on that external context. This reduces hallucination risk for domain-specific facts and increases answer grounding.

Comparison and Why RAG Helps

Direct LLM is simpler and faster to start, but can be less grounded for specialized or local context
RAG adds retrieval overhead but often improves factual relevance, repeatability, and explainability
For enterprise AI workloads, RAG is commonly used to improve trust, reduce bad recommendations, and support better cost-quality tradeoffs by using smaller models with stronger context

Step 2: Observe in Dynatrace¶

Use this step to develop observability experience around AI workload behavior and understand how observability is critical to effective understanding of AI quality and cost.

Dynatrace AI Observability: Overview

Dynatrace AI Observability provides unified visibility into AI workload performance, quality, and cost across your enterprise. It captures detailed telemetry from LLM calls, models, retrieval systems, and orchestration frameworks, enabling teams to detect quality degradation, latency bottlenecks, and security risks in near real-time. Use cases include monitoring model performance drift, detecting hallucination, optimizing token usage and cost, and triggering automated remediation on AI-specific failures. The platform integrates with OpenTelemetry and vendor-neutral standards, giving you portability and control over your AI observability data. Value is realized through faster incident detection, improved AI quality through data-driven tuning, and cost efficiency via token and latency optimization.

Reference: Dynatrace AI Observability App Documentation

Dynatrace AI Observability App¶

Log in to your Dynatrace tenant
In the Dynatrace web UI, search for AI Observability or navigate via the Apps menu
Open the AI Observability app

Explore the Overview Tab

The Overview tab displays a unified dashboard across all AI workloads, models, and agents in your environment.

Value of the Overview tab:

See all AI services in a single pane
Monitor aggregate metrics: request volume, average latency, model usage distribution, and error rates
Quickly identify which AI services are under stress or showing anomalies
Spot trends in token consumption, cost, and quality signals

Spend a few minutes reviewing the current metrics and services visible.

Review the Service Health Tab

Key metrics and signals available:

Request volume and latency — Transaction rate and p50/p95/p99 response times.
Model usage — Breakdown of models called, token counts, and cost estimates.
Quality signals — User feedback scores, hallucination detection, and retrieval quality.
Error rate and types — API failures, timeouts, rate limiting, and model-specific errors.
Infrastructure metrics — Memory, CPU, and network utilization of the AI service.

Review a few key metrics and understand what each tells you about service health.

Investigate the Explorer Tab

This is where you analyze individual AI services in detail.

In the service list, select the ai-travel-advisor service.
Once selected, you can now analyze:
- Health dashboard — High-level status and key metrics for this service.
- Prompts — Individual prompts sent to the service over time.
- Responses — Corresponding responses and quality assessments.
- Vulnerabilities — Security and safety issues detected (e.g., prompt injection, jailbreak attempts).
- Traces and performance — End-to-end call traces with detailed timing.

OpenLLMetry

OpenLLMetry is an open observability approach for LLM applications that captures prompt, response, model, token, latency, and retrieval metadata using OpenTelemetry-aligned telemetry. This turns AI behavior from a black box into measurable signals you can analyze for quality, performance, and cost. With this visibility, teams can detect drift, troubleshoot poor responses, compare model/runtime changes, and make evidence-based tuning decisions. In this workshop, OpenLLMetry-style instrumentation is exported through OpenTelemetry and ingested by Dynatrace AI Observability, where traces, spans, and AI metadata are correlated with infrastructure and application health for end-to-end analysis.

Analyze a Prompt Trace¶

In the Explorer tab for ai-travel-advisor, find the Prompts section.
Select one of your recent prompts (ideally from the baseline traffic you generated in Step 1).
Review the information captured:
- Prompt text
- Response text
- Model used
- Tokens consumed
- User feedback (if you provided it)
- Latency breakdown
Click Open Distributed Trace to view the full call flow.

Prompt Analysis

Explore the Distributed Trace

In the distributed trace view, examine:

Trace structure — How the prompt flows through the system (app → LLM → retrieval → response).
Spans — Individual operations: prompt preprocessing, vector search, model inference, response post-processing.
Latency contribution — Which spans consume the most time?
Metadata — Model name, temperature, top_k, embeddings used, retrieved documents, token counts.
Logs — Any errors, warnings, or structured logs tied to the trace.

Expand each span to understand its purpose and performance characteristics.

Gen AI Attributes

Gen AI Semantics

The OpenTelemetry GenAI semantic conventions define a standardized vocabulary for AI telemetry so traces, metrics, and logs consistently describe prompts, responses, model identity, token usage, latency, and retrieval context across tools and vendors. OpenLLMetry builds on these conventions with practical instrumentation patterns for LLM applications, making AI behavior portable and comparable while allowing platforms like Dynatrace to correlate AI quality, performance, and cost end to end.

Value of End-to-End Visibility¶

Being able to see the complete flow from user prompt to response—and the detailed breakdown of each step—enables:

Performance optimization — Identify latency bottlenecks (e.g., slow retrieval, high model latency).
Quality debugging — Trace back a poor response to the exact model call, retrieval results, and parameters used.
Cost attribution — Understand which operations consume tokens and where cost can be optimized.
Compliance and security — Audit what data was sent, retrieved, and returned for sensitivity analysis.

Reflection

Use these questions to deepen your thinking about what you've observed:

Performance:

Which span took the longest during your trace? Was it the LLM inference, retrieval, or data processing?
If your prompt took longer than expected, what would you hypothesize is the cause?

Architecture:

How many hops did your request take (app → retriever → model → app)? Would you design it differently?
Did you observe any parallel processing, or was everything sequential?

Metadata and Quality:

What model was used for your prompt? Would a different model have been better?
How many tokens were consumed? Was it more or less than you expected?
If you used RAG, how many documents were retrieved? Were they relevant?

Observability:

What information in the trace would be most useful if you were debugging a customer complaint?
What metrics would you add to this trace if you were responsible for cost control?

Step 3: Automate with Red Hat¶

Use this step to operationalize AI model tuning through enterprise automation, seeing how changes are deployed safely, repeatably, and with full audit trails.

Red Hat AI Automation¶

How Red Hat Ansible Automation Platform automates AI, app, infra runtime configuration changes
How infrastructure-as-code practices apply to AI workload management
How observability and automation work together to validate changes
Why governance, repeatability, and auditability matter for AI in production

Red Hat Ansible Automation Platform: Overview

Red Hat Ansible Automation Platform (AAP) provides enterprise-grade automation and orchestration for infrastructure, applications, and now AI workloads. It enables teams to define infrastructure and application configurations as code, version-control them, and execute them reliably across hybrid cloud and on-premises environments. When combined with Red Hat OpenShift for containerized AI services, AAP provides seamless orchestration of both compute and configuration. Red Hat's AI portfolio—including integration with IBM's watsonx and partnerships across LLM providers—extends AAP's capabilities into generative AI workflows, model serving, and governance. Value is realized through reduced manual toil, consistent deployments, lower error rates, compliance auditing, and the ability to scale operations globally without increasing headcount.

Access the AAP Web Interface

For Instructors:

Open the Red Hat AAP web interface in your browser
Log in using your instructor credentials
You have full permissions to execute and modify job templates

For Participants:

Open the Red Hat AAP web interface in your browser
Log in using your participant credentials (provided by your instructor)
You have read-only access and can view job status and outputs, but cannot directly launch templates. Your instructor will launch templates on your behalf.

Change the AI Runtime¶

Locate and Review the AI Runtime Workflow Template

Domains

Find your templates faster by filtering on Domains. Try using the App Domain. Then use the Search function to further narrow your results.

In the AAP web UI, navigate to Automation Execution → Templates
Find and select the destination-automation-workflow-app-ai-runtime template
Review the template configuration:
- Workflow design — A directed graph of job template steps
- Job templates involved — Which playbooks will be executed in order
- Survey/prompts — Input parameters that can be changed at launch time

Workflow Explanation

The destination-automation-workflow-app-ai-runtime workflow automates the deployment of the AI Travel Advisor application with configurable runtime parameters:

Stage 1: Validate — Pre-flight checks ensure target infrastructure is ready
Stage 2: Configure — Sets the environment variables for the app's model, temperature, and rag_instructions
Stage 3: Deploy — Deploy the app container with provided configuration values
Stage 4: Verify — Health checks confirm the new deployment is responsive and operational

Key parameters you can customize when launching the job via extra variables:

model — LLM model name (e.g., gemma3:1b-it-qat, orca-mini:3b)
temperature — Model creativity/randomness (0.0 = deterministic, 2.0 = very creative)
rag_instructions — Custom instructions for the RAG retrieval system

Execute the Workflow with New Parameters (Instructor Only)

Click Launch on the destination-automation-workflow-app-ai-runtime template.
A survey form appears with input fields for:
- model
- temperature
- rag_instructions
Enter new values:
- Example model change: Switch from qwen2.5:0.5b-instruct-q4_K_M to gemma3:1b-it-qat
- Example temperature change: Set to 0.2 (more deterministic) instead of 0.7
- Example RAG instruction change: Add a note like "- Emphasize things to do at the office location"
Click Next → Launch.

Automate App AI Runtime

The workflow job begins executing. You can monitor progress in real-time:

Status bar shows which stage is running.
Logs display output from each step.
Details tab shows timestamps and any errors.

Inspect Job Outputs and Validate Deployment

While the workflow executes, observe:

Stage output — Logs from each deployment step (stops, builds, deploys, verifies).
Deployment summary — Confirmation that the app was deployed with the new parameters.
Health checks — Verification that the application responded to health probes.

Re-run Prompts and Determine Results

Now return to the AI Travel Advisor application and re-run the same prompts from Step 1 using the newly configured model and temperature.

Access the AI Travel Advisor interface.
Submit the same prompts you used in Step 1 baseline.
Repeat 3-5 times per location, same as before.
Provide thumbs up/down feedback for each response.

Compare these new results against your Step 1 baseline notes:

Is the model more deterministic or more creative?
Are responses faster or slower?
Is answer consistency better or worse?
Did the temperature/model change impact relevance for your use case?

The Value of AAP for AI Operations¶

Red Hat Ansible Automation Platform brings enterprise discipline to AI workload management:

Repeatability — The same configuration deployed the same way every time, reducing human error
Auditability — Every deployment is logged with who launched it, what parameters were used, and the outcome
Security — Credentials and secrets are vaulted; sensitive parameters are never logged in plain text
Governance — Workflows can enforce approval gates, compliance checks, and rollback procedures
Scale — Deploy to one AI service or thousand services with identical repeatability and governance
Integration — AAP workflows orchestrate infrastructure, applications, and now AI runtime configurations, creating unified operations across the entire stack (OpenShift AI, RHEL AI, IBM WatsonX, vLLM, and more)

Without automation, scaling AI operations requires proportional growth in operational staff. With AAP, your team manages exponentially more AI services through codified, versioned, audited workflows.

Step 4: Compare AI Results with Dynatrace¶

Use this step to compare how the updated AI runtime behaves and determine whether the change improved quality, performance, and efficiency.

Comparing Model Behavior

Compare AI runtime behavior using real telemetry instead of opinion alone
Identify performance differences between models and prompt strategies
Evaluate token efficiency and cost across runtime variants
Understand why model comparison must include both technical and user-facing outcomes

Compare Prompts in the Explorer View¶

Open the AI Observability app in Dynatrace
Navigate to the Explorer tab
Select the ai-travel-advisor service
Find a prompt that uses the new model you configured in Step 3
Open the prompt details and review:
- Prompt text
- Response text
- Model metadata
- Temperature and runtime settings
- Token usage
- User feedback

From the selected prompt, open the distributed trace.

How to Analyze the Trace

When reviewing the distributed trace, focus on the following:

Total response time — Did the full request complete faster or slower than before?
Span timing — Which span now dominates latency: retrieval, inference, prompt assembly, or post-processing?
Metadata differences — Compare model, temperature, RAG instructions, and token counts.
Logs and events — Look for warnings, retries, or model-related anomalies.
Functional outcome — Determine whether better performance came at the cost of worse answer quality, or vice versa.

Compare Using the AI Model Versioning and A/B Testing Dashboard¶

In the context of AI models, A/B testing is crucial for assessing the impact of changes to training dataset, vector databases, algorithms, features, or configurations on key performance metrics, such as accuracy, user engagement, or revenue. By exposing different segments of users to each variation and analyzing their interactions, A/B testing provides data-driven insights into which version of specific models delivers better outcomes. This method ensures that updates to AI models are grounded in measurable improvements, minimizing risks and optimizing results in real-world applications.

In Dynatrace, open the Dashboards app.
Open the AI Model Versioning and A/B Testing dashboard.
At the top of the dashboard, set the comparison variables for:
- Provider A + Model A
- Provider B + Model B

Use these selectors to compare the original model/runtime against the new one you deployed in Step 3.

AI Model Comparison

Data Sampling

Data Sampling means the dashboard may analyze a representative subset of total requests instead of every single request. This helps keep dashboards fast and responsive while still preserving statistically useful comparisons. Sampling is especially important when request volume is high or when dashboards are comparing multiple services, models, or time windows. A sampled view is still valuable for trends and relative comparisons, but you should remember that it may not represent every outlier. For deep incident analysis, always drill into the full traces for the specific prompts that matter most.

Use the dashboard to compare the two model/runtime variants and identify which performs better.

Review these key indicators for each model:

Response time — Which model responds faster?
Token consumption — Which model uses fewer tokens for similar prompts?
Estimated cost — Which runtime is cheaper for the same user task?
Request volume — Is one runtime seeing more usage than the other?
Feedback quality — Which runtime is receiving better user feedback?

Compare Responses for the Same Prompt

Choose the same or very similar prompts across both model versions and compare:

Relevance of the travel advice
Specificity and grounding of the recommendations
Consistency across repeated requests
Verbosity and clarity
Hallucination risk or unsupported statements

Try to determine whether the newer runtime is truly better, or simply different.

Definitive Data for Improvement¶

Building and changing models, or their ecosystem, depends on observability. Blindly progressing without comparing definitive results leads to regression, poor responses, frustrated users, and missed business outcomes.

Use these questions to guide your analysis:

Performance and Cost:

Which runtime delivered the best balance of latency and token efficiency?
Did the lower-latency option also reduce cost, or was there a tradeoff?

Quality and Functionality:

Which model produced more useful travel advice for the same prompt?
Did the more expensive or slower runtime actually provide better answers?

Operational Decision-Making:

If you were choosing a production default, which runtime would you select and why?
Would your decision change based on region, cost constraints, or expected traffic volume?

Why This Comparison Matters

Dynatrace makes it possible to compare AI runtime inputs, outputs, performance, functionality, cost, and user feedback across the full delivery lifecycle. That means you can move from guesswork to evidence-based AI operations: compare versions, validate improvements, quantify regressions, and justify production changes using real telemetry. This is especially important in enterprise environments where model choice affects not only user experience, but also security posture, compliance, infrastructure consumption, and operating cost. With this level of visibility, teams can optimize AI services continuously instead of treating them as opaque black boxes.

Validation¶

[ ] Baseline prompts were run in both Direct LLM and RAG modes with thumbs up/down feedback captured
[ ] Baseline notes include quality and behavior signals (relevance, specificity, consistency, latency, and feedback ratio)
[ ] An AI Runtime workflow execution in AAP changed model, temperature, and/or rag_instructions, and the job history was reviewed
[ ] The same prompts were re-run after the runtime change, and post-change observations were documented
[ ] Dynatrace Explorer traces and the AI Model Versioning and A/B Testing dashboard were used to compare performance, quality, and cost across runtime variants

Continue to Remediate.