Data Job Failure Intelligence
AI-powered diagnostics for data job failures on Amazon EMR — automated root-cause analysis with log intelligence, PII redaction, and AWS Bedrock (Claude Sonnet).
1. Executive Summary
Data Job Failure Intelligence is an AI-powered, event-driven microservice that automatically diagnoses Amazon EMR step failures. When EMR emits a step failure event, the agent is triggered on AWS Lambda, gathers relevant cluster configuration and log data, applies intelligent log extraction to isolate the root cause, and then invokes AWS Bedrock (Claude Sonnet 4.6) to produce a structured diagnostic report.
The agent reduces mean time to resolution (MTTR) for EMR failures by eliminating manual log inspection. It surfaces actionable root cause analysis, log evidence, and targeted recommendations — all while enforcing strict data governance through built-in guardrails and PII redaction.
2. Goals and Non-Goals
2.1 Goals
- Automatically trigger root cause analysis on every EMR step failure event without human intervention.
- Retrieve and consolidate EMR cluster config, Spark, YARN, and Hive settings alongside step logs.
- Efficiently extract only failure-relevant log lines, minimising token usage and LLM cost.
- Redact all sensitive data (credentials, PII, internal endpoints) before sending logs to Bedrock.
- Return a structured diagnostic payload: root cause, log evidence, up to three recommendations, and failure category.
- Enforce Bedrock guardrails to prevent sensitive data leakage in LLM input and output.
- Expose all critical parameters (LLM model, token budget, log size cap, API endpoints) as externalised configuration.
2.2 Non-Goals
- The agent does not auto-remediate or restart EMR steps.
- The agent does not store analysis results; persistence is the responsibility of the calling system.
- The agent does not support non-EMR compute platforms (Glue, HDInsight, Databricks) in this version.
- Real-time streaming log analysis is out of scope; the agent operates on completed step logs.
3. System Context & Integration Points
Data Job Failure Intelligence sits within the broader data platform as a reactive diagnostic service.
| External Trigger | Agent Boundary | Downstream Consumers |
|---|---|---|
| Amazon EMR (EventBridge step failure event) | AWS Lambda (this service) | Operations / On-call team (Slack, PagerDuty) |
| EMR Config & Metadata API | AWS Bedrock (Claude Sonnet 4.6) | Incident Management System |
| S3 — Step & YARN Logs | LangChain orchestration layer | Data Platform Dashboard |
| AWS Secrets Manager (config) | Bedrock Guardrails | Audit / Compliance Log |
4. High-Level Architecture
The agent follows a linear pipeline architecture with five distinct stages. Each stage is independently testable and configuration-driven. The pipeline is implemented as a LangChain chain and executes entirely within a single Lambda invocation.
EventBridge (Step Failure)
│
▼
┌─────────────────┐
│ Lambda Handler │ ← Stage 1: Event Ingestion
└────────┬────────┘
│
▼
┌─────────────────────────┐
│ Config & Metadata Fetch │ ← Stage 2: API call for cluster config, log path, Spark/YARN/Hive
└────────────┬────────────┘
│
▼
┌──────────────────────────────┐
│ Log Extraction & Sanitisation│ ← Stage 3: Download stderr, extract errors, redact PII, trim
└────────────┬─────────────────┘
│
▼
┌──────────────────────────┐
│ LLM Analysis (Bedrock) │ ← Stage 4: Prompt assembly, Guardrails, claude-sonnet-4-6
└────────────┬─────────────┘
│
▼
┌──────────────────────┐
│ Response Structuring │ ← Stage 5: Parse JSON, validate schema, return payload
└──────────────────────┘
Pipeline Stage Summary
| # | Stage | Description | Output Artifact |
|---|---|---|---|
| 1 | Event Ingestion | EventBridge delivers step failure event to Lambda. Payload carries cluster-id, step-id, and failure timestamp. | Parsed event context |
| 2 | Config & Metadata Fetch | Calls internal metadata API to retrieve EMR cluster config, S3 log path, Spark, YARN, and Hive configuration. | Enriched context object |
| 3 | Log Extraction & Sanitisation | Downloads stderr from S3. Applies regex extraction to isolate error traces. If container IDs present, fetches YARN logs. Redacts PII and secrets. Trims to max_log_size. |
Sanitised log snippet |
| 4 | LLM Analysis (Bedrock) | Assembles system + user prompts with sanitised logs. Invokes Bedrock with claude-sonnet-4-6. Guardrails applied on input and output. | Structured LLM response |
| 5 | Response Structuring | Parses LLM JSON output into root_cause, log_evidences, recommendations (max 3), and category fields. Returns to caller. |
Final diagnostic payload |
5. Component Design
5.1 Event Ingestion — Lambda Handler
The Lambda entry point deserialises the EventBridge payload and validates required fields (clusterId, stepId, region). It initialises the LangChain pipeline, invokes it, and serialises the result for the caller.
- Runtime: Python 3.12
- Trigger: Amazon EventBridge Rule — source
aws.emr, detail-typeEMR Step Status Change, stateFAILED - Memory: configurable (default 512 MB); timeout: configurable (default 300 s)
- IAM Role permissions:
s3:GetObjecton log bucket,bedrock:InvokeModelon Bedrock endpoint,secretsmanager:GetSecretValuefor config,logs:CreateLogGroup/PutLogEvents
5.2 Config & Metadata Fetch
A dedicated LangChain tool wraps the internal metadata REST API. The tool is invoked as the first chain step. The response is parsed and normalised into a standard context schema consumed by downstream stages.
| Attribute Fetched | Usage |
|---|---|
| EMR Cluster Configuration | Cluster type, release, instance types — provides execution context to LLM |
| Step Logs S3 Location | Base path for stderr / stdout log download |
| Spark Configuration | spark.executor.memory, spark.sql.shuffle.partitions, etc. — context for OOM / shuffle diagnosis |
| YARN Configuration | yarn.nodemanager.resource.memory-mb, scheduler settings — context for resource failures |
| Hive Configuration | Metastore URI, execution engine — context for Hive query failures |
All API endpoint URLs and authentication headers are externalised via configuration.
5.3 Log Extraction & Sanitisation
This is the core value-delivery component of the pipeline. Its primary objective is to maximise diagnostic signal while minimising token consumption.
5.3.1 Stderr Log Download
- The agent constructs the S3 key from the log path returned by the metadata API.
boto3S3GetObjectis used; streaming is applied for large files to avoid Lambda memory exhaustion.- Total ingested log size is capped at
max_log_size_bytes(configurable).
5.3.2 Intelligent Error Extraction
Rather than forwarding raw log files to the LLM, the agent applies a multi-pass extraction strategy:
- Pass 1 — Exception Anchor: Locate lines containing exception keywords (
Exception,Error,FAILED,Caused by,OOM,OutOfMemory,FileNotFoundException, etc.) using compiled regex patterns. - Pass 2 — Context Window: Include N lines before and after each anchor (configurable, default 10 lines) to preserve stack-trace context.
- Pass 3 — Deduplication: Remove duplicate stack frames using a hash-based
seenset. - Pass 4 — Container Log Resolution: If container IDs (
container_XXXXXXXX) are present in stderr, fetch corresponding YARN container logs from S3 and apply the same extraction logic. - Pass 5 — Truncation: Hard-trim the final log snippet to
max_log_size_bytesto enforce token budget.
5.3.3 PII & Secret Redaction
All log content passes through a regex-based redaction pipeline before leaving the log extraction stage.
| Data Type | Pattern | Replacement |
|---|---|---|
| AWS Access Key ID | AKIA[0-9A-Z]{16} |
[REDACTED-AWS-KEY] |
| AWS Secret Key | (?i)secret.*key.*=\S+ |
[REDACTED-SECRET] |
| IP Addresses | \b(?:\d{1,3}\.){3}\d{1,3}\b |
[REDACTED-IP] |
| Email Addresses | \b[\w.+-]+@[\w-]+\.[a-z]{2,}\b |
[REDACTED-EMAIL] |
| JDBC / DB Passwords | (?i)password\s*=\s*\S+ |
[REDACTED-PASSWORD] |
| S3 Bucket Names | s3://[a-z0-9\-./]+ |
[REDACTED-S3-PATH] |
| Auth Tokens / Bearer | (?i)bearer\s+[A-Za-z0-9._\-]+ |
[REDACTED-TOKEN] |
The redaction regex set is extensible via configuration. Redaction is applied prior to any LLM call and any response logging.
5.4 LLM Analysis via AWS Bedrock
5.4.1 Prompt Architecture
The agent constructs a two-part prompt for each invocation:
System Prompt (static, version-controlled)
You are an expert data engineering assistant specialising in Amazon EMR, Apache Spark,
YARN, and Hive diagnostics. Your task is to analyse the provided log excerpt and return
a structured JSON diagnosis. Respond ONLY with a valid JSON object. Do not include
markdown or explanatory prose outside the JSON. Never include sensitive data such as
credentials, IP addresses, or internal hostnames in your response.
User Prompt (dynamic, per-invocation)
Cluster ID: {cluster_id} | Step ID: {step_id} | Failure Time: {timestamp}
Cluster Config: {cluster_config_summary}
Spark Config: {spark_config_summary}
YARN Config: {yarn_config_summary}
Log Excerpt:
{sanitised_log_snippet}
Return JSON with fields: root_cause (string), log_evidences (array of strings),
recommendations (array, max 3), category (string).
5.4.2 Bedrock Invocation Parameters
| Parameter | Value |
|---|---|
| Model ID | configurable (default: anthropic.claude-sonnet-4-6-20251101-v1:0) |
max_tokens |
configurable (default: 1024) |
temperature |
0 (deterministic output for consistent diagnostics) |
| Orchestration | LangChain ChatBedrock wrapper with ChatPromptTemplate |
| Guardrail | guardrailIdentifier + guardrailVersion passed on every invocation |
5.5 Bedrock Guardrails
AWS Bedrock Guardrails provide a second line of defence in addition to pre-processing redaction:
- Sensitive Information Filters: Block or mask AWS credential patterns and PII types (
NAME,EMAIL,PHONE,ADDRESS,US_SSN,AWS_ACCESS_KEY,AWS_SECRET_KEY) on both input and output. - Denied Topics: Block topics unrelated to EMR diagnostics (e.g. general code generation, personal advice) to prevent prompt injection exploitation.
- Content Filters: Block harmful content categories (
HATE,INSULTS,SEXUAL,VIOLENCE) atHIGHthreshold on output. - Word Filters: Block internal hostname patterns and custom terms specified in configuration.
6. Diagnostic Response Schema
The agent parses the LLM JSON output and validates it against a Pydantic schema before returning to the caller.
| Field | Type | Description |
|---|---|---|
root_cause |
string |
Concise, human-readable statement of the identified root cause. E.g., OOM in executor during shuffle stage. |
log_evidences |
string[] |
Array of exact log lines or paraphrased evidence snippets that support the root cause. |
recommendations |
string[] |
Ordered list of up to 3 actionable remediation steps, most impactful first. |
category |
string |
Failure taxonomy label. Enum: RESOURCE_EXHAUSTION | DATA_QUALITY | CONFIGURATION_ERROR | DEPENDENCY_FAILURE | PERMISSION_DENIED | UNKNOWN |
cluster_id |
string |
Echo of input cluster ID for traceability. |
step_id |
string |
Echo of input step ID for traceability. |
analysis_timestamp |
ISO 8601 | UTC timestamp of analysis completion. |
tokens_used |
integer |
Total prompt + completion tokens consumed; logged for cost monitoring. |
Example response:
{
"root_cause": "Executor out-of-memory error during shuffle write phase caused by insufficient spark.executor.memory.",
"log_evidences": [
"java.lang.OutOfMemoryError: GC overhead limit exceeded",
"ExecutorLostFailure: Executor 4 exited caused by one of the running tasks"
],
"recommendations": [
"Increase spark.executor.memory from 4g to 8g in the step configuration.",
"Enable spark.memory.offHeap.enabled=true and allocate off-heap memory for shuffle buffers.",
"Reduce spark.sql.shuffle.partitions to lower per-partition memory pressure."
],
"category": "RESOURCE_EXHAUSTION",
"cluster_id": "j-ABC123XYZ",
"step_id": "s-DEFGH456",
"analysis_timestamp": "2026-05-20T10:34:21Z",
"tokens_used": 847
}
7. Configuration Reference
All runtime parameters are externalised. The agent resolves configuration from AWS Secrets Manager (for secrets) and environment variables (for non-secret values). No values are hardcoded in the Lambda package.
| Parameter | Source | Default | Description |
|---|---|---|---|
BEDROCK_MODEL_ID |
Env Var | claude-sonnet-4-6 |
Bedrock model identifier |
BEDROCK_MAX_TOKENS |
Env Var | 1024 |
Maximum LLM completion tokens |
BEDROCK_GUARDRAIL_ID |
Env Var | required | Bedrock guardrail resource ID |
MAX_LOG_SIZE_BYTES |
Env Var | 32768 |
Hard cap on log snippet size (32 KB) |
LOG_CONTEXT_LINES |
Env Var | 10 |
Lines before/after each error anchor |
METADATA_API_URL |
Env Var | required | Internal EMR metadata API base URL |
METADATA_API_KEY |
Secrets Manager | required | API authentication key |
S3_LOG_BUCKET |
Env Var | required | S3 bucket for EMR step logs |
LAMBDA_TIMEOUT_SECONDS |
Env Var | 300 |
Lambda function timeout |
LAMBDA_MEMORY_MB |
Env Var | 512 |
Lambda function memory allocation |
REDACTION_EXTRA_PATTERNS |
Env Var | [] |
JSON array of extra regex redaction patterns |
YARN_LOG_FETCH_ENABLED |
Env Var | true |
Toggle YARN container log resolution |
AWS_REGION |
Env Var | us-east-1 |
AWS region for Bedrock and S3 calls |
8. Data Flow & Security Boundaries
[1] EventBridge → Lambda Handler (raw event — no log data)
[2] Lambda → Metadata API (HTTPS, API key) → Cluster context
[3] Lambda → S3 (IAM role) → Raw stderr logs
[4] Log Extractor → PII/Secret Redaction ⟹ Sanitised snippet
▲ SECURITY BOUNDARY 1
[5] Sanitised snippet → Bedrock Guardrails (input scan)
▲ SECURITY BOUNDARY 2
[6] Bedrock → LLM (claude-sonnet-4-6) → Raw LLM output
[7] Raw LLM output → Bedrock Guardrails (output scan)
▲ SECURITY BOUNDARY 3
[8] Response Structurer → Validated JSON payload → Caller
Security Checkpoints
| Boundary | What is checked | Action on violation |
|---|---|---|
| Boundary 1 — Pre-processing redaction | PII, secrets, IP addresses, credentials in raw logs | Replace with [REDACTED-*] placeholder |
| Boundary 2 — Guardrail input scan | Sensitive info, denied topics, injected instructions | Block invocation; return GUARDRAIL_BLOCKED |
| Boundary 3 — Guardrail output scan | PII leakage, harmful content, off-topic responses | Discard response; return GUARDRAIL_OUTPUT_BLOCKED |
9. Error Handling & Resilience
| Failure Scenario | Behaviour | Fallback |
|---|---|---|
| Metadata API unavailable | Retry 3× with exponential backoff (1 s, 2 s, 4 s) | Raise AnalysisAbortedException; return error payload |
| S3 log file not found | Log warning and continue with empty log snippet | LLM is informed that logs were unavailable |
| Bedrock throttling (429) | Retry 5× with jitter; respect Retry-After header |
Raise after max retries; return error payload |
| Guardrail input violation | Block invocation immediately | Return GUARDRAIL_BLOCKED status to caller |
| Guardrail output violation | Discard LLM response | Return GUARDRAIL_OUTPUT_BLOCKED status |
| LLM returns invalid JSON | Retry once with explicit JSON repair prompt | Return PARSE_ERROR with raw LLM text |
| Lambda timeout approaching | Check remaining time after each stage; abort gracefully at 30 s remaining | Return PARTIAL_ANALYSIS with completed stages |
10. Observability
10.1 Structured Logging
All Lambda log statements use structured JSON logging (python-json-logger). Every log line includes: cluster_id, step_id, stage, duration_ms, and status.
10.2 CloudWatch Custom Metrics
| Metric | Description |
|---|---|
EMRAnalyserInvocations |
Total Lambda invocations |
EMRAnalyserTokensUsed |
Prompt + completion tokens per invocation |
EMRAnalyserGuardrailBlocks |
Guardrail trigger count |
EMRAnalyserPipelineDuration |
End-to-end latency in ms |
EMRAnalyserLogSizeBytes |
Size of extracted log snippet |
10.3 CloudWatch Alarms
- Alert on
GuardrailBlocks > 5per 5 minutes (potential attack or misconfigured rule). - Alert on Lambda error rate
> 5%over 15 minutes. - Alert on
TokensUsedper invocation> 900(approachingmax_tokenscap).
11. Security Considerations
- Least-privilege IAM: Lambda execution role is scoped to specific S3 bucket ARN, specific Bedrock model ARN, and specific Secrets Manager secret ARN.
- No log data is stored in Lambda ephemeral storage (
/tmp) beyond the invocation lifecycle. - CloudWatch Logs group is encrypted with a customer-managed KMS key.
- The metadata API call uses TLS 1.2+ with certificate validation enforced.
- Redaction runs in-process before any data leaves the Lambda execution environment.
- Bedrock guardrail blocks are surfaced as distinct status codes to enable security team auditing.
- Lambda layers containing third-party packages (LangChain, boto3) are pinned to specific versions and scanned by Dependabot.