← Back to notes

Data Job Failure Intelligence

AI-powered diagnostics for data job failures on Amazon EMR — automated root-cause analysis with log intelligence, PII redaction, and AWS Bedrock (Claude Sonnet).

architectureawsemrbedrocklambdalangchain13 min read

1. Executive Summary

Data Job Failure Intelligence is an AI-powered, event-driven microservice that automatically diagnoses Amazon EMR step failures. When EMR emits a step failure event, the agent is triggered on AWS Lambda, gathers relevant cluster configuration and log data, applies intelligent log extraction to isolate the root cause, and then invokes AWS Bedrock (Claude Sonnet 4.6) to produce a structured diagnostic report.

The agent reduces mean time to resolution (MTTR) for EMR failures by eliminating manual log inspection. It surfaces actionable root cause analysis, log evidence, and targeted recommendations — all while enforcing strict data governance through built-in guardrails and PII redaction.


2. Goals and Non-Goals

2.1 Goals

  • Automatically trigger root cause analysis on every EMR step failure event without human intervention.
  • Retrieve and consolidate EMR cluster config, Spark, YARN, and Hive settings alongside step logs.
  • Efficiently extract only failure-relevant log lines, minimising token usage and LLM cost.
  • Redact all sensitive data (credentials, PII, internal endpoints) before sending logs to Bedrock.
  • Return a structured diagnostic payload: root cause, log evidence, up to three recommendations, and failure category.
  • Enforce Bedrock guardrails to prevent sensitive data leakage in LLM input and output.
  • Expose all critical parameters (LLM model, token budget, log size cap, API endpoints) as externalised configuration.

2.2 Non-Goals

  • The agent does not auto-remediate or restart EMR steps.
  • The agent does not store analysis results; persistence is the responsibility of the calling system.
  • The agent does not support non-EMR compute platforms (Glue, HDInsight, Databricks) in this version.
  • Real-time streaming log analysis is out of scope; the agent operates on completed step logs.

3. System Context & Integration Points

Data Job Failure Intelligence sits within the broader data platform as a reactive diagnostic service.

External Trigger Agent Boundary Downstream Consumers
Amazon EMR (EventBridge step failure event) AWS Lambda (this service) Operations / On-call team (Slack, PagerDuty)
EMR Config & Metadata API AWS Bedrock (Claude Sonnet 4.6) Incident Management System
S3 — Step & YARN Logs LangChain orchestration layer Data Platform Dashboard
AWS Secrets Manager (config) Bedrock Guardrails Audit / Compliance Log

4. High-Level Architecture

The agent follows a linear pipeline architecture with five distinct stages. Each stage is independently testable and configuration-driven. The pipeline is implemented as a LangChain chain and executes entirely within a single Lambda invocation.

EventBridge (Step Failure)
        │
        ▼
┌─────────────────┐
│  Lambda Handler │  ← Stage 1: Event Ingestion
└────────┬────────┘
         │
         ▼
┌─────────────────────────┐
│  Config & Metadata Fetch │  ← Stage 2: API call for cluster config, log path, Spark/YARN/Hive
└────────────┬────────────┘
             │
             ▼
┌──────────────────────────────┐
│  Log Extraction & Sanitisation│  ← Stage 3: Download stderr, extract errors, redact PII, trim
└────────────┬─────────────────┘
             │
             ▼
┌──────────────────────────┐
│  LLM Analysis (Bedrock)   │  ← Stage 4: Prompt assembly, Guardrails, claude-sonnet-4-6
└────────────┬─────────────┘
             │
             ▼
┌──────────────────────┐
│  Response Structuring │  ← Stage 5: Parse JSON, validate schema, return payload
└──────────────────────┘

Pipeline Stage Summary

# Stage Description Output Artifact
1 Event Ingestion EventBridge delivers step failure event to Lambda. Payload carries cluster-id, step-id, and failure timestamp. Parsed event context
2 Config & Metadata Fetch Calls internal metadata API to retrieve EMR cluster config, S3 log path, Spark, YARN, and Hive configuration. Enriched context object
3 Log Extraction & Sanitisation Downloads stderr from S3. Applies regex extraction to isolate error traces. If container IDs present, fetches YARN logs. Redacts PII and secrets. Trims to max_log_size. Sanitised log snippet
4 LLM Analysis (Bedrock) Assembles system + user prompts with sanitised logs. Invokes Bedrock with claude-sonnet-4-6. Guardrails applied on input and output. Structured LLM response
5 Response Structuring Parses LLM JSON output into root_cause, log_evidences, recommendations (max 3), and category fields. Returns to caller. Final diagnostic payload

5. Component Design

5.1 Event Ingestion — Lambda Handler

The Lambda entry point deserialises the EventBridge payload and validates required fields (clusterId, stepId, region). It initialises the LangChain pipeline, invokes it, and serialises the result for the caller.

  • Runtime: Python 3.12
  • Trigger: Amazon EventBridge Rule — source aws.emr, detail-type EMR Step Status Change, state FAILED
  • Memory: configurable (default 512 MB); timeout: configurable (default 300 s)
  • IAM Role permissions: s3:GetObject on log bucket, bedrock:InvokeModel on Bedrock endpoint, secretsmanager:GetSecretValue for config, logs:CreateLogGroup / PutLogEvents

5.2 Config & Metadata Fetch

A dedicated LangChain tool wraps the internal metadata REST API. The tool is invoked as the first chain step. The response is parsed and normalised into a standard context schema consumed by downstream stages.

Attribute Fetched Usage
EMR Cluster Configuration Cluster type, release, instance types — provides execution context to LLM
Step Logs S3 Location Base path for stderr / stdout log download
Spark Configuration spark.executor.memory, spark.sql.shuffle.partitions, etc. — context for OOM / shuffle diagnosis
YARN Configuration yarn.nodemanager.resource.memory-mb, scheduler settings — context for resource failures
Hive Configuration Metastore URI, execution engine — context for Hive query failures

All API endpoint URLs and authentication headers are externalised via configuration.


5.3 Log Extraction & Sanitisation

This is the core value-delivery component of the pipeline. Its primary objective is to maximise diagnostic signal while minimising token consumption.

5.3.1 Stderr Log Download

  • The agent constructs the S3 key from the log path returned by the metadata API.
  • boto3 S3 GetObject is used; streaming is applied for large files to avoid Lambda memory exhaustion.
  • Total ingested log size is capped at max_log_size_bytes (configurable).

5.3.2 Intelligent Error Extraction

Rather than forwarding raw log files to the LLM, the agent applies a multi-pass extraction strategy:

  • Pass 1 — Exception Anchor: Locate lines containing exception keywords (Exception, Error, FAILED, Caused by, OOM, OutOfMemory, FileNotFoundException, etc.) using compiled regex patterns.
  • Pass 2 — Context Window: Include N lines before and after each anchor (configurable, default 10 lines) to preserve stack-trace context.
  • Pass 3 — Deduplication: Remove duplicate stack frames using a hash-based seen set.
  • Pass 4 — Container Log Resolution: If container IDs (container_XXXXXXXX) are present in stderr, fetch corresponding YARN container logs from S3 and apply the same extraction logic.
  • Pass 5 — Truncation: Hard-trim the final log snippet to max_log_size_bytes to enforce token budget.

5.3.3 PII & Secret Redaction

All log content passes through a regex-based redaction pipeline before leaving the log extraction stage.

Data Type Pattern Replacement
AWS Access Key ID AKIA[0-9A-Z]{16} [REDACTED-AWS-KEY]
AWS Secret Key (?i)secret.*key.*=\S+ [REDACTED-SECRET]
IP Addresses \b(?:\d{1,3}\.){3}\d{1,3}\b [REDACTED-IP]
Email Addresses \b[\w.+-]+@[\w-]+\.[a-z]{2,}\b [REDACTED-EMAIL]
JDBC / DB Passwords (?i)password\s*=\s*\S+ [REDACTED-PASSWORD]
S3 Bucket Names s3://[a-z0-9\-./]+ [REDACTED-S3-PATH]
Auth Tokens / Bearer (?i)bearer\s+[A-Za-z0-9._\-]+ [REDACTED-TOKEN]

The redaction regex set is extensible via configuration. Redaction is applied prior to any LLM call and any response logging.


5.4 LLM Analysis via AWS Bedrock

5.4.1 Prompt Architecture

The agent constructs a two-part prompt for each invocation:

System Prompt (static, version-controlled)

You are an expert data engineering assistant specialising in Amazon EMR, Apache Spark,
YARN, and Hive diagnostics. Your task is to analyse the provided log excerpt and return
a structured JSON diagnosis. Respond ONLY with a valid JSON object. Do not include
markdown or explanatory prose outside the JSON. Never include sensitive data such as
credentials, IP addresses, or internal hostnames in your response.

User Prompt (dynamic, per-invocation)

Cluster ID: {cluster_id} | Step ID: {step_id} | Failure Time: {timestamp}
Cluster Config: {cluster_config_summary}
Spark Config:   {spark_config_summary}
YARN Config:    {yarn_config_summary}

Log Excerpt:
{sanitised_log_snippet}

Return JSON with fields: root_cause (string), log_evidences (array of strings),
recommendations (array, max 3), category (string).

5.4.2 Bedrock Invocation Parameters

Parameter Value
Model ID configurable (default: anthropic.claude-sonnet-4-6-20251101-v1:0)
max_tokens configurable (default: 1024)
temperature 0 (deterministic output for consistent diagnostics)
Orchestration LangChain ChatBedrock wrapper with ChatPromptTemplate
Guardrail guardrailIdentifier + guardrailVersion passed on every invocation

5.5 Bedrock Guardrails

AWS Bedrock Guardrails provide a second line of defence in addition to pre-processing redaction:

  • Sensitive Information Filters: Block or mask AWS credential patterns and PII types (NAME, EMAIL, PHONE, ADDRESS, US_SSN, AWS_ACCESS_KEY, AWS_SECRET_KEY) on both input and output.
  • Denied Topics: Block topics unrelated to EMR diagnostics (e.g. general code generation, personal advice) to prevent prompt injection exploitation.
  • Content Filters: Block harmful content categories (HATE, INSULTS, SEXUAL, VIOLENCE) at HIGH threshold on output.
  • Word Filters: Block internal hostname patterns and custom terms specified in configuration.

6. Diagnostic Response Schema

The agent parses the LLM JSON output and validates it against a Pydantic schema before returning to the caller.

Field Type Description
root_cause string Concise, human-readable statement of the identified root cause. E.g., OOM in executor during shuffle stage.
log_evidences string[] Array of exact log lines or paraphrased evidence snippets that support the root cause.
recommendations string[] Ordered list of up to 3 actionable remediation steps, most impactful first.
category string Failure taxonomy label. Enum: RESOURCE_EXHAUSTION | DATA_QUALITY | CONFIGURATION_ERROR | DEPENDENCY_FAILURE | PERMISSION_DENIED | UNKNOWN
cluster_id string Echo of input cluster ID for traceability.
step_id string Echo of input step ID for traceability.
analysis_timestamp ISO 8601 UTC timestamp of analysis completion.
tokens_used integer Total prompt + completion tokens consumed; logged for cost monitoring.

Example response:

{
  "root_cause": "Executor out-of-memory error during shuffle write phase caused by insufficient spark.executor.memory.",
  "log_evidences": [
    "java.lang.OutOfMemoryError: GC overhead limit exceeded",
    "ExecutorLostFailure: Executor 4 exited caused by one of the running tasks"
  ],
  "recommendations": [
    "Increase spark.executor.memory from 4g to 8g in the step configuration.",
    "Enable spark.memory.offHeap.enabled=true and allocate off-heap memory for shuffle buffers.",
    "Reduce spark.sql.shuffle.partitions to lower per-partition memory pressure."
  ],
  "category": "RESOURCE_EXHAUSTION",
  "cluster_id": "j-ABC123XYZ",
  "step_id": "s-DEFGH456",
  "analysis_timestamp": "2026-05-20T10:34:21Z",
  "tokens_used": 847
}

7. Configuration Reference

All runtime parameters are externalised. The agent resolves configuration from AWS Secrets Manager (for secrets) and environment variables (for non-secret values). No values are hardcoded in the Lambda package.

Parameter Source Default Description
BEDROCK_MODEL_ID Env Var claude-sonnet-4-6 Bedrock model identifier
BEDROCK_MAX_TOKENS Env Var 1024 Maximum LLM completion tokens
BEDROCK_GUARDRAIL_ID Env Var required Bedrock guardrail resource ID
MAX_LOG_SIZE_BYTES Env Var 32768 Hard cap on log snippet size (32 KB)
LOG_CONTEXT_LINES Env Var 10 Lines before/after each error anchor
METADATA_API_URL Env Var required Internal EMR metadata API base URL
METADATA_API_KEY Secrets Manager required API authentication key
S3_LOG_BUCKET Env Var required S3 bucket for EMR step logs
LAMBDA_TIMEOUT_SECONDS Env Var 300 Lambda function timeout
LAMBDA_MEMORY_MB Env Var 512 Lambda function memory allocation
REDACTION_EXTRA_PATTERNS Env Var [] JSON array of extra regex redaction patterns
YARN_LOG_FETCH_ENABLED Env Var true Toggle YARN container log resolution
AWS_REGION Env Var us-east-1 AWS region for Bedrock and S3 calls

8. Data Flow & Security Boundaries

[1]  EventBridge  →  Lambda Handler            (raw event — no log data)
[2]  Lambda  →  Metadata API  (HTTPS, API key) →  Cluster context
[3]  Lambda  →  S3  (IAM role)                 →  Raw stderr logs
[4]  Log Extractor  →  PII/Secret Redaction    ⟹  Sanitised snippet
                                                    ▲ SECURITY BOUNDARY 1
[5]  Sanitised snippet  →  Bedrock Guardrails (input scan)
                                                    ▲ SECURITY BOUNDARY 2
[6]  Bedrock  →  LLM (claude-sonnet-4-6)       →  Raw LLM output
[7]  Raw LLM output  →  Bedrock Guardrails (output scan)
                                                    ▲ SECURITY BOUNDARY 3
[8]  Response Structurer  →  Validated JSON payload  →  Caller

Security Checkpoints

Boundary What is checked Action on violation
Boundary 1 — Pre-processing redaction PII, secrets, IP addresses, credentials in raw logs Replace with [REDACTED-*] placeholder
Boundary 2 — Guardrail input scan Sensitive info, denied topics, injected instructions Block invocation; return GUARDRAIL_BLOCKED
Boundary 3 — Guardrail output scan PII leakage, harmful content, off-topic responses Discard response; return GUARDRAIL_OUTPUT_BLOCKED

9. Error Handling & Resilience

Failure Scenario Behaviour Fallback
Metadata API unavailable Retry 3× with exponential backoff (1 s, 2 s, 4 s) Raise AnalysisAbortedException; return error payload
S3 log file not found Log warning and continue with empty log snippet LLM is informed that logs were unavailable
Bedrock throttling (429) Retry 5× with jitter; respect Retry-After header Raise after max retries; return error payload
Guardrail input violation Block invocation immediately Return GUARDRAIL_BLOCKED status to caller
Guardrail output violation Discard LLM response Return GUARDRAIL_OUTPUT_BLOCKED status
LLM returns invalid JSON Retry once with explicit JSON repair prompt Return PARSE_ERROR with raw LLM text
Lambda timeout approaching Check remaining time after each stage; abort gracefully at 30 s remaining Return PARTIAL_ANALYSIS with completed stages

10. Observability

10.1 Structured Logging

All Lambda log statements use structured JSON logging (python-json-logger). Every log line includes: cluster_id, step_id, stage, duration_ms, and status.

10.2 CloudWatch Custom Metrics

Metric Description
EMRAnalyserInvocations Total Lambda invocations
EMRAnalyserTokensUsed Prompt + completion tokens per invocation
EMRAnalyserGuardrailBlocks Guardrail trigger count
EMRAnalyserPipelineDuration End-to-end latency in ms
EMRAnalyserLogSizeBytes Size of extracted log snippet

10.3 CloudWatch Alarms

  • Alert on GuardrailBlocks > 5 per 5 minutes (potential attack or misconfigured rule).
  • Alert on Lambda error rate > 5% over 15 minutes.
  • Alert on TokensUsed per invocation > 900 (approaching max_tokens cap).

11. Security Considerations

  • Least-privilege IAM: Lambda execution role is scoped to specific S3 bucket ARN, specific Bedrock model ARN, and specific Secrets Manager secret ARN.
  • No log data is stored in Lambda ephemeral storage (/tmp) beyond the invocation lifecycle.
  • CloudWatch Logs group is encrypted with a customer-managed KMS key.
  • The metadata API call uses TLS 1.2+ with certificate validation enforced.
  • Redaction runs in-process before any data leaves the Lambda execution environment.
  • Bedrock guardrail blocks are surfaced as distinct status codes to enable security team auditing.
  • Lambda layers containing third-party packages (LangChain, boto3) are pinned to specific versions and scanned by Dependabot.