Data Job Failure Intelligence

1. Executive Summary

Data Job Failure Intelligence is an AI-powered, event-driven microservice that automatically diagnoses Amazon EMR step failures. When EMR emits a step failure event, the agent is triggered on AWS Lambda, gathers relevant cluster configuration and log data, applies intelligent log extraction to isolate the root cause, and then invokes AWS Bedrock (Claude Sonnet 4.6) to produce a structured diagnostic report.

The agent reduces mean time to resolution (MTTR) for EMR failures by eliminating manual log inspection. It surfaces actionable root cause analysis, log evidence, and targeted recommendations — all while enforcing strict data governance through built-in guardrails and PII redaction.

2. Goals and Non-Goals

2.1 Goals

Automatically trigger root cause analysis on every EMR step failure event without human intervention.
Retrieve and consolidate EMR cluster config, Spark, YARN, and Hive settings alongside step logs.
Efficiently extract only failure-relevant log lines, minimising token usage and LLM cost.
Redact all sensitive data (credentials, PII, internal endpoints) before sending logs to Bedrock.
Return a structured diagnostic payload: root cause, log evidence, up to three recommendations, and failure category.
Enforce Bedrock guardrails to prevent sensitive data leakage in LLM input and output.
Expose all critical parameters (LLM model, token budget, log size cap, API endpoints) as externalised configuration.

2.2 Non-Goals

The agent does not auto-remediate or restart EMR steps.
The agent does not store analysis results; persistence is the responsibility of the calling system.
The agent does not support non-EMR compute platforms (Glue, HDInsight, Databricks) in this version.
Real-time streaming log analysis is out of scope; the agent operates on completed step logs.

3. System Context & Integration Points

Data Job Failure Intelligence sits within the broader data platform as a reactive diagnostic service.

External Trigger	Agent Boundary	Downstream Consumers
Amazon EMR (EventBridge step failure event)	AWS Lambda (this service)	Operations / On-call team (Slack, PagerDuty)
EMR Config & Metadata API	AWS Bedrock (Claude Sonnet 4.6)	Incident Management System
S3 — Step & YARN Logs	LangChain orchestration layer	Data Platform Dashboard
AWS Secrets Manager (config)	Bedrock Guardrails	Audit / Compliance Log

4. High-Level Architecture

The agent follows a linear pipeline architecture with five distinct stages. Each stage is independently testable and configuration-driven. The pipeline is implemented as a LangChain chain and executes entirely within a single Lambda invocation.

EventBridge (Step Failure)
        │
        ▼
┌─────────────────┐
│  Lambda Handler │  ← Stage 1: Event Ingestion
└────────┬────────┘
         │
         ▼
┌─────────────────────────┐
│  Config & Metadata Fetch │  ← Stage 2: API call for cluster config, log path, Spark/YARN/Hive
└────────────┬────────────┘
             │
             ▼
┌──────────────────────────────┐
│  Log Extraction & Sanitisation│  ← Stage 3: Download stderr, extract errors, redact PII, trim
└────────────┬─────────────────┘
             │
             ▼
┌──────────────────────────┐
│  LLM Analysis (Bedrock)   │  ← Stage 4: Prompt assembly, Guardrails, claude-sonnet-4-6
└────────────┬─────────────┘
             │
             ▼
┌──────────────────────┐
│  Response Structuring │  ← Stage 5: Parse JSON, validate schema, return payload
└──────────────────────┘

Pipeline Stage Summary

#	Stage	Description	Output Artifact
1	Event Ingestion	EventBridge delivers step failure event to Lambda. Payload carries cluster-id, step-id, and failure timestamp.	Parsed event context
2	Config & Metadata Fetch	Calls internal metadata API to retrieve EMR cluster config, S3 log path, Spark, YARN, and Hive configuration.	Enriched context object
3	Log Extraction & Sanitisation	Downloads stderr from S3. Applies regex extraction to isolate error traces. If container IDs present, fetches YARN logs. Redacts PII and secrets. Trims to `max_log_size`.	Sanitised log snippet
4	LLM Analysis (Bedrock)	Assembles system + user prompts with sanitised logs. Invokes Bedrock with claude-sonnet-4-6. Guardrails applied on input and output.	Structured LLM response
5	Response Structuring	Parses LLM JSON output into `root_cause`, `log_evidences`, `recommendations` (max 3), and `category` fields. Returns to caller.	Final diagnostic payload

5. Component Design

5.1 Event Ingestion — Lambda Handler

The Lambda entry point deserialises the EventBridge payload and validates required fields (clusterId, stepId, region). It initialises the LangChain pipeline, invokes it, and serialises the result for the caller.

Runtime: Python 3.12
Trigger: Amazon EventBridge Rule — source aws.emr, detail-type EMR Step Status Change, state FAILED
Memory: configurable (default 512 MB); timeout: configurable (default 300 s)
IAM Role permissions: s3:GetObject on log bucket, bedrock:InvokeModel on Bedrock endpoint, secretsmanager:GetSecretValue for config, logs:CreateLogGroup / PutLogEvents

5.2 Config & Metadata Fetch

A dedicated LangChain tool wraps the internal metadata REST API. The tool is invoked as the first chain step. The response is parsed and normalised into a standard context schema consumed by downstream stages.

Attribute Fetched	Usage
EMR Cluster Configuration	Cluster type, release, instance types — provides execution context to LLM
Step Logs S3 Location	Base path for stderr / stdout log download
Spark Configuration	`spark.executor.memory`, `spark.sql.shuffle.partitions`, etc. — context for OOM / shuffle diagnosis
YARN Configuration	`yarn.nodemanager.resource.memory-mb`, scheduler settings — context for resource failures
Hive Configuration	Metastore URI, execution engine — context for Hive query failures

All API endpoint URLs and authentication headers are externalised via configuration.

5.3 Log Extraction & Sanitisation

This is the core value-delivery component of the pipeline. Its primary objective is to maximise diagnostic signal while minimising token consumption.

5.3.1 Stderr Log Download

The agent constructs the S3 key from the log path returned by the metadata API.
boto3 S3 GetObject is used; streaming is applied for large files to avoid Lambda memory exhaustion.
Total ingested log size is capped at max_log_size_bytes (configurable).

5.3.2 Intelligent Error Extraction

Rather than forwarding raw log files to the LLM, the agent applies a multi-pass extraction strategy:

Pass 1 — Exception Anchor: Locate lines containing exception keywords (Exception, Error, FAILED, Caused by, OOM, OutOfMemory, FileNotFoundException, etc.) using compiled regex patterns.
Pass 2 — Context Window: Include N lines before and after each anchor (configurable, default 10 lines) to preserve stack-trace context.
Pass 3 — Deduplication: Remove duplicate stack frames using a hash-based seen set.
Pass 4 — Container Log Resolution: If container IDs (container_XXXXXXXX) are present in stderr, fetch corresponding YARN container logs from S3 and apply the same extraction logic.
Pass 5 — Truncation: Hard-trim the final log snippet to max_log_size_bytes to enforce token budget.

5.3.3 PII & Secret Redaction

All log content passes through a regex-based redaction pipeline before leaving the log extraction stage.

Data Type	Pattern	Replacement
AWS Access Key ID	`AKIA[0-9A-Z]{16}`	`[REDACTED-AWS-KEY]`
AWS Secret Key	`(?i)secret.key.=\S+`	`[REDACTED-SECRET]`
IP Addresses	`\b(?:\d{1,3}\.){3}\d{1,3}\b`	`[REDACTED-IP]`
Email Addresses	`\b[\w.+-]+@[\w-]+\.[a-z]{2,}\b`	`[REDACTED-EMAIL]`
JDBC / DB Passwords	`(?i)password\s=\s\S+`	`[REDACTED-PASSWORD]`
S3 Bucket Names	`s3://[a-z0-9\-./]+`	`[REDACTED-S3-PATH]`
Auth Tokens / Bearer	`(?i)bearer\s+[A-Za-z0-9._\-]+`	`[REDACTED-TOKEN]`

The redaction regex set is extensible via configuration. Redaction is applied prior to any LLM call and any response logging.

5.4 LLM Analysis via AWS Bedrock

5.4.1 Prompt Architecture

The agent constructs a two-part prompt for each invocation:

System Prompt (static, version-controlled)

You are an expert data engineering assistant specialising in Amazon EMR, Apache Spark,
YARN, and Hive diagnostics. Your task is to analyse the provided log excerpt and return
a structured JSON diagnosis. Respond ONLY with a valid JSON object. Do not include
markdown or explanatory prose outside the JSON. Never include sensitive data such as
credentials, IP addresses, or internal hostnames in your response.

User Prompt (dynamic, per-invocation)

Cluster ID: {cluster_id} | Step ID: {step_id} | Failure Time: {timestamp}
Cluster Config: {cluster_config_summary}
Spark Config:   {spark_config_summary}
YARN Config:    {yarn_config_summary}

Log Excerpt:
{sanitised_log_snippet}

Return JSON with fields: root_cause (string), log_evidences (array of strings),
recommendations (array, max 3), category (string).

5.4.2 Bedrock Invocation Parameters

Parameter	Value
Model ID	configurable (default: `anthropic.claude-sonnet-4-6-20251101-v1:0`)
`max_tokens`	configurable (default: 1024)
`temperature`	0 (deterministic output for consistent diagnostics)
Orchestration	LangChain `ChatBedrock` wrapper with `ChatPromptTemplate`
Guardrail	`guardrailIdentifier` + `guardrailVersion` passed on every invocation

5.5 Bedrock Guardrails

AWS Bedrock Guardrails provide a second line of defence in addition to pre-processing redaction:

Sensitive Information Filters: Block or mask AWS credential patterns and PII types (NAME, EMAIL, PHONE, ADDRESS, US_SSN, AWS_ACCESS_KEY, AWS_SECRET_KEY) on both input and output.
Denied Topics: Block topics unrelated to EMR diagnostics (e.g. general code generation, personal advice) to prevent prompt injection exploitation.
Content Filters: Block harmful content categories (HATE, INSULTS, SEXUAL, VIOLENCE) at HIGH threshold on output.
Word Filters: Block internal hostname patterns and custom terms specified in configuration.

6. Diagnostic Response Schema

The agent parses the LLM JSON output and validates it against a Pydantic schema before returning to the caller.

Field	Type	Description
`root_cause`	`string`	Concise, human-readable statement of the identified root cause. E.g., OOM in executor during shuffle stage.
`log_evidences`	`string[]`	Array of exact log lines or paraphrased evidence snippets that support the root cause.
`recommendations`	`string[]`	Ordered list of up to 3 actionable remediation steps, most impactful first.
`category`	`string`	Failure taxonomy label. Enum: `RESOURCE_EXHAUSTION` \| `DATA_QUALITY` \| `CONFIGURATION_ERROR` \| `DEPENDENCY_FAILURE` \| `PERMISSION_DENIED` \| `UNKNOWN`
`cluster_id`	`string`	Echo of input cluster ID for traceability.
`step_id`	`string`	Echo of input step ID for traceability.
`analysis_timestamp`	ISO 8601	UTC timestamp of analysis completion.
`tokens_used`	`integer`	Total prompt + completion tokens consumed; logged for cost monitoring.

Example response:

{
  "root_cause": "Executor out-of-memory error during shuffle write phase caused by insufficient spark.executor.memory.",
  "log_evidences": [
    "java.lang.OutOfMemoryError: GC overhead limit exceeded",
    "ExecutorLostFailure: Executor 4 exited caused by one of the running tasks"
  ],
  "recommendations": [
    "Increase spark.executor.memory from 4g to 8g in the step configuration.",
    "Enable spark.memory.offHeap.enabled=true and allocate off-heap memory for shuffle buffers.",
    "Reduce spark.sql.shuffle.partitions to lower per-partition memory pressure."
  ],
  "category": "RESOURCE_EXHAUSTION",
  "cluster_id": "j-ABC123XYZ",
  "step_id": "s-DEFGH456",
  "analysis_timestamp": "2026-05-20T10:34:21Z",
  "tokens_used": 847
}

7. Configuration Reference

All runtime parameters are externalised. The agent resolves configuration from AWS Secrets Manager (for secrets) and environment variables (for non-secret values). No values are hardcoded in the Lambda package.

Parameter	Source	Default	Description
`BEDROCK_MODEL_ID`	Env Var	`claude-sonnet-4-6`	Bedrock model identifier
`BEDROCK_MAX_TOKENS`	Env Var	`1024`	Maximum LLM completion tokens
`BEDROCK_GUARDRAIL_ID`	Env Var	required	Bedrock guardrail resource ID
`MAX_LOG_SIZE_BYTES`	Env Var	`32768`	Hard cap on log snippet size (32 KB)
`LOG_CONTEXT_LINES`	Env Var	`10`	Lines before/after each error anchor
`METADATA_API_URL`	Env Var	required	Internal EMR metadata API base URL
`METADATA_API_KEY`	Secrets Manager	required	API authentication key
`S3_LOG_BUCKET`	Env Var	required	S3 bucket for EMR step logs
`LAMBDA_TIMEOUT_SECONDS`	Env Var	`300`	Lambda function timeout
`LAMBDA_MEMORY_MB`	Env Var	`512`	Lambda function memory allocation
`REDACTION_EXTRA_PATTERNS`	Env Var	`[]`	JSON array of extra regex redaction patterns
`YARN_LOG_FETCH_ENABLED`	Env Var	`true`	Toggle YARN container log resolution
`AWS_REGION`	Env Var	`us-east-1`	AWS region for Bedrock and S3 calls

8. Data Flow & Security Boundaries

[1]  EventBridge  →  Lambda Handler            (raw event — no log data)
[2]  Lambda  →  Metadata API  (HTTPS, API key) →  Cluster context
[3]  Lambda  →  S3  (IAM role)                 →  Raw stderr logs
[4]  Log Extractor  →  PII/Secret Redaction    ⟹  Sanitised snippet
                                                    ▲ SECURITY BOUNDARY 1
[5]  Sanitised snippet  →  Bedrock Guardrails (input scan)
                                                    ▲ SECURITY BOUNDARY 2
[6]  Bedrock  →  LLM (claude-sonnet-4-6)       →  Raw LLM output
[7]  Raw LLM output  →  Bedrock Guardrails (output scan)
                                                    ▲ SECURITY BOUNDARY 3
[8]  Response Structurer  →  Validated JSON payload  →  Caller

Security Checkpoints

Boundary	What is checked	Action on violation
Boundary 1 — Pre-processing redaction	PII, secrets, IP addresses, credentials in raw logs	Replace with `[REDACTED-*]` placeholder
Boundary 2 — Guardrail input scan	Sensitive info, denied topics, injected instructions	Block invocation; return `GUARDRAIL_BLOCKED`
Boundary 3 — Guardrail output scan	PII leakage, harmful content, off-topic responses	Discard response; return `GUARDRAIL_OUTPUT_BLOCKED`

9. Error Handling & Resilience

Failure Scenario	Behaviour	Fallback
Metadata API unavailable	Retry 3× with exponential backoff (1 s, 2 s, 4 s)	Raise `AnalysisAbortedException`; return error payload
S3 log file not found	Log warning and continue with empty log snippet	LLM is informed that logs were unavailable
Bedrock throttling (429)	Retry 5× with jitter; respect `Retry-After` header	Raise after max retries; return error payload
Guardrail input violation	Block invocation immediately	Return `GUARDRAIL_BLOCKED` status to caller
Guardrail output violation	Discard LLM response	Return `GUARDRAIL_OUTPUT_BLOCKED` status
LLM returns invalid JSON	Retry once with explicit JSON repair prompt	Return `PARSE_ERROR` with raw LLM text
Lambda timeout approaching	Check remaining time after each stage; abort gracefully at 30 s remaining	Return `PARTIAL_ANALYSIS` with completed stages

10. Observability

10.1 Structured Logging

All Lambda log statements use structured JSON logging (python-json-logger). Every log line includes: cluster_id, step_id, stage, duration_ms, and status.

10.2 CloudWatch Custom Metrics

Metric	Description
`EMRAnalyserInvocations`	Total Lambda invocations
`EMRAnalyserTokensUsed`	Prompt + completion tokens per invocation
`EMRAnalyserGuardrailBlocks`	Guardrail trigger count
`EMRAnalyserPipelineDuration`	End-to-end latency in ms
`EMRAnalyserLogSizeBytes`	Size of extracted log snippet

10.3 CloudWatch Alarms

Alert on GuardrailBlocks > 5 per 5 minutes (potential attack or misconfigured rule).
Alert on Lambda error rate > 5% over 15 minutes.
Alert on TokensUsed per invocation > 900 (approaching max_tokens cap).

11. Security Considerations

Least-privilege IAM: Lambda execution role is scoped to specific S3 bucket ARN, specific Bedrock model ARN, and specific Secrets Manager secret ARN.
No log data is stored in Lambda ephemeral storage (/tmp) beyond the invocation lifecycle.
CloudWatch Logs group is encrypted with a customer-managed KMS key.
The metadata API call uses TLS 1.2+ with certificate validation enforced.
Redaction runs in-process before any data leaves the Lambda execution environment.
Bedrock guardrail blocks are surfaced as distinct status codes to enable security team auditing.
Lambda layers containing third-party packages (LangChain, boto3) are pinned to specific versions and scanned by Dependabot.