Automated CI/CD Debugging
An Interactive Flow for GitLab & Amazon Bedrock Integration
Introduction: The Challenge of CI/CD Failures
In a complex multi-account AWS environment, CI/CD pipeline failures can be time-consuming to diagnose. This workflow demonstrates an automated solution using a custom GitLab Component to invoke an Amazon Bedrock Agent. The agent intelligently analyzes failures, determines the root cause, and proposes a solution—either as a code fix via a Merge Request or as a configuration change, significantly reducing manual debugging efforts.
GitLab CI Job Fails
A job in the pipeline (e.g., `terraform apply`) fails during execution.
What's Happening:
The pipeline executes as normal, but a command returns a non-zero exit code, triggering a failure state in GitLab. This is the entry point for our automated debugging process.
Example Error Log:
on main.tf line 25, in resource "aws_instance" "web":
25: resource "aws_instance" "web" {
ERROR: Job failed: exit code 1
Post-Stage Debug Component Runs
A component in the `.post` stage triggers on failure.
What's Happening:
This special job runs only when previous jobs in the pipeline have failed. It uses a custom GitLab Component template, which contains the logic to gather context and call the Bedrock Agent. The GitLab Runner executing this job assumes an IAM Instance Role with the necessary permissions.
Example `.gitlab-ci.yml` Usage:
- component: 'my-org/gitlab-components/bedrock-debugger@1.0.0'
inputs:
bedrock_agent_id: 'ABC123XYZ'
bedrock_agent_alias_id: 'TSTALIASID'
aws_region: 'us-east-1'
debug:on-failure:
stage: .post
script:
- /usr/bin/python3 /path/to/script/in/component.py
when: on_failure
Invoke Amazon Bedrock Agent
The component script sends job logs and metadata to the agent.
What's Happening:
A Python script (part of the component) uses the GitLab API to fetch the logs of the failed job. It combines this with predefined CI/CD variables (like project URL and commit SHA) and sends it all as a payload to the specified Bedrock Agent endpoint using the AWS SDK (Boto3).
Example Payload to Bedrock:
"sessionAttributes": {},
"promptSessionAttributes": {},
"inputText": "A GitLab CI job failed. Here are the details. Please perform a root cause analysis.\n\nProject URL: ${CI_PROJECT_URL}\nCommit: ${CI_COMMIT_SHA}\n\nLogs:\nError: creating EC2 Instance: InvalidIAMInstanceProfile.Name..."
}
Agent Performs Analysis
Bedrock analyzes logs, queries AWS, and checks source code.
What's Happening:
The Bedrock Agent executes its pre-configured action groups. It can:
1. Parse Logs: Identify specific error messages.
2. Query AWS: Use its underlying Lambda functions and IAM role to run `aws cli` or SDK commands to check the state of resources (e.g., check if an IAM role exists).
3. Examine Code: Use a GitLab Project Access Token (configured in the agent) to clone the repository at the specific commit and analyze the Terraform or script files referenced in the error log.
Root Cause Identified & Resolution Proposed
The agent determines the failure type and generates a solution.
Outcome: Merge Request Created
The agent identified a typo in the Terraform source code. It generates a suggested fix and uses the GitLab API to create a new branch and submit a Merge Request for review.
Example MR Description:
Root Cause:
The job failed due to `InvalidIAMInstanceProfile.Name`. Analysis of `main.tf` revealed a typo in the `iam_instance_profile` name.
Proposed Change:
Corrected `my-instance-pofile` to `my-instance-profile` in `resource "aws_instance" "web"`.