[Concept,14/16] pickman: Support automatically fixing pipeline failures

Message ID 20260222154303.2851319-15-sjg@u-boot.org
State New
Headers
Series pickman: Support monitoring and fixing pipeline failures |

Commit Message

Simon Glass Feb. 22, 2026, 3:42 p.m. UTC
  From: Simon Glass <simon.glass@canonical.com>

After pickman pushes cherry-pick MRs to GitLab, CI pipelines sometimes
fail due to build or test errors introduced by the cherry-picks. This
currently requires manual intervention.

Add a pipeline-fix feature that detects failed pipelines on open MRs and
uses a Claude agent to diagnose and fix them. The agent analyses the
failed job logs, identifies the responsible commit, amends it via
interactive rebase (using uman's rf/rn helpers), verifies the fix with
um build and buildman, then leaves the result on a local branch for the
caller to push.

The feature integrates into the existing do_step()/do_poll() flow,
running after review comment processing. A --fix-retries/-F flag
(default 3, 0 to disable) controls the maximum attempts per MR. Each
attempt is tracked per pipeline ID in the pipeline_fix table so the same
failure is never reprocessed, and a new pipeline from a rebase is treated
independently.

On success, pickman pushes the fix branch, posts an MR comment
summarising what was fixed, updates the MR description with the full
agent log, and records the fix in .pickman-history. On failure or when
the retry limit is reached, a comment is posted requesting manual
intervention.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Simon Glass <simon.glass@canonical.com>
---

 tools/pickman/README.rst  |  48 ++++
 tools/pickman/__main__.py |   6 +
 tools/pickman/agent.py    | 177 +++++++++++++
 tools/pickman/control.py  | 247 ++++++++++++++++++
 tools/pickman/ftest.py    | 536 +++++++++++++++++++++++++++++++++++++-
 5 files changed, 1011 insertions(+), 3 deletions(-)
  

Patch

diff --git a/tools/pickman/README.rst b/tools/pickman/README.rst
index 70309032838..0c9d2d1521b 100644
--- a/tools/pickman/README.rst
+++ b/tools/pickman/README.rst
@@ -212,6 +212,39 @@  This ensures:
 - No manual intervention is required to continue
 - False positives are minimized by comparing actual patch content
 
+Pipeline Fix
+------------
+
+When a CI pipeline fails on a pickman MR, the ``step`` and ``poll`` commands
+can automatically diagnose and fix the failure using a Claude agent. This is
+useful when cherry-picks introduce build or test failures that need minor
+adjustments.
+
+**How it works**
+
+During each step, after processing review comments, pickman checks active MRs
+for failed pipelines. For each failed pipeline:
+
+1. Pickman fetches the failed job logs from GitLab
+2. A Claude agent analyses the logs, diagnoses the root cause, and makes
+   targeted fixes
+3. The fix is pushed to the MR branch, triggering a new pipeline
+4. The attempt is recorded in the database to avoid reprocessing
+
+**Retry behaviour**
+
+Each MR gets up to ``--fix-retries`` attempts (default: 3). If the limit is
+reached, pickman posts a comment on the MR indicating that manual intervention
+is required. Set ``--fix-retries 0`` to disable automatic pipeline fixing.
+
+Each attempt is tracked per pipeline ID, so a new pipeline triggered by a rebase
+or comment fix is treated independently.
+
+**Options**
+
+- ``-F, --fix-retries``: Maximum pipeline-fix attempts per MR (default: 3, 0 to
+  disable). Available on both ``step`` and ``poll`` commands.
+
 CI Pipelines
 ------------
 
@@ -448,6 +481,7 @@  review comments are handled automatically.
 
 Options for the step command:
 
+- ``-F, --fix-retries``: Max pipeline-fix attempts per MR (default: 3, 0 to disable)
 - ``-m, --max-mrs``: Maximum open MRs allowed (default: 5)
 - ``-r, --remote``: Git remote for push (default: ci)
 - ``-t, --target``: Target branch for MR (default: master)
@@ -461,6 +495,7 @@  creating new MRs as previous ones are merged. Press Ctrl+C to stop.
 
 Options for the poll command:
 
+- ``-F, --fix-retries``: Max pipeline-fix attempts per MR (default: 3, 0 to disable)
 - ``-i, --interval``: Interval between steps in seconds (default: 300)
 - ``-m, --max-mrs``: Maximum open MRs allowed (default: 5)
 - ``-r, --remote``: Git remote for push (default: ci)
@@ -563,6 +598,19 @@  Tables
     This table prevents the same comment from being addressed multiple times
     when running ``review`` or ``poll`` commands.
 
+**pipeline_fix**
+    Tracks pipeline fix attempts per MR to avoid reprocessing.
+
+    - ``id``: Primary key
+    - ``mr_iid``: GitLab merge request IID
+    - ``pipeline_id``: GitLab pipeline ID
+    - ``attempt``: Attempt number
+    - ``status``: Result ('success', 'failure', 'skipped', 'no_jobs')
+    - ``created_at``: Timestamp when the attempt was made
+
+    The ``(mr_iid, pipeline_id)`` pair is unique, so each pipeline is only
+    processed once.
+
 Configuration
 -------------
 
diff --git a/tools/pickman/__main__.py b/tools/pickman/__main__.py
index 7814fd0fedc..1e13646ec52 100755
--- a/tools/pickman/__main__.py
+++ b/tools/pickman/__main__.py
@@ -113,6 +113,9 @@  def add_main_commands(subparsers):
     step_cmd = subparsers.add_parser('step',
                                      help='Create MR if none pending')
     step_cmd.add_argument('source', help='Source branch name')
+    step_cmd.add_argument('-F', '--fix-retries', type=int, default=3,
+                          help='Max pipeline-fix attempts per MR '
+                               '(0 to disable, default: 3)')
     step_cmd.add_argument('-m', '--max-mrs', type=int, default=5,
                           help='Max open MRs allowed (default: 5)')
     step_cmd.add_argument('-r', '--remote', default='ci',
@@ -123,6 +126,9 @@  def add_main_commands(subparsers):
     poll_cmd = subparsers.add_parser('poll',
                                      help='Run step repeatedly until stopped')
     poll_cmd.add_argument('source', help='Source branch name')
+    poll_cmd.add_argument('-F', '--fix-retries', type=int, default=3,
+                          help='Max pipeline-fix attempts per MR '
+                               '(0 to disable, default: 3)')
     poll_cmd.add_argument('-i', '--interval', type=int, default=300,
                           help='Interval between steps in seconds '
                                '(default: 300)')
diff --git a/tools/pickman/agent.py b/tools/pickman/agent.py
index 4b914e314e8..ec5c1b0df75 100644
--- a/tools/pickman/agent.py
+++ b/tools/pickman/agent.py
@@ -536,3 +536,180 @@  def handle_mr_comments(mr_iid, branch_name, comments, remote, target='master',
     return asyncio.run(run_review_agent(mr_iid, branch_name, comments, remote,
                                         target, needs_rebase, has_conflicts,
                                         mr_description, repo_path))
+
+
+# pylint: disable=too-many-arguments
+def build_pipeline_fix_prompt(mr_iid, branch_name, failed_jobs, remote,
+                               target, mr_description, attempt):
+    """Build prompt and task description for the pipeline fix agent
+
+    Args:
+        mr_iid (int): Merge request IID
+        branch_name (str): Source branch name
+        failed_jobs (list): List of FailedJob tuples
+        remote (str): Git remote name
+        target (str): Target branch
+        mr_description (str): MR description with context
+        attempt (int): Fix attempt number
+
+    Returns:
+        tuple: (prompt, task_desc) where prompt is the full agent prompt and
+            task_desc is a short description
+    """
+    task_desc = f'fix {len(failed_jobs)} failed pipeline job(s) (attempt {attempt})'
+
+    # Format failed jobs
+    job_sections = []
+    for job in failed_jobs:
+        job_sections.append(
+            f'### Job: {job.name} (stage: {job.stage})\n'
+            f'URL: {job.web_url}\n'
+            f'Log tail:\n```\n{job.log_tail}\n```'
+        )
+    jobs_text = '\n\n'.join(job_sections)
+
+    # Include MR description for context
+    context_section = ''
+    if mr_description:
+        context_section = f'''
+Context from MR description:
+
+{mr_description}
+'''
+
+    # Extract board names from failed job names for targeted builds.
+    # CI job names typically contain a board name (e.g. 'build:sandbox',
+    # 'test:venice_gw7905', 'world build <board>').  Collect unique names
+    # to pass to buildman so the agent can verify all affected boards.
+    board_names = set()
+    for job in failed_jobs:
+        # Try common CI patterns: 'build:<board>', 'test:<board>',
+        # or a board name token in the job name
+        for part in job.name.replace(':', ' ').replace('/', ' ').split():
+            # Skip generic tokens that are not board names
+            if part.lower() in ('build', 'test', 'world', 'check', 'lint',
+                                'ci', 'job', 'stage'):
+                continue
+            board_names.add(part)
+
+    # Always include sandbox for a basic sanity check
+    board_names.add('sandbox')
+    boards_csv = ','.join(sorted(board_names))
+
+    prompt = f"""Fix pipeline failures for merge request !{mr_iid} \
+(branch: {branch_name}, attempt {attempt}).
+{context_section}
+Failed jobs:
+
+{jobs_text}
+
+Steps to follow:
+1. Checkout the branch: git checkout {branch_name}
+2. Diagnose the root cause from the job logs above
+3. Identify which commit introduced the problem:
+   - Use 'git log --oneline' to list the commits on the branch
+   - Correlate the failing file/symbol with the commit that touched it
+   - Use 'git log --oneline -- <file>' if needed
+4. Apply the fix to the appropriate commit:
+   - If you can identify the responsible commit, use uman's rebase helpers
+     to amend it:
+     a) 'rf N' to start an interactive rebase going back N commits from
+        HEAD, stopping at the oldest (first) commit in the range
+     b) Make your fix, then amend the commit with a 1-2 line note appended
+        to the end of the commit message describing the fix, e.g.:
+        git add <files>
+        git commit --amend -m "$(git log -1 --format=%B)
+
+        [pickman] Fix <short description of what was fixed>"
+     c) 'rn' to advance to the next commit (or 'git rebase --continue'
+        to finish)
+   - If the cause spans multiple commits or cannot be pinpointed, add a new
+     fixup commit on top of the branch
+5. Build and verify:
+   a) Quick sandbox check: um build sandbox
+   b) Build all affected boards: \
+buildman -o /tmp/pickman {boards_csv}
+   Fix any build errors before proceeding.
+6. Create a local branch: {branch_name}-fix{attempt}
+7. Report what was fixed: which commit was responsible, what the root cause
+   was, and what change was made.  Do NOT push the branch; the caller
+   handles that.
+
+Important:
+- Keep changes minimal and focused on fixing the failures
+- Prefer amending the responsible commit over adding a new commit, so the
+  MR history stays clean
+- If the failure is an infrastructure or transient issue (network timeout, \
+runner problem, etc.), report this without making changes
+- Do not modify unrelated code
+- Use 'um build sandbox' for sandbox builds (fast, local)
+- Use 'buildman -o /tmp/pickman <board1> <board2> ...' to build multiple
+  boards in one go
+- Leave the result on local branch {branch_name}-fix{attempt}
+"""
+
+    return prompt, task_desc
+
+
+async def run_pipeline_fix_agent(mr_iid, branch_name, failed_jobs, remote,
+                                  target='master', mr_description='',
+                                  attempt=1, repo_path=None):
+    """Run the Claude agent to fix pipeline failures
+
+    Args:
+        mr_iid (int): Merge request IID
+        branch_name (str): Source branch name
+        failed_jobs (list): List of FailedJob tuples
+        remote (str): Git remote name
+        target (str): Target branch
+        mr_description (str): MR description with context
+        attempt (int): Fix attempt number
+        repo_path (str): Path to repository (defaults to current directory)
+
+    Returns:
+        tuple: (success, conversation_log) where success is bool and
+            conversation_log is the agent's output text
+    """
+    if not check_available():
+        return False, ''
+
+    if repo_path is None:
+        repo_path = os.getcwd()
+
+    prompt, task_desc = build_pipeline_fix_prompt(
+        mr_iid, branch_name, failed_jobs, remote, target,
+        mr_description, attempt)
+
+    options = ClaudeAgentOptions(
+        allowed_tools=['Bash', 'Read', 'Grep', 'Edit', 'Write'],
+        cwd=repo_path,
+        max_buffer_size=MAX_BUFFER_SIZE,
+    )
+
+    tout.info(f'Starting Claude agent to {task_desc}...')
+    tout.info('')
+
+    return await run_agent_collect(prompt, options)
+
+
+def fix_pipeline(mr_iid, branch_name, failed_jobs, remote, target='master',
+                 mr_description='', attempt=1, repo_path=None):
+    """Synchronous wrapper for running the pipeline fix agent
+
+    Args:
+        mr_iid (int): Merge request IID
+        branch_name (str): Source branch name
+        failed_jobs (list): List of FailedJob tuples
+        remote (str): Git remote name
+        target (str): Target branch
+        mr_description (str): MR description with context
+        attempt (int): Fix attempt number
+        repo_path (str): Path to repository (defaults to current directory)
+
+    Returns:
+        tuple: (success, conversation_log) where success is bool and
+            conversation_log is the agent's output text
+    """
+    return asyncio.run(run_pipeline_fix_agent(
+        mr_iid, branch_name, failed_jobs, remote, target,
+        mr_description, attempt, repo_path))
diff --git a/tools/pickman/control.py b/tools/pickman/control.py
index 057e4400adb..f4d4a43c292 100644
--- a/tools/pickman/control.py
+++ b/tools/pickman/control.py
@@ -2398,6 +2398,248 @@  def process_mr_reviews(remote, mrs, dbs, target='master'):
     return processed
 
 
+def _rebase_mr_branch(remote, merge_req, dbs, target):
+    """Rebase an MR branch onto the target before attempting a pipeline fix
+
+    When a branch needs rebasing, the pipeline failure may be caused by the
+    stale base rather than by the cherry-picked commits. Rebasing and pushing
+    triggers a fresh pipeline run.
+
+    Args:
+        remote (str): Remote name
+        merge_req (PickmanMr): MR with a failed pipeline
+        dbs (Database): Database instance for tracking fix attempts
+        target (str): Target branch
+
+    Returns:
+        True if the branch was rebased and pushed, False if the rebase
+        failed (conflicts), or None if no rebase is needed
+    """
+    if not merge_req.needs_rebase and not merge_req.has_conflicts:
+        return None
+
+    mr_iid = merge_req.iid
+    branch = merge_req.source_branch
+    if merge_req.has_conflicts:
+        tout.info(f'MR !{mr_iid}: has conflicts, rebasing before '
+                  f'pipeline fix...')
+    else:
+        tout.info(f'MR !{mr_iid}: needs rebase, rebasing before '
+                  f'pipeline fix...')
+    run_git(['checkout', branch])
+    try:
+        run_git(['rebase', f'{remote}/{target}'])
+    except command.CommandExc:
+        tout.warning(f'MR !{mr_iid}: rebase failed, aborting')
+        try:
+            run_git(['rebase', '--abort'])
+        except command.CommandExc:
+            pass
+        return False
+    gitlab_api.push_branch(remote, branch, force=True, skip_ci=False)
+    dbs.pfix_add(mr_iid, merge_req.pipeline_id, 0, 'rebased')
+    dbs.commit()
+    tout.info(f'MR !{mr_iid}: rebased and pushed, waiting for '
+              f'new pipeline')
+    return True
+
+
+def _attempt_pipeline_fix(remote, merge_req, dbs, target, attempt):
+    """Run the agent to fix a failed pipeline and report the result
+
+    Fetches the failed-job logs, invokes the fix agent, then pushes the
+    result and updates the MR description and history on success, or posts
+    a failure comment otherwise.
+
+    Args:
+        remote (str): Remote name
+        merge_req (PickmanMr): MR with a failed pipeline
+        dbs (Database): Database instance for tracking fix attempts
+        target (str): Target branch
+        attempt (int): Current fix attempt number
+
+    Returns:
+        bool: True if the fix was attempted, False if no failed jobs
+            were found
+    """
+    mr_iid = merge_req.iid
+
+    # Fetch failed jobs
+    failed_jobs = gitlab_api.get_failed_jobs(remote, merge_req.pipeline_id)
+    if not failed_jobs:
+        tout.info(f'MR !{mr_iid}: no failed jobs found')
+        dbs.pfix_add(mr_iid, merge_req.pipeline_id, attempt, 'no_jobs')
+        dbs.commit()
+        return False
+
+    # Run agent to fix the failures
+    success, conversation_log = agent.fix_pipeline(
+        mr_iid,
+        merge_req.source_branch,
+        failed_jobs,
+        remote,
+        target,
+        mr_description=merge_req.description,
+        attempt=attempt,
+    )
+
+    status = 'success' if success else 'failure'
+    dbs.pfix_add(mr_iid, merge_req.pipeline_id, attempt, status)
+    dbs.commit()
+
+    if success:
+        # Push the fix branch to the original MR branch
+        branch = merge_req.source_branch
+        gitlab_api.push_branch(remote, branch, force=True,
+                               skip_ci=False)
+
+        # Update MR description with fix log
+        old_desc = merge_req.description
+        job_names = ', '.join(j.name for j in failed_jobs)
+        new_desc = (f"{old_desc}\n\n### Pipeline fix (attempt {attempt})"
+                    f"\n\n**Failed jobs:** {job_names}\n\n"
+                    f"**Response:**\n{conversation_log}")
+        gitlab_api.update_mr_desc(remote, mr_iid, new_desc)
+
+        # Post a comment summarising the fix
+        gitlab_api.reply_to_mr(
+            remote, mr_iid,
+            f'Pipeline fix (attempt {attempt}): '
+            f'fixed failed job(s) {job_names}.\n\n'
+            f'{conversation_log[:2000]}')
+
+        # Update .pickman-history
+        update_history_pipeline_fix(merge_req.source_branch, failed_jobs,
+                                    conversation_log, attempt)
+
+        tout.info(f'MR !{mr_iid}: pipeline fix pushed (attempt {attempt})')
+    else:
+        gitlab_api.reply_to_mr(
+            remote, mr_iid,
+            f'Pipeline fix attempt {attempt} failed. '
+            f'Agent output:\n\n{conversation_log[:1000]}')
+        tout.error(f'MR !{mr_iid}: pipeline fix failed '
+                   f'(attempt {attempt})')
+
+    return True
+
+
+def process_pipeline_failures(remote, mrs, dbs, target, max_retries):
+    """Process pipeline failures on open MRs
+
+    Checks each MR for failed pipelines and uses Claude agent to diagnose
+    and fix them. Tracks attempts in the database to avoid reprocessing.
+
+    Args:
+        remote (str): Remote name
+        mrs (list): List of active (non-skipped) PickmanMr tuples
+        dbs (Database): Database instance for tracking fix attempts
+        target (str): Target branch
+        max_retries (int): Maximum fix attempts per MR
+
+    Returns:
+        int: Number of MRs with pipeline fixes attempted
+    """
+    # Save current branch to restore later
+    original_branch = run_git(['rev-parse', '--abbrev-ref', 'HEAD'])
+
+    # Fetch to get latest remote state
+    tout.info(f'Fetching {remote}...')
+    run_git(['fetch', remote])
+
+    processed = 0
+    for merge_req in mrs:
+        mr_iid = merge_req.iid
+
+        # Skip if pipeline is not failed or has no pipeline
+        if merge_req.pipeline_status != 'failed':
+            continue
+        if merge_req.pipeline_id is None:
+            continue
+
+        # Skip if this pipeline was already handled
+        if dbs.pfix_has(mr_iid, merge_req.pipeline_id):
+            continue
+
+        rebased = _rebase_mr_branch(remote, merge_req, dbs, target)
+        if rebased is not None:
+            if rebased:
+                processed += 1
+            continue
+
+        attempt = dbs.pfix_count(mr_iid) + 1
+
+        # Check retry limit
+        if attempt > max_retries:
+            tout.info(f'MR !{mr_iid}: reached fix retry limit '
+                      f'({max_retries}), skipping')
+            gitlab_api.reply_to_mr(
+                remote, mr_iid,
+                f'Pipeline fix: reached retry limit ({max_retries} '
+                f'attempts). Manual intervention required.')
+            dbs.pfix_add(mr_iid, merge_req.pipeline_id, attempt, 'skipped')
+            dbs.commit()
+            continue
+
+        tout.info('')
+        tout.info(f'MR !{mr_iid}: pipeline {merge_req.pipeline_id} failed, '
+                  f'attempting fix (attempt {attempt}/{max_retries})...')
+
+        if _attempt_pipeline_fix(remote, merge_req, dbs, target, attempt):
+            processed += 1
+
+    # Restore original branch
+    if processed:
+        tout.info(f'Returning to {original_branch}')
+        run_git(['checkout', original_branch])
+
+    return processed
+
+
+def update_history_pipeline_fix(branch_name, failed_jobs, conversation_log,
+                                attempt):
+    """Append pipeline fix handling to .pickman-history
+
+    Args:
+        branch_name (str): Branch name for the MR
+        failed_jobs (list): List of FailedJob tuples that were fixed
+        conversation_log (str): Agent conversation log
+        attempt (int): Fix attempt number
+    """
+    job_summary = '\n'.join(
+        f'- {j.name} ({j.stage})'
+        for j in failed_jobs
+    )
+
+    entry = f'''### Pipeline fix: {date.today()} (attempt {attempt})
+
+Branch: {branch_name}
+
+Failed jobs:
+{job_summary}
+
+### Conversation log
+{conversation_log}
+
+---
+
+'''
+
+    # Append to history file
+    existing = ''
+    if os.path.exists(HISTORY_FILE):
+        with open(HISTORY_FILE, 'r', encoding='utf-8') as fhandle:
+            existing = fhandle.read()
+
+    with open(HISTORY_FILE, 'w', encoding='utf-8') as fhandle:
+        fhandle.write(existing + entry)
+
+    # Commit the history file
+    run_git(['add', '-f', HISTORY_FILE])
+    run_git(['commit', '-m',
+             f'pickman: Record pipeline fix for {branch_name}'])
+
+
 def update_history(branch_name, comments, conversation_log):
     """Append review handling to .pickman-history
 
@@ -2604,6 +2846,11 @@  def do_step(args, dbs):
         # in case they have an unskip request)
         process_mr_reviews(remote, mrs, dbs, args.target)
 
+        # Process pipeline failures on active MRs only
+        if active_mrs and args.fix_retries > 0:
+            process_pipeline_failures(remote, active_mrs, dbs,
+                                      args.target, args.fix_retries)
+
     # Only block new MR creation if we've reached the max allowed open MRs
     max_mrs = args.max_mrs
     if len(active_mrs) >= max_mrs:
diff --git a/tools/pickman/ftest.py b/tools/pickman/ftest.py
index 67a7d004ca6..af26cbb4229 100644
--- a/tools/pickman/ftest.py
+++ b/tools/pickman/ftest.py
@@ -2068,7 +2068,7 @@  class TestStep(unittest.TestCase):
                                        return_value=[mock_mr]):
                     args = argparse.Namespace(cmd='step', source='us/next',
                                               remote='ci', target='master',
-                                              max_mrs=1)
+                                              max_mrs=1, fix_retries=3)
                     with terminal.capture():
                         ret = control.do_step(args, None)
 
@@ -2118,7 +2118,7 @@  class TestStep(unittest.TestCase):
                                            return_value=0) as mock_apply:
                         args = argparse.Namespace(cmd='step', source='us/next',
                                                   remote='ci', target='master',
-                                                  max_mrs=5)
+                                                  max_mrs=5, fix_retries=3)
                         with terminal.capture():
                             ret = control.do_step(args, None)
 
@@ -2146,7 +2146,7 @@  class TestStep(unittest.TestCase):
                     with mock.patch.object(control, 'do_apply') as mock_apply:
                         args = argparse.Namespace(cmd='step', source='us/next',
                                                   remote='ci', target='master',
-                                                  max_mrs=3)
+                                                  max_mrs=3, fix_retries=3)
                         with terminal.capture():
                             ret = control.do_step(args, None)
 
@@ -5636,5 +5636,535 @@  class TestDoPick(unittest.TestCase):
             dbs.close()
 
 
+class TestPickmanMrPipelineFields(unittest.TestCase):
+    """Tests for PickmanMr pipeline fields."""
+
+    def test_defaults_none(self):
+        """Test that pipeline fields default to None"""
+        pmr = gitlab.PickmanMr(
+            iid=1,
+            title='[pickman] Test',
+            web_url='https://example.com/mr/1',
+            source_branch='cherry-test',
+            description='Test',
+        )
+        self.assertIsNone(pmr.pipeline_status)
+        self.assertIsNone(pmr.pipeline_id)
+
+    def test_with_pipeline(self):
+        """Test creating PickmanMr with pipeline fields"""
+        pmr = gitlab.PickmanMr(
+            iid=1,
+            title='[pickman] Test',
+            web_url='https://example.com/mr/1',
+            source_branch='cherry-test',
+            description='Test',
+            pipeline_status='failed',
+            pipeline_id=42,
+        )
+        self.assertEqual(pmr.pipeline_status, 'failed')
+        self.assertEqual(pmr.pipeline_id, 42)
+
+
+class TestGetFailedJobs(unittest.TestCase):
+    """Tests for get_failed_jobs function."""
+
+    def _make_mock_job(self, job_id, name, stage, web_url, trace_bytes):
+        """Helper to create a mock job object"""
+        job = mock.MagicMock()
+        job.id = job_id
+        job.name = name
+        job.stage = stage
+        job.web_url = web_url
+        return job
+
+    @mock.patch.object(gitlab, 'get_remote_url',
+                       return_value=TEST_SSH_URL)
+    @mock.patch.object(gitlab, 'get_token', return_value='test-token')
+    @mock.patch.object(gitlab, 'AVAILABLE', True)
+    def test_success(self, _mock_token, _mock_url):
+        """Test successful retrieval of failed jobs"""
+        mock_job = self._make_mock_job(
+            1, 'build:sandbox', 'build', 'https://gitlab.com/job/1',
+            b'line1\nline2\nerror: build failed\n')
+
+        mock_full_job = mock.MagicMock()
+        mock_full_job.trace.return_value = b'line1\nline2\nerror: build failed\n'
+
+        mock_pipeline = mock.MagicMock()
+        mock_pipeline.jobs.list.return_value = [mock_job]
+
+        mock_project = mock.MagicMock()
+        mock_project.pipelines.get.return_value = mock_pipeline
+        mock_project.jobs.get.return_value = mock_full_job
+
+        mock_glab = mock.MagicMock()
+        mock_glab.projects.get.return_value = mock_project
+
+        with mock.patch('gitlab.Gitlab', return_value=mock_glab):
+            with terminal.capture():
+                result = gitlab.get_failed_jobs('ci', 100)
+
+        self.assertIsNotNone(result)
+        self.assertEqual(len(result), 1)
+        self.assertEqual(result[0].name, 'build:sandbox')
+        self.assertEqual(result[0].stage, 'build')
+        self.assertIn('error: build failed', result[0].log_tail)
+
+    @mock.patch.object(gitlab, 'get_remote_url',
+                       return_value=TEST_SSH_URL)
+    @mock.patch.object(gitlab, 'get_token', return_value='test-token')
+    @mock.patch.object(gitlab, 'AVAILABLE', True)
+    def test_empty(self, _mock_token, _mock_url):
+        """Test when no failed jobs exist"""
+        mock_pipeline = mock.MagicMock()
+        mock_pipeline.jobs.list.return_value = []
+
+        mock_project = mock.MagicMock()
+        mock_project.pipelines.get.return_value = mock_pipeline
+
+        mock_glab = mock.MagicMock()
+        mock_glab.projects.get.return_value = mock_project
+
+        with mock.patch('gitlab.Gitlab', return_value=mock_glab):
+            with terminal.capture():
+                result = gitlab.get_failed_jobs('ci', 100)
+
+        self.assertIsNotNone(result)
+        self.assertEqual(len(result), 0)
+
+    @mock.patch.object(gitlab, 'get_remote_url',
+                       return_value=TEST_SSH_URL)
+    @mock.patch.object(gitlab, 'get_token', return_value='test-token')
+    @mock.patch.object(gitlab, 'AVAILABLE', True)
+    def test_log_truncation(self, _mock_token, _mock_url):
+        """Test that log output is truncated to max_log_lines"""
+        # Create a trace with 500 lines
+        trace_lines = [f'line {i}' for i in range(500)]
+        trace_bytes = '\n'.join(trace_lines).encode()
+
+        mock_job = self._make_mock_job(
+            1, 'test:sandbox', 'test', 'https://gitlab.com/job/1',
+            trace_bytes)
+
+        mock_full_job = mock.MagicMock()
+        mock_full_job.trace.return_value = trace_bytes
+
+        mock_pipeline = mock.MagicMock()
+        mock_pipeline.jobs.list.return_value = [mock_job]
+
+        mock_project = mock.MagicMock()
+        mock_project.pipelines.get.return_value = mock_pipeline
+        mock_project.jobs.get.return_value = mock_full_job
+
+        mock_glab = mock.MagicMock()
+        mock_glab.projects.get.return_value = mock_project
+
+        with mock.patch('gitlab.Gitlab', return_value=mock_glab):
+            with terminal.capture():
+                result = gitlab.get_failed_jobs('ci', 100, max_log_lines=50)
+
+        self.assertEqual(len(result), 1)
+        # Should only have last 50 lines
+        log_lines = result[0].log_tail.splitlines()
+        self.assertEqual(len(log_lines), 50)
+        self.assertIn('line 499', result[0].log_tail)
+
+
+class TestBuildPipelineFixPrompt(unittest.TestCase):
+    """Tests for build_pipeline_fix_prompt function."""
+
+    def test_single_job(self):
+        """Test prompt with a single failed job"""
+        failed_jobs = [
+            gitlab.FailedJob(
+                id=1, name='build:sandbox', stage='build',
+                web_url='https://gitlab.com/job/1',
+                log_tail='error: undefined reference'),
+        ]
+        prompt, task_desc = agent.build_pipeline_fix_prompt(
+            42, 'cherry-abc123', failed_jobs, 'ci', 'master',
+            'Test MR desc', 1)
+
+        self.assertIn('!42', prompt)
+        self.assertIn('cherry-abc123', prompt)
+        self.assertIn('build:sandbox', prompt)
+        self.assertIn('error: undefined reference', prompt)
+        self.assertIn('attempt 1', prompt)
+        self.assertIn('cherry-abc123-fix1', prompt)
+        self.assertIn('1 failed', task_desc)
+
+    def test_multiple_jobs(self):
+        """Test prompt with multiple failed jobs"""
+        failed_jobs = [
+            gitlab.FailedJob(
+                id=1, name='build:sandbox', stage='build',
+                web_url='https://gitlab.com/job/1',
+                log_tail='build error'),
+            gitlab.FailedJob(
+                id=2, name='test:dm', stage='test',
+                web_url='https://gitlab.com/job/2',
+                log_tail='test failure'),
+        ]
+        prompt, task_desc = agent.build_pipeline_fix_prompt(
+            42, 'cherry-abc123', failed_jobs, 'ci', 'master', '', 1)
+
+        self.assertIn('build:sandbox', prompt)
+        self.assertIn('test:dm', prompt)
+        self.assertIn('build error', prompt)
+        self.assertIn('test failure', prompt)
+        self.assertIn('2 failed', task_desc)
+
+    def test_attempt_number(self):
+        """Test that attempt number is reflected in prompt"""
+        failed_jobs = [
+            gitlab.FailedJob(
+                id=1, name='build', stage='build',
+                web_url='https://gitlab.com/job/1',
+                log_tail='error'),
+        ]
+        prompt, task_desc = agent.build_pipeline_fix_prompt(
+            42, 'cherry-abc123', failed_jobs, 'ci', 'master', '', 3)
+
+        self.assertIn('attempt 3', prompt)
+        self.assertIn('cherry-abc123-fix3', prompt)
+        self.assertIn('attempt 3', task_desc)
+
+    def test_uses_um_build(self):
+        """Test that prompt uses 'um build sandbox' for sandbox"""
+        failed_jobs = [
+            gitlab.FailedJob(
+                id=1, name='build:sandbox', stage='build',
+                web_url='https://gitlab.com/job/1',
+                log_tail='error'),
+        ]
+        prompt, _ = agent.build_pipeline_fix_prompt(
+            42, 'cherry-abc123', failed_jobs, 'ci', 'master', '', 1)
+
+        self.assertIn('um build sandbox', prompt)
+
+    def test_extracts_board_names(self):
+        """Test that board names are extracted from job names"""
+        failed_jobs = [
+            gitlab.FailedJob(
+                id=1, name='build:imx8mm_venice', stage='build',
+                web_url='https://gitlab.com/job/1',
+                log_tail='error'),
+            gitlab.FailedJob(
+                id=2, name='build:rpi_4', stage='build',
+                web_url='https://gitlab.com/job/2',
+                log_tail='error'),
+        ]
+        prompt, _ = agent.build_pipeline_fix_prompt(
+            42, 'cherry-abc123', failed_jobs, 'ci', 'master', '', 1)
+
+        # Should include both boards plus sandbox in the buildman command
+        self.assertIn('buildman', prompt)
+        self.assertIn('imx8mm_venice', prompt)
+        self.assertIn('rpi_4', prompt)
+        self.assertIn('sandbox', prompt)
+
+    def test_buildman_for_multiple_boards(self):
+        """Test that buildman is used for building multiple boards"""
+        failed_jobs = [
+            gitlab.FailedJob(
+                id=1, name='build:coral', stage='build',
+                web_url='https://gitlab.com/job/1',
+                log_tail='error'),
+        ]
+        prompt, _ = agent.build_pipeline_fix_prompt(
+            42, 'cherry-abc123', failed_jobs, 'ci', 'master', '', 1)
+
+        self.assertIn('buildman -o /tmp/pickman', prompt)
+        self.assertIn('coral', prompt)
+
+
+class TestProcessPipelineFailures(unittest.TestCase):
+    """Tests for process_pipeline_failures function."""
+
+    def setUp(self):
+        """Set up test fixtures."""
+        fd, self.db_path = tempfile.mkstemp(suffix='.db')
+        os.close(fd)
+        os.unlink(self.db_path)
+
+    def tearDown(self):
+        """Clean up test fixtures."""
+        if os.path.exists(self.db_path):
+            os.unlink(self.db_path)
+        database.Database.instances.clear()
+
+    def _make_mr(self, iid=1, pipeline_status='failed', pipeline_id=100,
+                 needs_rebase=False, has_conflicts=False):
+        """Helper to create a PickmanMr with pipeline fields"""
+        return gitlab.PickmanMr(
+            iid=iid,
+            title=f'[pickman] Test MR {iid}',
+            web_url=f'https://gitlab.com/mr/{iid}',
+            source_branch=f'cherry-test-{iid}',
+            description='Test description',
+            has_conflicts=has_conflicts,
+            needs_rebase=needs_rebase,
+            pipeline_status=pipeline_status,
+            pipeline_id=pipeline_id,
+        )
+
+    def test_skips_running(self):
+        """Test that running pipelines are skipped"""
+        with terminal.capture():
+            dbs = database.Database(self.db_path)
+            dbs.start()
+
+            mrs = [self._make_mr(pipeline_status='running')]
+            with mock.patch.object(control, 'run_git'):
+                result = control.process_pipeline_failures(
+                    'ci', mrs, dbs, 'master', 3)
+
+            self.assertEqual(result, 0)
+            dbs.close()
+
+    def test_skips_success(self):
+        """Test that successful pipelines are skipped"""
+        with terminal.capture():
+            dbs = database.Database(self.db_path)
+            dbs.start()
+
+            mrs = [self._make_mr(pipeline_status='success')]
+            with mock.patch.object(control, 'run_git'):
+                result = control.process_pipeline_failures(
+                    'ci', mrs, dbs, 'master', 3)
+
+            self.assertEqual(result, 0)
+            dbs.close()
+
+    def test_skips_already_processed(self):
+        """Test that already-processed pipelines are skipped"""
+        with terminal.capture():
+            dbs = database.Database(self.db_path)
+            dbs.start()
+
+            # Pre-record this pipeline
+            dbs.pfix_add(1, 100, 1, 'success')
+            dbs.commit()
+
+            mrs = [self._make_mr()]
+            with mock.patch.object(control, 'run_git'):
+                result = control.process_pipeline_failures(
+                    'ci', mrs, dbs, 'master', 3)
+
+            self.assertEqual(result, 0)
+            dbs.close()
+
+    def test_respects_retry_limit(self):
+        """Test that retry limit is respected"""
+        with terminal.capture():
+            dbs = database.Database(self.db_path)
+            dbs.start()
+
+            # Pre-record 3 attempts with different pipeline IDs
+            dbs.pfix_add(1, 10, 1, 'failure')
+            dbs.pfix_add(1, 20, 2, 'failure')
+            dbs.pfix_add(1, 30, 3, 'failure')
+            dbs.commit()
+
+            mrs = [self._make_mr(pipeline_id=40)]
+            with mock.patch.object(control, 'run_git'):
+                with mock.patch.object(gitlab, 'reply_to_mr',
+                                       return_value=True):
+                    result = control.process_pipeline_failures(
+                        'ci', mrs, dbs, 'master', 3)
+
+            # Should have been processed (comment posted) but not fixed
+            self.assertEqual(result, 0)
+            dbs.close()
+
+    def test_posts_comment_at_limit(self):
+        """Test that a comment is posted when retry limit is reached"""
+        with terminal.capture():
+            dbs = database.Database(self.db_path)
+            dbs.start()
+
+            # Pre-record 3 attempts
+            dbs.pfix_add(1, 10, 1, 'failure')
+            dbs.pfix_add(1, 20, 2, 'failure')
+            dbs.pfix_add(1, 30, 3, 'failure')
+            dbs.commit()
+
+            mrs = [self._make_mr(pipeline_id=40)]
+            with mock.patch.object(control, 'run_git'):
+                with mock.patch.object(gitlab, 'reply_to_mr',
+                                       return_value=True) as mock_reply:
+                    control.process_pipeline_failures(
+                        'ci', mrs, dbs, 'master', 3)
+
+            mock_reply.assert_called_once()
+            call_args = mock_reply.call_args
+            self.assertIn('retry limit', call_args[0][2])
+            dbs.close()
+
+    def test_processes_failed(self):
+        """Test processing a failed pipeline"""
+        with terminal.capture():
+            dbs = database.Database(self.db_path)
+            dbs.start()
+
+            failed_jobs = [
+                gitlab.FailedJob(id=1, name='build', stage='build',
+                                 web_url='https://gitlab.com/job/1',
+                                 log_tail='error'),
+            ]
+            mrs = [self._make_mr()]
+
+            with mock.patch.object(control, 'run_git'):
+                with mock.patch.object(gitlab, 'get_failed_jobs',
+                                       return_value=failed_jobs):
+                    with mock.patch.object(agent, 'fix_pipeline',
+                                           return_value=(True, 'Fixed it')):
+                        with mock.patch.object(
+                                gitlab, 'push_branch',
+                                return_value=True) as mock_push:
+                            with mock.patch.object(gitlab, 'update_mr_desc',
+                                                   return_value=True):
+                                with mock.patch.object(
+                                        gitlab, 'reply_to_mr',
+                                        return_value=True) as mock_reply:
+                                    with mock.patch.object(
+                                            control,
+                                            'update_history_pipeline_fix'):
+                                        result = \
+                                            control.process_pipeline_failures(
+                                                'ci', mrs, dbs, 'master', 3)
+
+            self.assertEqual(result, 1)
+            # Should be recorded in database
+            self.assertTrue(dbs.pfix_has(1, 100))
+            # Should push the branch
+            mock_push.assert_called_once_with(
+                'ci', 'cherry-test-1', force=True, skip_ci=False)
+            # Should post a comment on the MR
+            mock_reply.assert_called_once()
+            reply_msg = mock_reply.call_args[0][2]
+            self.assertIn('Fixed it', reply_msg)
+            self.assertIn('build', reply_msg)
+            dbs.close()
+
+    def test_skips_skipped_mr(self):
+        """Test that MRs without pipeline_id are skipped"""
+        with terminal.capture():
+            dbs = database.Database(self.db_path)
+            dbs.start()
+
+            mrs = [self._make_mr(pipeline_id=None)]
+            with mock.patch.object(control, 'run_git'):
+                result = control.process_pipeline_failures(
+                    'ci', mrs, dbs, 'master', 3)
+
+            self.assertEqual(result, 0)
+            dbs.close()
+
+    def test_rebases_before_fix(self):
+        """Test that a branch needing rebase is rebased instead of fixed"""
+        with terminal.capture():
+            dbs = database.Database(self.db_path)
+            dbs.start()
+
+            mrs = [self._make_mr(needs_rebase=True)]
+            with mock.patch.object(control, 'run_git'):
+                with mock.patch.object(
+                        gitlab, 'push_branch',
+                        return_value=True) as mock_push:
+                    with mock.patch.object(agent, 'fix_pipeline') as mock_fix:
+                        result = control.process_pipeline_failures(
+                            'ci', mrs, dbs, 'master', 3)
+
+            self.assertEqual(result, 1)
+            # Should push the rebased branch, not call fix_pipeline
+            mock_push.assert_called_once_with(
+                'ci', 'cherry-test-1', force=True, skip_ci=False)
+            mock_fix.assert_not_called()
+            # Should be recorded as 'rebased' in database
+            self.assertTrue(dbs.pfix_has(1, 100))
+            dbs.close()
+
+    def test_rebase_with_conflicts_skips(self):
+        """Test that a failed rebase skips the pipeline fix"""
+        with terminal.capture():
+            dbs = database.Database(self.db_path)
+            dbs.start()
+
+            mrs = [self._make_mr(has_conflicts=True)]
+
+            def mock_run_git_fn(args):
+                if args[0] == 'rebase':
+                    raise command.CommandExc('conflict', None)
+                return ''
+
+            with mock.patch.object(control, 'run_git',
+                                   side_effect=mock_run_git_fn):
+                with mock.patch.object(agent, 'fix_pipeline') as mock_fix:
+                    result = control.process_pipeline_failures(
+                        'ci', mrs, dbs, 'master', 3)
+
+            self.assertEqual(result, 0)
+            mock_fix.assert_not_called()
+            dbs.close()
+
+    def test_disabled_with_zero(self):
+        """Test that fix_retries=0 is handled in do_step (not called)"""
+        mock_mr = gitlab.PickmanMr(
+            iid=123,
+            title='[pickman] Test MR',
+            web_url='https://gitlab.com/mr/123',
+            source_branch='cherry-test',
+            description='Test',
+            pipeline_status='failed',
+            pipeline_id=100,
+        )
+        with mock.patch.object(control, 'run_git'):
+            with mock.patch.object(gitlab, 'get_merged_pickman_mrs',
+                                   return_value=[]):
+                with mock.patch.object(gitlab, 'get_open_pickman_mrs',
+                                       return_value=[mock_mr]):
+                    with mock.patch.object(
+                            control, 'process_pipeline_failures') as mock_ppf:
+                        args = argparse.Namespace(
+                            cmd='step', source='us/next',
+                            remote='ci', target='master',
+                            max_mrs=1, fix_retries=0)
+                        with terminal.capture():
+                            control.do_step(args, None)
+
+        mock_ppf.assert_not_called()
+
+
+class TestStepFixRetries(unittest.TestCase):
+    """Tests for --fix-retries argument parsing."""
+
+    def test_default(self):
+        """Test default fix-retries value for step"""
+        args = pickman.parse_args(['step', 'us/next'])
+        self.assertEqual(args.fix_retries, 3)
+
+    def test_custom(self):
+        """Test custom fix-retries value for step"""
+        args = pickman.parse_args(['step', 'us/next', '--fix-retries', '5'])
+        self.assertEqual(args.fix_retries, 5)
+
+    def test_zero_disables(self):
+        """Test that fix-retries=0 is accepted"""
+        args = pickman.parse_args(['step', 'us/next', '--fix-retries', '0'])
+        self.assertEqual(args.fix_retries, 0)
+
+    def test_poll_default(self):
+        """Test default fix-retries value for poll"""
+        args = pickman.parse_args(['poll', 'us/next'])
+        self.assertEqual(args.fix_retries, 3)
+
+    def test_poll_custom(self):
+        """Test custom fix-retries value for poll"""
+        args = pickman.parse_args(['poll', 'us/next', '--fix-retries', '1'])
+        self.assertEqual(args.fix_retries, 1)
+
+
 if __name__ == '__main__':
     unittest.main()