Skip to content

feat: upstream sync pipeline with per-commit AI analysis#3

Merged
sharpninja merged 2 commits intomainfrom
copilot/setup-incoming-sync-pipeline
Feb 28, 2026
Merged

feat: upstream sync pipeline with per-commit AI analysis#3
sharpninja merged 2 commits intomainfrom
copilot/setup-incoming-sync-pipeline

Conversation

Copy link

Copilot AI commented Feb 28, 2026

Description

Adds an automated pipeline that mirrors microsoft/graphrag main into an incoming branch daily, then dispatches an AI agent per new commit to analyze required .NET and documentation changes, opens a labeled PR, and enables auto-merge.

Related Issues

Upstream synchronization requirement.

Proposed Changes

  • .github/workflows/sync-incoming.yml

    • Runs daily at 06:00 UTC (+ workflow_dispatch)
    • Force-resets incoming branch to upstream/main tip (pure mirror, intentional --force)
    • Computes PREV..upstream/main commit range (capped at 10); on first run, queues only the latest commit
    • Fires analyze-upstream-commit.yml per SHA via matrix dispatch (max 2 parallel)
  • .github/workflows/analyze-upstream-commit.yml

    • Input: upstream_commit_sha (called by sync workflow or manually)
    • Extracts commit message, stat, and diff (Python/Markdown only, ≤ 8 KB)
    • Calls GitHub Models API (gpt-4o-mini via GITHUB_TOKEN) — structured output: Summary, .NET Changes Required, Documentation Changes Required, Priority, PR Title, PR Body
    • Commits analysis doc to docs/upstream-sync/upstream-<sha8>.md on a new sync/upstream-<sha8> branch
    • Opens PR → main labelled upstream-sync with full analysis in body
    • Enables auto-merge via GraphQL enablePullRequestAutoMerge (squash); falls back to direct merge; leaves PR open on conflict
    • Fully idempotent — re-running for the same SHA is a no-op

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

Additional Notes

Auto-merge requires branch-protection rules with "Allow auto-merge" enabled on the repo. Without it, the workflow falls back to a direct squash-merge attempt; PRs with conflicts remain open for manual resolution.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI self-assigned this Feb 28, 2026
Adds two GitHub Actions workflows:

sync-incoming.yml:
- Runs daily at 06:00 UTC (and on demand)
- Fetches microsoft/graphrag main and force-pushes to the incoming branch
- Calculates new upstream commits since the last sync (max 10)
- Dispatches analyze-upstream-commit.yml for each new commit

analyze-upstream-commit.yml:
- Triggered per-commit via workflow_dispatch
- Fetches the specific upstream commit and extracts message, stat, and diff
- Calls GitHub Models API (gpt-4o-mini) for structured dotnet/docs analysis
- Creates a sync/upstream-<sha8> branch with an analysis document
- Opens a PR targeting main labelled upstream-sync
- Enables auto-merge (GraphQL); falls back to direct squash-merge;
  leaves PR open if there are unresolvable conflicts
- Fully idempotent: re-running for same SHA is a no-op

Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
Copilot AI changed the title [WIP] Create pipeline for synchronizing with upstream repository feat: upstream sync pipeline with per-commit AI analysis Feb 28, 2026
@sharpninja sharpninja marked this pull request as ready for review February 28, 2026 20:47
Copilot AI review requested due to automatic review settings February 28, 2026 20:47
@sharpninja sharpninja merged commit eb062de into main Feb 28, 2026
15 of 22 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a two-workflow automated pipeline for monitoring and analyzing upstream changes from microsoft/graphrag. A daily sync workflow force-mirrors upstream/main into an incoming branch, computes new commits since the last sync, and dispatches per-commit analysis jobs. The analysis workflow fetches each commit, sends its message/stat/diff to the GitHub Models API (gpt-4o-mini) for AI-generated .NET and docs change recommendations, commits the analysis to a sync/upstream-<sha8> branch, opens a labeled PR to main, and attempts to enable auto-merge.

Changes:

  • sync-incoming.yml: Daily cron + manual dispatch to mirror upstream mainincoming branch and dispatch per-SHA analysis jobs (up to 10, capped at 2 parallel).
  • analyze-upstream-commit.yml: Per-SHA workflow that extracts commit info, calls the GitHub Models API for AI analysis, creates a sync branch with the analysis doc, opens a PR, and enables auto-merge.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 15 comments.

File Description
.github/workflows/sync-incoming.yml Syncs upstream to incoming branch and fans out per-commit analysis dispatches
.github/workflows/analyze-upstream-commit.yml AI-driven per-commit analysis, branch creation, PR creation, and auto-merge enablement

- name: Extract commit information
id: commit-info
run: |
SHA="${{ inputs.upstream_commit_sha }}"
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 41, ${{ inputs.upstream_commit_sha }} is interpolated directly into a shell script (SHA="${{ inputs.upstream_commit_sha }}"). Although the SHA is immediately quoted with "$SHA" when used in subsequent git show commands, an attacker-controlled SHA could contain shell metacharacters (e.g., $(...), backticks) that execute before the variable is assigned and quoted. The input should be sanitized or validated to be a valid hex SHA (e.g., using a regex check like [[ "$SHA" =~ ^[0-9a-fA-F]{40}$ ]]) before use.

Suggested change
SHA="${{ inputs.upstream_commit_sha }}"
SHA="${{ inputs.upstream_commit_sha }}"
if ! [[ "$SHA" =~ ^[0-9a-fA-F]{40}$ ]]; then
echo "Invalid upstream_commit_sha: $SHA" >&2
exit 1
fi

Copilot uses AI. Check for mistakes.
Comment on lines +44 to +59
git show "$SHA" --format="%s%n%b" --no-patch \
> /tmp/commit_message.txt 2>/dev/null \
|| echo "Commit ${SHORT}" > /tmp/commit_message.txt

git show "$SHA" --stat --no-patch \
> /tmp/commit_stat.txt 2>/dev/null \
|| echo "(stat unavailable)" > /tmp/commit_stat.txt

# Capture diff for Python and Markdown files only (capped to keep tokens low)
git show "$SHA" -- '*.py' '*.md' \
| head -c 8000 > /tmp/commit_diff.txt 2>/dev/null \
|| echo "(diff unavailable)" > /tmp/commit_diff.txt

echo "sha=${SHA}" >> "$GITHUB_OUTPUT"
echo "short=${SHORT}" >> "$GITHUB_OUTPUT"
echo "branch=sync/upstream-${SHORT}" >> "$GITHUB_OUTPUT"
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The git show "$SHA" command in the "Extract commit information" step (line 53) only fetches upstream main but uses git show on an arbitrary SHA provided as input. If the SHA is not reachable from upstream/main (e.g., it was force-pushed away, belongs to a different branch, or was entered incorrectly in a manual workflow_dispatch), the git show commands will fail silently (the 2>/dev/null || fallback catches the error) but the extracted diff will be empty. There is no validation that the SHA actually exists in the fetched upstream ref before proceeding. A SHA validation step (e.g., git cat-file -e "$SHA" after fetching) would prevent creating empty/misleading analysis documents.

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +55
git show "$SHA" -- '*.py' '*.md' \
| head -c 8000 > /tmp/commit_diff.txt 2>/dev/null \
|| echo "(diff unavailable)" > /tmp/commit_diff.txt
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 53, git show "$SHA" -- '*.py' '*.md' is piped to head -c 8000, but the || fallback is applied to the entire pipeline. In bash, the exit code of the pipeline is the exit code of the last command (head), not git show. This means if git show fails, the pipeline may still "succeed" (because head returns 0), and the fallback echo "(diff unavailable)" will NOT be written. The correct approach would be to use set -o pipefail at the top of the shell script, or to check git show separately before piping.

Suggested change
git show "$SHA" -- '*.py' '*.md' \
| head -c 8000 > /tmp/commit_diff.txt 2>/dev/null \
|| echo "(diff unavailable)" > /tmp/commit_diff.txt
if git show "$SHA" -- '*.py' '*.md' > /tmp/commit_diff_raw.txt 2>/dev/null; then
head -c 8000 /tmp/commit_diff_raw.txt > /tmp/commit_diff.txt
else
echo "(diff unavailable)" > /tmp/commit_diff.txt
fi

Copilot uses AI. Check for mistakes.
Comment on lines +197 to +198
analysis_text = (
f"Analysis unavailable: {exc}\n\n"
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analyze-upstream-commit.yml workflow uses ${{ secrets.GITHUB_TOKEN }} to authenticate against the GitHub Models API at https://models.inference.ai.azure.com. The GITHUB_TOKEN is a short-lived token scoped to the current repository. Depending on the repository's token settings, this token may not have the required permissions to call GitHub Models. If this fails in a run that was dispatched automatically, there is no alerting mechanism — the analysis will silently fall back to a placeholder message (line 197–200), the PR will still be opened, but with no useful content. Consider logging or failing the step explicitly on API auth failures rather than silently creating empty analysis PRs.

Suggested change
analysis_text = (
f"Analysis unavailable: {exc}\n\n"
msg = str(exc)
print(f"GitHub Models API call failed: {msg}")
# If this looks like an authentication/authorization failure,
# fail the step explicitly so we don't create an empty analysis PR.
if "401" in msg or "403" in msg:
print(
"GitHub Models API authentication/authorization appears to have "
"failed (HTTP 401/403). Verify that the token used for "
"GitHub Models access has the required permissions."
)
raise
# For non-auth failures, fall back to a placeholder analysis but keep the workflow running.
analysis_text = (
f"Analysis unavailable: {msg}\n\n"

Copilot uses AI. Check for mistakes.
Comment on lines +311 to +313
with:
script: |
const prNumber = parseInt('${{ steps.create-pr.outputs.pr_number }}', 10);
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 313, ${{ steps.create-pr.outputs.pr_number }} is interpolated directly into JavaScript. While pr_number is set to pr.data.number.toString() by an earlier step and is unlikely to be attacker-controlled, the pattern of direct interpolation into script strings is unsafe and inconsistent with GitHub's recommended practice. This value should be passed via an environment variable and read via process.env inside the script.

Suggested change
with:
script: |
const prNumber = parseInt('${{ steps.create-pr.outputs.pr_number }}', 10);
env:
PR_NUMBER: ${{ steps.create-pr.outputs.pr_number }}
with:
script: |
const prNumber = parseInt(process.env.PR_NUMBER || '', 10);

Copilot uses AI. Check for mistakes.
import textwrap
import urllib.request

sha = "${{ inputs.upstream_commit_sha }}"
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream_commit_sha input is interpolated directly into a Python heredoc at line 83: sha = "${{ inputs.upstream_commit_sha }}". A maliciously crafted SHA (e.g., one containing ", \n, or backticks) could escape the Python string literal or inject arbitrary shell/Python code. The SHA should be passed via an environment variable (using env:) and read inside Python with os.environ, rather than being interpolated directly into the script source code.

Copilot uses AI. Check for mistakes.
Comment on lines +243 to +248
with:
script: |
const fs = require('fs');
const sha = '${{ inputs.upstream_commit_sha }}';
const short = sha.substring(0, 8);
const branch = '${{ steps.commit-info.outputs.branch }}';
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At lines 246 and 248, ${{ inputs.upstream_commit_sha }} and ${{ steps.commit-info.outputs.branch }} are interpolated directly into the JavaScript source using single-quoted string literals. If the SHA or branch name contains a single quote or other JS metacharacter, the script can be broken or exploited. These values should instead be read from environment variables (set via the env: key of the step) and accessed through process.env inside the script.

Suggested change
with:
script: |
const fs = require('fs');
const sha = '${{ inputs.upstream_commit_sha }}';
const short = sha.substring(0, 8);
const branch = '${{ steps.commit-info.outputs.branch }}';
env:
UPSTREAM_SHA: ${{ inputs.upstream_commit_sha }}
SYNC_BRANCH: ${{ steps.commit-info.outputs.branch }}
with:
script: |
const fs = require('fs');
const sha = process.env.UPSTREAM_SHA;
const short = sha.substring(0, 8);
const branch = process.env.SYNC_BRANCH;

Copilot uses AI. Check for mistakes.
Comment on lines +94 to +136
commit_msg = read_capped("/tmp/commit_message.txt", 800)
stat = read_capped("/tmp/commit_stat.txt", 2000)
diff = read_capped("/tmp/commit_diff.txt", 4000)

prompt = textwrap.dedent(f"""
You are analyzing an upstream commit from the microsoft/graphrag Python repository.
This fork (sharpninja/graphrag) adds a .NET/C# implementation in `dotnet/` and
extended documentation that mirrors the Python library behavior.

Upstream commit: {short}

Commit message:
{commit_msg}

Changed files (stat):
{stat}

Diff preview (Python/Markdown files only):
{diff}

Analyze what changes are required in the fork's `dotnet/` and `docs/` directories
to keep the .NET implementation and documentation synchronized with this upstream change.

Reply with EXACTLY this format (keep all section headers):

## Summary
<one-paragraph description of what this upstream commit does>

## .NET Changes Required
<bullet list of specific changes needed in dotnet/, or "None required" if not applicable>

## Documentation Changes Required
<bullet list of documentation changes needed, or "None required" if not applicable>

## Priority
HIGH | MEDIUM | LOW — with one-sentence justification

## PR Title
<concise imperative title, e.g. "sync: update X to match upstream Y behavior">

## PR Body
<markdown body (2-4 sentences) for the pull request>
""").strip()
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AGENTS.md, .clinerules, and custom-instructions/repo/.github/copilot-instructions.md files in this repository instruct AI agents to read a file AGENTS-README-FIRST.yaml in the repo root and POST to an MCP session log endpoint before doing any work. This is a prompt injection / exfiltration vector: a malicious upstream commit message or diff (processed by the AI in this workflow) could contain instructions that exploit these agent procedures. Moreover, the GITHUB_TOKEN is passed to the GitHub Models API at line 138, which could allow a compromised or unintentionally crafted prompt (via commit_msg, stat, or diff — lines 94–96) to exfiltrate the token via the API call, or to instruct the AI to output content that, when written to the repo, later triggers the agent MCP procedures. Input content used in the AI prompt should be carefully sandboxed and not include raw diff content without additional escaping or content moderation.

Copilot uses AI. Check for mistakes.
id: branch-check
run: |
BRANCH="${{ steps.commit-info.outputs.branch }}"
if git ls-remote --heads origin "$BRANCH" | grep -q "$BRANCH"; then
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 65 in analyze-upstream-commit.yml, the branch-existence check uses grep -q "$BRANCH" where $BRANCH is an unquoted value from steps.commit-info.outputs.branch. If the branch name contained regex metacharacters (e.g., .), the grep would match unintended patterns. The branch name should be double-quoted: grep -qF "$BRANCH" (using -F for fixed-string matching to avoid any regex interpretation).

Suggested change
if git ls-remote --heads origin "$BRANCH" | grep -q "$BRANCH"; then
if git ls-remote --heads origin "$BRANCH" | grep -qF "$BRANCH"; then

Copilot uses AI. Check for mistakes.
run: |
PREV="${{ steps.prev-head.outputs.prev }}"
if [ -n "$PREV" ]; then
# Commits reachable from upstream/main but not from PREV
Copy link

Copilot AI Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync-incoming.yml workflow computes new commits in the "Collect new upstream commits" step as ${PREV}..upstream/main. However, PREV is the SHA of origin/incoming, which was set to mirror upstream/main from a previous run. If the upstream repository force-pushes its main branch (which is unusual but possible), the range PREV..upstream/main will be empty (since PREV is no longer an ancestor), and no commits will be queued for analysis. The sync will silently update incoming to the new force-pushed tip without dispatching analysis for any of the affected commits. This edge case should at least be documented in a comment.

Suggested change
# Commits reachable from upstream/main but not from PREV
# Commits reachable from upstream/main but not from PREV.
# NOTE: PREV is the previous origin/incoming tip, which normally mirrors upstream/main.
# If upstream/main is force-pushed so that PREV is no longer an ancestor, the range
# "${PREV}..upstream/main" will be empty and no commits will be queued for analysis,
# even though the incoming branch is still force-synced to the new upstream tip.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants