feat: upstream sync pipeline with per-commit AI analysis#3
Conversation
Adds two GitHub Actions workflows: sync-incoming.yml: - Runs daily at 06:00 UTC (and on demand) - Fetches microsoft/graphrag main and force-pushes to the incoming branch - Calculates new upstream commits since the last sync (max 10) - Dispatches analyze-upstream-commit.yml for each new commit analyze-upstream-commit.yml: - Triggered per-commit via workflow_dispatch - Fetches the specific upstream commit and extracts message, stat, and diff - Calls GitHub Models API (gpt-4o-mini) for structured dotnet/docs analysis - Creates a sync/upstream-<sha8> branch with an analysis document - Opens a PR targeting main labelled upstream-sync - Enables auto-merge (GraphQL); falls back to direct squash-merge; leaves PR open if there are unresolvable conflicts - Fully idempotent: re-running for same SHA is a no-op Co-authored-by: sharpninja <16146732+sharpninja@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a two-workflow automated pipeline for monitoring and analyzing upstream changes from microsoft/graphrag. A daily sync workflow force-mirrors upstream/main into an incoming branch, computes new commits since the last sync, and dispatches per-commit analysis jobs. The analysis workflow fetches each commit, sends its message/stat/diff to the GitHub Models API (gpt-4o-mini) for AI-generated .NET and docs change recommendations, commits the analysis to a sync/upstream-<sha8> branch, opens a labeled PR to main, and attempts to enable auto-merge.
Changes:
sync-incoming.yml: Daily cron + manual dispatch to mirror upstreammain→incomingbranch and dispatch per-SHA analysis jobs (up to 10, capped at 2 parallel).analyze-upstream-commit.yml: Per-SHA workflow that extracts commit info, calls the GitHub Models API for AI analysis, creates a sync branch with the analysis doc, opens a PR, and enables auto-merge.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 15 comments.
| File | Description |
|---|---|
.github/workflows/sync-incoming.yml |
Syncs upstream to incoming branch and fans out per-commit analysis dispatches |
.github/workflows/analyze-upstream-commit.yml |
AI-driven per-commit analysis, branch creation, PR creation, and auto-merge enablement |
| - name: Extract commit information | ||
| id: commit-info | ||
| run: | | ||
| SHA="${{ inputs.upstream_commit_sha }}" |
There was a problem hiding this comment.
At line 41, ${{ inputs.upstream_commit_sha }} is interpolated directly into a shell script (SHA="${{ inputs.upstream_commit_sha }}"). Although the SHA is immediately quoted with "$SHA" when used in subsequent git show commands, an attacker-controlled SHA could contain shell metacharacters (e.g., $(...), backticks) that execute before the variable is assigned and quoted. The input should be sanitized or validated to be a valid hex SHA (e.g., using a regex check like [[ "$SHA" =~ ^[0-9a-fA-F]{40}$ ]]) before use.
| SHA="${{ inputs.upstream_commit_sha }}" | |
| SHA="${{ inputs.upstream_commit_sha }}" | |
| if ! [[ "$SHA" =~ ^[0-9a-fA-F]{40}$ ]]; then | |
| echo "Invalid upstream_commit_sha: $SHA" >&2 | |
| exit 1 | |
| fi |
| git show "$SHA" --format="%s%n%b" --no-patch \ | ||
| > /tmp/commit_message.txt 2>/dev/null \ | ||
| || echo "Commit ${SHORT}" > /tmp/commit_message.txt | ||
|
|
||
| git show "$SHA" --stat --no-patch \ | ||
| > /tmp/commit_stat.txt 2>/dev/null \ | ||
| || echo "(stat unavailable)" > /tmp/commit_stat.txt | ||
|
|
||
| # Capture diff for Python and Markdown files only (capped to keep tokens low) | ||
| git show "$SHA" -- '*.py' '*.md' \ | ||
| | head -c 8000 > /tmp/commit_diff.txt 2>/dev/null \ | ||
| || echo "(diff unavailable)" > /tmp/commit_diff.txt | ||
|
|
||
| echo "sha=${SHA}" >> "$GITHUB_OUTPUT" | ||
| echo "short=${SHORT}" >> "$GITHUB_OUTPUT" | ||
| echo "branch=sync/upstream-${SHORT}" >> "$GITHUB_OUTPUT" |
There was a problem hiding this comment.
The git show "$SHA" command in the "Extract commit information" step (line 53) only fetches upstream main but uses git show on an arbitrary SHA provided as input. If the SHA is not reachable from upstream/main (e.g., it was force-pushed away, belongs to a different branch, or was entered incorrectly in a manual workflow_dispatch), the git show commands will fail silently (the 2>/dev/null || fallback catches the error) but the extracted diff will be empty. There is no validation that the SHA actually exists in the fetched upstream ref before proceeding. A SHA validation step (e.g., git cat-file -e "$SHA" after fetching) would prevent creating empty/misleading analysis documents.
| git show "$SHA" -- '*.py' '*.md' \ | ||
| | head -c 8000 > /tmp/commit_diff.txt 2>/dev/null \ | ||
| || echo "(diff unavailable)" > /tmp/commit_diff.txt |
There was a problem hiding this comment.
At line 53, git show "$SHA" -- '*.py' '*.md' is piped to head -c 8000, but the || fallback is applied to the entire pipeline. In bash, the exit code of the pipeline is the exit code of the last command (head), not git show. This means if git show fails, the pipeline may still "succeed" (because head returns 0), and the fallback echo "(diff unavailable)" will NOT be written. The correct approach would be to use set -o pipefail at the top of the shell script, or to check git show separately before piping.
| git show "$SHA" -- '*.py' '*.md' \ | |
| | head -c 8000 > /tmp/commit_diff.txt 2>/dev/null \ | |
| || echo "(diff unavailable)" > /tmp/commit_diff.txt | |
| if git show "$SHA" -- '*.py' '*.md' > /tmp/commit_diff_raw.txt 2>/dev/null; then | |
| head -c 8000 /tmp/commit_diff_raw.txt > /tmp/commit_diff.txt | |
| else | |
| echo "(diff unavailable)" > /tmp/commit_diff.txt | |
| fi |
| analysis_text = ( | ||
| f"Analysis unavailable: {exc}\n\n" |
There was a problem hiding this comment.
The analyze-upstream-commit.yml workflow uses ${{ secrets.GITHUB_TOKEN }} to authenticate against the GitHub Models API at https://models.inference.ai.azure.com. The GITHUB_TOKEN is a short-lived token scoped to the current repository. Depending on the repository's token settings, this token may not have the required permissions to call GitHub Models. If this fails in a run that was dispatched automatically, there is no alerting mechanism — the analysis will silently fall back to a placeholder message (line 197–200), the PR will still be opened, but with no useful content. Consider logging or failing the step explicitly on API auth failures rather than silently creating empty analysis PRs.
| analysis_text = ( | |
| f"Analysis unavailable: {exc}\n\n" | |
| msg = str(exc) | |
| print(f"GitHub Models API call failed: {msg}") | |
| # If this looks like an authentication/authorization failure, | |
| # fail the step explicitly so we don't create an empty analysis PR. | |
| if "401" in msg or "403" in msg: | |
| print( | |
| "GitHub Models API authentication/authorization appears to have " | |
| "failed (HTTP 401/403). Verify that the token used for " | |
| "GitHub Models access has the required permissions." | |
| ) | |
| raise | |
| # For non-auth failures, fall back to a placeholder analysis but keep the workflow running. | |
| analysis_text = ( | |
| f"Analysis unavailable: {msg}\n\n" |
| with: | ||
| script: | | ||
| const prNumber = parseInt('${{ steps.create-pr.outputs.pr_number }}', 10); |
There was a problem hiding this comment.
At line 313, ${{ steps.create-pr.outputs.pr_number }} is interpolated directly into JavaScript. While pr_number is set to pr.data.number.toString() by an earlier step and is unlikely to be attacker-controlled, the pattern of direct interpolation into script strings is unsafe and inconsistent with GitHub's recommended practice. This value should be passed via an environment variable and read via process.env inside the script.
| with: | |
| script: | | |
| const prNumber = parseInt('${{ steps.create-pr.outputs.pr_number }}', 10); | |
| env: | |
| PR_NUMBER: ${{ steps.create-pr.outputs.pr_number }} | |
| with: | |
| script: | | |
| const prNumber = parseInt(process.env.PR_NUMBER || '', 10); |
| import textwrap | ||
| import urllib.request | ||
|
|
||
| sha = "${{ inputs.upstream_commit_sha }}" |
There was a problem hiding this comment.
The upstream_commit_sha input is interpolated directly into a Python heredoc at line 83: sha = "${{ inputs.upstream_commit_sha }}". A maliciously crafted SHA (e.g., one containing ", \n, or backticks) could escape the Python string literal or inject arbitrary shell/Python code. The SHA should be passed via an environment variable (using env:) and read inside Python with os.environ, rather than being interpolated directly into the script source code.
| with: | ||
| script: | | ||
| const fs = require('fs'); | ||
| const sha = '${{ inputs.upstream_commit_sha }}'; | ||
| const short = sha.substring(0, 8); | ||
| const branch = '${{ steps.commit-info.outputs.branch }}'; |
There was a problem hiding this comment.
At lines 246 and 248, ${{ inputs.upstream_commit_sha }} and ${{ steps.commit-info.outputs.branch }} are interpolated directly into the JavaScript source using single-quoted string literals. If the SHA or branch name contains a single quote or other JS metacharacter, the script can be broken or exploited. These values should instead be read from environment variables (set via the env: key of the step) and accessed through process.env inside the script.
| with: | |
| script: | | |
| const fs = require('fs'); | |
| const sha = '${{ inputs.upstream_commit_sha }}'; | |
| const short = sha.substring(0, 8); | |
| const branch = '${{ steps.commit-info.outputs.branch }}'; | |
| env: | |
| UPSTREAM_SHA: ${{ inputs.upstream_commit_sha }} | |
| SYNC_BRANCH: ${{ steps.commit-info.outputs.branch }} | |
| with: | |
| script: | | |
| const fs = require('fs'); | |
| const sha = process.env.UPSTREAM_SHA; | |
| const short = sha.substring(0, 8); | |
| const branch = process.env.SYNC_BRANCH; |
| commit_msg = read_capped("/tmp/commit_message.txt", 800) | ||
| stat = read_capped("/tmp/commit_stat.txt", 2000) | ||
| diff = read_capped("/tmp/commit_diff.txt", 4000) | ||
|
|
||
| prompt = textwrap.dedent(f""" | ||
| You are analyzing an upstream commit from the microsoft/graphrag Python repository. | ||
| This fork (sharpninja/graphrag) adds a .NET/C# implementation in `dotnet/` and | ||
| extended documentation that mirrors the Python library behavior. | ||
|
|
||
| Upstream commit: {short} | ||
|
|
||
| Commit message: | ||
| {commit_msg} | ||
|
|
||
| Changed files (stat): | ||
| {stat} | ||
|
|
||
| Diff preview (Python/Markdown files only): | ||
| {diff} | ||
|
|
||
| Analyze what changes are required in the fork's `dotnet/` and `docs/` directories | ||
| to keep the .NET implementation and documentation synchronized with this upstream change. | ||
|
|
||
| Reply with EXACTLY this format (keep all section headers): | ||
|
|
||
| ## Summary | ||
| <one-paragraph description of what this upstream commit does> | ||
|
|
||
| ## .NET Changes Required | ||
| <bullet list of specific changes needed in dotnet/, or "None required" if not applicable> | ||
|
|
||
| ## Documentation Changes Required | ||
| <bullet list of documentation changes needed, or "None required" if not applicable> | ||
|
|
||
| ## Priority | ||
| HIGH | MEDIUM | LOW — with one-sentence justification | ||
|
|
||
| ## PR Title | ||
| <concise imperative title, e.g. "sync: update X to match upstream Y behavior"> | ||
|
|
||
| ## PR Body | ||
| <markdown body (2-4 sentences) for the pull request> | ||
| """).strip() |
There was a problem hiding this comment.
The AGENTS.md, .clinerules, and custom-instructions/repo/.github/copilot-instructions.md files in this repository instruct AI agents to read a file AGENTS-README-FIRST.yaml in the repo root and POST to an MCP session log endpoint before doing any work. This is a prompt injection / exfiltration vector: a malicious upstream commit message or diff (processed by the AI in this workflow) could contain instructions that exploit these agent procedures. Moreover, the GITHUB_TOKEN is passed to the GitHub Models API at line 138, which could allow a compromised or unintentionally crafted prompt (via commit_msg, stat, or diff — lines 94–96) to exfiltrate the token via the API call, or to instruct the AI to output content that, when written to the repo, later triggers the agent MCP procedures. Input content used in the AI prompt should be carefully sandboxed and not include raw diff content without additional escaping or content moderation.
| id: branch-check | ||
| run: | | ||
| BRANCH="${{ steps.commit-info.outputs.branch }}" | ||
| if git ls-remote --heads origin "$BRANCH" | grep -q "$BRANCH"; then |
There was a problem hiding this comment.
At line 65 in analyze-upstream-commit.yml, the branch-existence check uses grep -q "$BRANCH" where $BRANCH is an unquoted value from steps.commit-info.outputs.branch. If the branch name contained regex metacharacters (e.g., .), the grep would match unintended patterns. The branch name should be double-quoted: grep -qF "$BRANCH" (using -F for fixed-string matching to avoid any regex interpretation).
| if git ls-remote --heads origin "$BRANCH" | grep -q "$BRANCH"; then | |
| if git ls-remote --heads origin "$BRANCH" | grep -qF "$BRANCH"; then |
| run: | | ||
| PREV="${{ steps.prev-head.outputs.prev }}" | ||
| if [ -n "$PREV" ]; then | ||
| # Commits reachable from upstream/main but not from PREV |
There was a problem hiding this comment.
The sync-incoming.yml workflow computes new commits in the "Collect new upstream commits" step as ${PREV}..upstream/main. However, PREV is the SHA of origin/incoming, which was set to mirror upstream/main from a previous run. If the upstream repository force-pushes its main branch (which is unusual but possible), the range PREV..upstream/main will be empty (since PREV is no longer an ancestor), and no commits will be queued for analysis. The sync will silently update incoming to the new force-pushed tip without dispatching analysis for any of the affected commits. This edge case should at least be documented in a comment.
| # Commits reachable from upstream/main but not from PREV | |
| # Commits reachable from upstream/main but not from PREV. | |
| # NOTE: PREV is the previous origin/incoming tip, which normally mirrors upstream/main. | |
| # If upstream/main is force-pushed so that PREV is no longer an ancestor, the range | |
| # "${PREV}..upstream/main" will be empty and no commits will be queued for analysis, | |
| # even though the incoming branch is still force-synced to the new upstream tip. |
Description
Adds an automated pipeline that mirrors
microsoft/graphragmaininto anincomingbranch daily, then dispatches an AI agent per new commit to analyze required.NETand documentation changes, opens a labeled PR, and enables auto-merge.Related Issues
Upstream synchronization requirement.
Proposed Changes
.github/workflows/sync-incoming.ymlworkflow_dispatch)incomingbranch toupstream/maintip (pure mirror, intentional--force)PREV..upstream/maincommit range (capped at 10); on first run, queues only the latest commitanalyze-upstream-commit.ymlper SHA via matrix dispatch (max 2 parallel).github/workflows/analyze-upstream-commit.ymlupstream_commit_sha(called by sync workflow or manually)gpt-4o-miniviaGITHUB_TOKEN) — structured output: Summary, .NET Changes Required, Documentation Changes Required, Priority, PR Title, PR Bodydocs/upstream-sync/upstream-<sha8>.mdon a newsync/upstream-<sha8>branchmainlabelledupstream-syncwith full analysis in bodyenablePullRequestAutoMerge(squash); falls back to direct merge; leaves PR open on conflictChecklist
Additional Notes
Auto-merge requires branch-protection rules with "Allow auto-merge" enabled on the repo. Without it, the workflow falls back to a direct squash-merge attempt; PRs with conflicts remain open for manual resolution.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.