Skip to content

Fix link checker CI failure by excluding LinkedIn URLs#5461

Open
dai-chen wants to merge 1 commit into
opensearch-project:mainfrom
dai-chen:fix/exclude-linkedin-from-link-checker
Open

Fix link checker CI failure by excluding LinkedIn URLs#5461
dai-chen wants to merge 1 commit into
opensearch-project:mainfrom
dai-chen:fix/exclude-linkedin-from-link-checker

Conversation

@dai-chen
Copy link
Copy Markdown
Collaborator

Description

Exclude LinkedIn URLs from the lychee link checker workflow. LinkedIn blocks automated crawlers and returns 404 for valid profile pages, causing false-positive CI failures.

Example:

## Errors per input

### Errors in docs/presentations/20201116-sql-demo.md

* [404] <https://www.linkedin.com/in/chen-dai/> (at 6:15) | Rejected status code: 404 Not Found

Notice: Summary report available at: https://github.com/opensearch-project/sql/actions/runs/26255105263#summary-77275788037
Error: Process completed with exit code 2.

Related Issues

N/A

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dai-chen dai-chen self-assigned this May 21, 2026
@dai-chen dai-chen added the infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. label May 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

PR Reviewer Guide 🔍

(Review updated until commit d6c3749)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

PR Code Suggestions ✨

Latest suggestions up to d6c3749
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Broaden LinkedIn URL exclusion pattern

The LinkedIn exclusion pattern https://www.linkedin.com/.* may not match all
LinkedIn URL variations (e.g., http://, subdomains like in.linkedin.com). Consider
using a more comprehensive pattern like .linkedin.com. to ensure all LinkedIn
URLs are properly excluded.

.github/workflows/link-checker.yml [21]

-args: --accept=200,403,429,999  "./**/*.html" "./**/*.md" "./**/*.txt" --exclude "https://aws.oss.sonatype.*|https://ci.opensearch.*|https://central.sonatype.*|http://localhost.*|https://localhost|https://odfe-node1:9200/|https://community.tableau.com/docs/DOC-17978|.*family.zzz|opensearch*|.*@amazon.com|.*email.com|.*@github.com|http://timestamp.verisign.com/scripts/timstamp.dll|https://www.linkedin.com/.*"
+args: --accept=200,403,429,999  "./**/*.html" "./**/*.md" "./**/*.txt" --exclude "https://aws.oss.sonatype.*|https://ci.opensearch.*|https://central.sonatype.*|http://localhost.*|https://localhost|https://odfe-node1:9200/|https://community.tableau.com/docs/DOC-17978|.*family.zzz|opensearch*|.*@amazon.com|.*email.com|.*@github.com|http://timestamp.verisign.com/scripts/timstamp.dll|.*linkedin\.com.*"
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the current pattern https://www.linkedin.com/.* only matches HTTPS URLs with the www subdomain. Using .*linkedin\.com.* would catch all LinkedIn URL variations including HTTP, different subdomains, and paths, making the exclusion more robust and comprehensive.

Medium

Previous suggestions

Suggestions up to commit f3c36b4
CategorySuggestion                                                                                                                                    Impact
General
Broaden LinkedIn URL exclusion pattern

The LinkedIn exclusion pattern https://www.linkedin.com/.* may not match all
LinkedIn URL variations (e.g., http://, subdomains like in.linkedin.com). Consider
using a more comprehensive pattern like .linkedin.com. to catch all LinkedIn URLs
regardless of protocol or subdomain.

.github/workflows/link-checker.yml [21]

-args: --accept=200,403,429,999  "./**/*.html" "./**/*.md" "./**/*.txt" --exclude "https://aws.oss.sonatype.*|https://ci.opensearch.*|https://central.sonatype.*|http://localhost.*|https://localhost|https://odfe-node1:9200/|https://community.tableau.com/docs/DOC-17978|.*family.zzz|opensearch*|.*@amazon.com|.*email.com|.*@github.com|http://timestamp.verisign.com/scripts/timstamp.dll|https://www.linkedin.com/.*"
+args: --accept=200,403,429,999  "./**/*.html" "./**/*.md" "./**/*.txt" --exclude "https://aws.oss.sonatype.*|https://ci.opensearch.*|https://central.sonatype.*|http://localhost.*|https://localhost|https://odfe-node1:9200/|https://community.tableau.com/docs/DOC-17978|.*family.zzz|opensearch*|.*@amazon.com|.*email.com|.*@github.com|http://timestamp.verisign.com/scripts/timstamp.dll|.*linkedin\.com.*"
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that the current pattern https://www.linkedin.com/.* only matches HTTPS URLs with the www subdomain. The proposed pattern .*linkedin\.com.* is more comprehensive and will catch all LinkedIn URLs regardless of protocol or subdomain, making the exclusion more robust.

Medium

LinkedIn actively blocks automated crawlers, returning 404 for valid
profiles. Add https://www.linkedin.com/.* to the lychee exclude list
to prevent false-positive CI failures.

Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen force-pushed the fix/exclude-linkedin-from-link-checker branch from f3c36b4 to d6c3749 Compare May 21, 2026 23:18
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit d6c3749

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants