Skip to content

feat: bot detection logic#170

Open
tkotthakota-adobe wants to merge 81 commits intomainfrom
SITES-37727
Open

feat: bot detection logic#170
tkotthakota-adobe wants to merge 81 commits intomainfrom
SITES-37727

Conversation

@tkotthakota-adobe
Copy link
Collaborator

@tkotthakota-adobe tkotthakota-adobe commented Dec 23, 2025

TODO:
Add SCRAPE_JOB_CONFIGURATION to spacecat-services--task-manager secrets manager on prod. Value can be obtained from spacecat-services--audit-worker secrets manager.

Implemented comprehensive bot protection detection and alerting in the Task Processor to identify when sites are blocked by bot protection services (Cloudflare, Akamai, Imperva, etc.) and prevent unnecessary processing.

  • Using scrape client, retrieves scrape results along with abort info to check on bot protection signal.
  • If there is a bot protection signals user via slack channel.

Tests:
DynamoDB records created and fetched using scrape client:
Sample record

{
  "pk": "$spacecat#scrapejobid_ef09f4e4-c9cb-4a5f-b884-ac9b3ffa69a3",
  "sk": "$scrapejob_1",
  "abortInfo": {
    "reason": "bot-protection",
    "details": {
      "totalUrlsCount": 1,
      "blockedUrlsCount": 150,
      "blockedUrlsSampled": true,
      "blockedUrls": [
        {
          "url": "https://www.abbvie.com/science/areas-of-innovation/ai-and-data-convergence.html",
          "blockerType": "cloudflare",
          "httpStatus": 403,
          "confidence": 0.99
        }
      ],
      "byBlockerType": {
        "cloudflare": 150
      },
      "byHttpStatus": {
        "403": 150
      },
      "auditType": "meta-tags"
    }
  }
}

Debug Log:

"[BOT-BLOCKED] Bot protection detected" 

Tests:
https://cq-dev.slack.com/archives/C060T2PPF8V/p1770426770385289?thread_ts=1770424775.658269&cid=C060T2PPF8V

image

https://jira.corp.adobe.com/browse/SITES-37727

@tkotthakota-adobe tkotthakota-adobe marked this pull request as draft December 23, 2025 02:33
@github-actions
Copy link

This PR will trigger no release when merged.

@codecov
Copy link

codecov bot commented Dec 23, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@tkotthakota-adobe
Copy link
Collaborator Author

tkotthakota-adobe commented Feb 7, 2026

Hi @solaris007

  • I addressed your concerns. This PR is ready for review.

  • After approvals, on SpaceCat prod spacecat-services--task-manager Secrets Manager could you help with these two service requests:

    • add SCRAPE_JOB_CONFIGURATION. Value can be obtained from spacecat-services--audit-worker. This is needed for scrape client.
    • add SPACECAT_BOT_IPS with value 3.218.16.42,52.55.82.37,54.172.145.38 for task processor to add message in the slack channel about allowlisting these NAT Gateway Elastic IPs. Before adding please verify for me those are correct. Looks like below commands will do .
      • cd spacecat-infrastructure/environments/prod
      • terraform init
      • terraform output static_outbound_ips
  • I tried Error logs using (severity:"error" OR severity:"ERROR") for some reason did not work.

  • To look at debug logs I added, do they appear on dev environment by default or do I need to configure anything? Asking for future purpose.
    Thanks again.

Copy link
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkotthakota-adobe - good progress on the refactoring. The overall architecture is sound: database-based bot detection via bot-detection.js, clean separation from CloudWatch utils, and well-structured Slack messages. A few items to address:

Important

1. Test mock path mismatch - esmock won't intercept checkAndAlertBotProtection

The handler imports:

import { checkAndAlertBotProtection } from '../../utils/bot-detection.js';

But 15+ test locations mock it at the wrong path:

'../../../src/utils/cloudwatch-utils.js': {
  checkAndAlertBotProtection: sinon.stub().resolves(null),  // <-- ineffective
  getAuditStatus: sinon.stub().resolves({ executed: true, failureReason: null }),
},

esmock resolves by module path, so this mock doesn't intercept the handler's import from bot-detection.js. Tests pass because the bot detection code path either isn't reached (no jobId in scraping mock data) or the unmocked function returns null. Fix by mocking at the correct path:

'../../../src/utils/bot-detection.js': {
  checkAndAlertBotProtection: sinon.stub().resolves(null),
},
'../../../src/utils/cloudwatch-utils.js': {
  getAuditStatus: sinon.stub().resolves({ executed: true, failureReason: null }),
},

2. Dead re-export in cloudwatch-utils.js

// Bot detection functions have been moved to bot-detection.js
// Re-export for backward compatibility
export {
  convertAbortInfoToStats,
  checkAndAlertBotProtection,
} from './bot-detection.js';

This is a new file - there's no backward compatibility concern. Nothing uses this re-export (handler and tests both import from bot-detection.js directly). Remove it.

3. sortJobsByDate - confusing variable names

function sortJobsByDate(jobs) {
  return jobs.sort((a, b) => {
    const dateA = new Date(b.startedAt || b.createdAt || 0);  // dateA from b?
    const dateB = new Date(a.startedAt || a.createdAt || 0);  // dateB from a?
    return dateA - dateB;
  });
}

Functionally correct (sorts latest first) but the swapped names are confusing. Consider:

const dateA = new Date(a.startedAt || a.createdAt || 0);
const dateB = new Date(b.startedAt || b.createdAt || 0);
return dateB - dateA;

4. Test file naming / missing coverage

  • cloudwatch-utils.test.js actually tests bot-detection.js functions - consider renaming to bot-detection.test.js
  • No tests for getAuditStatus (the only function actually exported by cloudwatch-utils.js)

Minor

  • @slack/web-api added as a direct dependency in package.json but not imported in any source file - is this needed?
  • Skipped test (should use fallback stats when CloudWatch returns no logs) with no tracking ticket

What looks good

  • bot-detection.js - clean, focused, proper error handling, good separation
  • isScrapingAvailable correctly returns jobId from the scrape job
  • formatBotProtectionSlackMessage - well-structured with impact percentage, sample URLs, and resolution instructions
  • Defensive error handling throughout (try-catch around Slack alerts, null checks)

Overall this is in good shape. The mock path mismatch (#1) is the most important fix since it means bot detection integration isn't actually being tested.

@solaris007
Copy link
Member

@tkotthakota-adobe - regarding SPACECAT_BOT_IPS, I verified the NAT Gateway Elastic IPs via terraform output static_outbound_ips for all environments:

Environment Static Outbound IPs
Dev 54.87.205.187, 44.218.57.115
Stage 18.232.5.47, 34.197.74.223
Prod 34.205.100.249, 3.94.219.54

The IPs you proposed (3.218.16.42, 52.55.82.37, 54.172.145.38) don't match any environment. Where did those come from?

For SPACECAT_BOT_IPS in prod Secrets Manager, the correct value should be:

34.205.100.249,3.94.219.54

And for the Slack message to be useful per-environment, you'd want to set the right IPs for each:

  • Dev: 54.87.205.187,44.218.57.115
  • Stage: 18.232.5.47,34.197.74.223
  • Prod: 34.205.100.249,3.94.219.54

@solaris007
Copy link
Member

@tkotthakota-adobe - on your Coralogix questions:

Error log filtering

The filter severity:"error" doesn't work because it's missing the field path. Here are the working approaches (verified):

inv.functionName:*content-processor* AND level:error

or:

inv.functionName:*content-processor* AND coralogix.metadata.severity:error

level:error is the simplest - it matches the level field in the structured log JSON.

Debug log visibility

log.debug() statements are not forwarded to Coralogix. The spacecat-coralogix-feeder only passes info level and above. If you need statements visible in Coralogix, use log.info() instead of log.debug().

Heads up - prod error in task-processor

I spotted this error from yesterday (Feb 6) in Coralogix:

Error sending Slack message: Cannot read properties of undefined (reading 'debug')
  at _BaseSlackClient._logDuration

The log object appears to be undefined when BaseSlackClient._logDuration tries to call log.debug. This is in the say() function - might be related to how the Slack client context is constructed. Worth investigating as part of the cleanup.

@solaris007
Copy link
Member

@tkotthakota-adobe - both secrets are now set for all three environments:

Env SCRAPE_JOB_CONFIGURATION SPACECAT_BOT_IPS
Dev Copied from audit-worker 54.87.205.187,44.218.57.115
Stage Constructed from stage resources 18.232.5.47,34.197.74.223
Prod Copied from audit-worker 34.205.100.249,3.94.219.54

Note on the IPs: The values you originally proposed (3.218.16.42,52.55.82.37,54.172.145.38) didn't match any environment - see my earlier comment with the full breakdown. The values above are verified via terraform output static_outbound_ips for each environment.

The Lambda will pick up the new secrets on next cold start - no redeployment needed.

@tkotthakota-adobe
Copy link
Collaborator Author

tkotthakota-adobe commented Feb 7, 2026

@solaris007 Appreciate checking the IPs and adding both variables to secrets manager on all 3 environments. Review comments are addressed.

  • Do we use stage environment in some particular cases which I am not aware of?
  • Adjusted tests to use right module missed this after moving bot code to bot-detection.js.
  • IPs were suggested by cursor. I did not verify my bad. Looks like they were obtained from tests not from terraform. I also verified using terraform command. Agree with the IPs you posted. Sorry for the confusion.
  • log.debug is causing issue in the shared lib. Fixed it to use log.info.

@tkotthakota-adobe
Copy link
Collaborator Author

@solaris007 what could be causing this deploy error? Changes made should not effect deployment in my view.
Other services are deployed fine.
https://github.com/adobe/spacecat-task-processor/actions/runs/21789518189/job/62866745175?pr=170

image

@tkotthakota-adobe
Copy link
Collaborator Author

@solaris007 I noticed on aws console lambda is being deployed. For some reason Jenkins deployer is showing the error I shared earlier.

Copy link
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review - All requested changes addressed

The 4 issues from the previous review have all been fixed:

  • Mock paths now correctly target bot-detection.js (verified all esmock calls)
  • Dead re-export removed - cloudwatch-utils.js only exports getAuditStatus
  • sortJobsByDate function vs sortedJobs result - no more name collision
  • Single bot-detection.test.js, no duplicate test file

Bot detection end-to-end flow is clean: handler gets jobId from scrape jobs, checkAndAlertBotProtection reads abortInfo via ScrapeClient, converts to stats, sends Slack alert. LGTM.

Minor cleanup items

  1. Duplicate JSDoc on filterJobsByTimestamp - Two JSDoc blocks stacked on the same function. Remove the shorter one.

  2. Unnecessary esmock mock - bot-detection.test.js mocks fetchRecentThreadMessages from slack-utils.js, but bot-detection.js doesn't import it. Harmless but confusing.

  3. Gist tarball dependencies - 4 packages on gist tarballs. Must be replaced with published npm versions once #1308 merges.

  4. getScrapeJobsByBaseURL no longer filters by 'default' processing type - Intentional? This now returns jobs of all processing types. Just confirming this is desired behavior.

@tkotthakota-adobe
Copy link
Collaborator Author

tkotthakota-adobe commented Feb 9, 2026

@solaris007 addressed your comments. For some reason this PR is not getting deployed. When I created fresh test PR it deployed fine. When I created new PR out of this PR it failed. That means some state in this PR is causing issue. I am running out of ideas. One other option is just cherry pick code to new PR. This issue started happening only from yesterday.

Update: Upgrading to latest @adobe/helix-deploy (same as in audit worker) solved the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants