Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions .github/workflows/db-backup.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,13 @@ on:

jobs:
backup:
name: PostgreSQL Backup to GCS
name: PostgreSQL Backup (${{ matrix.environment }})
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
environment: [staging, production]
environment: ${{ matrix.environment }}
steps:
- name: Dump and upload
uses: appleboy/ssh-action@v1
Expand All @@ -19,9 +24,12 @@ jobs:
port: ${{ secrets.SERVER_PORT || 22 }}
script: |
set -euo pipefail
STAMP="$(date +%Y%m%d)"
DUMP="/tmp/paperscout-${STAMP}.dump"
STAMP="$(date -u +%Y%m%dT%H%M%SZ)"
RUN_KEY="${{ github.run_id }}-${{ github.run_attempt }}"
DUMP="/tmp/paperscout-${{ matrix.environment }}-${STAMP}-${RUN_KEY}.dump"
DEST="gs://insights-db-backups/paperscout/${{ matrix.environment }}/paperscout-${STAMP}-${RUN_KEY}.dump"
trap 'rm -f "$DUMP"' EXIT

sudo -u postgres pg_dump -Fc paperscout > "$DUMP"
gsutil cp "$DUMP" "gs://paperscout-backups/paperscout-${STAMP}.dump"
gsutil cp "$DUMP" "$DEST"
rm -f "$DUMP"
Comment thread
coderabbitai[bot] marked this conversation as resolved.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,4 @@ build/
Icon?
.com.apple.timemachine.donotpresent
.VolumeIcon.icns
.cursor
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Post the same Slack **status** summary as the interactive command to `NOTIFICATION_CHANNEL` once when the process starts (when that channel is configured).
- Open-source hygiene: contributing guide, security policy, code of conduct, onboarding and handoff docs, pre-commit (Ruff), GitHub issue templates, Dependabot, CodeQL, CODEOWNERS template, and `.gitattributes`.

### Changed

- Documentation: deployment URLs (Slack Request URL behind nginx `/paperscout/`), clone URL in server setup, staging-style placeholders.
- `db-backup.yml`: matrix parallel backups for `staging` / `production` using environment-level SSH secrets; uploads under `gs://insights-db-backups/paperscout/<environment>/` with unique temp files and object keys (UTC timestamp + `run_id` + `run_attempt` + environment); `EXIT` trap removes temp dump on failure. `SERVER_SETUP` restore examples updated (`--no-owner`, listing/copy by object name).

## [0.1.0] - 2026-05-05

### Added
Expand Down
22 changes: 12 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,15 +113,17 @@ python -m paperscout
Once the scout is running and reachable at a public URL:

1. Go back to **Event Subscriptions** in the Slack app config
2. Set **Request URL** to `https://your-server.com/slack/events`
3. Slack will send a challenge request -- the scout responds automatically
2. Set **Request URL** depending on how traffic reaches Bolt:
- **Reverse proxy (recommended for production/staging):** If nginx terminates TLS and proxies under a path prefix (see [`deploy/paperscout.conf`](deploy/paperscout.conf)), Slack must use that prefix. Example: `https://your-domain.example.org/paperscout/slack/events` — not `https://your-domain.example.org/slack/events`.
- **Direct to the app (local dev or ngrok without nginx):** Bolt serves `/slack/events` at the container root. Example: `https://staging.example.org/slack/events` or `https://abc123.ngrok-free.app/slack/events`.
3. Slack will send a challenge request — the scout responds automatically
4. Click **Save Changes**

For local testing with ngrok:
For local testing with ngrok (traffic straight to `PORT`, no path prefix):

```bash
ngrok http 3000
# Use the ngrok URL: https://abc123.ngrok.io/slack/events
# Use: https://<ngrok-host>/slack/events
```

### 8. Invite the Scout
Expand Down Expand Up @@ -191,7 +193,7 @@ curl -sf http://localhost:9102/health

See [`deploy/SERVER_SETUP.md`](deploy/SERVER_SETUP.md) for the full Ubuntu 22.04 provisioning guide, and [`.github/workflows/cd.yml`](.github/workflows/cd.yml) for the CD pipeline.

Database backups run daily via [`.github/workflows/db-backup.yml`](.github/workflows/db-backup.yml), uploading `pg_dump` snapshots to Google Cloud Storage.
Database backups run daily via [`.github/workflows/db-backup.yml`](.github/workflows/db-backup.yml): **matrix jobs** for **`staging`** and **`production`** run **in parallel**, each using that **GitHub Environment’s** SSH secrets (same names as CD: `SERVER_HOST`, `SERVER_USER`, `SERVER_SSH_KEY`, optional `SERVER_PORT`). Dumps are uploaded to **`gs://insights-db-backups/paperscout/<environment>/`** so staging and production stay under the shared **`paperscout`** prefix in the bucket.

## Scout Commands

Expand Down Expand Up @@ -331,7 +333,7 @@ paperscout/
.github/workflows/
ci.yml Test matrix on push/PR to main
cd.yml SSH deploy (git pull + build) on push to main
db-backup.yml Daily pg_dump to Google Cloud Storage
db-backup.yml Matrix pg_dump (staging + production) to GCS insights-db-backups/paperscout/<env>/
```

### PostgreSQL Schema
Expand Down Expand Up @@ -453,8 +455,8 @@ A `concurrency` group keyed by branch prevents overlapping deploys to the same e

The `.github/workflows/db-backup.yml` workflow runs daily at 3 AM UTC (and supports manual dispatch):

1. SSHes into the server and runs `pg_dump` on the host's PostgreSQL
2. Uploads the dump to Google Cloud Storage (`gs://paperscout-backups/`)
3. Old backups are auto-pruned by a GCS lifecycle rule (30 days)
1. Runs **two jobs in parallel** (matrix: `staging`, `production`), each bound to the matching **GitHub Environment** so SSH secrets match that tier’s server (same secret names as CD).
2. On each host, runs `pg_dump` and uploads to **`gs://insights-db-backups/paperscout/<environment>/`**, using object keys that include UTC time plus the GitHub Actions run id so backups do not collide on reruns.
3. Configure lifecycle rules on the bucket/prefixes as needed (for example, pruning objects older than 30 days).

CD secrets and variables are configured per **GitHub Environment** (`production` and `staging`); see the table in [Deployment](#deployment). Other secrets (e.g. database backups) are documented in [`deploy/SERVER_SETUP.md`](deploy/SERVER_SETUP.md#9-github-secrets-checklist).
SSH credentials for backups live under **each environment** (`staging`, `production`), not at the repository level — parallel to [Deployment](#deployment). See [`deploy/SERVER_SETUP.md`](deploy/SERVER_SETUP.md#9-github-secrets-and-environments).
58 changes: 38 additions & 20 deletions deploy/SERVER_SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,10 +100,14 @@ rm /tmp/paperscout.dump
```

If the dump is stored in GCS (from the daily backup workflow),
download it directly on the new server instead:
download it directly on the new server instead — use the prefix that matches
the environment you are restoring (**`staging`** or **`production`**). Object
names include UTC time and the workflow run id (see §8); pick the file you need,
for example:

```bash
gsutil cp gs://paperscout-backups/paperscout-<YYYYMMDD>.dump /tmp/paperscout.dump
gsutil ls gs://insights-db-backups/paperscout/<environment>/
gsutil cp gs://insights-db-backups/paperscout/<environment>/paperscout-<object-name>.dump /tmp/paperscout.dump
pg_restore -U paperscout -h localhost -d paperscout --no-owner /tmp/paperscout.dump
rm /tmp/paperscout.dump
```
Expand Down Expand Up @@ -180,7 +184,7 @@ Clone the repo into `/opt/paperscout`:

```bash
sudo mkdir -p /opt
sudo git clone https://github.com/<org>/<repo>.git /opt/paperscout
sudo git clone https://github.com/cppalliance/paperscout.git /opt/paperscout
sudo chown -R <deploy-user>:<deploy-user> /opt/paperscout
```

Expand Down Expand Up @@ -222,42 +226,56 @@ curl -sf http://localhost:9101/health | python3 -m json.tool
docker compose logs -f paperscout
```

### Example: staging-style host

If you use a **separate** staging deployment (second clone path and GitHub Environment `staging`), typical placeholders are:

- TLS / DNS: `sudo certbot --nginx -d staging.example.org` (replace with your real staging hostname when provisioning).
- Health check on the staging machine after mapping ports (see README CD table): `curl -sf http://localhost:9102/health` — use whatever port your staging compose publishes for health instead of `9102` if different.
- Slack **Request URL** when nginx proxies under `/paperscout/`: `https://staging.example.org/paperscout/slack/events`.

---

## 7. Restoring from a GCS backup (optional)

If migrating from another server with an existing database:

```bash
gsutil cp gs://paperscout-backups/paperscout-<YYYYMMDD>.dump /tmp/paperscout.dump
pg_restore -U paperscout -h localhost -d paperscout -c /tmp/paperscout.dump
gsutil ls gs://insights-db-backups/paperscout/<environment>/
gsutil cp gs://insights-db-backups/paperscout/<environment>/paperscout-<object-name>.dump /tmp/paperscout.dump
pg_restore -U paperscout -h localhost -d paperscout -c --no-owner /tmp/paperscout.dump
rm /tmp/paperscout.dump
Comment thread
coderabbitai[bot] marked this conversation as resolved.
```

---

## 8. Database backups

The `db-backup.yml` GitHub Actions workflow SSHes into the server daily
and runs `pg_dump` + `gsutil cp` to upload to GCS. The VM's service
account handles authentication automatically — no credentials needed.
The `db-backup.yml` workflow runs **two parallel matrix jobs** (`staging` and
`production`). Each job uses the **GitHub Environment** with the same name, so
SSH secrets (`SERVER_HOST`, etc.) resolve per tier — matching CD. Each run uploads to:

The GCS bucket `paperscout-backups` should have a lifecycle rule to
auto-delete objects older than 30 days (configured in the Cloud Console
under the bucket's **Lifecycle** tab).
```text
gs://insights-db-backups/paperscout/<environment>/paperscout-<UTC-timestamp>-<run-id>-<run-attempt>.dump
```

Object keys include the workflow run id so same-day reruns do not overwrite objects; each matrix job uses its own temp file on the host.

---

## 9. GitHub Secrets checklist
## 9. GitHub secrets and environments

**Continuous deployment** (`cd.yml`) and **database backups** (`db-backup.yml`)
both use the **`staging`** and **`production`** GitHub Environments. Configure the **same SSH secret names** in each environment (values differ per server):

Configure these in the repo under **Settings → Secrets and variables → Actions**:
| Secret | Purpose |
| ---------------- | -------------------------------------------------------- |
| `SERVER_HOST` | SSH target host for that environment’s VM |
| `SERVER_USER` | SSH username (e.g. `<deploy-user>`) |
| `SERVER_SSH_KEY` | Private SSH key for the deploy user |
| `SERVER_PORT` | SSH port (optional; default `22`) |

| Secret | Purpose |
| ---------------- | ----------------------------------- |
| `SERVER_HOST` | Server IP or hostname |
| `SERVER_USER` | SSH username (e.g. `<deploy-user>`) |
| `SERVER_SSH_KEY` | Private SSH key for the deploy user |
| `SERVER_PORT` | SSH port (optional, defaults to 22) |
CD also uses **environment Variables** (`DEPLOY_PATH`, `DEPLOY_BRANCH`, `HEALTH_PORT`) — see the README Deployment table. Backup jobs only need the secrets above.

`GITHUB_TOKEN` is provided automatically by GitHub Actions.
GCS authentication uses the VM's service account — no extra secrets needed.
GCS uploads use the VM's service account (`gsutil`) — ensure each server can write to `gs://insights-db-backups/paperscout/<environment>/`.
2 changes: 1 addition & 1 deletion docs/onboarding.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ python -m paperscout
- **Slack HTTP app** listens on `PORT` (default **3000**).
- **Health** endpoint listens on `health_port` from settings (default **8080**) — `GET /health`.

For Slack Event Subscriptions you need a public URL (e.g. ngrok); see [README](../README.md#7-set-the-request-url).
For Slack Event Subscriptions you need a public URL (e.g. ngrok). With nginx and a `/paperscout/` prefix, the Request URL must include that path; see [README — Set the Request URL](../README.md#7-set-the-request-url).

## Deployment (summary)

Expand Down
11 changes: 10 additions & 1 deletion src/paperscout/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,14 @@
from .db import init_db, init_pool
from .health import start_health_server
from .monitor import Scheduler
from .scout import MessageQueue, create_app, notify_channel, notify_users, register_handlers
from .scout import (
MessageQueue,
create_app,
enqueue_startup_status,
notify_channel,
notify_users,
register_handlers,
)
from .sources import ISOProber, WG21Index
from .storage import ProbeState, UserWatchlist

Expand Down Expand Up @@ -131,6 +138,8 @@ def _on_poll_result(result):
)
bolt_thread.start()

enqueue_startup_status(mq, state, paper_count_fn)

await scheduler.run_forever()


Expand Down
42 changes: 28 additions & 14 deletions src/paperscout/scout.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,30 +440,44 @@ def _show_watchlist(
)


def _handle_status(state: ProbeState, paper_count_fn, say, reply_opts: dict) -> None:
"""Post loaded paper count, last poll, probe settings."""
def format_status_message(state: ProbeState, paper_count_fn) -> str:
"""Mrkdwn body for the interactive ``status`` command and startup channel post."""
from datetime import datetime as _dt
from datetime import timezone as _tz

last = state.last_poll
last_str = (
_dt.fromtimestamp(last, tz=_tz.utc).strftime("%Y-%m-%d %H:%M:%S UTC") if last else "never"
)
say(
text=(
f"*Paperscout Status*\n"
f"• Papers loaded: {paper_count_fn():,}\n"
f"• Last poll: {last_str}\n"
f"• Poll interval: {settings.poll_interval_minutes} min\n"
f"• Discovered via probe: {len(state.get_all_discovered())}\n"
f"• ISO probing: {'enabled' if settings.enable_iso_probe else 'disabled'}\n"
f"• Alert window: {settings.alert_modified_hours}h\n"
f"• Cold cycle: 1/{settings.cold_cycle_divisor}"
),
**reply_opts,
return (
f"*Paperscout Status*\n"
f"• Papers loaded: {paper_count_fn():,}\n"
f"• Last poll: {last_str}\n"
f"• Poll interval: {settings.poll_interval_minutes} min\n"
f"• Discovered via probe: {len(state.get_all_discovered())}\n"
f"• ISO probing: {'enabled' if settings.enable_iso_probe else 'disabled'}\n"
f"• Alert window: {settings.alert_modified_hours}h\n"
f"• Cold cycle: 1/{settings.cold_cycle_divisor}"
)


def _handle_status(state: ProbeState, paper_count_fn, say, reply_opts: dict) -> None:
"""Post loaded paper count, last poll, probe settings."""
say(text=format_status_message(state, paper_count_fn), **reply_opts)


def enqueue_startup_status(
mq: MessageQueue,
state: ProbeState,
paper_count_fn,
) -> None:
"""Post *status* summary to ``NOTIFICATION_CHANNEL`` once at process start."""
channel = settings.notification_channel
if not channel:
return
mq.enqueue(channel, format_status_message(state, paper_count_fn))


def _handle_version(say, reply_opts: dict) -> None:
"""Post package version string."""
from . import __version__
Expand Down
31 changes: 31 additions & 0 deletions tests/test_scout.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
_paper_link,
_reply_opts,
_show_watchlist,
enqueue_startup_status,
format_status_message,
notify_channel,
notify_users,
register_handlers,
Expand Down Expand Up @@ -483,6 +485,35 @@ def test_status_after_poll(self, fake_pool):
assert "100" in text and "never" not in text


class TestFormatStatusMessage:
def test_matches_handle_status_output(self, fake_pool):
state = ProbeState(fake_pool)
say = MagicMock()
with patch("paperscout.scout.settings", _make_settings()):
expected = format_status_message(state, lambda: 42)
_handle_status(state, lambda: 42, say, {})
assert say.call_args[1]["text"] == expected


class TestEnqueueStartupStatus:
def test_enqueues_when_channel_configured(self, fake_pool):
mq = MagicMock()
state = ProbeState(fake_pool)
with patch("paperscout.scout.settings", _make_settings(channel="C-alerts")):
enqueue_startup_status(mq, state, lambda: 7)
mq.enqueue.assert_called_once()
assert mq.enqueue.call_args[0][0] == "C-alerts"
assert "Paperscout Status" in mq.enqueue.call_args[0][1]
assert "7" in mq.enqueue.call_args[0][1]

def test_skips_when_no_channel(self, fake_pool):
mq = MagicMock()
state = ProbeState(fake_pool)
with patch("paperscout.scout.settings", _make_settings(channel="")):
enqueue_startup_status(mq, state, lambda: 1)
mq.enqueue.assert_not_called()


# ── register_handlers ─────────────────────────────────────────────────────────


Expand Down
Loading