Skip to content

MM-67613 Added handling of encryption key rotation #979

Open
avasconcelos114 wants to merge 7 commits intomasterfrom
MM-67613
Open

MM-67613 Added handling of encryption key rotation #979
avasconcelos114 wants to merge 7 commits intomasterfrom
MM-67613

Conversation

@avasconcelos114
Copy link
Contributor

@avasconcelos114 avasconcelos114 commented Mar 10, 2026

Summary

This PR applies a more graceful behavior for rotation of encryptionkey values in the settings.

This PR adds a rotation of encrypted values alongside a fallback if the rotation fails that clear user sessions and notifies the user. It also adds another fallback to prevent scenarios where a user is unable to use their account but also unable to disconnect due to the getGitHubUserInfo call in disconnectGitHubAccount running into encryption errors

Ticket Link

Fixes https://mattermost.atlassian.net/browse/MM-67613

QA Notes

Other than regenerating the Encryption Key value in the System Console and verifying that users can continue to use the plugin normally, we would also want to test a scenario where:

  1. You have the latest version (without this fix) installed
  2. The admin regenerates the EncryptionKey
  3. Users are forced to be in limbo, unable to disconnect or create a new connection
  4. Upload a build with this fix
  5. Use /github disconnect to confirm that a user is able to properly disconnect their account and create a fresh connection

Change Impact: 🔴 High

Reasoning: The PR modifies encryption key rotation, per-user token re-encryption, and disconnect/fallback flows—core authentication and data-persistence areas that affect all users with GitHub-linked accounts.

Regression Risk: High — changes touch encryption/decryption, KV persistence, and cluster-wide coordination; while tests were added, concurrent/race conditions, KV/API failures, or mis-handled decrypt errors could orphan credentials or force unintended disconnects.

** QA Recommendation:** Strongly recommend thorough manual QA covering key-rotation happy paths and failure modes (forced decrypt failures), verification of re-encryption vs. force-disconnect behavior (DM and WebSocket notifications), disconnect/reconnect flows for unaffected users, and concurrent rotations with active API usage. Skipping manual QA risks user lockout or silent credential loss.

Generated by CodeRabbitAI

- Ensuring that even if an environment somehow bypasses the OnConfigurationChanged hook, a user can still disconnect and re-connect their accounts instead of being locked
@avasconcelos114 avasconcelos114 self-assigned this Mar 10, 2026
@avasconcelos114 avasconcelos114 requested a review from a team as a code owner March 10, 2026 17:02
@avasconcelos114 avasconcelos114 added 2: Dev Review Requires review by a core committer 3: QA Review Requires review by a QA tester labels Mar 10, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c55ed518-4d95-4d1f-805e-bfa77c1dd621

📥 Commits

Reviewing files that changed from the base of the PR and between e0a4cb2 and ccf9268.

📒 Files selected for processing (8)
  • server/plugin/api.go
  • server/plugin/audit.go
  • server/plugin/command.go
  • server/plugin/configuration.go
  • server/plugin/mm_34646_token_refresh.go
  • server/plugin/plugin.go
  • server/plugin/plugin_test.go
  • server/plugin/utils.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/plugin/plugin_test.go

📝 Walkthrough

Walkthrough

On configuration change the plugin captures the previous configuration and its EncryptionKey, sends websocket updates using the previous config, applies the new config, and — when both old and new EncryptionKey are non-empty and different — coordinates cluster-wide re-encryption of stored user tokens, forcing per-user disconnects on failures.

Changes

Cohort / File(s) Summary
Configuration Management
server/plugin/configuration.go
OnConfigurationChange now captures the previous config and previous EncryptionKey, sends websocket updates using the previous config, applies the new config, and triggers async re-encryption only when both old and new keys are non-empty and different.
Encryption Key Rotation & Disconnect Flow
server/plugin/plugin.go
Changed storeGitHubUserInfo signature to accept encryptionKey string; added reEncryptUserData(newEncryptionKey, previousEncryptionKey string), reEncryptUserToken(kvKey, newEncryptionKey, previousEncryptionKey string) (string, error), and forceDisconnectUser(userID, githubUsername string); implemented cluster-coordinated re-encryption, per-user re-encrypt logic (decrypt with old key, encrypt with new), and best-effort force-disconnect/cleanup with websocket notifications and audit logging.
Public API Callsites
server/plugin/api.go, server/plugin/command.go, server/plugin/mm_34646_token_refresh.go
Updated call sites to pass current configuration EncryptionKey to storeGitHubUserInfo(info, encryptionKey) after signature change.
Audit Types
server/plugin/audit.go
Added ReEncryptUserDataAuditParams and ReEncryptUserDataAuditResult with Auditable() methods to record total users processed, migrated count, and force-disconnected count.
Tests
server/plugin/plugin_test.go
Added comprehensive tests for re-encryption and force-disconnect flows (happy path, decrypt/store failures, list errors, multiple users, already-migrated tokens, no-connected-users, and delete errors) using mocked KV and API clients.
Crypto Utils
server/plugin/utils.go
Hardened unpad to guard empty input, validate padding bounds (1..aes.BlockSize), check each padding byte for correctness, and return error on mismatch; no signature change.

Sequence Diagram(s)

sequenceDiagram
    participant Admin as Administrator
    participant Config as Configuration Handler
    participant WS as WebSocket Notifier
    participant Cluster as Cluster Mutex/Leader
    participant KV as KV Store
    participant Crypto as Encryption Engine
    participant UserMgr as User Cleanup & Notifier
    Note over Config: Admin changes EncryptionKey
    Admin->>Config: Update config (newEncryptionKey)
    Config->>Config: Capture previous config & previousEncryptionKey
    Config->>WS: Send update using previous config
    Config->>Config: Apply new configuration
    Config->>Cluster: Acquire re-encryption lock (cluster coord)
    Cluster-->>Config: Lock acquired
    Config->>KV: List token keys
    KV-->>Config: Return token keys
    loop For each token key
        Config->>KV: Get encrypted token (kvKey)
        KV-->>Config: Encrypted data
        alt Decrypt with previousEncryptionKey succeeds
            Config->>Crypto: Decrypt using previousEncryptionKey
            Crypto-->>Config: Plain token
            Config->>Crypto: Encrypt using newEncryptionKey
            Crypto-->>Config: New encrypted token
            Config->>KV: Store updated encrypted token
            KV-->>Config: Persisted
        else Decrypt or store fails
            Config->>UserMgr: Force-disconnect user (cleanup & notify)
            UserMgr->>KV: Delete token & username mapping
            UserMgr->>WS: Notify user & post reconnection guidance
        end
    end
    Config->>Cluster: Release lock
    Cluster-->>Config: Lock released
    Config->>Config: Finish rotation, audit results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I nibbled keys both old and new,
I hopped through bytes to make them true,
If decryption snapped, I swept the den,
I pinged and nudged the lonely friend,
A little hop — secure again.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Added handling of encryption key rotation' directly and accurately reflects the main objective, matching the MM-67613 ticket focus on graceful encryption key rotation handling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch MM-67613

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: unknown linters: 'modernize', run 'golangci-lint help linters' to see the list of supported linters
The command is terminated due to an error: unknown linters: 'modernize', run 'golangci-lint help linters' to see the list of supported linters


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
server/plugin/plugin_test.go (1)

45-77: Consider adding assertion to verify token is correctly re-encrypted.

The happy path test verifies that Set is called, but doesn't verify that the stored data contains a token that can be decrypted with the new key. This would add confidence that the encryption cycle completes correctly.

💡 Optional: Add token verification
mockKvStore.EXPECT().Set("user1"+githubTokenKey, gomock.Any()).DoAndReturn(
    func(key string, value any, opts ...pluginapi.KVSetOption) (bool, error) {
        storedInfo := value.(*GitHubUserInfo)
        // Verify we can decrypt with new key
        decrypted, err := decrypt([]byte(testNewKey), storedInfo.Token.AccessToken)
        require.NoError(t, err)
        require.Equal(t, MockAccessToken, decrypted)
        return true, nil
    },
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/plugin/plugin_test.go` around lines 45 - 77, The test currently only
asserts that mockKvStore.Set was called; update the mockKvStore.EXPECT().Set
invocation in TestReEncryptUserData_HappyPath to use DoAndReturn and inspect the
value parameter (the stored GitHubUserInfo) — cast it to *GitHubUserInfo,
extract storedInfo.Token.AccessToken, attempt to decrypt it with testNewKey
(calling decrypt([]byte(testNewKey), ...)), and assert no error and that the
decrypted value equals MockAccessToken; keep returning (true, nil) from the
DoAndReturn so existing behavior is preserved and ensure
p.reEncryptUserData(testOldKey) and api.AssertExpectations(t) remain unchanged.
server/plugin/configuration.go (1)

210-220: Consider running re-encryption asynchronously to avoid blocking configuration changes.

The reEncryptUserData call is synchronous, which could block OnConfigurationChange for a significant duration when many users are connected. This may cause timeouts or degrade plugin responsiveness during key rotation.

Additionally, there's a subtle window between setConfiguration (line 215) and starting re-encryption (line 219) where the new key is active but tokens are still encrypted with the old key. If concurrent requests occur during this window, they may fail to decrypt tokens.

♻️ Consider wrapping re-encryption in a goroutine
 	p.setConfiguration(configuration)
 
 	if previousEncryptionKey != "" && configuration.EncryptionKey != "" &&
 		previousEncryptionKey != configuration.EncryptionKey {
-		p.reEncryptUserData(previousEncryptionKey)
+		go p.reEncryptUserData(previousEncryptionKey)
 	}

Note: If you make this async, ensure proper logging of completion/failure and consider adding a mutex or flag to prevent concurrent re-encryption attempts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/plugin/configuration.go` around lines 210 - 220, The synchronous call
to p.reEncryptUserData blocks OnConfigurationChange and leaves a window where
the new EncryptionKey is active but tokens are still encrypted with the old key;
change this to start re-encryption in a new goroutine and coordinate via a
mutex/flag on the plugin (e.g. add p.reEncryptInProgress bool + sync.Mutex or
sync/atomic flag) so only one re-encryption runs at a time, set the flag before
switching the configuration (or ensure decrypt logic checks the flag and
attempts both keys) and clear it when the goroutine finishes, and ensure the
goroutine logs start/completion/errors from p.reEncryptUserData so failures are
observable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/plugin/plugin.go`:
- Around line 744-755: The current loop calling p.store.ListKeys(page,
keysPerPage, pluginapi.WithChecker(checker)) returns early on any page error,
leaving previously-collected keys in allKeys unprocessed; change this so that on
ListKeys error you log the error (using p.client.Log.Warn) but do not return
immediately — instead break the pagination loop and proceed to the re-encryption
loop to process the already-collected allKeys (or alternatively accumulate
errors and continue fetching all pages, but ensure allKeys is processed
regardless). Ensure keysPerPage, checker and the allKeys slice are reused after
the loop so re-encryption runs even when a later page fails.

---

Nitpick comments:
In `@server/plugin/configuration.go`:
- Around line 210-220: The synchronous call to p.reEncryptUserData blocks
OnConfigurationChange and leaves a window where the new EncryptionKey is active
but tokens are still encrypted with the old key; change this to start
re-encryption in a new goroutine and coordinate via a mutex/flag on the plugin
(e.g. add p.reEncryptInProgress bool + sync.Mutex or sync/atomic flag) so only
one re-encryption runs at a time, set the flag before switching the
configuration (or ensure decrypt logic checks the flag and attempts both keys)
and clear it when the goroutine finishes, and ensure the goroutine logs
start/completion/errors from p.reEncryptUserData so failures are observable.

In `@server/plugin/plugin_test.go`:
- Around line 45-77: The test currently only asserts that mockKvStore.Set was
called; update the mockKvStore.EXPECT().Set invocation in
TestReEncryptUserData_HappyPath to use DoAndReturn and inspect the value
parameter (the stored GitHubUserInfo) — cast it to *GitHubUserInfo, extract
storedInfo.Token.AccessToken, attempt to decrypt it with testNewKey (calling
decrypt([]byte(testNewKey), ...)), and assert no error and that the decrypted
value equals MockAccessToken; keep returning (true, nil) from the DoAndReturn so
existing behavior is preserved and ensure p.reEncryptUserData(testOldKey) and
api.AssertExpectations(t) remain unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9a21d94e-4734-413e-b375-1a3e0e965fa0

📥 Commits

Reviewing files that changed from the base of the PR and between c4c0b47 and 6262378.

📒 Files selected for processing (3)
  • server/plugin/configuration.go
  • server/plugin/plugin.go
  • server/plugin/plugin_test.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/plugin/plugin_test.go`:
- Around line 113-114: The LogWarn mock in the decrypt-failure test has the
wrong arity: update the api.On("LogWarn", ...) expectation to match the actual
call signature (5 arguments) instead of 7 so the mock matches and the test can
exercise disconnect behavior; locate the mock setup that calls api.On("LogWarn",
mock.Anything, ...) and change the argument list to five mock.Anything entries
(and keep the .Maybe() behavior if intended) so it matches the real LogWarn(msg,
"user_id", ..., "error", ...) invocation.

In `@server/plugin/plugin.go`:
- Around line 813-830: forceDisconnectUser currently skips deleting the reverse
KV mapping when githubUsername is empty; change the logic so before skipping the
p.store.Delete call you fallback to the stored property by fetching the user
(use p.client.User.Get(userID)) and, if user.Props contains "git_user", set
githubUsername to that value and then proceed to call
p.store.Delete(githubUsername + githubUsernameKey); keep existing warning logs
on Delete/Get/Update failures and retain the existing update-of-user.Props
removal path (delete(user.Props["git_user"]); p.client.User.Update(user)) so
stale mappings are removed even when original githubUsername couldn't be
obtained.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 52b578b2-55f5-4c09-b470-335f2a9d6e43

📥 Commits

Reviewing files that changed from the base of the PR and between 6262378 and 5c972b9.

📒 Files selected for processing (2)
  • server/plugin/plugin.go
  • server/plugin/plugin_test.go

@marianunez marianunez requested a review from edgarbellot March 10, 2026 20:29
@marianunez marianunez added the 3: Security Review Review requested from Security Team label Mar 10, 2026
Copy link
Contributor

@nevyangelova nevyangelova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @avasconcelos114 just a couple of comments from me


if previousEncryptionKey != "" && configuration.EncryptionKey != "" &&
previousEncryptionKey != configuration.EncryptionKey {
p.reEncryptUserData(previousEncryptionKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.reEncryptUserData(previousEncryptionKey) runs synchronously. For many users this will block the config hook long enough to cause a plugin timeout. Run it in a goroutine maybe, the cluster mutex already prevents concurrent runs.

p.client.Log.Warn("Failed to load user info for disconnect, falling back to force-disconnect",
"user_id", userID, "error", apiErr.Message)
var rawInfo *GitHubUserInfo
_ = p.store.Get(userID+githubTokenKey, &rawInfo)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this fails, there's no log. Should at least log the error since getGitHubUserInfo already failed on the same data.

// forceDisconnectUser performs a best-effort cleanup of a user's encrypted
// data and notifies them to reconnect
func (p *Plugin) forceDisconnectUser(userID, githubUsername string) {
if err := p.store.Delete(userID + githubTokenKey); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this leaves the userID + "_githubprivate" key orphaned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! funnily enough the normal disconnect function was also leaving this as an orphan so I made sure this gets cleared properly on both cases

Copy link

@edgarbellot edgarbellot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avasconcelos114 Solid work on this one! These changes also address what we discussed in the ticket. Left a few suggestions to tighten up some edge cases.


p.sendWebsocketEventIfNeeded(previousConfig, configuration)

p.setConfiguration(configuration)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor suggestion: setConfiguration activates the new key immediately here, but reEncryptUserData runs after (line 219). During that window, any user whose token hasn't been migrated yet will get a decryption error in getGitHubUserInfo, since it only tries the current key with no fallback.

For small deployments this is barely noticeable, but with thousands of users the migration could take some seconds, which means a visible outage of the plugin for those users.

The simplest fix would be to swap the order: run reEncryptUserData before setConfiguration, passing the new key explicitly so it can decrypt with the old key and re-encrypt with the new one. That way, by the time the new key becomes active, all tokens are already migrated and there's no error window.

}

config := p.getConfiguration()
if _, err := decrypt([]byte(config.EncryptionKey), userInfo.Token.AccessToken); err == nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unpad function in server/plugin/utils.go:75 only checks unpadding > length, but doesn't validate that the value is in the valid PKCS7 range (1–16) or that all padding bytes match. This matters because the idempotency check here relies on decrypt failing to detect tokens still encrypted with the old key.

Since AES-CFB has no authentication, decrypting with the wrong key produces garbage instead of an error. The only safety net is unpad, and right now it accepts any last byte <= message length. For a typical GitHub token (~40 chars, padded to 48 bytes), the last byte of the garbage is random (0–255) and only fails the check if it's > 48. That means 49 out of 256 possible values pass - roughly a ~19% false positive rate per user per rotation (analysis provided by Claude). When this happens, the token silently stays encrypted with the old key and the user is left broken without being force-disconnected.

A stricter unpad would fix this:

func unpad(src []byte) ([]byte, error) {
	length := len(src)
	if length == 0 {
		return nil, errors.New("unpad error: empty input")
	}

	unpadding := int(src[length-1])
	if unpadding < 1 || unpadding > aes.BlockSize || unpadding > length {
		return nil, errors.New("unpad error. This could happen when incorrect encryption key is used")
	}

	for i := length - unpadding; i < length; i++ {
		if src[i] != byte(unpadding) {
			return nil, errors.New("unpad error. This could happen when incorrect encryption key is used")
		}
	}

	return src[:(length - unpadding)], nil
}

This is safe because thepad function already generates valid PKCS7, so all existing correctly encrypted tokens will pass the stricter validation. It brings the false positive rate down from ~19% to ~0.4%.

return
}

p.client.Log.Info("Encryption key changed, re-encrypting user tokens",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR introduces new state-changing operations (re-encryption and force-disconnect), it would be a good opportunity to use LogAuditRec here instead of Log.Info/Log.Warn. An audit record summarizing the result (users migrated vs. force-disconnected) would give admins much better visibility.

Not a blocker, but worth considering to go in line with the Audit Logs Everywhere initiative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added audit logging for the migration in the last commit :) ccf9268

- Making sure rotation is non-blocking
- Passing correct references to remove risk of race conditions
- Applying stricter unpad logic
@avasconcelos114
Copy link
Contributor Author

@edgarbellot @nevyangelova Thank you both for the amazing reviews! 🙌

I've addressed the points listed above including the audit logging, let me know if there is anything else :)

Copy link

@edgarbellot edgarbellot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avasconcelos114 just went through the latest two commits - all good from the security side. Thanks for turning this around so quickly! :)

@edgarbellot edgarbellot removed the 3: Security Review Review requested from Security Team label Mar 11, 2026
Copy link
Contributor

@nevyangelova nevyangelova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2: Dev Review Requires review by a core committer 3: QA Review Requires review by a QA tester

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants