MM-67613 Added handling of encryption key rotation #979
MM-67613 Added handling of encryption key rotation #979avasconcelos114 wants to merge 7 commits intomasterfrom
Conversation
- Ensuring that even if an environment somehow bypasses the OnConfigurationChanged hook, a user can still disconnect and re-connect their accounts instead of being locked
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughOn configuration change the plugin captures the previous configuration and its EncryptionKey, sends websocket updates using the previous config, applies the new config, and — when both old and new EncryptionKey are non-empty and different — coordinates cluster-wide re-encryption of stored user tokens, forcing per-user disconnects on failures. Changes
Sequence Diagram(s)sequenceDiagram
participant Admin as Administrator
participant Config as Configuration Handler
participant WS as WebSocket Notifier
participant Cluster as Cluster Mutex/Leader
participant KV as KV Store
participant Crypto as Encryption Engine
participant UserMgr as User Cleanup & Notifier
Note over Config: Admin changes EncryptionKey
Admin->>Config: Update config (newEncryptionKey)
Config->>Config: Capture previous config & previousEncryptionKey
Config->>WS: Send update using previous config
Config->>Config: Apply new configuration
Config->>Cluster: Acquire re-encryption lock (cluster coord)
Cluster-->>Config: Lock acquired
Config->>KV: List token keys
KV-->>Config: Return token keys
loop For each token key
Config->>KV: Get encrypted token (kvKey)
KV-->>Config: Encrypted data
alt Decrypt with previousEncryptionKey succeeds
Config->>Crypto: Decrypt using previousEncryptionKey
Crypto-->>Config: Plain token
Config->>Crypto: Encrypt using newEncryptionKey
Crypto-->>Config: New encrypted token
Config->>KV: Store updated encrypted token
KV-->>Config: Persisted
else Decrypt or store fails
Config->>UserMgr: Force-disconnect user (cleanup & notify)
UserMgr->>KV: Delete token & username mapping
UserMgr->>WS: Notify user & post reconnection guidance
end
end
Config->>Cluster: Release lock
Cluster-->>Config: Lock released
Config->>Config: Finish rotation, audit results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: unknown linters: 'modernize', run 'golangci-lint help linters' to see the list of supported linters Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
server/plugin/plugin_test.go (1)
45-77: Consider adding assertion to verify token is correctly re-encrypted.The happy path test verifies that
Setis called, but doesn't verify that the stored data contains a token that can be decrypted with the new key. This would add confidence that the encryption cycle completes correctly.💡 Optional: Add token verification
mockKvStore.EXPECT().Set("user1"+githubTokenKey, gomock.Any()).DoAndReturn( func(key string, value any, opts ...pluginapi.KVSetOption) (bool, error) { storedInfo := value.(*GitHubUserInfo) // Verify we can decrypt with new key decrypted, err := decrypt([]byte(testNewKey), storedInfo.Token.AccessToken) require.NoError(t, err) require.Equal(t, MockAccessToken, decrypted) return true, nil }, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/plugin/plugin_test.go` around lines 45 - 77, The test currently only asserts that mockKvStore.Set was called; update the mockKvStore.EXPECT().Set invocation in TestReEncryptUserData_HappyPath to use DoAndReturn and inspect the value parameter (the stored GitHubUserInfo) — cast it to *GitHubUserInfo, extract storedInfo.Token.AccessToken, attempt to decrypt it with testNewKey (calling decrypt([]byte(testNewKey), ...)), and assert no error and that the decrypted value equals MockAccessToken; keep returning (true, nil) from the DoAndReturn so existing behavior is preserved and ensure p.reEncryptUserData(testOldKey) and api.AssertExpectations(t) remain unchanged.server/plugin/configuration.go (1)
210-220: Consider running re-encryption asynchronously to avoid blocking configuration changes.The
reEncryptUserDatacall is synchronous, which could blockOnConfigurationChangefor a significant duration when many users are connected. This may cause timeouts or degrade plugin responsiveness during key rotation.Additionally, there's a subtle window between
setConfiguration(line 215) and starting re-encryption (line 219) where the new key is active but tokens are still encrypted with the old key. If concurrent requests occur during this window, they may fail to decrypt tokens.♻️ Consider wrapping re-encryption in a goroutine
p.setConfiguration(configuration) if previousEncryptionKey != "" && configuration.EncryptionKey != "" && previousEncryptionKey != configuration.EncryptionKey { - p.reEncryptUserData(previousEncryptionKey) + go p.reEncryptUserData(previousEncryptionKey) }Note: If you make this async, ensure proper logging of completion/failure and consider adding a mutex or flag to prevent concurrent re-encryption attempts.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/plugin/configuration.go` around lines 210 - 220, The synchronous call to p.reEncryptUserData blocks OnConfigurationChange and leaves a window where the new EncryptionKey is active but tokens are still encrypted with the old key; change this to start re-encryption in a new goroutine and coordinate via a mutex/flag on the plugin (e.g. add p.reEncryptInProgress bool + sync.Mutex or sync/atomic flag) so only one re-encryption runs at a time, set the flag before switching the configuration (or ensure decrypt logic checks the flag and attempts both keys) and clear it when the goroutine finishes, and ensure the goroutine logs start/completion/errors from p.reEncryptUserData so failures are observable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/plugin/plugin.go`:
- Around line 744-755: The current loop calling p.store.ListKeys(page,
keysPerPage, pluginapi.WithChecker(checker)) returns early on any page error,
leaving previously-collected keys in allKeys unprocessed; change this so that on
ListKeys error you log the error (using p.client.Log.Warn) but do not return
immediately — instead break the pagination loop and proceed to the re-encryption
loop to process the already-collected allKeys (or alternatively accumulate
errors and continue fetching all pages, but ensure allKeys is processed
regardless). Ensure keysPerPage, checker and the allKeys slice are reused after
the loop so re-encryption runs even when a later page fails.
---
Nitpick comments:
In `@server/plugin/configuration.go`:
- Around line 210-220: The synchronous call to p.reEncryptUserData blocks
OnConfigurationChange and leaves a window where the new EncryptionKey is active
but tokens are still encrypted with the old key; change this to start
re-encryption in a new goroutine and coordinate via a mutex/flag on the plugin
(e.g. add p.reEncryptInProgress bool + sync.Mutex or sync/atomic flag) so only
one re-encryption runs at a time, set the flag before switching the
configuration (or ensure decrypt logic checks the flag and attempts both keys)
and clear it when the goroutine finishes, and ensure the goroutine logs
start/completion/errors from p.reEncryptUserData so failures are observable.
In `@server/plugin/plugin_test.go`:
- Around line 45-77: The test currently only asserts that mockKvStore.Set was
called; update the mockKvStore.EXPECT().Set invocation in
TestReEncryptUserData_HappyPath to use DoAndReturn and inspect the value
parameter (the stored GitHubUserInfo) — cast it to *GitHubUserInfo, extract
storedInfo.Token.AccessToken, attempt to decrypt it with testNewKey (calling
decrypt([]byte(testNewKey), ...)), and assert no error and that the decrypted
value equals MockAccessToken; keep returning (true, nil) from the DoAndReturn so
existing behavior is preserved and ensure p.reEncryptUserData(testOldKey) and
api.AssertExpectations(t) remain unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 9a21d94e-4734-413e-b375-1a3e0e965fa0
📒 Files selected for processing (3)
server/plugin/configuration.goserver/plugin/plugin.goserver/plugin/plugin_test.go
…entire rotation flow
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/plugin/plugin_test.go`:
- Around line 113-114: The LogWarn mock in the decrypt-failure test has the
wrong arity: update the api.On("LogWarn", ...) expectation to match the actual
call signature (5 arguments) instead of 7 so the mock matches and the test can
exercise disconnect behavior; locate the mock setup that calls api.On("LogWarn",
mock.Anything, ...) and change the argument list to five mock.Anything entries
(and keep the .Maybe() behavior if intended) so it matches the real LogWarn(msg,
"user_id", ..., "error", ...) invocation.
In `@server/plugin/plugin.go`:
- Around line 813-830: forceDisconnectUser currently skips deleting the reverse
KV mapping when githubUsername is empty; change the logic so before skipping the
p.store.Delete call you fallback to the stored property by fetching the user
(use p.client.User.Get(userID)) and, if user.Props contains "git_user", set
githubUsername to that value and then proceed to call
p.store.Delete(githubUsername + githubUsernameKey); keep existing warning logs
on Delete/Get/Update failures and retain the existing update-of-user.Props
removal path (delete(user.Props["git_user"]); p.client.User.Update(user)) so
stale mappings are removed even when original githubUsername couldn't be
obtained.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 52b578b2-55f5-4c09-b470-335f2a9d6e43
📒 Files selected for processing (2)
server/plugin/plugin.goserver/plugin/plugin_test.go
nevyangelova
left a comment
There was a problem hiding this comment.
Thanks @avasconcelos114 just a couple of comments from me
server/plugin/configuration.go
Outdated
|
|
||
| if previousEncryptionKey != "" && configuration.EncryptionKey != "" && | ||
| previousEncryptionKey != configuration.EncryptionKey { | ||
| p.reEncryptUserData(previousEncryptionKey) |
There was a problem hiding this comment.
p.reEncryptUserData(previousEncryptionKey) runs synchronously. For many users this will block the config hook long enough to cause a plugin timeout. Run it in a goroutine maybe, the cluster mutex already prevents concurrent runs.
server/plugin/plugin.go
Outdated
| p.client.Log.Warn("Failed to load user info for disconnect, falling back to force-disconnect", | ||
| "user_id", userID, "error", apiErr.Message) | ||
| var rawInfo *GitHubUserInfo | ||
| _ = p.store.Get(userID+githubTokenKey, &rawInfo) |
There was a problem hiding this comment.
if this fails, there's no log. Should at least log the error since getGitHubUserInfo already failed on the same data.
| // forceDisconnectUser performs a best-effort cleanup of a user's encrypted | ||
| // data and notifies them to reconnect | ||
| func (p *Plugin) forceDisconnectUser(userID, githubUsername string) { | ||
| if err := p.store.Delete(userID + githubTokenKey); err != nil { |
There was a problem hiding this comment.
I think this leaves the userID + "_githubprivate" key orphaned.
There was a problem hiding this comment.
Great catch! funnily enough the normal disconnect function was also leaving this as an orphan so I made sure this gets cleared properly on both cases
edgarbellot
left a comment
There was a problem hiding this comment.
@avasconcelos114 Solid work on this one! These changes also address what we discussed in the ticket. Left a few suggestions to tighten up some edge cases.
server/plugin/configuration.go
Outdated
|
|
||
| p.sendWebsocketEventIfNeeded(previousConfig, configuration) | ||
|
|
||
| p.setConfiguration(configuration) |
There was a problem hiding this comment.
One minor suggestion: setConfiguration activates the new key immediately here, but reEncryptUserData runs after (line 219). During that window, any user whose token hasn't been migrated yet will get a decryption error in getGitHubUserInfo, since it only tries the current key with no fallback.
For small deployments this is barely noticeable, but with thousands of users the migration could take some seconds, which means a visible outage of the plugin for those users.
The simplest fix would be to swap the order: run reEncryptUserData before setConfiguration, passing the new key explicitly so it can decrypt with the old key and re-encrypt with the new one. That way, by the time the new key becomes active, all tokens are already migrated and there's no error window.
server/plugin/plugin.go
Outdated
| } | ||
|
|
||
| config := p.getConfiguration() | ||
| if _, err := decrypt([]byte(config.EncryptionKey), userInfo.Token.AccessToken); err == nil { |
There was a problem hiding this comment.
The unpad function in server/plugin/utils.go:75 only checks unpadding > length, but doesn't validate that the value is in the valid PKCS7 range (1–16) or that all padding bytes match. This matters because the idempotency check here relies on decrypt failing to detect tokens still encrypted with the old key.
Since AES-CFB has no authentication, decrypting with the wrong key produces garbage instead of an error. The only safety net is unpad, and right now it accepts any last byte <= message length. For a typical GitHub token (~40 chars, padded to 48 bytes), the last byte of the garbage is random (0–255) and only fails the check if it's > 48. That means 49 out of 256 possible values pass - roughly a ~19% false positive rate per user per rotation (analysis provided by Claude). When this happens, the token silently stays encrypted with the old key and the user is left broken without being force-disconnected.
A stricter unpad would fix this:
func unpad(src []byte) ([]byte, error) {
length := len(src)
if length == 0 {
return nil, errors.New("unpad error: empty input")
}
unpadding := int(src[length-1])
if unpadding < 1 || unpadding > aes.BlockSize || unpadding > length {
return nil, errors.New("unpad error. This could happen when incorrect encryption key is used")
}
for i := length - unpadding; i < length; i++ {
if src[i] != byte(unpadding) {
return nil, errors.New("unpad error. This could happen when incorrect encryption key is used")
}
}
return src[:(length - unpadding)], nil
}
This is safe because thepad function already generates valid PKCS7, so all existing correctly encrypted tokens will pass the stricter validation. It brings the false positive rate down from ~19% to ~0.4%.
| return | ||
| } | ||
|
|
||
| p.client.Log.Info("Encryption key changed, re-encrypting user tokens", |
There was a problem hiding this comment.
Since this PR introduces new state-changing operations (re-encryption and force-disconnect), it would be a good opportunity to use LogAuditRec here instead of Log.Info/Log.Warn. An audit record summarizing the result (users migrated vs. force-disconnected) would give admins much better visibility.
Not a blocker, but worth considering to go in line with the Audit Logs Everywhere initiative.
There was a problem hiding this comment.
Added audit logging for the migration in the last commit :) ccf9268
- Making sure rotation is non-blocking - Passing correct references to remove risk of race conditions - Applying stricter unpad logic
|
@edgarbellot @nevyangelova Thank you both for the amazing reviews! 🙌 I've addressed the points listed above including the audit logging, let me know if there is anything else :) |
edgarbellot
left a comment
There was a problem hiding this comment.
@avasconcelos114 just went through the latest two commits - all good from the security side. Thanks for turning this around so quickly! :)
nevyangelova
left a comment
There was a problem hiding this comment.
Thanks @avasconcelos114
Summary
This PR applies a more graceful behavior for rotation of encryptionkey values in the settings.
This PR adds a rotation of encrypted values alongside a fallback if the rotation fails that clear user sessions and notifies the user. It also adds another fallback to prevent scenarios where a user is unable to use their account but also unable to disconnect due to the
getGitHubUserInfocall indisconnectGitHubAccountrunning into encryption errorsTicket Link
Fixes https://mattermost.atlassian.net/browse/MM-67613
QA Notes
Other than regenerating the Encryption Key value in the System Console and verifying that users can continue to use the plugin normally, we would also want to test a scenario where:
/github disconnectto confirm that a user is able to properly disconnect their account and create a fresh connectionChange Impact: 🔴 High
Reasoning: The PR modifies encryption key rotation, per-user token re-encryption, and disconnect/fallback flows—core authentication and data-persistence areas that affect all users with GitHub-linked accounts.
Regression Risk: High — changes touch encryption/decryption, KV persistence, and cluster-wide coordination; while tests were added, concurrent/race conditions, KV/API failures, or mis-handled decrypt errors could orphan credentials or force unintended disconnects.
** QA Recommendation:** Strongly recommend thorough manual QA covering key-rotation happy paths and failure modes (forced decrypt failures), verification of re-encryption vs. force-disconnect behavior (DM and WebSocket notifications), disconnect/reconnect flows for unaffected users, and concurrent rotations with active API usage. Skipping manual QA risks user lockout or silent credential loss.
Generated by CodeRabbitAI