fix(video): check encoder started up before trying to flush frames on destruction#5257
fix(video): check encoder started up before trying to flush frames on destruction#5257Kishi85 wants to merge 1 commit into
Conversation
ec0fcc7 to
4ccc01b
Compare
Bundle ReportBundle size has no change ✅ |
|
This flushing was a workaround to address a memory leak each time streaming is started and stopped on AMD AMF encoders, so please test that to make sure you aren't reintroducing that. I agree that it shouldn't be necessary, but it also shouldn't be crashing either. Flushing is a totally valid thing to do here. The crash probably means there's a deeper use-after-free or race condition that we're just hiding by removing the flushing code. |
I'm also thinking that this is some kind of race condition that occurs when the hw context is torn down (by closing the capture instance as the specific problem here is accessing What would be interesting is to know if we have a way to ensure that this context is still valid when the frame flushing is validated (as that is where it segfaults) or if the fix can be done solely in FFmpeg. |
4ccc01b to
3731283
Compare
|
I think I've found a better solution and also finally the root cause for this issue which is due to the encoder never having seen a single frame (avcodec_ctx->frame_num being 0) when trying to flush frames. This does also explain why ctx->pic_end is NULL when the segfault hits as there was never a frame that could have been used to setup start/end. So my updated solution is to check if the encoder has started up before trying to flush frames and otherwise skip this step. Checking if the encoder has started up is done by checking if // Fix timestamps if we hit end-of-stream before the initial decode
// delay has elapsed.This seems to fix the issue without having to remove the frame flushing fully leaving the workaround for AMD AMF intact, except for when it has not seen enough frames. When the encoder is starting up the delay seems to be 0 for the encoders I was able to test here (so it's just checking if the encoder has seen a frame at all in order to do the flushing). @cgutman Does this look like a better solution to you? |
|
I'm not sure if we could just check for I'll have another look at the code, will do another testrun and update the PR if this is working. EDIT: Looks like ctx->pic_start and ctx->pic_end are setup from the first valid frame sent: |
3731283 to
5885788
Compare
|
|
Updated PR to check for at least one frame sent to the encoder to make sure |
|
This still doesn't resolve crashing on mode or display switch with Vulkan on my system, but adding an arbitrary sleep to the destructor actually seems to resolve the issue. This is not the right solution but perhaps it might help identify the proper area to fix. I tested the position of the sleep and it seems to work OK right before or after the flush block, but adding it after The crash that occurs on master or with this PR (and no sleep insertion) is easily reproduced either by spamming the display switch key combo (it often happens on the first switch, and usually within about 5 switch attempts) or with this script: Running with a 0 sec timeout triggers the crash every time on my system, but it can survive 50 or so of the 100 cycles sometimes. If you can't reproduce the issue, it might be an issue that's exposed with a slower CPU; my 5700X is getting old, so manually throttling your CPU frequency may help to reproduce the issue. Finally, adding the sleep to the destructor on master branch alone doesn't resolve the display switch crashing, so your flush check is definitely helping the situation. There may be multiple reasons for these crashes and the flush check you added is definitely resolving at least one of them. Edit: I can still reproduce a rare crash on display switch with the sleep directly after the flush block, but it may just be an insufficient sleep interval to workaround the issue. When I have time I will test this more thoroughly and open a separate issue if I don't make any progress. |
I've seen one instance of a crash that was specifically related to vulkan but the stacktrace is different from what this PR fixes. In fact it had a totally different callstack: This PR is only fixing the specific issue of Do you have a stacktrace/systemd coredump with debug symbols by any chance? Would be interesting to see what it looks like as it seems to be linked to the destructor call as well (if it's like my stacktrace above it might be a double free from the encoder and av_packet network stack, which a sleep workaround would help with) and was likely masked by the flush segfault before this PR is applied. That new vulkan-only crash would be a separate issue with a separate fix though. EDIT: I also cannot reproduce this with your script on a 0 second timeout unfortunately (even throttled). With 0 seconds its green screening from time to time but it never crashes. |
|
The problem is that I can't effectively test this PR with Vulkan due to the multiple conflicting issues - but I can confirm that it fixes VAAPI. The backtrace against master from a display switch crash shows the I'll move my UAF/deadlock issue with Vulkan that's not targeted by this PR into another PR or issue, but for the purposes of testing, I added the 1ms sleep to master branch in order to try and reproduce So I think I can confirm that your flush check is definitely working as intended for VAAPI and most likely for Vulkan too. |
Thanks for confirming the fix and ensuring the vulkan crash is indeed a separate issue. I'll keep an eye out for your Issue/PR on that and will help as much as I'm able fixing that one as well. |



Description
This PR aims to fix #4943 by removing the frame flushing on
~avcodec_encode_session_t()for cases where the encoder has not stared up yet. Checking this is done by checking ifavcodec_ctx->frame_num > 0. This check is sufficient as the first frame sent will setupctx->pic_endwhich is segfaulting otherwise here:This code sequence is expecting
ctx->pic_end != NULLbut if the encoder has not seen any frame yet then this expectation will not be fulfilled causing a segfault asctx->pic_startandctx->pic_endare stillNULL.Screenshot
Issues Fixed or Closed
Roadmap Issues
Type of Change
Checklist
AI Usage