Nhse o40 orkv.i141 silvermachine#3
Conversation
WIP. A framework of modules and functions for SliverMachine.
Apply only to new files, or heavily altered files so that broader change history is maintained.
Also add callback module and unit test the sending of responses both streamed and whole.
Plus some further testing/formatting
Retain mohijson2 and mochinum to ease transition
Where URI starts "/" will add a leading <<>> to the split path - which is confusing and easy to forget about.
Address initial feedback
Also provides functions for converting Last Modified Date in KV GET.
128 aligns with mochiweb - otherwise some tests with riak_test that use large bursts of connections may fail
Without requiring knowledge when implementing routes of the routes priority relative to others.
| {ok, PeerIP, Cert} = riak_api_web_socket:get_peer(Socket), | ||
| loop(Socket, <<>>, PeerIP, Cert, Port); | ||
| {error, timeout} -> | ||
| init(Server, Listener, Port); |
There was a problem hiding this comment.
No max number of attempts?
There was a problem hiding this comment.
I think not. We start a listener, and a pool of acceptors - however, there is no telling how long it might be before there is a connection. If we reached a max number of loops, we still want to maintain the pool, so even if we exit, it would only be to start another acceptor and put it in the loop.
This was copied from Elli:
https://github.com/elli-lib/elli/blob/main/src/elli_http.erl#L57-L81
However, is it actually the case that we don't need the timeout. i.e. we should use gen_tcp:accept/1 (with an infinite timeout) and then not have to loop on timeout - as the purpose of the loop isn Elli (reading the comments) is to support a code upgrade we will probably not require?
| State#socket_state{ | ||
| pool_size = State#socket_state.pool_size - 1, | ||
| acceptor_pool = | ||
| sets:del_element(Pid, State#socket_state.acceptor_pool) |
There was a problem hiding this comment.
If I don't misunderstand something, this is a potential bug.
Please correct me if I'm wrong :)
While a new acceptor is started on accept when:
PS < State#socket_state.max_pool_size
EXIT decrements pool_size and removes the pid.
When PS >= max_pool_size, every acceptor can be busy serving connections, so no process is blocked in accept.
When those workers exit, the listener still never get a new blocking acceptor unless something else spawns one.
From my understanding, the behaviour will be that some connections get served as normal, but new connection attempts will timeout (?) while the listener is still alive.
I will try to come up with a test that reproduces this.
There was a problem hiding this comment.
I think you're right. Once the max_pool_size has been exceeded, and all acceptors are busy handling connections - I don't think there is anything to start new acceptors.
I think the model for acceptor pools requires a rethink.
There was a problem hiding this comment.
It might be easier to follow if we track awaiting_acceptor and busy_acceptor. When an awaiting_acceptor gets a connection, it is moved to busy_acceptor, and if size(awaiting_acceptor) + size(busy_acceptor) < max_pool_size a new acceptor is created.
If an acceptor exit is detected, tit needs to be removed from the pool (presumably the busy_acceptor), but maybe check awaiting_acceptor if it is not in the busy list (perhaps it can exit before accepting a connection). If the size(awaiting_acceptor) < pool_size AND size(awaiting_acceptor) + size(busy_acceptor) < max_pool_size - then create a replacement acceptor.
There was a problem hiding this comment.
See 27d39b8
This contains a change to riak_api_web_socket acceptor pool management, and new connection tests in tests/riak_api_web_ets_store
There was a problem hiding this comment.
Nice!
To follow up on my comment - it took a while to wrap my head around this :)
This test reproduces the issue with the "single" acceptor_pool. My adhoc solution was to restart an acceptor on EXIT (it works to demonstrate the issue).
I very much like your approach: separating acceptor_pool and connected_pool makes this much clearer to follow.
There was a problem hiding this comment.
The only thing I am suspicious about is, that the accepted check happens BEFORE setting the new state.
For example, this edge case: AP = 0, CP = 3 and max_pool_size = 3.
The guard sees 0 + 3 < 3 -> which does not spawn a new acceptor (as it should ??). Only AFTER a CP process exits, a new acceptor gets triggered.
I am not sure if this is what we want.
Maybe better: compute the new state and then decide what to do.
Something like this:
AP1 = sets:del_element(ExP, AP),
CP1 = sets:del_element(ExP, CP),
NeedIdle = sets:size(AP1) < State#socket_state.pool_size,
BelowMax = sets:size(AP1) + sets:size(CP1) < State#socket_state.max_pool_size.Then start a replacement when NeedIdle andalso BelowMax is true.
Wdyt?
There was a problem hiding this comment.
I've done a different version of your proposed modification: 18a1f78
I think it is much clearer to update the sets and then have the case statement based on sets:size rather than assumed size. Hopefully the above reads OK (I set it out this way to reduce some code duplication).
I've added the commit with your test into the test suite. Thanks
Co-authored-by: Thomas Arts <thomas.arts@quviq.com>
No intention to support trailer fields. Cannot merge with headers, as headers already processed by callback. Actually looking at trailer fields provides further risk (how many should be supported, should they be included in content size etc).
hmmr
left a comment
There was a problem hiding this comment.
Overall, designing this riak component as a pared-down http server, with no attempt to present it as a general purpose http server outside of riak, is clearly the right approach. A handful of non-critical issues; otherwise, ready to approve.
Handle different error scenarios
Add a regression test for the empty acceptor pool case after max pool use and worker exits. Register the new module in the erlfmt file list.
Switch HTTP API to use SilverMachine not WebMachine/Mochiweb.
Intentions are: