Skip to content

Nhse o40 orkv.i141 silvermachine#3

Open
martinsumner wants to merge 71 commits into
openriak-4.0from
nhse-o40-orkv.i141-silvermachine
Open

Nhse o40 orkv.i141 silvermachine#3
martinsumner wants to merge 71 commits into
openriak-4.0from
nhse-o40-orkv.i141-silvermachine

Conversation

@martinsumner
Copy link
Copy Markdown
Contributor

Switch HTTP API to use SilverMachine not WebMachine/Mochiweb.

Intentions are:

  • improve performance of common requests, especially with significant volumes of index entries and/or user object metadata;
  • simplify the callbacks required within the application;
  • only implement the basics of HTTP, the full extent of rules (i.e. equivalent webmachine decision_core) may be implemented in the callbacks, but aren't applied by default;
  • allow for streaming on both inbound request bodies as well as outbound response bodies, to ready support for file upload/downloads;
  • simplify the task of providing SSL handshake information to the application

WIP.  A framework of modules and functions for SliverMachine.
Apply only to new files, or heavily altered files so that broader change history is maintained.
Also add callback module and unit test the sending of responses both streamed and whole.
Plus some further testing/formatting
Retain mohijson2 and mochinum to ease transition
Where URI starts "/" will add a leading <<>> to the split path - which is confusing and easy to forget about.
Also provides functions for converting Last Modified Date in KV GET.
128 aligns with mochiweb - otherwise some tests with riak_test that use large bursts of connections may fail
@martinsumner martinsumner marked this pull request as ready for review May 1, 2026 11:14
Without requiring knowledge when implementing routes of the routes priority relative to others.
Comment thread src/riak_api_web.erl Outdated
{ok, PeerIP, Cert} = riak_api_web_socket:get_peer(Socket),
loop(Socket, <<>>, PeerIP, Cert, Port);
{error, timeout} ->
init(Server, Listener, Port);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No max number of attempts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not. We start a listener, and a pool of acceptors - however, there is no telling how long it might be before there is a connection. If we reached a max number of loops, we still want to maintain the pool, so even if we exit, it would only be to start another acceptor and put it in the loop.

This was copied from Elli:

https://github.com/elli-lib/elli/blob/main/src/elli_http.erl#L57-L81

However, is it actually the case that we don't need the timeout. i.e. we should use gen_tcp:accept/1 (with an infinite timeout) and then not have to loop on timeout - as the purpose of the loop isn Elli (reading the comments) is to support a code upgrade we will probably not require?

Comment thread README.md Outdated
Comment thread docs/silverMachine.md Outdated
Comment thread src/riak_api_web_acceptor.erl Outdated
Comment thread src/riak_api_web_acceptor.erl
Comment thread src/riak_api_web_acceptor.erl Outdated
Comment thread src/riak_api_web_acceptor.erl Outdated
Comment thread src/riak_api_web_headers.erl Outdated
Comment thread src/riak_api_web_headers.erl
Comment thread src/riak_api_web_headers.erl
Comment thread src/riak_api_web_body.erl Outdated
Comment thread src/riak_api_web_security.erl
Comment thread src/riak_api_web_socket.erl Outdated
State#socket_state{
pool_size = State#socket_state.pool_size - 1,
acceptor_pool =
sets:del_element(Pid, State#socket_state.acceptor_pool)
Copy link
Copy Markdown

@f34nk f34nk May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't misunderstand something, this is a potential bug.
Please correct me if I'm wrong :)

While a new acceptor is started on accept when:

PS < State#socket_state.max_pool_size

EXIT decrements pool_size and removes the pid.

When PS >= max_pool_size, every acceptor can be busy serving connections, so no process is blocked in accept.

When those workers exit, the listener still never get a new blocking acceptor unless something else spawns one.

From my understanding, the behaviour will be that some connections get served as normal, but new connection attempts will timeout (?) while the listener is still alive.

I will try to come up with a test that reproduces this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. Once the max_pool_size has been exceeded, and all acceptors are busy handling connections - I don't think there is anything to start new acceptors.

I think the model for acceptor pools requires a rethink.

Copy link
Copy Markdown
Contributor Author

@martinsumner martinsumner May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to follow if we track awaiting_acceptor and busy_acceptor. When an awaiting_acceptor gets a connection, it is moved to busy_acceptor, and if size(awaiting_acceptor) + size(busy_acceptor) < max_pool_size a new acceptor is created.

If an acceptor exit is detected, tit needs to be removed from the pool (presumably the busy_acceptor), but maybe check awaiting_acceptor if it is not in the busy list (perhaps it can exit before accepting a connection). If the size(awaiting_acceptor) < pool_size AND size(awaiting_acceptor) + size(busy_acceptor) < max_pool_size - then create a replacement acceptor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 27d39b8

This contains a change to riak_api_web_socket acceptor pool management, and new connection tests in tests/riak_api_web_ets_store

Copy link
Copy Markdown

@f34nk f34nk May 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

To follow up on my comment - it took a while to wrap my head around this :)
This test reproduces the issue with the "single" acceptor_pool. My adhoc solution was to restart an acceptor on EXIT (it works to demonstrate the issue).

I very much like your approach: separating acceptor_pool and connected_pool makes this much clearer to follow.

Copy link
Copy Markdown

@f34nk f34nk May 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I am suspicious about is, that the accepted check happens BEFORE setting the new state.

For example, this edge case: AP = 0, CP = 3 and max_pool_size = 3.
The guard sees 0 + 3 < 3 -> which does not spawn a new acceptor (as it should ??). Only AFTER a CP process exits, a new acceptor gets triggered.

I am not sure if this is what we want.

Maybe better: compute the new state and then decide what to do.

Something like this:

AP1 = sets:del_element(ExP, AP),
CP1 = sets:del_element(ExP, CP),
NeedIdle = sets:size(AP1) < State#socket_state.pool_size,
BelowMax = sets:size(AP1) + sets:size(CP1) < State#socket_state.max_pool_size.

Then start a replacement when NeedIdle andalso BelowMax is true.

Wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a different version of your proposed modification: 18a1f78

I think it is much clearer to update the sets and then have the case statement based on sets:size rather than assumed size. Hopefully the above reads OK (I set it out this way to reduce some code duplication).

I've added the commit with your test into the test suite. Thanks

martinsumner and others added 4 commits May 12, 2026 15:44
Co-authored-by: Thomas Arts <thomas.arts@quviq.com>
No intention to support trailer fields.  Cannot merge with headers, as headers already processed by callback.

Actually looking at trailer fields provides further risk (how many should be supported, should they be included in content size etc).
Copy link
Copy Markdown

@hmmr hmmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, designing this riak component as a pared-down http server, with no attempt to present it as a general purpose http server outside of riak, is clearly the right approach. A handful of non-critical issues; otherwise, ready to approve.

Comment thread test/riak_api_web_ets_store.erl
Comment thread src/riak_api_web_handler.erl Outdated
Comment thread src/riak_api_web_headers.erl
Comment thread src/riak_api_web.erl Outdated
Comment thread src/riak_api_web.erl
Comment thread src/riak_api_web.erl Outdated
Comment thread src/riak_api_web.erl
Comment thread src/riak_api_web.erl Outdated
f34nk and others added 3 commits May 18, 2026 14:18
Add a regression test for the empty acceptor pool case after max pool use and worker exits.
Register the new module in the erlfmt file list.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants