REST: Add retry and timeout configuration for REST catalog#3418
REST: Add retry and timeout configuration for REST catalog#3418rahulsmahadev wants to merge 4 commits into
Conversation
Closes apache#2772. The REST Catalog uses requests with no retries and no timeout by default, so transient 5xx/network failures bubble up immediately and slow servers can hang the client indefinitely (e.g. a Polaris instance returning 504 from a proxy). Add an optional connection: block on the catalog properties: catalog: default: uri: http://rest-catalog/ws/ connection: timeout: 60 retry: total: 5 backoff_factor: 1.0 status_forcelist: [429, 500, 502, 503, 504] allowed_methods: [GET, HEAD, OPTIONS] connection.retry is passed verbatim to urllib3.util.retry.Retry. Both keys are optional and opt-in: when neither is set the default requests behavior is preserved. Signed-off-by: rahulsmahadev <rahul.mahadev@databricks.com>
|
Can you take a look at the linter failures? 'make lint' will help you run the linters locally. |
| | Key | Example | Description | | ||
| | ---------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------------ | | ||
| | connection.timeout | 60 | Per-request timeout in seconds. Must be a positive number. | | ||
| | connection.retry | `{total: 5, backoff_factor: 1.0}` | Mapping passed verbatim as kwargs to [`urllib3.util.retry.Retry`](https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#urllib3.util.Retry). | |
There was a problem hiding this comment.
I feel like having a dictionary here is a bit tricky; this couples the configuration tightly with urllib3. When urllib changes something, or we want to use another library, then we have to map this. How about adding the options explicitly.
There was a problem hiding this comment.
+1 on this.
Allowing so many retry options can also lead to unintended behaviors. As an example, what happens if you set raise_on_status to False, but you receive a 403 that you should raise an error on?
I think we should focus on timeout, # of retries, and a backoff factor.
There was a problem hiding this comment.
Good call — dropped the dict pass-through. The connection block now exposes only three explicit options: timeout, retries, and backoff-factor. The status_forcelist and allowed_methods are hard-coded internally to transient codes (429/500/502/503/504) and idempotent methods (GET/HEAD/OPTIONS), so users can't accidentally swallow non-transient errors or retry non-idempotent writes. Fixed in 6fb87ff.
| assert "Could not find the TLS certificate file, invalid path: path_to_client_cert" in str(e.value) | ||
|
|
||
|
|
||
| def test_session_without_connection_config_uses_default_adapter(rest_mock: Mocker) -> None: |
There was a problem hiding this comment.
Can we get a test where we set the retry logic and then see the retries occur? We should be able to simulate this with mock HTTP calls and then see that X number of HTTP calls were made afterwards.
There was a problem hiding this comment.
Added test_session_retries_on_transient_5xx_then_succeeds in 6fb87ff. requests_mock actually replaces the HTTPAdapter on the session, which bypasses our retry logic, so the test instead stands up a real http.server on a loopback port. The handler returns three 503s followed by a 200, and the test asserts both that list_namespaces succeeds and that the handler saw 4 requests.
Per @Fokko + @rambleraptor: drop the urllib3 retry-dict pass-through and expose only three explicit knobs on the connection block (timeout, retries, backoff-factor). Hard-code the retry policy (status_forcelist of transient codes; allowed_methods of idempotent verbs) so users cannot misconfigure e.g. raise_on_status=False and silently swallow 4xx errors. Per @rambleraptor: add a test that exercises the retry path end-to-end by spinning up a loopback HTTP server that returns three 503s then a 200, verifying the catalog makes all four attempts. requests_mock can't be used here because it replaces the HTTPAdapter and bypasses retry logic. Fix the three mypy errors flagged by CI: - _RetryTimeoutHTTPAdapter.send now matches HTTPAdapter.send's full signature instead of (request, **kwargs). - Test's set(adapter.max_retries.allowed_methods) now guards the Collection[str] | None type. Signed-off-by: rahulsmahadev <rahul.mahadev@databricks.com>
|
Pushed 6fb87ff addressing the review feedback:
PTAL! |
|
Can you run |
requests' HTTPAdapter.send declares proxies as Mapping[str, str] | None (via types-requests). Overriding with the more specific MutableMapping is an invalid override under mypy 1.18 strict mode. Signed-off-by: rahulsmahadev <rahul.mahadev@databricks.com>
|
Linters are now passing — |
rambleraptor
left a comment
There was a problem hiding this comment.
A couple quick notes. This is basically there in my opinion. Thanks a lot for doing all of this!
| The REST Catalog uses `requests` with no retries and no timeout by default, so transient | ||
| 5xx / network failures bubble up immediately and slow servers can hang the client indefinitely. | ||
| Set a `connection:` block on the catalog to opt in to a per-request timeout and a retry policy. | ||
| Every key is optional; when none are set, the default `requests` behavior is preserved. |
There was a problem hiding this comment.
I'd get rid of this sentence.
If you override connection.timeout, connections.retries should still take its default value. This sentence makes it sound like you have to specify the default values on the other places or else you'll have undefined behavior.
| max_retries: Retry | None = None | ||
| if retries is not None or backoff_factor is not None: | ||
| max_retries = Retry( | ||
| total=retries if retries is not None else 0, |
There was a problem hiding this comment.
We can set retries initially to 0 to avoid this statement.
There was a problem hiding this comment.
Good call — initialized retries and backoff_factor to 0 in 47a5382, so the inline conditional defaults are gone. A Retry(total=0, backoff_factor=0) is a no-op, functionally equivalent to no Retry at all.
| if retries is not None or backoff_factor is not None: | ||
| max_retries = Retry( | ||
| total=retries if retries is not None else 0, | ||
| backoff_factor=backoff_factor if backoff_factor is not None else 0, |
There was a problem hiding this comment.
Same fix — backoff_factor now initializes to 0.0 alongside retries. See reply above.
| {"defaults": {}, "overrides": {}, "endpoints": [str(endpoint) for endpoint in TEST_SUPPORTED_ENDPOINTS]} | ||
| ).encode() | ||
|
|
||
| class _Handler(BaseHTTPRequestHandler): |
There was a problem hiding this comment.
Can you add the server setup into a different method? That way, we can easily see what this is actually testing + less about the test setup.
There was a problem hiding this comment.
Done in 47a5382 — extracted the handler + threading setup into a _local_rest_server_503_then_200(num_failures) context manager. The test body is now just the catalog construction, list_namespaces(), and the two assertions.
- Drop the misleading 'when none are set, the default requests behavior is preserved' sentence in the docs (partial overrides don't reset other knobs to undefined behavior). - Initialize retries and backoff_factor to 0 instead of None so the Retry() call no longer needs conditional defaults inline. A Retry(total=0) is a no-op and is functionally equivalent to no Retry at all. - Extract the loopback HTTP server setup from the retry test into a _local_rest_server_503_then_200 context manager; the test body now shows intent (set retries=3, list, verify 4 calls) without the handler scaffold. Signed-off-by: rahulsmahadev <rahul.mahadev@databricks.com>
|
Pushed 47a5382 addressing the four review nits:
Thanks again for the careful review! |
rambleraptor
left a comment
There was a problem hiding this comment.
Thank you for doing this! This will be super helpful.
Rationale for this change
The REST Catalog uses
requestswith no retries and no timeout configured, so transient 5xx / network failures bubble up immediately and slow servers can hang the client indefinitely. The reporting user on #2772 hit infinite-timeout hangs against a Polaris instance returning504from a proxy.This change adds an optional
connection:block on the catalog properties for opting in to a per-request timeout and a retry policy, as proposed in the issue body:Changes:
connection,connection.timeout,connection.retrycatalog properties._RetryTimeoutHTTPAdapterthat subclassesrequests.adapters.HTTPAdapterand applies a default timeout to every call (sincerequestsdoes not expose a session-wide timeout).connection.retrysub-mapping is passed verbatim tourllib3.util.retry.Retry, so any kwarg the upstream class accepts can be configured.requestsbehavior is preserved — no behavior change for existing users.timeoutraisesValueError; unknownRetrykwargs are caught and re-raised asValueErrorwith a clearer message.The SigV4 adapter mounted at the catalog URI (see #3063) is a longer-prefix match and still wins for that host, so SigV4 users continue to get their existing retry behavior unchanged — this change deliberately stays out of that path.
Are these changes tested?
Yes. Added five tests in
tests/catalog/test_rest.py:test_session_without_connection_config_uses_default_adapter— noconnection:block → no_RetryTimeoutHTTPAdapteris mounted (defaultrequestsbehavior preserved).test_session_with_connection_timeout_and_retry— full nestedconnection:block applies timeout and allRetrykwargs (total,backoff_factor,status_forcelist,allowed_methods).test_session_with_connection_timeout_only—timeoutalone works without aretryblock.test_session_with_invalid_connection_timeout_raises— negativetimeoutraisesValueError.test_session_with_invalid_connection_retry_kwarg_raises— unknownRetrykwarg raisesValueError.Are there any user-facing changes?
Yes — two new optional REST catalog properties (
connection.timeout,connection.retry) documented inmkdocs/docs/configuration.mdunder "Retry and timeout". No behavior change for users who do not set them.