feat(dynamo): add target_executorch setting to keep output-allocator ops in PyTorch by shoumikhin · Pull Request #4355 · pytorch/TensorRT

shoumikhin · 2026-06-21T00:28:32Z

Description

Some converters require a TensorRT output allocator because their output shape is
data-dependent (for example aten.nonzero and boolean aten.index.Tensor). A
TensorRT engine that needs an output allocator cannot be consumed by every downstream
runtime that executes the compiled program (for instance, runtimes that rely on
ahead-of-time static memory planning and cannot size an output whose shape is only
known after the engine runs).

This adds a fallback_data_dependent_ops compile setting (default False). When
enabled, an operator runs in PyTorch instead of TensorRT iff the converter selected
for that specific node requires an output allocator. When disabled (the default),
behavior is unchanged.

Details

The decision is made per node in the partitioner operator-support check
(TorchTensorRTOperatorSupport and OpSupportTester), which asks the converter
registry which converter it would select for the node (honoring
capability_validator). This matters for targets like aten.index.Tensor that
register two converters: boolean indexing requires an output allocator and falls
back, while ordinary integer gather indexing stays on TensorRT.
Wired through compile() and cross_compile_for_windows(); the check runs during
partitioning, which both entry points reach through compile_module(). It is
intentionally not exposed on convert_exported_program_to_serialized_trt_engine(),
where a single serialized engine cannot contain PyTorch fallbacks.
Combining fallback_data_dependent_ops with require_full_compilation raises a
clear error, since routing ops to PyTorch contradicts full compilation.
CompilationSettings.__setstate__ defaults the new field so older pickles load.

Tests

tests/py/dynamo/models/test_fallback_data_dependent_ops.py:

the setting defaults to False and is settable;
a state missing the field (older pickle) restores to False;
the per-node support decision for nonzero (CPU, no GPU needed);
combining with require_full_compilation raises;
end to end on GPU, a data-dependent op (nonzero) falls back to PyTorch.

Type of change

New feature (non-breaking change which adds functionality)

Checklist

My code follows the style guidelines of this project (isort + black)
I have added tests that prove my fix/feature works
Commit is signed off (DCO)

narendasan · 2026-06-23T17:01:06Z

@shoumikhin why can't downstream operators consume output allocated outputs?

Also name wise, I'd rather a specific name like fallback_data_dependent_ops so it could be used for other use cases naturally

…pendent ops in PyTorch Some converters require a TensorRT output allocator because their output shape is data-dependent (for example aten.nonzero and boolean aten.index.Tensor). A TensorRT engine that needs an output allocator cannot be consumed by every downstream runtime that executes the compiled program. This adds a fallback_data_dependent_ops compile setting (default False). When enabled, an operator runs in PyTorch instead of TensorRT iff the converter selected for that specific node requires an output allocator. When disabled (the default), behavior is unchanged. The decision is made per node in the partitioner operator-support check (TorchTensorRTOperatorSupport and OpSupportTester), which asks the converter registry which converter it would select for the node (honoring capability_validator). This matters for targets like aten.index.Tensor that register two converters: boolean indexing requires an output allocator and falls back, while ordinary integer gather indexing stays on TensorRT. Details: - Wired through compile() and cross_compile_for_windows(); the check runs during partitioning, which both entry points reach through compile_module(). It is intentionally not exposed on convert_exported_program_to_serialized_trt_engine(), where a single serialized engine cannot contain PyTorch fallbacks. - Combining fallback_data_dependent_ops with require_full_compilation raises a clear error, since routing ops to PyTorch contradicts full compilation. - CompilationSettings.__setstate__ defaults the new field so older pickles load. Tests (tests/py/dynamo/models/test_fallback_data_dependent_ops.py): default value; old-pickle compatibility; the per-node support decision for nonzero (CPU); the require_full_compilation conflict; and an end to end GPU test that a data-dependent op falls back to PyTorch. Signed-off-by: shoumikhin <shoumikhin@meta.com>

shoumikhin · 2026-06-24T15:12:34Z

Thanks, good questions.

why can't downstream operators consume output allocated outputs?

It's the production side, not consumption. For data-dependent ops like nonzero / boolean index.Tensor the output shape isn't known until the engine runs: TensorRT calls back into reallocateOutputAsync during enqueueV3, and the true shape is only readable afterward via notifyShape (the runtime is set_output_allocator -> execute_async_v3 -> read shape -> reshape). The blocker is downstream runtimes that use ahead-of-time static memory planning: every output tensor gets a fixed buffer sized from a concrete shape at plan time, so there is no way for the embedded engine to install an output allocator, learn the post-enqueue shape, and hand back a runtime-sized buffer. torch-tensorrt already hard-rejects output-allocator engines on the ExecuTorch export path for this reason; this setting just turns that hard failure into a graceful per-op fallback.

name wise, I'd rather a specific name like fallback_data_dependent_ops

Done, renamed to fallback_data_dependent_ops. I also made the routing per-node rather than per-target: it keys on the converter actually selected for each node (its requires_output_allocator), honoring capability_validator, so boolean index.Tensor falls back while ordinary integer gather index.Tensor stays on TensorRT. PR description updated to match.

meta-cla Bot added the cla signed label Jun 21, 2026

github-actions Bot added component: tests Issues re: Tests component: core Issues re: The core compiler component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Jun 21, 2026

github-actions Bot requested a review from cehongwang June 21, 2026 00:28

shoumikhin force-pushed the target-executorch-setting branch 2 times, most recently from 424b09f to 85fa1eb Compare June 22, 2026 02:58

shoumikhin force-pushed the target-executorch-setting branch 3 times, most recently from 26f7d08 to de6d90a Compare June 24, 2026 09:13

shoumikhin force-pushed the target-executorch-setting branch from de6d90a to 8da5cd6 Compare June 24, 2026 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(dynamo): add target_executorch setting to keep output-allocator ops in PyTorch#4355

feat(dynamo): add target_executorch setting to keep output-allocator ops in PyTorch#4355
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:target-executorch-setting

shoumikhin commented Jun 21, 2026 •

edited

Loading

Uh oh!

narendasan commented Jun 23, 2026

Uh oh!

shoumikhin commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

shoumikhin commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Details

Tests

Type of change

Checklist

Uh oh!

narendasan commented Jun 23, 2026

Uh oh!

shoumikhin commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shoumikhin commented Jun 21, 2026 •

edited

Loading