Skip to content

feat(dynamo): add target_executorch setting to keep output-allocator ops in PyTorch#4355

Open
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:target-executorch-setting
Open

feat(dynamo): add target_executorch setting to keep output-allocator ops in PyTorch#4355
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:target-executorch-setting

Conversation

@shoumikhin

@shoumikhin shoumikhin commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Description

Some converters require a TensorRT output allocator because their output shape is
data-dependent (for example aten.nonzero and boolean aten.index.Tensor). A
TensorRT engine that needs an output allocator cannot be consumed by every downstream
runtime that executes the compiled program (for instance, runtimes that rely on
ahead-of-time static memory planning and cannot size an output whose shape is only
known after the engine runs).

This adds a fallback_data_dependent_ops compile setting (default False). When
enabled, an operator runs in PyTorch instead of TensorRT iff the converter selected
for that specific node requires an output allocator. When disabled (the default),
behavior is unchanged.

Details

  • The decision is made per node in the partitioner operator-support check
    (TorchTensorRTOperatorSupport and OpSupportTester), which asks the converter
    registry which converter it would select for the node (honoring
    capability_validator). This matters for targets like aten.index.Tensor that
    register two converters: boolean indexing requires an output allocator and falls
    back, while ordinary integer gather indexing stays on TensorRT.
  • Wired through compile() and cross_compile_for_windows(); the check runs during
    partitioning, which both entry points reach through compile_module(). It is
    intentionally not exposed on convert_exported_program_to_serialized_trt_engine(),
    where a single serialized engine cannot contain PyTorch fallbacks.
  • Combining fallback_data_dependent_ops with require_full_compilation raises a
    clear error, since routing ops to PyTorch contradicts full compilation.
  • CompilationSettings.__setstate__ defaults the new field so older pickles load.

Tests

tests/py/dynamo/models/test_fallback_data_dependent_ops.py:

  • the setting defaults to False and is settable;
  • a state missing the field (older pickle) restores to False;
  • the per-node support decision for nonzero (CPU, no GPU needed);
  • combining with require_full_compilation raises;
  • end to end on GPU, a data-dependent op (nonzero) falls back to PyTorch.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • My code follows the style guidelines of this project (isort + black)
  • I have added tests that prove my fix/feature works
  • Commit is signed off (DCO)

@meta-cla meta-cla Bot added the cla signed label Jun 21, 2026
@github-actions github-actions Bot added component: tests Issues re: Tests component: core Issues re: The core compiler component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Jun 21, 2026
@github-actions github-actions Bot requested a review from cehongwang June 21, 2026 00:28
@shoumikhin shoumikhin force-pushed the target-executorch-setting branch 2 times, most recently from 424b09f to 85fa1eb Compare June 22, 2026 02:58
@narendasan

Copy link
Copy Markdown
Collaborator

@shoumikhin why can't downstream operators consume output allocated outputs?

Also name wise, I'd rather a specific name like fallback_data_dependent_ops so it could be used for other use cases naturally

@shoumikhin shoumikhin force-pushed the target-executorch-setting branch 3 times, most recently from 26f7d08 to de6d90a Compare June 24, 2026 09:13
…pendent ops in PyTorch

Some converters require a TensorRT output allocator because their output shape is
data-dependent (for example aten.nonzero and boolean aten.index.Tensor). A TensorRT
engine that needs an output allocator cannot be consumed by every downstream runtime
that executes the compiled program.

This adds a fallback_data_dependent_ops compile setting (default False). When enabled,
an operator runs in PyTorch instead of TensorRT iff the converter selected for that
specific node requires an output allocator. When disabled (the default), behavior is
unchanged.

The decision is made per node in the partitioner operator-support check
(TorchTensorRTOperatorSupport and OpSupportTester), which asks the converter registry
which converter it would select for the node (honoring capability_validator). This
matters for targets like aten.index.Tensor that register two converters: boolean
indexing requires an output allocator and falls back, while ordinary integer gather
indexing stays on TensorRT.

Details:
- Wired through compile() and cross_compile_for_windows(); the check runs during
  partitioning, which both entry points reach through compile_module(). It is
  intentionally not exposed on convert_exported_program_to_serialized_trt_engine(),
  where a single serialized engine cannot contain PyTorch fallbacks.
- Combining fallback_data_dependent_ops with require_full_compilation raises a clear
  error, since routing ops to PyTorch contradicts full compilation.
- CompilationSettings.__setstate__ defaults the new field so older pickles load.

Tests (tests/py/dynamo/models/test_fallback_data_dependent_ops.py): default value;
old-pickle compatibility; the per-node support decision for nonzero (CPU); the
require_full_compilation conflict; and an end to end GPU test that a data-dependent op
falls back to PyTorch.

Signed-off-by: shoumikhin <shoumikhin@meta.com>
@shoumikhin shoumikhin force-pushed the target-executorch-setting branch from de6d90a to 8da5cd6 Compare June 24, 2026 15:11
@shoumikhin

Copy link
Copy Markdown
Contributor Author

Thanks, good questions.

why can't downstream operators consume output allocated outputs?

It's the production side, not consumption. For data-dependent ops like nonzero / boolean index.Tensor the output shape isn't known until the engine runs: TensorRT calls back into reallocateOutputAsync during enqueueV3, and the true shape is only readable afterward via notifyShape (the runtime is set_output_allocator -> execute_async_v3 -> read shape -> reshape). The blocker is downstream runtimes that use ahead-of-time static memory planning: every output tensor gets a fixed buffer sized from a concrete shape at plan time, so there is no way for the embedded engine to install an output allocator, learn the post-enqueue shape, and hand back a runtime-sized buffer. torch-tensorrt already hard-rejects output-allocator engines on the ExecuTorch export path for this reason; this setting just turns that hard failure into a graceful per-op fallback.

name wise, I'd rather a specific name like fallback_data_dependent_ops

Done, renamed to fallback_data_dependent_ops. I also made the routing per-node rather than per-target: it keys on the converter actually selected for each node (its requires_output_allocator), honoring capability_validator, so boolean index.Tensor falls back while ordinary integer gather index.Tensor stays on TensorRT. PR description updated to match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] Issues re: Python API component: core Issues re: The core compiler component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: tests Issues re: Tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants