Skip to content

feat: fused mul_hilo for 64-bit batches (shared 32x32->64 partials)#1367

Merged
serge-sans-paille merged 1 commit into
xtensor-stack:masterfrom
DiamonDinoia:fused-mul-hilo
Jun 11, 2026
Merged

feat: fused mul_hilo for 64-bit batches (shared 32x32->64 partials)#1367
serge-sans-paille merged 1 commit into
xtensor-stack:masterfrom
DiamonDinoia:fused-mul-hilo

Conversation

@DiamonDinoia

Copy link
Copy Markdown
Contributor

mul_hilo<uint64_t> previously fell through to the generic common path { mul_hi(x, y), x * y }

mul_hilo<uint64_t> previously fell through to the generic common path
{ mul_hi(x, y), x * y }, deriving the high half (mulhi_u64_core: 4 vpmuludq)
and the low half (operator*: 3 vpmuludq) from separately-computed 32-bit
partials -- 7 vpmuludq per pair, none CSE-able because the two halves split
the operands differently (&mask/>>32 vs vpshufd).

Add detail::mulhilo_u64_core, which derives BOTH halves from one set of four
32x32->64 partials (ll, lh, hl, hh): 4 vpmuludq per pair. By construction
hi == mulhi_u64_core and lo == operator*, so the returned pair is
bit-identical to the unfused result.

Native kernels for SSE4.1, AVX2 and AVX-512F each pass their _mm*_mul_epu32
widening functor, mirroring the existing mul_hi<uint64_t> structure. SSE2
keeps the common fallback (it has no fused mul_hi<uint64_t> either). Signed
int64 reuses the unsigned core through a single common overload (bitwise_cast
+ sign fixup on hi; lo is sign-invariant), so no per-arch signed overloads
are added.

Verified bit-identical to __int128 for uint64/int64 across SSE4.1 and AVX2
on g++ and clang (including edge cases); avx2 asm shows 4 vpmuludq for the
fused mul_hilo vs 7 for the unfused { mul_hi, mul_lo }.
@serge-sans-paille serge-sans-paille merged commit 5d2490f into xtensor-stack:master Jun 11, 2026
88 of 92 checks passed
@serge-sans-paille

Copy link
Copy Markdown
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants