Skip to content

[NEP42 NEP43 NEP55] DType System and String Support #549

@Nucs

Description

@Nucs

Overview

NumPy's evolving dtype system architecture, extensible ufuncs, and new variable-width string support.


NEP 42: New and Extensible DTypes

Status: Accepted | Full Text

Architectural Shift

From monolithic dtypes to class-based hierarchy:

numpy.dtype (base)
├── Abstract dtypes (Integer, Floating, Complex)
│   └── Cannot be instantiated
└── Concrete dtypes (Float64, Int32, String)
    └── Cannot be subclassed

Class Getter Syntax

np.dtype[np.int64]      # Get DType class
np.dtype[UserScalar]    # Works with user-defined

Key Methods

class DType(np.dtype):
    @classmethod
    def common_dtype(cls, other) -> DTypeMeta: ...
    def common_instance(self, other) -> DType: ...

NEP 43: Extensible UFuncs

Status: Draft | Full Text

ArrayMethod Object

Encapsulates dtype-specific ufunc functionality:

class ArrayMethod:
    def resolve_descriptors(self, DTypes, given_descrs):
        """For parametric types: S5 + S4 -> S9"""
        return (resolved_descrs, casting_safety)
    
    def strided_inner_loop(context, data, dims, strides):
        """The actual computation kernel"""
        pass

Parametric Type Resolution

# String concatenation: output length = sum of inputs
np.add(np.array(["abc"], "S3"), np.array(["xy"], "S2"))
# Result dtype must be S5, determined at runtime

NEP 55: UTF-8 Variable-Width String DType

Status: Final | Full Text

Comparison

Aspect Old (U/S) New (StringDType)
Width Fixed Variable
Encoding UCS-4 UTF-8
Memory Wastes space Optimized

Usage

arr = np.array(["short", "very long"], dtype=StringDType())
arr = np.array(["hello"], dtype="T")  # Character code

Missing Data

dt = StringDType(na_object=np.nan)
arr = np.array(["hello", np.nan, "world"], dtype=dt)
np.isnan(arr)  # [False, True, False]

np.strings Namespace

np.strings.upper(arr)
np.strings.lower(arr)
np.strings.str_len(arr)
np.strings.replace(arr, old, new)

Suggested Implementation for NumSharp

DType Architecture (Future)

// Potential class-based approach
abstract class DType {
    public abstract NPTypeCode TypeCode { get; }
    public abstract Type ScalarType { get; }
}

class Float64DType : DType { }
class Int32DType : DType { }

// Parametric types
class StringDType : DType {
    public int MaxLength { get; }
}

Type Resolution Pattern

public interface ITypeOperation<T> {
    NPTypeCode ResolveOutputType(NPTypeCode[] inputs);
    void Execute(Span<T> a, Span<T> b, Span<T> output);
}

String Support Options

  1. Object arrays with string references (current workaround)
  2. New NPTypeCode.String with variable-length storage
  3. np.strings namespace for string operations

Documentation

See docs/neps/NEP42.md, docs/neps/NEP43.md, docs/neps/NEP55.md

Metadata

Metadata

Assignees

Labels

NumPy 2.x ComplianceAligns behavior with NumPy 2.x (NEPs, breaking changes)architectureCross-cutting structural changes affecting multiple componentscoreInternal engine: Shape, Storage, TensorEngine, iteratorsdocumentation-neededFeature requires documentation after implementation or depiction of lack of documentationenhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions