Skip to content

gh-111089: Add PyUnicode_AsUTF8NoNUL() function#111688

Closed
vstinner wants to merge 3 commits into
python:mainfrom
vstinner:unicode_asutf8safe
Closed

gh-111089: Add PyUnicode_AsUTF8NoNUL() function#111688
vstinner wants to merge 3 commits into
python:mainfrom
vstinner:unicode_asutf8safe

Conversation

@vstinner

@vstinner vstinner commented Nov 3, 2023

Copy link
Copy Markdown
Member

Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null characters: the PyUnicode_AsUTF8NoNUL() function should be used instead.


📚 Documentation preview 📚: https://cpython-previews--111688.org.readthedocs.build/

@vstinner

vstinner commented Nov 3, 2023

Copy link
Copy Markdown
Member Author

I'm not sure about PyUnicode_AsUTF8Safe() name: "safe" might give a feeling of security which can be wrong. As written in the function documentation, it doesn't check for "dangerous Unicode characters". It implements a single check: only tests if the string contains null characters.

For example, if the string is used to open a file, the function doesn't reject ../ substring which is commonly used for directory traversal vulnerabilities. Further sanitization and validation is required depending on the domain where the string is used.

Maybe the name should be more explicitly about its purpose, such as: PyUnicode_AsUTF8NoNul()? Can such name be misunderstood as "the result cannot be NULL", such as "the function cannot fail"? Or maybe PyUnicode_AsUTF8NoNulChar()?

@AlexWaygood

Copy link
Copy Markdown
Member

the clinic.py changes look fine to me; I have no opinion on the C changes :)

@erlend-aasland

Copy link
Copy Markdown
Contributor

Clinic changes are ok with me. I'll leave the C API discussion to the C API WG :)

@vstinner

vstinner commented Nov 3, 2023

Copy link
Copy Markdown
Member Author

If this change is merged, it would be interesting to go through PyUnicode_AsUTF8() usage in the Python code base and check if switching PyUnicode_AsUTF8Safe() would be worth it.

See #111656 (comment) list for example.

@vstinner

vstinner commented Nov 3, 2023

Copy link
Copy Markdown
Member Author

I'll leave the C API discussion to the C API WG :)

The working group doesn't exist yet, it's still a draft PEP: https://discuss.python.org/t/pep-731-c-api-working-group-charter/36117

@vstinner

vstinner commented Nov 3, 2023

Copy link
Copy Markdown
Member Author

@encukou @gpshead @Yhg1s @serhiy-storchaka: Would you mind to review this change?

@serhiy-storchaka

Copy link
Copy Markdown
Member

Are there strong objections against simply making PyUnicode_AsUTF8AdnSize(unicode, NULL) to check for embedded NULs?

Comment thread Tools/clinic/clinic.py Outdated
goto exit;
}}}}
{paramname} = PyUnicode_AsUTF8({argname});
{paramname} = PyUnicode_AsUTF8Safe({argname});

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be different code for limited_capi is true.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I wanted to add PyUnicode_AsUTF8Safe() to the limited C API in a separated PR. Maybe it's more convenient to do it in the PR, since there is a lot of code generated by Argument Clinic impacted by these changes.

@vstinner

vstinner commented Nov 3, 2023

Copy link
Copy Markdown
Member Author

Are there strong objections against simply making PyUnicode_AsUTF8AdnSize(unicode, NULL) to check for embedded NULs?

I suppose that it's too late to change it, same rationale than changing PyUnicode_AsUTF8().

Moreover, PyUnicode_AsUTF8AndSize(str, NULL) is the most convenient API of the stable ABI to convert a Python str to char* (as UTF-8) without storing the size.

@vstinner

vstinner commented Nov 4, 2023

Copy link
Copy Markdown
Member Author

I rebased the PR on the main branch to get the test_asyncio fix. I also squashed commits.

@encukou

encukou commented Nov 6, 2023

Copy link
Copy Markdown
Member

Wait until the WG can discuss this.

I agree that Safe is too generic.

I wouldn't mind naming it UTF8Z, with Z used for “zero-terminated string without NUL characters” for all new API from now on.

@vstinner

vstinner commented Nov 6, 2023

Copy link
Copy Markdown
Member Author

If "Safe" is too generic, what about "AsUTF8NoNul"? I propose "Nul" instead of "Null" which looks like "NULL pointer". Or maybe "AsUTF8NoNullChars"?

"Z" looks too short to me, it's not easy to guess its intent.

@vstinner

vstinner commented Nov 7, 2023

Copy link
Copy Markdown
Member Author

If "Safe" is too generic, what about "AsUTF8NoNul"? I propose "Nul" instead of "Null" which looks like "NULL pointer". Or maybe "AsUTF8NoNullChars"?

Alternative short name: PyUnicode_AsUTF8NoNUL() since it's common to refer to "null characters" as NUL. It's commonly used in encoding tables such as the ASCII table. Example on Wikipedia: ASCII: Character Set.

Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null
characters: the PyUnicode_AsUTF8Safe() function should be used
instead.
@vstinner vstinner requested review from a team and encukou as code owners November 7, 2023 11:01
@vstinner

vstinner commented Nov 7, 2023

Copy link
Copy Markdown
Member Author

@serhiy-storchaka @erlend-aasland: I renamed the function to PyUnicode_AsUTF8NoNUL(). Are you fine with this name?

@vstinner vstinner changed the title gh-111089: Add PyUnicode_AsUTF8Safe() function gh-111089: Add PyUnicode_AsUTF8NoNUL() function Nov 7, 2023
@erlend-aasland

Copy link
Copy Markdown
Contributor

@serhiy-storchaka @erlend-aasland: I renamed the function to PyUnicode_AsUTF8NoNUL(). Are you fine with this name?

I didn't follow the discussion. My first impression is that I don't understand what that API is supposed to do, based on the name only.

@serhiy-storchaka

Copy link
Copy Markdown
Member

I am fine with added the embedded null character check in PyUnicode_AsUTF8() or PyUnicode_AsUTF8AndSize(). If you ask for a char pointer, but have no way to get the size, you ask for a null-terminated C string. If the Python string contains embedded nulls, the result is ambiguous, therefore it must be exception.

If people not fine with this, ask them with what are they fine.

@vstinner

vstinner commented Nov 7, 2023

Copy link
Copy Markdown
Member Author

If people not fine with this, ask them with what are they fine.

I asked @gpshead and @Yhg1s for a review ;-)

@Yhg1s

Yhg1s commented Nov 7, 2023

Copy link
Copy Markdown
Member

I don't know why we need a new public function. For a new function, if you feel like it needs a strlen check, that's fine. I don't think it's a sensible way to deal with C strings (everyone who works with C strings should know about the importance of NUL), but for new APIs I don't really care either way.

However:

  • Changing existing functions is a really, really bad idea.
  • Does this really need to be a public API function? It seems stunningly trivial.
  • There's nothing 'safer' about propagating NULs rather than truncating, so avoid that framing. It would be safer to remind people to learn about C strings when dealing with strings in C.
  • I think these kinds of decisions should go to the C API WG that's being considered. It doesn't hurt to wait a few weeks to see if that is going to happen.

@vstinner

vstinner commented Nov 7, 2023

Copy link
Copy Markdown
Member Author

I close my issue: #111089 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants