Don't choke on (legitimately) invalidly encoded Unicode paths#467
Don't choke on (legitimately) invalidly encoded Unicode paths#467
Conversation
|
@nvie Thanks for bringing this up ! I believe it's one of the worst decisions made in GitPython to attempt to decode everything using UTF-8 (or sometimes using whatever encoding Git hints us at). Especially for paths, doing this is plain wrong as the filesystem may use arbitrary encodings. This is the reason Rust uses OsStrings to represent these, as opposed to Strings which need to be valid UTF-8. That said, fixing the API in that regard might be a somewhat major undertaking certainly worth making. Without a major version upgrade, the proposed adjustment might be good enough. The question is, what you as an API user think about these options - I'd be happy to hear your thoughts. |
|
I don't think it's too bad in practice. In most parts where paths hit the UI, you'd want to use a However, sometimes there just is a need to use the raw bytes, especially when the result of command A (say, a diff), defines the input of command B (say, a blame). Only in these cases, you need to have the exact bytes. I would still show the I'll have a stab at adding those extra raw paths as properties. |
Previously, the following fields on Diff instances were assumed to be passed in as unicode strings: - `a_path` - `b_path` - `rename_from` - `rename_to` However, since Git natively records paths as bytes, these may potentially not have a valid unicode representation. This patch changes the Diff instance to instead take the following equivalent fields that should be raw bytes instead: - `a_rawpath` - `b_rawpath` - `raw_rename_from` - `raw_rename_to` NOTE ON BACKWARD COMPATIBILITY: The original `a_path`, `b_path`, etc. fields are still available as properties (rather than slots). These properties now dynamically decode the raw bytes into a unicode string (performing the potentially destructive operation of replacing invalid unicode chars by "�"'s). This means that all code using Diffs should remain backward compatible. The only exception is when people would manually construct Diff instances by calling the constructor directly, in which case they should now pass in bytes rather than unicode strings. See also the discussion on #467
|
I guess related: #505 -- I have tried to run tests using TMPDIR with unicode in it... actually may be it is only marginally related since paths in that case are unicode outside of the repository, not within (as it is discussed here), correct? |
We've come across path names that contain bytes that are invalid in UTF-8 encoded strings, even though they're very rare. My assumption here is these commits have been created by an old (buggy?) version of Git, and now live in the tree objects with this data. Since we return only unicode strings for the
a_pathandb_pathproperties, we're not able to decode this string and thus choke when asking for the diff.This PR fixes that by using "replace" semantics when decoding. This will effectively replace the illegal bytes
\200(or\x80) by \ufffd (= �).Follow-up discussion
However, this also means that if you would want to git-blame this file, there's no good way of referencing this path, since it's inherently a bytes path. Normally, when we pass unicode paths to git-blame via GitPython's blame API, the paths get converted to UTF-8 right before issuing the external command. But there's no way of getting the original bytes back after the "replace" operation happened.
Example:
b'illegal-\x80.txt'(containing illegal byte\x80)u'illegal-\ufffd.txt'(= "illegal-�.txt")b'illegal-\xef\xbf\xbd.txt'When we next pass
illegal-\xef\xbf\xbd.txtto git-blame, it will not be able to find this path. Perhaps it would be a good idea to not only return the decoded path strings, but also provide access to the raw bytes found, i.e. by exposinga_rawpathandb_rawpath, which would always be bytes? That way, you could still have the friendly "unicode paths" for most use cases, but use bytes if you need to speak the language of Git more accurately.