Unicode Objects and Codecs¶

Unicode Objects¶

Unicode Type¶

These are the basic Unicode object types used for the Unicode implementation in Python:

Py_UNICODE¶: This type represents the storage type which is used by Python internally as basis for holding Unicode ordinals. Python’s default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. It is also possible to build a UCS4 version of Python (most recent Linux distributions come with UCS4 builds of Python). These builds then use a 32-bit type for Py_UNICODE and store Unicode data internally as UCS4. On platforms where wchar_t is available and compatible with the chosen Python Unicode build variant, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for either unsigned short (UCS2) or unsigned long (UCS4).

Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep this in mind when writing extensions or interfaces.

PyUnicodeObject¶: This subtype of PyObject represents a Python Unicode object.

PyTypeObject PyUnicode_Type¶: This instance of PyTypeObject represents the Python Unicode type. It is exposed to Python code as str.

The following APIs are really C macros and can be used to do fast checks and to access internal read-only data of Unicode objects:

int PyUnicode_Check(PyObject *o)¶: Return true if the object o is a Unicode object or an instance of a Unicode subtype.

int PyUnicode_CheckExact(PyObject *o)¶: Return true if the object o is a Unicode object, but not an instance of a subtype.

Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)¶: Return the size of the object. o has to be a PyUnicodeObject (not checked).

Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)¶: Return the size of the object’s internal buffer in bytes. o has to be a PyUnicodeObject (not checked).

Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)¶: Return a pointer to the internal Py_UNICODE buffer of the object. o has to be a PyUnicodeObject (not checked).

const char* PyUnicode_AS_DATA(PyObject *o)¶: Return a pointer to the internal buffer of the object. o has to be a PyUnicodeObject (not checked).

int PyUnicode_ClearFreeList()¶: Clear the free list. Return the total number of freed items.

Unicode Character Properties¶

Unicode provides many different character properties. The most often needed ones are available through these macros which are mapped to C functions depending on the Python configuration.

int Py_UNICODE_ISSPACE(Py_UNICODE ch)¶: Return 1 or 0 depending on whether ch is a whitespace character.

int Py_UNICODE_ISLOWER(Py_UNICODE ch)¶: Return 1 or 0 depending on whether ch is a lowercase character.

int Py_UNICODE_ISUPPER(Py_UNICODE ch)¶: Return 1 or 0 depending on whether ch is an uppercase character.

int Py_UNICODE_ISTITLE(Py_UNICODE ch)¶: Return 1 or 0 depending on whether ch is a titlecase character.

int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)¶: Return 1 or 0 depending on whether ch is a linebreak character.

int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)¶: Return 1 or 0 depending on whether ch is a decimal character.

int Py_UNICODE_ISDIGIT(Py_UNICODE ch)¶: Return 1 or 0 depending on whether ch is a digit character.

int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)¶: Return 1 or 0 depending on whether ch is a numeric character.

int Py_UNICODE_ISALPHA(Py_UNICODE ch)¶: Return 1 or 0 depending on whether ch is an alphabetic character.

int Py_UNICODE_ISALNUM(Py_UNICODE ch)¶: Return 1 or 0 depending on whether ch is an alphanumeric character.

int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)¶: Return 1 or 0 depending on whether ch is a printable character. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)

These APIs can be used for fast direct character conversions:

Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)¶: Return the character ch converted to lower case.

Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)¶: Return the character ch converted to upper case.

Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)¶: Return the character ch converted to title case.

int Py_UNICODE_TODECIMAL(Py_UNICODE ch)¶: Return the character ch converted to a decimal positive integer. Return -1 if this is not possible. This macro does not raise exceptions.

int Py_UNICODE_TODIGIT(Py_UNICODE ch)¶: Return the character ch converted to a single digit integer. Return -1 if this is not possible. This macro does not raise exceptions.

double Py_UNICODE_TONUMERIC(Py_UNICODE ch)¶: Return the character ch converted to a double. Return -1.0 if this is not possible. This macro does not raise exceptions.

Plain Py_UNICODE¶

To create Unicode objects and access their basic sequence properties, use these APIs:

PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)¶: Create a Unicode object from the Py_UNICODE buffer u of the given size. u may be NULL which causes the contents to be undefined. It is the user’s responsibility to fill in the needed data. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object. Therefore, modification of the resulting Unicode object is only allowed when u is NULL.

PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)¶: Create a Unicode object from the char buffer u. The bytes will be interpreted as being UTF-8 encoded. u may also be NULL which causes the contents to be undefined. It is the user’s responsibility to fill in the needed data. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object. Therefore, modification of the resulting Unicode object is only allowed when u is NULL.

PyObject *PyUnicode_FromString(const char *u)¶: Create a Unicode object from an UTF-8 encoded null-terminated char buffer u.

PyObject* PyUnicode_FromFormat(const char *format, ...)¶

Take a C printf()-style format string and a variable number of arguments, calculate the size of the resulting Python unicode string and return a string with the values formatted into it. The variable arguments must be C types and must correspond exactly to the format characters in the format ASCII-encoded string. The following format characters are allowed:

Format Characters	Type	Comment
`%%`	n/a	The literal % character.
`%c`	int	A single character, represented as an C int.
`%d`	int	Exactly equivalent to `printf("%d")`.
`%u`	unsigned int	Exactly equivalent to `printf("%u")`.
`%ld`	long	Exactly equivalent to `printf("%ld")`.
`%lu`	unsigned long	Exactly equivalent to `printf("%lu")`.
`%lld`	long long	Exactly equivalent to `printf("%lld")`.
`%llu`	unsigned long long	Exactly equivalent to `printf("%llu")`.
`%zd`	Py_ssize_t	Exactly equivalent to `printf("%zd")`.
`%zu`	size_t	Exactly equivalent to `printf("%zu")`.
`%i`	int	Exactly equivalent to `printf("%i")`.
`%x`	int	Exactly equivalent to `printf("%x")`.
`%s`	char*	A null-terminated C character array.
`%p`	void*	The hex representation of a C pointer. Mostly equivalent to `printf("%p")` except that it is guaranteed to start with the literal `0x` regardless of what the platform’s `printf` yields.
`%A`	PyObject*	The result of calling `ascii()`.
`%U`	PyObject*	A unicode object.
`%V`	PyObject, char	A unicode object (which may be NULL) and a null-terminated C character array as a second parameter (which will be used, if the first parameter is NULL).
`%S`	PyObject*	The result of calling `PyObject_Str()`.
`%R`	PyObject*	The result of calling `PyObject_Repr()`.

An unrecognized format character causes all the rest of the format string to be copied as-is to the result string, and any extra arguments discarded.

Note

The “%lld” and “%llu” format specifiers are only available when HAVE_LONG_LONG is defined.

Changed in version 3.2:

Changed in version 3.2: Support for "%lld" and "%llu" added.

PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)¶: Identical to PyUnicode_FromFormat() except that it takes exactly two arguments.

PyObject* PyUnicode_TransformDecimalToASCII(Py_UNICODE *s, Py_ssize_t size)¶: Create a Unicode object by replacing all decimal digits in Py_UNICODE buffer of the given size by ASCII digits 0–9 according to their decimal value. Return NULL if an exception occurs.

Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)¶: Return a read-only pointer to the Unicode object’s internal Py_UNICODE buffer, NULL if unicode is not a Unicode object.

Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode)¶

Create a copy of a Unicode string ending with a nul character. Return NULL and raise a MemoryError exception on memory allocation failure, otherwise return a new allocated buffer (use PyMem_Free() to free the buffer).

New in version 3.2:

New in version 3.2.

Py_ssize_t PyUnicode_GetSize(PyObject *unicode)¶: Return the length of the Unicode object.

PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)¶

Coerce an encoded object obj to an Unicode object and return a reference with incremented refcount.

bytes, bytearray and other char buffer compatible objects are decoded according to the given encoding and using the error handling defined by errors. Both can be NULL to have the interface use the default values (see the next section for details).

All other objects, including Unicode objects, cause a TypeError to be set.

The API returns NULL if there was an error. The caller is responsible for decref’ing the returned objects.

PyObject* PyUnicode_FromObject(PyObject *obj)¶: Shortcut for PyUnicode_FromEncodedObject(obj, NULL, "strict") which is used throughout the interpreter whenever coercion to Unicode is needed.

If the platform supports wchar_t and provides a header file wchar.h, Python can interface directly to this type using the following functions. Support is optimized if Python’s own Py_UNICODE type is identical to the system’s wchar_t.

File System Encoding¶

To encode and decode file names and other environment strings, Py_FileSystemEncoding should be used as the encoding, and "surrogateescape" should be used as the error handler (PEP 383). To encode file names during argument parsing, the "O&" converter should be used, passing PyUnicode_FSConverter() as the conversion function:

int PyUnicode_FSConverter(PyObject* obj, void* result)¶

ParseTuple converter: encode str objects to bytes using PyUnicode_EncodeFSDefault(); bytes objects are output as-is. result must be a PyBytesObject* which must be released when it is no longer used.

New in version 3.1:

New in version 3.1.

To decode file names during argument parsing, the "O&" converter should be used, passing PyUnicode_FSDecoder() as the conversion function:

int PyUnicode_FSDecoder(PyObject* obj, void* result)¶

ParseTuple converter: decode bytes objects to str using PyUnicode_DecodeFSDefaultAndSize(); str objects are output as-is. result must be a PyUnicodeObject* which must be released when it is no longer used.

New in version 3.2:

New in version 3.2.

PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)¶

Decode a string using Py_FileSystemDefaultEncoding and the 'surrogateescape' error handler, or 'strict' on Windows.

If Py_FileSystemDefaultEncoding is not set, fall back to the locale encoding.

Changed in version 3.2:

Changed in version 3.2: Use 'strict' error handler on Windows.

PyObject* PyUnicode_DecodeFSDefault(const char *s)¶

Decode a null-terminated string using Py_FileSystemDefaultEncoding and the 'surrogateescape' error handler, or 'strict' on Windows.

If Py_FileSystemDefaultEncoding is not set, fall back to the locale encoding.

Use PyUnicode_DecodeFSDefaultAndSize() if you know the string length.

Changed in version 3.2:

Changed in version 3.2: Use 'strict' error handler on Windows.

PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode)¶

Encode a Unicode object to Py_FileSystemDefaultEncoding with the 'surrogateescape' error handler, or 'strict' on Windows, and return bytes.

If Py_FileSystemDefaultEncoding is not set, fall back to the locale encoding.

New in version 3.2:

New in version 3.2.

Python 3 文档(简体中文) 3.2.2 documentation

Unicode Objects and Codecs¶

Unicode Objects¶

Unicode Type¶

Unicode Character Properties¶

Plain Py_UNICODE¶

File System Encoding¶

wchar_t Support¶

Built-in Codecs¶

Generic Codecs¶

UTF-8 Codecs¶

UTF-32 Codecs¶

UTF-16 Codecs¶

UTF-7 Codecs¶

Unicode-Escape Codecs¶

Raw-Unicode-Escape Codecs¶

Latin-1 Codecs¶

ASCII Codecs¶

Character Map Codecs¶

MBCS codecs for Windows¶

Methods & Slots¶

Methods and Slot Functions¶