python-3.6.zip added from Github

README.cosmo contains the necessary links.
This commit is contained in:
ahgamut 2021-08-08 09:38:33 +05:30 committed by Justine Tunney
parent 75fc601ff5
commit 0c4c56ff39
4219 changed files with 1968626 additions and 0 deletions

View file

@ -0,0 +1,765 @@
*****************
Argparse Tutorial
*****************
:author: Tshepang Lekhonkhobe
.. _argparse-tutorial:
This tutorial is intended to be a gentle introduction to :mod:`argparse`, the
recommended command-line parsing module in the Python standard library.
.. note::
There are two other modules that fulfill the same task, namely
:mod:`getopt` (an equivalent for :c:func:`getopt` from the C
language) and the deprecated :mod:`optparse`.
Note also that :mod:`argparse` is based on :mod:`optparse`,
and therefore very similar in terms of usage.
Concepts
========
Let's show the sort of functionality that we are going to explore in this
introductory tutorial by making use of the :command:`ls` command:
.. code-block:: shell-session
$ ls
cpython devguide prog.py pypy rm-unused-function.patch
$ ls pypy
ctypes_configure demo dotviewer include lib_pypy lib-python ...
$ ls -l
total 20
drwxr-xr-x 19 wena wena 4096 Feb 18 18:51 cpython
drwxr-xr-x 4 wena wena 4096 Feb 8 12:04 devguide
-rwxr-xr-x 1 wena wena 535 Feb 19 00:05 prog.py
drwxr-xr-x 14 wena wena 4096 Feb 7 00:59 pypy
-rw-r--r-- 1 wena wena 741 Feb 18 01:01 rm-unused-function.patch
$ ls --help
Usage: ls [OPTION]... [FILE]...
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.
...
A few concepts we can learn from the four commands:
* The :command:`ls` command is useful when run without any options at all. It defaults
to displaying the contents of the current directory.
* If we want beyond what it provides by default, we tell it a bit more. In
this case, we want it to display a different directory, ``pypy``.
What we did is specify what is known as a positional argument. It's named so
because the program should know what to do with the value, solely based on
where it appears on the command line. This concept is more relevant
to a command like :command:`cp`, whose most basic usage is ``cp SRC DEST``.
The first position is *what you want copied,* and the second
position is *where you want it copied to*.
* Now, say we want to change behaviour of the program. In our example,
we display more info for each file instead of just showing the file names.
The ``-l`` in that case is known as an optional argument.
* That's a snippet of the help text. It's very useful in that you can
come across a program you have never used before, and can figure out
how it works simply by reading its help text.
The basics
==========
Let us start with a very simple example which does (almost) nothing::
import argparse
parser = argparse.ArgumentParser()
parser.parse_args()
Following is a result of running the code:
.. code-block:: shell-session
$ python3 prog.py
$ python3 prog.py --help
usage: prog.py [-h]
optional arguments:
-h, --help show this help message and exit
$ python3 prog.py --verbose
usage: prog.py [-h]
prog.py: error: unrecognized arguments: --verbose
$ python3 prog.py foo
usage: prog.py [-h]
prog.py: error: unrecognized arguments: foo
Here is what is happening:
* Running the script without any options results in nothing displayed to
stdout. Not so useful.
* The second one starts to display the usefulness of the :mod:`argparse`
module. We have done almost nothing, but already we get a nice help message.
* The ``--help`` option, which can also be shortened to ``-h``, is the only
option we get for free (i.e. no need to specify it). Specifying anything
else results in an error. But even then, we do get a useful usage message,
also for free.
Introducing Positional arguments
================================
An example::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("echo")
args = parser.parse_args()
print(args.echo)
And running the code:
.. code-block:: shell-session
$ python3 prog.py
usage: prog.py [-h] echo
prog.py: error: the following arguments are required: echo
$ python3 prog.py --help
usage: prog.py [-h] echo
positional arguments:
echo
optional arguments:
-h, --help show this help message and exit
$ python3 prog.py foo
foo
Here is what's happening:
* We've added the :meth:`add_argument` method, which is what we use to specify
which command-line options the program is willing to accept. In this case,
I've named it ``echo`` so that it's in line with its function.
* Calling our program now requires us to specify an option.
* The :meth:`parse_args` method actually returns some data from the
options specified, in this case, ``echo``.
* The variable is some form of 'magic' that :mod:`argparse` performs for free
(i.e. no need to specify which variable that value is stored in).
You will also notice that its name matches the string argument given
to the method, ``echo``.
Note however that, although the help display looks nice and all, it currently
is not as helpful as it can be. For example we see that we got ``echo`` as a
positional argument, but we don't know what it does, other than by guessing or
by reading the source code. So, let's make it a bit more useful::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("echo", help="echo the string you use here")
args = parser.parse_args()
print(args.echo)
And we get:
.. code-block:: shell-session
$ python3 prog.py -h
usage: prog.py [-h] echo
positional arguments:
echo echo the string you use here
optional arguments:
-h, --help show this help message and exit
Now, how about doing something even more useful::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", help="display a square of a given number")
args = parser.parse_args()
print(args.square**2)
Following is a result of running the code:
.. code-block:: shell-session
$ python3 prog.py 4
Traceback (most recent call last):
File "prog.py", line 5, in <module>
print(args.square**2)
TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'
That didn't go so well. That's because :mod:`argparse` treats the options we
give it as strings, unless we tell it otherwise. So, let's tell
:mod:`argparse` to treat that input as an integer::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", help="display a square of a given number",
type=int)
args = parser.parse_args()
print(args.square**2)
Following is a result of running the code:
.. code-block:: shell-session
$ python3 prog.py 4
16
$ python3 prog.py four
usage: prog.py [-h] square
prog.py: error: argument square: invalid int value: 'four'
That went well. The program now even helpfully quits on bad illegal input
before proceeding.
Introducing Optional arguments
==============================
So far we have been playing with positional arguments. Let us
have a look on how to add optional ones::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--verbosity", help="increase output verbosity")
args = parser.parse_args()
if args.verbosity:
print("verbosity turned on")
And the output:
.. code-block:: shell-session
$ python3 prog.py --verbosity 1
verbosity turned on
$ python3 prog.py
$ python3 prog.py --help
usage: prog.py [-h] [--verbosity VERBOSITY]
optional arguments:
-h, --help show this help message and exit
--verbosity VERBOSITY
increase output verbosity
$ python3 prog.py --verbosity
usage: prog.py [-h] [--verbosity VERBOSITY]
prog.py: error: argument --verbosity: expected one argument
Here is what is happening:
* The program is written so as to display something when ``--verbosity`` is
specified and display nothing when not.
* To show that the option is actually optional, there is no error when running
the program without it. Note that by default, if an optional argument isn't
used, the relevant variable, in this case :attr:`args.verbosity`, is
given ``None`` as a value, which is the reason it fails the truth
test of the :keyword:`if` statement.
* The help message is a bit different.
* When using the ``--verbosity`` option, one must also specify some value,
any value.
The above example accepts arbitrary integer values for ``--verbosity``, but for
our simple program, only two values are actually useful, ``True`` or ``False``.
Let's modify the code accordingly::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--verbose", help="increase output verbosity",
action="store_true")
args = parser.parse_args()
if args.verbose:
print("verbosity turned on")
And the output:
.. code-block:: shell-session
$ python3 prog.py --verbose
verbosity turned on
$ python3 prog.py --verbose 1
usage: prog.py [-h] [--verbose]
prog.py: error: unrecognized arguments: 1
$ python3 prog.py --help
usage: prog.py [-h] [--verbose]
optional arguments:
-h, --help show this help message and exit
--verbose increase output verbosity
Here is what is happening:
* The option is now more of a flag than something that requires a value.
We even changed the name of the option to match that idea.
Note that we now specify a new keyword, ``action``, and give it the value
``"store_true"``. This means that, if the option is specified,
assign the value ``True`` to :data:`args.verbose`.
Not specifying it implies ``False``.
* It complains when you specify a value, in true spirit of what flags
actually are.
* Notice the different help text.
Short options
-------------
If you are familiar with command line usage,
you will notice that I haven't yet touched on the topic of short
versions of the options. It's quite simple::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-v", "--verbose", help="increase output verbosity",
action="store_true")
args = parser.parse_args()
if args.verbose:
print("verbosity turned on")
And here goes:
.. code-block:: shell-session
$ python3 prog.py -v
verbosity turned on
$ python3 prog.py --help
usage: prog.py [-h] [-v]
optional arguments:
-h, --help show this help message and exit
-v, --verbose increase output verbosity
Note that the new ability is also reflected in the help text.
Combining Positional and Optional arguments
===========================================
Our program keeps growing in complexity::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display a square of a given number")
parser.add_argument("-v", "--verbose", action="store_true",
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
if args.verbose:
print("the square of {} equals {}".format(args.square, answer))
else:
print(answer)
And now the output:
.. code-block:: shell-session
$ python3 prog.py
usage: prog.py [-h] [-v] square
prog.py: error: the following arguments are required: square
$ python3 prog.py 4
16
$ python3 prog.py 4 --verbose
the square of 4 equals 16
$ python3 prog.py --verbose 4
the square of 4 equals 16
* We've brought back a positional argument, hence the complaint.
* Note that the order does not matter.
How about we give this program of ours back the ability to have
multiple verbosity values, and actually get to use them::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display a square of a given number")
parser.add_argument("-v", "--verbosity", type=int,
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
if args.verbosity == 2:
print("the square of {} equals {}".format(args.square, answer))
elif args.verbosity == 1:
print("{}^2 == {}".format(args.square, answer))
else:
print(answer)
And the output:
.. code-block:: shell-session
$ python3 prog.py 4
16
$ python3 prog.py 4 -v
usage: prog.py [-h] [-v VERBOSITY] square
prog.py: error: argument -v/--verbosity: expected one argument
$ python3 prog.py 4 -v 1
4^2 == 16
$ python3 prog.py 4 -v 2
the square of 4 equals 16
$ python3 prog.py 4 -v 3
16
These all look good except the last one, which exposes a bug in our program.
Let's fix it by restricting the values the ``--verbosity`` option can accept::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display a square of a given number")
parser.add_argument("-v", "--verbosity", type=int, choices=[0, 1, 2],
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
if args.verbosity == 2:
print("the square of {} equals {}".format(args.square, answer))
elif args.verbosity == 1:
print("{}^2 == {}".format(args.square, answer))
else:
print(answer)
And the output:
.. code-block:: shell-session
$ python3 prog.py 4 -v 3
usage: prog.py [-h] [-v {0,1,2}] square
prog.py: error: argument -v/--verbosity: invalid choice: 3 (choose from 0, 1, 2)
$ python3 prog.py 4 -h
usage: prog.py [-h] [-v {0,1,2}] square
positional arguments:
square display a square of a given number
optional arguments:
-h, --help show this help message and exit
-v {0,1,2}, --verbosity {0,1,2}
increase output verbosity
Note that the change also reflects both in the error message as well as the
help string.
Now, let's use a different approach of playing with verbosity, which is pretty
common. It also matches the way the CPython executable handles its own
verbosity argument (check the output of ``python --help``)::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display the square of a given number")
parser.add_argument("-v", "--verbosity", action="count",
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
if args.verbosity == 2:
print("the square of {} equals {}".format(args.square, answer))
elif args.verbosity == 1:
print("{}^2 == {}".format(args.square, answer))
else:
print(answer)
We have introduced another action, "count",
to count the number of occurrences of a specific optional arguments:
.. code-block:: shell-session
$ python3 prog.py 4
16
$ python3 prog.py 4 -v
4^2 == 16
$ python3 prog.py 4 -vv
the square of 4 equals 16
$ python3 prog.py 4 --verbosity --verbosity
the square of 4 equals 16
$ python3 prog.py 4 -v 1
usage: prog.py [-h] [-v] square
prog.py: error: unrecognized arguments: 1
$ python3 prog.py 4 -h
usage: prog.py [-h] [-v] square
positional arguments:
square display a square of a given number
optional arguments:
-h, --help show this help message and exit
-v, --verbosity increase output verbosity
$ python3 prog.py 4 -vvv
16
* Yes, it's now more of a flag (similar to ``action="store_true"``) in the
previous version of our script. That should explain the complaint.
* It also behaves similar to "store_true" action.
* Now here's a demonstration of what the "count" action gives. You've probably
seen this sort of usage before.
* And if you don't specify the ``-v`` flag, that flag is considered to have
``None`` value.
* As should be expected, specifying the long form of the flag, we should get
the same output.
* Sadly, our help output isn't very informative on the new ability our script
has acquired, but that can always be fixed by improving the documentation for
our script (e.g. via the ``help`` keyword argument).
* That last output exposes a bug in our program.
Let's fix::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display a square of a given number")
parser.add_argument("-v", "--verbosity", action="count",
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
# bugfix: replace == with >=
if args.verbosity >= 2:
print("the square of {} equals {}".format(args.square, answer))
elif args.verbosity >= 1:
print("{}^2 == {}".format(args.square, answer))
else:
print(answer)
And this is what it gives:
.. code-block:: shell-session
$ python3 prog.py 4 -vvv
the square of 4 equals 16
$ python3 prog.py 4 -vvvv
the square of 4 equals 16
$ python3 prog.py 4
Traceback (most recent call last):
File "prog.py", line 11, in <module>
if args.verbosity >= 2:
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
* First output went well, and fixes the bug we had before.
That is, we want any value >= 2 to be as verbose as possible.
* Third output not so good.
Let's fix that bug::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("square", type=int,
help="display a square of a given number")
parser.add_argument("-v", "--verbosity", action="count", default=0,
help="increase output verbosity")
args = parser.parse_args()
answer = args.square**2
if args.verbosity >= 2:
print("the square of {} equals {}".format(args.square, answer))
elif args.verbosity >= 1:
print("{}^2 == {}".format(args.square, answer))
else:
print(answer)
We've just introduced yet another keyword, ``default``.
We've set it to ``0`` in order to make it comparable to the other int values.
Remember that by default,
if an optional argument isn't specified,
it gets the ``None`` value, and that cannot be compared to an int value
(hence the :exc:`TypeError` exception).
And:
.. code-block:: shell-session
$ python3 prog.py 4
16
You can go quite far just with what we've learned so far,
and we have only scratched the surface.
The :mod:`argparse` module is very powerful,
and we'll explore a bit more of it before we end this tutorial.
Getting a little more advanced
==============================
What if we wanted to expand our tiny program to perform other powers,
not just squares::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("x", type=int, help="the base")
parser.add_argument("y", type=int, help="the exponent")
parser.add_argument("-v", "--verbosity", action="count", default=0)
args = parser.parse_args()
answer = args.x**args.y
if args.verbosity >= 2:
print("{} to the power {} equals {}".format(args.x, args.y, answer))
elif args.verbosity >= 1:
print("{}^{} == {}".format(args.x, args.y, answer))
else:
print(answer)
Output:
.. code-block:: shell-session
$ python3 prog.py
usage: prog.py [-h] [-v] x y
prog.py: error: the following arguments are required: x, y
$ python3 prog.py -h
usage: prog.py [-h] [-v] x y
positional arguments:
x the base
y the exponent
optional arguments:
-h, --help show this help message and exit
-v, --verbosity
$ python3 prog.py 4 2 -v
4^2 == 16
Notice that so far we've been using verbosity level to *change* the text
that gets displayed. The following example instead uses verbosity level
to display *more* text instead::
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("x", type=int, help="the base")
parser.add_argument("y", type=int, help="the exponent")
parser.add_argument("-v", "--verbosity", action="count", default=0)
args = parser.parse_args()
answer = args.x**args.y
if args.verbosity >= 2:
print("Running '{}'".format(__file__))
if args.verbosity >= 1:
print("{}^{} == ".format(args.x, args.y), end="")
print(answer)
Output:
.. code-block:: shell-session
$ python3 prog.py 4 2
16
$ python3 prog.py 4 2 -v
4^2 == 16
$ python3 prog.py 4 2 -vv
Running 'prog.py'
4^2 == 16
Conflicting options
-------------------
So far, we have been working with two methods of an
:class:`argparse.ArgumentParser` instance. Let's introduce a third one,
:meth:`add_mutually_exclusive_group`. It allows for us to specify options that
conflict with each other. Let's also change the rest of the program so that
the new functionality makes more sense:
we'll introduce the ``--quiet`` option,
which will be the opposite of the ``--verbose`` one::
import argparse
parser = argparse.ArgumentParser()
group = parser.add_mutually_exclusive_group()
group.add_argument("-v", "--verbose", action="store_true")
group.add_argument("-q", "--quiet", action="store_true")
parser.add_argument("x", type=int, help="the base")
parser.add_argument("y", type=int, help="the exponent")
args = parser.parse_args()
answer = args.x**args.y
if args.quiet:
print(answer)
elif args.verbose:
print("{} to the power {} equals {}".format(args.x, args.y, answer))
else:
print("{}^{} == {}".format(args.x, args.y, answer))
Our program is now simpler, and we've lost some functionality for the sake of
demonstration. Anyways, here's the output:
.. code-block:: shell-session
$ python3 prog.py 4 2
4^2 == 16
$ python3 prog.py 4 2 -q
16
$ python3 prog.py 4 2 -v
4 to the power 2 equals 16
$ python3 prog.py 4 2 -vq
usage: prog.py [-h] [-v | -q] x y
prog.py: error: argument -q/--quiet: not allowed with argument -v/--verbose
$ python3 prog.py 4 2 -v --quiet
usage: prog.py [-h] [-v | -q] x y
prog.py: error: argument -q/--quiet: not allowed with argument -v/--verbose
That should be easy to follow. I've added that last output so you can see the
sort of flexibility you get, i.e. mixing long form options with short form
ones.
Before we conclude, you probably want to tell your users the main purpose of
your program, just in case they don't know::
import argparse
parser = argparse.ArgumentParser(description="calculate X to the power of Y")
group = parser.add_mutually_exclusive_group()
group.add_argument("-v", "--verbose", action="store_true")
group.add_argument("-q", "--quiet", action="store_true")
parser.add_argument("x", type=int, help="the base")
parser.add_argument("y", type=int, help="the exponent")
args = parser.parse_args()
answer = args.x**args.y
if args.quiet:
print(answer)
elif args.verbose:
print("{} to the power {} equals {}".format(args.x, args.y, answer))
else:
print("{}^{} == {}".format(args.x, args.y, answer))
Note that slight difference in the usage text. Note the ``[-v | -q]``,
which tells us that we can either use ``-v`` or ``-q``,
but not both at the same time:
.. code-block:: shell-session
$ python3 prog.py --help
usage: prog.py [-h] [-v | -q] x y
calculate X to the power of Y
positional arguments:
x the base
y the exponent
optional arguments:
-h, --help show this help message and exit
-v, --verbose
-q, --quiet
Conclusion
==========
The :mod:`argparse` module offers a lot more than shown here.
Its docs are quite detailed and thorough, and full of examples.
Having gone through this tutorial, you should easily digest them
without feeling overwhelmed.

1734
third_party/python/Doc/howto/clinic.rst vendored Normal file

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,257 @@
.. highlightlang:: c
.. _cporting-howto:
*************************************
Porting Extension Modules to Python 3
*************************************
:author: Benjamin Peterson
.. topic:: Abstract
Although changing the C-API was not one of Python 3's objectives,
the many Python-level changes made leaving Python 2's API intact
impossible. In fact, some changes such as :func:`int` and
:func:`long` unification are more obvious on the C level. This
document endeavors to document incompatibilities and how they can
be worked around.
Conditional compilation
=======================
The easiest way to compile only some code for Python 3 is to check
if :c:macro:`PY_MAJOR_VERSION` is greater than or equal to 3. ::
#if PY_MAJOR_VERSION >= 3
#define IS_PY3K
#endif
API functions that are not present can be aliased to their equivalents within
conditional blocks.
Changes to Object APIs
======================
Python 3 merged together some types with similar functions while cleanly
separating others.
str/unicode Unification
-----------------------
Python 3's :func:`str` type is equivalent to Python 2's :func:`unicode`; the C
functions are called ``PyUnicode_*`` for both. The old 8-bit string type has become
:func:`bytes`, with C functions called ``PyBytes_*``. Python 2.6 and later provide a compatibility header,
:file:`bytesobject.h`, mapping ``PyBytes`` names to ``PyString`` ones. For best
compatibility with Python 3, :c:type:`PyUnicode` should be used for textual data and
:c:type:`PyBytes` for binary data. It's also important to remember that
:c:type:`PyBytes` and :c:type:`PyUnicode` in Python 3 are not interchangeable like
:c:type:`PyString` and :c:type:`PyUnicode` are in Python 2. The following example
shows best practices with regards to :c:type:`PyUnicode`, :c:type:`PyString`,
and :c:type:`PyBytes`. ::
#include "stdlib.h"
#include "Python.h"
#include "bytesobject.h"
/* text example */
static PyObject *
say_hello(PyObject *self, PyObject *args) {
PyObject *name, *result;
if (!PyArg_ParseTuple(args, "U:say_hello", &name))
return NULL;
result = PyUnicode_FromFormat("Hello, %S!", name);
return result;
}
/* just a forward */
static char * do_encode(PyObject *);
/* bytes example */
static PyObject *
encode_object(PyObject *self, PyObject *args) {
char *encoded;
PyObject *result, *myobj;
if (!PyArg_ParseTuple(args, "O:encode_object", &myobj))
return NULL;
encoded = do_encode(myobj);
if (encoded == NULL)
return NULL;
result = PyBytes_FromString(encoded);
free(encoded);
return result;
}
long/int Unification
--------------------
Python 3 has only one integer type, :func:`int`. But it actually
corresponds to Python 2's :func:`long` type—the :func:`int` type
used in Python 2 was removed. In the C-API, ``PyInt_*`` functions
are replaced by their ``PyLong_*`` equivalents.
Module initialization and state
===============================
Python 3 has a revamped extension module initialization system. (See
:pep:`3121`.) Instead of storing module state in globals, they should
be stored in an interpreter specific structure. Creating modules that
act correctly in both Python 2 and Python 3 is tricky. The following
simple example demonstrates how. ::
#include "Python.h"
struct module_state {
PyObject *error;
};
#if PY_MAJOR_VERSION >= 3
#define GETSTATE(m) ((struct module_state*)PyModule_GetState(m))
#else
#define GETSTATE(m) (&_state)
static struct module_state _state;
#endif
static PyObject *
error_out(PyObject *m) {
struct module_state *st = GETSTATE(m);
PyErr_SetString(st->error, "something bad happened");
return NULL;
}
static PyMethodDef myextension_methods[] = {
{"error_out", (PyCFunction)error_out, METH_NOARGS, NULL},
{NULL, NULL}
};
#if PY_MAJOR_VERSION >= 3
static int myextension_traverse(PyObject *m, visitproc visit, void *arg) {
Py_VISIT(GETSTATE(m)->error);
return 0;
}
static int myextension_clear(PyObject *m) {
Py_CLEAR(GETSTATE(m)->error);
return 0;
}
static struct PyModuleDef moduledef = {
PyModuleDef_HEAD_INIT,
"myextension",
NULL,
sizeof(struct module_state),
myextension_methods,
NULL,
myextension_traverse,
myextension_clear,
NULL
};
#define INITERROR return NULL
PyMODINIT_FUNC
PyInit_myextension(void)
#else
#define INITERROR return
void
initmyextension(void)
#endif
{
#if PY_MAJOR_VERSION >= 3
PyObject *module = PyModule_Create(&moduledef);
#else
PyObject *module = Py_InitModule("myextension", myextension_methods);
#endif
if (module == NULL)
INITERROR;
struct module_state *st = GETSTATE(module);
st->error = PyErr_NewException("myextension.Error", NULL, NULL);
if (st->error == NULL) {
Py_DECREF(module);
INITERROR;
}
#if PY_MAJOR_VERSION >= 3
return module;
#endif
}
CObject replaced with Capsule
=============================
The :c:type:`Capsule` object was introduced in Python 3.1 and 2.7 to replace
:c:type:`CObject`. CObjects were useful,
but the :c:type:`CObject` API was problematic: it didn't permit distinguishing
between valid CObjects, which allowed mismatched CObjects to crash the
interpreter, and some of its APIs relied on undefined behavior in C.
(For further reading on the rationale behind Capsules, please see :issue:`5630`.)
If you're currently using CObjects, and you want to migrate to 3.1 or newer,
you'll need to switch to Capsules.
:c:type:`CObject` was deprecated in 3.1 and 2.7 and completely removed in
Python 3.2. If you only support 2.7, or 3.1 and above, you
can simply switch to :c:type:`Capsule`. If you need to support Python 3.0,
or versions of Python earlier than 2.7,
you'll have to support both CObjects and Capsules.
(Note that Python 3.0 is no longer supported, and it is not recommended
for production use.)
The following example header file :file:`capsulethunk.h` may
solve the problem for you. Simply write your code against the
:c:type:`Capsule` API and include this header file after
:file:`Python.h`. Your code will automatically use Capsules
in versions of Python with Capsules, and switch to CObjects
when Capsules are unavailable.
:file:`capsulethunk.h` simulates Capsules using CObjects. However,
:c:type:`CObject` provides no place to store the capsule's "name". As a
result the simulated :c:type:`Capsule` objects created by :file:`capsulethunk.h`
behave slightly differently from real Capsules. Specifically:
* The name parameter passed in to :c:func:`PyCapsule_New` is ignored.
* The name parameter passed in to :c:func:`PyCapsule_IsValid` and
:c:func:`PyCapsule_GetPointer` is ignored, and no error checking
of the name is performed.
* :c:func:`PyCapsule_GetName` always returns NULL.
* :c:func:`PyCapsule_SetName` always raises an exception and
returns failure. (Since there's no way to store a name
in a CObject, noisy failure of :c:func:`PyCapsule_SetName`
was deemed preferable to silent failure here. If this is
inconvenient, feel free to modify your local
copy as you see fit.)
You can find :file:`capsulethunk.h` in the Python source distribution
as :source:`Doc/includes/capsulethunk.h`. We also include it here for
your convenience:
.. literalinclude:: ../includes/capsulethunk.h
Other options
=============
If you are writing a new extension module, you might consider `Cython
<http://cython.org/>`_. It translates a Python-like language to C. The
extension modules it creates are compatible with Python 3 and Python 2.

552
third_party/python/Doc/howto/curses.rst vendored Normal file
View file

@ -0,0 +1,552 @@
.. _curses-howto:
**********************************
Curses Programming with Python
**********************************
:Author: A.M. Kuchling, Eric S. Raymond
:Release: 2.04
.. topic:: Abstract
This document describes how to use the :mod:`curses` extension
module to control text-mode displays.
What is curses?
===============
The curses library supplies a terminal-independent screen-painting and
keyboard-handling facility for text-based terminals; such terminals
include VT100s, the Linux console, and the simulated terminal provided
by various programs. Display terminals support various control codes
to perform common operations such as moving the cursor, scrolling the
screen, and erasing areas. Different terminals use widely differing
codes, and often have their own minor quirks.
In a world of graphical displays, one might ask "why bother"? It's
true that character-cell display terminals are an obsolete technology,
but there are niches in which being able to do fancy things with them
are still valuable. One niche is on small-footprint or embedded
Unixes that don't run an X server. Another is tools such as OS
installers and kernel configurators that may have to run before any
graphical support is available.
The curses library provides fairly basic functionality, providing the
programmer with an abstraction of a display containing multiple
non-overlapping windows of text. The contents of a window can be
changed in various ways---adding text, erasing it, changing its
appearance---and the curses library will figure out what control codes
need to be sent to the terminal to produce the right output. curses
doesn't provide many user-interface concepts such as buttons, checkboxes,
or dialogs; if you need such features, consider a user interface library such as
`Urwid <https://pypi.org/project/urwid/>`_.
The curses library was originally written for BSD Unix; the later System V
versions of Unix from AT&T added many enhancements and new functions. BSD curses
is no longer maintained, having been replaced by ncurses, which is an
open-source implementation of the AT&T interface. If you're using an
open-source Unix such as Linux or FreeBSD, your system almost certainly uses
ncurses. Since most current commercial Unix versions are based on System V
code, all the functions described here will probably be available. The older
versions of curses carried by some proprietary Unixes may not support
everything, though.
The Windows version of Python doesn't include the :mod:`curses`
module. A ported version called `UniCurses
<https://pypi.org/project/UniCurses>`_ is available. You could
also try `the Console module <http://effbot.org/zone/console-index.htm>`_
written by Fredrik Lundh, which doesn't
use the same API as curses but provides cursor-addressable text output
and full support for mouse and keyboard input.
The Python curses module
------------------------
The Python module is a fairly simple wrapper over the C functions provided by
curses; if you're already familiar with curses programming in C, it's really
easy to transfer that knowledge to Python. The biggest difference is that the
Python interface makes things simpler by merging different C functions such as
:c:func:`addstr`, :c:func:`mvaddstr`, and :c:func:`mvwaddstr` into a single
:meth:`~curses.window.addstr` method. You'll see this covered in more
detail later.
This HOWTO is an introduction to writing text-mode programs with curses
and Python. It doesn't attempt to be a complete guide to the curses API; for
that, see the Python library guide's section on ncurses, and the C manual pages
for ncurses. It will, however, give you the basic ideas.
Starting and ending a curses application
========================================
Before doing anything, curses must be initialized. This is done by
calling the :func:`~curses.initscr` function, which will determine the
terminal type, send any required setup codes to the terminal, and
create various internal data structures. If successful,
:func:`initscr` returns a window object representing the entire
screen; this is usually called ``stdscr`` after the name of the
corresponding C variable. ::
import curses
stdscr = curses.initscr()
Usually curses applications turn off automatic echoing of keys to the
screen, in order to be able to read keys and only display them under
certain circumstances. This requires calling the
:func:`~curses.noecho` function. ::
curses.noecho()
Applications will also commonly need to react to keys instantly,
without requiring the Enter key to be pressed; this is called cbreak
mode, as opposed to the usual buffered input mode. ::
curses.cbreak()
Terminals usually return special keys, such as the cursor keys or navigation
keys such as Page Up and Home, as a multibyte escape sequence. While you could
write your application to expect such sequences and process them accordingly,
curses can do it for you, returning a special value such as
:const:`curses.KEY_LEFT`. To get curses to do the job, you'll have to enable
keypad mode. ::
stdscr.keypad(True)
Terminating a curses application is much easier than starting one. You'll need
to call::
curses.nocbreak()
stdscr.keypad(False)
curses.echo()
to reverse the curses-friendly terminal settings. Then call the
:func:`~curses.endwin` function to restore the terminal to its original
operating mode. ::
curses.endwin()
A common problem when debugging a curses application is to get your terminal
messed up when the application dies without restoring the terminal to its
previous state. In Python this commonly happens when your code is buggy and
raises an uncaught exception. Keys are no longer echoed to the screen when
you type them, for example, which makes using the shell difficult.
In Python you can avoid these complications and make debugging much easier by
importing the :func:`curses.wrapper` function and using it like this::
from curses import wrapper
def main(stdscr):
# Clear screen
stdscr.clear()
# This raises ZeroDivisionError when i == 10.
for i in range(0, 11):
v = i-10
stdscr.addstr(i, 0, '10 divided by {} is {}'.format(v, 10/v))
stdscr.refresh()
stdscr.getkey()
wrapper(main)
The :func:`~curses.wrapper` function takes a callable object and does the
initializations described above, also initializing colors if color
support is present. :func:`wrapper` then runs your provided callable.
Once the callable returns, :func:`wrapper` will restore the original
state of the terminal. The callable is called inside a
:keyword:`try`...\ :keyword:`except` that catches exceptions, restores
the state of the terminal, and then re-raises the exception. Therefore
your terminal won't be left in a funny state on exception and you'll be
able to read the exception's message and traceback.
Windows and Pads
================
Windows are the basic abstraction in curses. A window object represents a
rectangular area of the screen, and supports methods to display text,
erase it, allow the user to input strings, and so forth.
The ``stdscr`` object returned by the :func:`~curses.initscr` function is a
window object that covers the entire screen. Many programs may need
only this single window, but you might wish to divide the screen into
smaller windows, in order to redraw or clear them separately. The
:func:`~curses.newwin` function creates a new window of a given size,
returning the new window object. ::
begin_x = 20; begin_y = 7
height = 5; width = 40
win = curses.newwin(height, width, begin_y, begin_x)
Note that the coordinate system used in curses is unusual.
Coordinates are always passed in the order *y,x*, and the top-left
corner of a window is coordinate (0,0). This breaks the normal
convention for handling coordinates where the *x* coordinate comes
first. This is an unfortunate difference from most other computer
applications, but it's been part of curses since it was first written,
and it's too late to change things now.
Your application can determine the size of the screen by using the
:data:`curses.LINES` and :data:`curses.COLS` variables to obtain the *y* and
*x* sizes. Legal coordinates will then extend from ``(0,0)`` to
``(curses.LINES - 1, curses.COLS - 1)``.
When you call a method to display or erase text, the effect doesn't
immediately show up on the display. Instead you must call the
:meth:`~curses.window.refresh` method of window objects to update the
screen.
This is because curses was originally written with slow 300-baud
terminal connections in mind; with these terminals, minimizing the
time required to redraw the screen was very important. Instead curses
accumulates changes to the screen and displays them in the most
efficient manner when you call :meth:`refresh`. For example, if your
program displays some text in a window and then clears the window,
there's no need to send the original text because they're never
visible.
In practice, explicitly telling curses to redraw a window doesn't
really complicate programming with curses much. Most programs go into a flurry
of activity, and then pause waiting for a keypress or some other action on the
part of the user. All you have to do is to be sure that the screen has been
redrawn before pausing to wait for user input, by first calling
``stdscr.refresh()`` or the :meth:`refresh` method of some other relevant
window.
A pad is a special case of a window; it can be larger than the actual display
screen, and only a portion of the pad displayed at a time. Creating a pad
requires the pad's height and width, while refreshing a pad requires giving the
coordinates of the on-screen area where a subsection of the pad will be
displayed. ::
pad = curses.newpad(100, 100)
# These loops fill the pad with letters; addch() is
# explained in the next section
for y in range(0, 99):
for x in range(0, 99):
pad.addch(y,x, ord('a') + (x*x+y*y) % 26)
# Displays a section of the pad in the middle of the screen.
# (0,0) : coordinate of upper-left corner of pad area to display.
# (5,5) : coordinate of upper-left corner of window area to be filled
# with pad content.
# (20, 75) : coordinate of lower-right corner of window area to be
# : filled with pad content.
pad.refresh( 0,0, 5,5, 20,75)
The :meth:`refresh` call displays a section of the pad in the rectangle
extending from coordinate (5,5) to coordinate (20,75) on the screen; the upper
left corner of the displayed section is coordinate (0,0) on the pad. Beyond
that difference, pads are exactly like ordinary windows and support the same
methods.
If you have multiple windows and pads on screen there is a more
efficient way to update the screen and prevent annoying screen flicker
as each part of the screen gets updated. :meth:`refresh` actually
does two things:
1) Calls the :meth:`~curses.window.noutrefresh` method of each window
to update an underlying data structure representing the desired
state of the screen.
2) Calls the function :func:`~curses.doupdate` function to change the
physical screen to match the desired state recorded in the data structure.
Instead you can call :meth:`noutrefresh` on a number of windows to
update the data structure, and then call :func:`doupdate` to update
the screen.
Displaying Text
===============
From a C programmer's point of view, curses may sometimes look like a
twisty maze of functions, all subtly different. For example,
:c:func:`addstr` displays a string at the current cursor location in
the ``stdscr`` window, while :c:func:`mvaddstr` moves to a given y,x
coordinate first before displaying the string. :c:func:`waddstr` is just
like :c:func:`addstr`, but allows specifying a window to use instead of
using ``stdscr`` by default. :c:func:`mvwaddstr` allows specifying both
a window and a coordinate.
Fortunately the Python interface hides all these details. ``stdscr``
is a window object like any other, and methods such as
:meth:`~curses.window.addstr` accept multiple argument forms. Usually there
are four different forms.
+---------------------------------+-----------------------------------------------+
| Form | Description |
+=================================+===============================================+
| *str* or *ch* | Display the string *str* or character *ch* at |
| | the current position |
+---------------------------------+-----------------------------------------------+
| *str* or *ch*, *attr* | Display the string *str* or character *ch*, |
| | using attribute *attr* at the current |
| | position |
+---------------------------------+-----------------------------------------------+
| *y*, *x*, *str* or *ch* | Move to position *y,x* within the window, and |
| | display *str* or *ch* |
+---------------------------------+-----------------------------------------------+
| *y*, *x*, *str* or *ch*, *attr* | Move to position *y,x* within the window, and |
| | display *str* or *ch*, using attribute *attr* |
+---------------------------------+-----------------------------------------------+
Attributes allow displaying text in highlighted forms such as boldface,
underline, reverse code, or in color. They'll be explained in more detail in
the next subsection.
The :meth:`~curses.window.addstr` method takes a Python string or
bytestring as the value to be displayed. The contents of bytestrings
are sent to the terminal as-is. Strings are encoded to bytes using
the value of the window's :attr:`encoding` attribute; this defaults to
the default system encoding as returned by
:func:`locale.getpreferredencoding`.
The :meth:`~curses.window.addch` methods take a character, which can be
either a string of length 1, a bytestring of length 1, or an integer.
Constants are provided for extension characters; these constants are
integers greater than 255. For example, :const:`ACS_PLMINUS` is a +/-
symbol, and :const:`ACS_ULCORNER` is the upper left corner of a box
(handy for drawing borders). You can also use the appropriate Unicode
character.
Windows remember where the cursor was left after the last operation, so if you
leave out the *y,x* coordinates, the string or character will be displayed
wherever the last operation left off. You can also move the cursor with the
``move(y,x)`` method. Because some terminals always display a flashing cursor,
you may want to ensure that the cursor is positioned in some location where it
won't be distracting; it can be confusing to have the cursor blinking at some
apparently random location.
If your application doesn't need a blinking cursor at all, you can
call ``curs_set(False)`` to make it invisible. For compatibility
with older curses versions, there's a ``leaveok(bool)`` function
that's a synonym for :func:`~curses.curs_set`. When *bool* is true, the
curses library will attempt to suppress the flashing cursor, and you
won't need to worry about leaving it in odd locations.
Attributes and Color
--------------------
Characters can be displayed in different ways. Status lines in a text-based
application are commonly shown in reverse video, or a text viewer may need to
highlight certain words. curses supports this by allowing you to specify an
attribute for each cell on the screen.
An attribute is an integer, each bit representing a different
attribute. You can try to display text with multiple attribute bits
set, but curses doesn't guarantee that all the possible combinations
are available, or that they're all visually distinct. That depends on
the ability of the terminal being used, so it's safest to stick to the
most commonly available attributes, listed here.
+----------------------+--------------------------------------+
| Attribute | Description |
+======================+======================================+
| :const:`A_BLINK` | Blinking text |
+----------------------+--------------------------------------+
| :const:`A_BOLD` | Extra bright or bold text |
+----------------------+--------------------------------------+
| :const:`A_DIM` | Half bright text |
+----------------------+--------------------------------------+
| :const:`A_REVERSE` | Reverse-video text |
+----------------------+--------------------------------------+
| :const:`A_STANDOUT` | The best highlighting mode available |
+----------------------+--------------------------------------+
| :const:`A_UNDERLINE` | Underlined text |
+----------------------+--------------------------------------+
So, to display a reverse-video status line on the top line of the screen, you
could code::
stdscr.addstr(0, 0, "Current mode: Typing mode",
curses.A_REVERSE)
stdscr.refresh()
The curses library also supports color on those terminals that provide it. The
most common such terminal is probably the Linux console, followed by color
xterms.
To use color, you must call the :func:`~curses.start_color` function soon
after calling :func:`~curses.initscr`, to initialize the default color set
(the :func:`curses.wrapper` function does this automatically). Once that's
done, the :func:`~curses.has_colors` function returns TRUE if the terminal
in use can
actually display color. (Note: curses uses the American spelling 'color',
instead of the Canadian/British spelling 'colour'. If you're used to the
British spelling, you'll have to resign yourself to misspelling it for the sake
of these functions.)
The curses library maintains a finite number of color pairs, containing a
foreground (or text) color and a background color. You can get the attribute
value corresponding to a color pair with the :func:`~curses.color_pair`
function; this can be bitwise-OR'ed with other attributes such as
:const:`A_REVERSE`, but again, such combinations are not guaranteed to work
on all terminals.
An example, which displays a line of text using color pair 1::
stdscr.addstr("Pretty text", curses.color_pair(1))
stdscr.refresh()
As I said before, a color pair consists of a foreground and background color.
The ``init_pair(n, f, b)`` function changes the definition of color pair *n*, to
foreground color f and background color b. Color pair 0 is hard-wired to white
on black, and cannot be changed.
Colors are numbered, and :func:`start_color` initializes 8 basic
colors when it activates color mode. They are: 0:black, 1:red,
2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and 7:white. The :mod:`curses`
module defines named constants for each of these colors:
:const:`curses.COLOR_BLACK`, :const:`curses.COLOR_RED`, and so forth.
Let's put all this together. To change color 1 to red text on a white
background, you would call::
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
When you change a color pair, any text already displayed using that color pair
will change to the new colors. You can also display new text in this color
with::
stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1))
Very fancy terminals can change the definitions of the actual colors to a given
RGB value. This lets you change color 1, which is usually red, to purple or
blue or any other color you like. Unfortunately, the Linux console doesn't
support this, so I'm unable to try it out, and can't provide any examples. You
can check if your terminal can do this by calling
:func:`~curses.can_change_color`, which returns ``True`` if the capability is
there. If you're lucky enough to have such a talented terminal, consult your
system's man pages for more information.
User Input
==========
The C curses library offers only very simple input mechanisms. Python's
:mod:`curses` module adds a basic text-input widget. (Other libraries
such as `Urwid <https://pypi.org/project/urwid/>`_ have more extensive
collections of widgets.)
There are two methods for getting input from a window:
* :meth:`~curses.window.getch` refreshes the screen and then waits for
the user to hit a key, displaying the key if :func:`~curses.echo` has been
called earlier. You can optionally specify a coordinate to which
the cursor should be moved before pausing.
* :meth:`~curses.window.getkey` does the same thing but converts the
integer to a string. Individual characters are returned as
1-character strings, and special keys such as function keys return
longer strings containing a key name such as ``KEY_UP`` or ``^G``.
It's possible to not wait for the user using the
:meth:`~curses.window.nodelay` window method. After ``nodelay(True)``,
:meth:`getch` and :meth:`getkey` for the window become
non-blocking. To signal that no input is ready, :meth:`getch` returns
``curses.ERR`` (a value of -1) and :meth:`getkey` raises an exception.
There's also a :func:`~curses.halfdelay` function, which can be used to (in
effect) set a timer on each :meth:`getch`; if no input becomes
available within a specified delay (measured in tenths of a second),
curses raises an exception.
The :meth:`getch` method returns an integer; if it's between 0 and 255, it
represents the ASCII code of the key pressed. Values greater than 255 are
special keys such as Page Up, Home, or the cursor keys. You can compare the
value returned to constants such as :const:`curses.KEY_PPAGE`,
:const:`curses.KEY_HOME`, or :const:`curses.KEY_LEFT`. The main loop of
your program may look something like this::
while True:
c = stdscr.getch()
if c == ord('p'):
PrintDocument()
elif c == ord('q'):
break # Exit the while loop
elif c == curses.KEY_HOME:
x = y = 0
The :mod:`curses.ascii` module supplies ASCII class membership functions that
take either integer or 1-character string arguments; these may be useful in
writing more readable tests for such loops. It also supplies
conversion functions that take either integer or 1-character-string arguments
and return the same type. For example, :func:`curses.ascii.ctrl` returns the
control character corresponding to its argument.
There's also a method to retrieve an entire string,
:meth:`~curses.window.getstr`. It isn't used very often, because its
functionality is quite limited; the only editing keys available are
the backspace key and the Enter key, which terminates the string. It
can optionally be limited to a fixed number of characters. ::
curses.echo() # Enable echoing of characters
# Get a 15-character string, with the cursor on the top line
s = stdscr.getstr(0,0, 15)
The :mod:`curses.textpad` module supplies a text box that supports an
Emacs-like set of keybindings. Various methods of the
:class:`~curses.textpad.Textbox` class support editing with input
validation and gathering the edit results either with or without
trailing spaces. Here's an example::
import curses
from curses.textpad import Textbox, rectangle
def main(stdscr):
stdscr.addstr(0, 0, "Enter IM message: (hit Ctrl-G to send)")
editwin = curses.newwin(5,30, 2,1)
rectangle(stdscr, 1,0, 1+5+1, 1+30+1)
stdscr.refresh()
box = Textbox(editwin)
# Let the user edit until Ctrl-G is struck.
box.edit()
# Get resulting contents
message = box.gather()
See the library documentation on :mod:`curses.textpad` for more details.
For More Information
====================
This HOWTO doesn't cover some advanced topics, such as reading the
contents of the screen or capturing mouse events from an xterm
instance, but the Python library page for the :mod:`curses` module is now
reasonably complete. You should browse it next.
If you're in doubt about the detailed behavior of the curses
functions, consult the manual pages for your curses implementation,
whether it's ncurses or a proprietary Unix vendor's. The manual pages
will document any quirks, and provide complete lists of all the
functions, attributes, and :const:`ACS_\*` characters available to
you.
Because the curses API is so large, some functions aren't supported in
the Python interface. Often this isn't because they're difficult to
implement, but because no one has needed them yet. Also, Python
doesn't yet support the menu library associated with ncurses.
Patches adding support for these would be welcome; see
`the Python Developer's Guide <https://devguide.python.org/>`_ to
learn more about submitting patches to Python.
* `Writing Programs with NCURSES <http://invisible-island.net/ncurses/ncurses-intro.html>`_:
a lengthy tutorial for C programmers.
* `The ncurses man page <http://linux.die.net/man/3/ncurses>`_
* `The ncurses FAQ <http://invisible-island.net/ncurses/ncurses.faq.html>`_
* `"Use curses... don't swear" <https://www.youtube.com/watch?v=eN1eZtjLEnU>`_:
video of a PyCon 2013 talk on controlling terminals using curses or Urwid.
* `"Console Applications with Urwid" <http://www.pyvideo.org/video/1568/console-applications-with-urwid>`_:
video of a PyCon CA 2012 talk demonstrating some applications written using
Urwid.

View file

@ -0,0 +1,443 @@
======================
Descriptor HowTo Guide
======================
:Author: Raymond Hettinger
:Contact: <python at rcn dot com>
.. Contents::
Abstract
--------
Defines descriptors, summarizes the protocol, and shows how descriptors are
called. Examines a custom descriptor and several built-in python descriptors
including functions, properties, static methods, and class methods. Shows how
each works by giving a pure Python equivalent and a sample application.
Learning about descriptors not only provides access to a larger toolset, it
creates a deeper understanding of how Python works and an appreciation for the
elegance of its design.
Definition and Introduction
---------------------------
In general, a descriptor is an object attribute with "binding behavior", one
whose attribute access has been overridden by methods in the descriptor
protocol. Those methods are :meth:`__get__`, :meth:`__set__`, and
:meth:`__delete__`. If any of those methods are defined for an object, it is
said to be a descriptor.
The default behavior for attribute access is to get, set, or delete the
attribute from an object's dictionary. For instance, ``a.x`` has a lookup chain
starting with ``a.__dict__['x']``, then ``type(a).__dict__['x']``, and
continuing through the base classes of ``type(a)`` excluding metaclasses. If the
looked-up value is an object defining one of the descriptor methods, then Python
may override the default behavior and invoke the descriptor method instead.
Where this occurs in the precedence chain depends on which descriptor methods
were defined.
Descriptors are a powerful, general purpose protocol. They are the mechanism
behind properties, methods, static methods, class methods, and :func:`super()`.
They are used throughout Python itself to implement the new style classes
introduced in version 2.2. Descriptors simplify the underlying C-code and offer
a flexible set of new tools for everyday Python programs.
Descriptor Protocol
-------------------
``descr.__get__(self, obj, type=None) --> value``
``descr.__set__(self, obj, value) --> None``
``descr.__delete__(self, obj) --> None``
That is all there is to it. Define any of these methods and an object is
considered a descriptor and can override default behavior upon being looked up
as an attribute.
If an object defines both :meth:`__get__` and :meth:`__set__`, it is considered
a data descriptor. Descriptors that only define :meth:`__get__` are called
non-data descriptors (they are typically used for methods but other uses are
possible).
Data and non-data descriptors differ in how overrides are calculated with
respect to entries in an instance's dictionary. If an instance's dictionary
has an entry with the same name as a data descriptor, the data descriptor
takes precedence. If an instance's dictionary has an entry with the same
name as a non-data descriptor, the dictionary entry takes precedence.
To make a read-only data descriptor, define both :meth:`__get__` and
:meth:`__set__` with the :meth:`__set__` raising an :exc:`AttributeError` when
called. Defining the :meth:`__set__` method with an exception raising
placeholder is enough to make it a data descriptor.
Invoking Descriptors
--------------------
A descriptor can be called directly by its method name. For example,
``d.__get__(obj)``.
Alternatively, it is more common for a descriptor to be invoked automatically
upon attribute access. For example, ``obj.d`` looks up ``d`` in the dictionary
of ``obj``. If ``d`` defines the method :meth:`__get__`, then ``d.__get__(obj)``
is invoked according to the precedence rules listed below.
The details of invocation depend on whether ``obj`` is an object or a class.
For objects, the machinery is in :meth:`object.__getattribute__` which
transforms ``b.x`` into ``type(b).__dict__['x'].__get__(b, type(b))``. The
implementation works through a precedence chain that gives data descriptors
priority over instance variables, instance variables priority over non-data
descriptors, and assigns lowest priority to :meth:`__getattr__` if provided.
The full C implementation can be found in :c:func:`PyObject_GenericGetAttr()` in
:source:`Objects/object.c`.
For classes, the machinery is in :meth:`type.__getattribute__` which transforms
``B.x`` into ``B.__dict__['x'].__get__(None, B)``. In pure Python, it looks
like::
def __getattribute__(self, key):
"Emulate type_getattro() in Objects/typeobject.c"
v = object.__getattribute__(self, key)
if hasattr(v, '__get__'):
return v.__get__(None, self)
return v
The important points to remember are:
* descriptors are invoked by the :meth:`__getattribute__` method
* overriding :meth:`__getattribute__` prevents automatic descriptor calls
* :meth:`object.__getattribute__` and :meth:`type.__getattribute__` make
different calls to :meth:`__get__`.
* data descriptors always override instance dictionaries.
* non-data descriptors may be overridden by instance dictionaries.
The object returned by ``super()`` also has a custom :meth:`__getattribute__`
method for invoking descriptors. The call ``super(B, obj).m()`` searches
``obj.__class__.__mro__`` for the base class ``A`` immediately following ``B``
and then returns ``A.__dict__['m'].__get__(obj, B)``. If not a descriptor,
``m`` is returned unchanged. If not in the dictionary, ``m`` reverts to a
search using :meth:`object.__getattribute__`.
The implementation details are in :c:func:`super_getattro()` in
:source:`Objects/typeobject.c`. and a pure Python equivalent can be found in
`Guido's Tutorial`_.
.. _`Guido's Tutorial`: https://www.python.org/download/releases/2.2.3/descrintro/#cooperation
The details above show that the mechanism for descriptors is embedded in the
:meth:`__getattribute__()` methods for :class:`object`, :class:`type`, and
:func:`super`. Classes inherit this machinery when they derive from
:class:`object` or if they have a meta-class providing similar functionality.
Likewise, classes can turn-off descriptor invocation by overriding
:meth:`__getattribute__()`.
Descriptor Example
------------------
The following code creates a class whose objects are data descriptors which
print a message for each get or set. Overriding :meth:`__getattribute__` is
alternate approach that could do this for every attribute. However, this
descriptor is useful for monitoring just a few chosen attributes::
class RevealAccess(object):
"""A data descriptor that sets and returns values
normally and prints a message logging their access.
"""
def __init__(self, initval=None, name='var'):
self.val = initval
self.name = name
def __get__(self, obj, objtype):
print('Retrieving', self.name)
return self.val
def __set__(self, obj, val):
print('Updating', self.name)
self.val = val
>>> class MyClass(object):
... x = RevealAccess(10, 'var "x"')
... y = 5
...
>>> m = MyClass()
>>> m.x
Retrieving var "x"
10
>>> m.x = 20
Updating var "x"
>>> m.x
Retrieving var "x"
20
>>> m.y
5
The protocol is simple and offers exciting possibilities. Several use cases are
so common that they have been packaged into individual function calls.
Properties, bound methods, static methods, and class methods are all
based on the descriptor protocol.
Properties
----------
Calling :func:`property` is a succinct way of building a data descriptor that
triggers function calls upon access to an attribute. Its signature is::
property(fget=None, fset=None, fdel=None, doc=None) -> property attribute
The documentation shows a typical use to define a managed attribute ``x``::
class C(object):
def getx(self): return self.__x
def setx(self, value): self.__x = value
def delx(self): del self.__x
x = property(getx, setx, delx, "I'm the 'x' property.")
To see how :func:`property` is implemented in terms of the descriptor protocol,
here is a pure Python equivalent::
class Property(object):
"Emulate PyProperty_Type() in Objects/descrobject.c"
def __init__(self, fget=None, fset=None, fdel=None, doc=None):
self.fget = fget
self.fset = fset
self.fdel = fdel
if doc is None and fget is not None:
doc = fget.__doc__
self.__doc__ = doc
def __get__(self, obj, objtype=None):
if obj is None:
return self
if self.fget is None:
raise AttributeError("unreadable attribute")
return self.fget(obj)
def __set__(self, obj, value):
if self.fset is None:
raise AttributeError("can't set attribute")
self.fset(obj, value)
def __delete__(self, obj):
if self.fdel is None:
raise AttributeError("can't delete attribute")
self.fdel(obj)
def getter(self, fget):
return type(self)(fget, self.fset, self.fdel, self.__doc__)
def setter(self, fset):
return type(self)(self.fget, fset, self.fdel, self.__doc__)
def deleter(self, fdel):
return type(self)(self.fget, self.fset, fdel, self.__doc__)
The :func:`property` builtin helps whenever a user interface has granted
attribute access and then subsequent changes require the intervention of a
method.
For instance, a spreadsheet class may grant access to a cell value through
``Cell('b10').value``. Subsequent improvements to the program require the cell
to be recalculated on every access; however, the programmer does not want to
affect existing client code accessing the attribute directly. The solution is
to wrap access to the value attribute in a property data descriptor::
class Cell(object):
. . .
def getvalue(self):
"Recalculate the cell before returning value"
self.recalc()
return self._value
value = property(getvalue)
Functions and Methods
---------------------
Python's object oriented features are built upon a function based environment.
Using non-data descriptors, the two are merged seamlessly.
Class dictionaries store methods as functions. In a class definition, methods
are written using :keyword:`def` or :keyword:`lambda`, the usual tools for
creating functions. Methods only differ from regular functions in that the
first argument is reserved for the object instance. By Python convention, the
instance reference is called *self* but may be called *this* or any other
variable name.
To support method calls, functions include the :meth:`__get__` method for
binding methods during attribute access. This means that all functions are
non-data descriptors which return bound methods when they are invoked from an
object. In pure python, it works like this::
class Function(object):
. . .
def __get__(self, obj, objtype=None):
"Simulate func_descr_get() in Objects/funcobject.c"
if obj is None:
return self
return types.MethodType(self, obj)
Running the interpreter shows how the function descriptor works in practice::
>>> class D(object):
... def f(self, x):
... return x
...
>>> d = D()
# Access through the class dictionary does not invoke __get__.
# It just returns the underlying function object.
>>> D.__dict__['f']
<function D.f at 0x00C45070>
# Dotted access from a class calls __get__() which just returns
# the underlying function unchanged.
>>> D.f
<function D.f at 0x00C45070>
# The function has a __qualname__ attribute to support introspection
>>> D.f.__qualname__
'D.f'
# Dotted access from an instance calls __get__() which returns the
# function wrapped in a bound method object
>>> d.f
<bound method D.f of <__main__.D object at 0x00B18C90>>
# Internally, the bound method stores the underlying function,
# the bound instance, and the class of the bound instance.
>>> d.f.__func__
<function D.f at 0x1012e5ae8>
>>> d.f.__self__
<__main__.D object at 0x1012e1f98>
>>> d.f.__class__
<class 'method'>
Static Methods and Class Methods
--------------------------------
Non-data descriptors provide a simple mechanism for variations on the usual
patterns of binding functions into methods.
To recap, functions have a :meth:`__get__` method so that they can be converted
to a method when accessed as attributes. The non-data descriptor transforms an
``obj.f(*args)`` call into ``f(obj, *args)``. Calling ``klass.f(*args)``
becomes ``f(*args)``.
This chart summarizes the binding and its two most useful variants:
+-----------------+----------------------+------------------+
| Transformation | Called from an | Called from a |
| | Object | Class |
+=================+======================+==================+
| function | f(obj, \*args) | f(\*args) |
+-----------------+----------------------+------------------+
| staticmethod | f(\*args) | f(\*args) |
+-----------------+----------------------+------------------+
| classmethod | f(type(obj), \*args) | f(klass, \*args) |
+-----------------+----------------------+------------------+
Static methods return the underlying function without changes. Calling either
``c.f`` or ``C.f`` is the equivalent of a direct lookup into
``object.__getattribute__(c, "f")`` or ``object.__getattribute__(C, "f")``. As a
result, the function becomes identically accessible from either an object or a
class.
Good candidates for static methods are methods that do not reference the
``self`` variable.
For instance, a statistics package may include a container class for
experimental data. The class provides normal methods for computing the average,
mean, median, and other descriptive statistics that depend on the data. However,
there may be useful functions which are conceptually related but do not depend
on the data. For instance, ``erf(x)`` is handy conversion routine that comes up
in statistical work but does not directly depend on a particular dataset.
It can be called either from an object or the class: ``s.erf(1.5) --> .9332`` or
``Sample.erf(1.5) --> .9332``.
Since staticmethods return the underlying function with no changes, the example
calls are unexciting::
>>> class E(object):
... def f(x):
... print(x)
... f = staticmethod(f)
...
>>> print(E.f(3))
3
>>> print(E().f(3))
3
Using the non-data descriptor protocol, a pure Python version of
:func:`staticmethod` would look like this::
class StaticMethod(object):
"Emulate PyStaticMethod_Type() in Objects/funcobject.c"
def __init__(self, f):
self.f = f
def __get__(self, obj, objtype=None):
return self.f
Unlike static methods, class methods prepend the class reference to the
argument list before calling the function. This format is the same
for whether the caller is an object or a class::
>>> class E(object):
... def f(klass, x):
... return klass.__name__, x
... f = classmethod(f)
...
>>> print(E.f(3))
('E', 3)
>>> print(E().f(3))
('E', 3)
This behavior is useful whenever the function only needs to have a class
reference and does not care about any underlying data. One use for classmethods
is to create alternate class constructors. In Python 2.3, the classmethod
:func:`dict.fromkeys` creates a new dictionary from a list of keys. The pure
Python equivalent is::
class Dict(object):
. . .
def fromkeys(klass, iterable, value=None):
"Emulate dict_fromkeys() in Objects/dictobject.c"
d = klass()
for key in iterable:
d[key] = value
return d
fromkeys = classmethod(fromkeys)
Now a new dictionary of unique keys can be constructed like this::
>>> Dict.fromkeys('abracadabra')
{'a': None, 'r': None, 'b': None, 'c': None, 'd': None}
Using the non-data descriptor protocol, a pure Python version of
:func:`classmethod` would look like this::
class ClassMethod(object):
"Emulate PyClassMethod_Type() in Objects/funcobject.c"
def __init__(self, f):
self.f = f
def __get__(self, obj, klass=None):
if klass is None:
klass = type(obj)
def newfunc(*args):
return self.f(klass, *args)
return newfunc

File diff suppressed because it is too large Load diff

32
third_party/python/Doc/howto/index.rst vendored Normal file
View file

@ -0,0 +1,32 @@
***************
Python HOWTOs
***************
Python HOWTOs are documents that cover a single, specific topic,
and attempt to cover it fairly completely. Modelled on the Linux
Documentation Project's HOWTO collection, this collection is an
effort to foster documentation that's more detailed than the
Python Library Reference.
Currently, the HOWTOs are:
.. toctree::
:maxdepth: 1
pyporting.rst
cporting.rst
curses.rst
descriptor.rst
functional.rst
logging.rst
logging-cookbook.rst
regex.rst
sockets.rst
sorting.rst
unicode.rst
urllib2.rst
argparse.rst
ipaddress.rst
clinic.rst
instrumentation.rst

View file

@ -0,0 +1,412 @@
.. highlight:: shell-session
.. _instrumentation:
===============================================
Instrumenting CPython with DTrace and SystemTap
===============================================
:author: David Malcolm
:author: Łukasz Langa
DTrace and SystemTap are monitoring tools, each providing a way to inspect
what the processes on a computer system are doing. They both use
domain-specific languages allowing a user to write scripts which:
- filter which processes are to be observed
- gather data from the processes of interest
- generate reports on the data
As of Python 3.6, CPython can be built with embedded "markers", also
known as "probes", that can be observed by a DTrace or SystemTap script,
making it easier to monitor what the CPython processes on a system are
doing.
.. impl-detail::
DTrace markers are implementation details of the CPython interpreter.
No guarantees are made about probe compatibility between versions of
CPython. DTrace scripts can stop working or work incorrectly without
warning when changing CPython versions.
Enabling the static markers
---------------------------
macOS comes with built-in support for DTrace. On Linux, in order to
build CPython with the embedded markers for SystemTap, the SystemTap
development tools must be installed.
On a Linux machine, this can be done via::
$ yum install systemtap-sdt-devel
or::
$ sudo apt-get install systemtap-sdt-dev
CPython must then be configured ``--with-dtrace``:
.. code-block:: none
checking for --with-dtrace... yes
On macOS, you can list available DTrace probes by running a Python
process in the background and listing all probes made available by the
Python provider::
$ python3.6 -q &
$ sudo dtrace -l -P python$! # or: dtrace -l -m python3.6
ID PROVIDER MODULE FUNCTION NAME
29564 python18035 python3.6 _PyEval_EvalFrameDefault function-entry
29565 python18035 python3.6 dtrace_function_entry function-entry
29566 python18035 python3.6 _PyEval_EvalFrameDefault function-return
29567 python18035 python3.6 dtrace_function_return function-return
29568 python18035 python3.6 collect gc-done
29569 python18035 python3.6 collect gc-start
29570 python18035 python3.6 _PyEval_EvalFrameDefault line
29571 python18035 python3.6 maybe_dtrace_line line
On Linux, you can verify if the SystemTap static markers are present in
the built binary by seeing if it contains a ".note.stapsdt" section.
::
$ readelf -S ./python | grep .note.stapsdt
[30] .note.stapsdt NOTE 0000000000000000 00308d78
If you've built Python as a shared library (with --enable-shared), you
need to look instead within the shared library. For example::
$ readelf -S libpython3.3dm.so.1.0 | grep .note.stapsdt
[29] .note.stapsdt NOTE 0000000000000000 00365b68
Sufficiently modern readelf can print the metadata::
$ readelf -n ./python
Displaying notes found at file offset 0x00000254 with length 0x00000020:
Owner Data size Description
GNU 0x00000010 NT_GNU_ABI_TAG (ABI version tag)
OS: Linux, ABI: 2.6.32
Displaying notes found at file offset 0x00000274 with length 0x00000024:
Owner Data size Description
GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring)
Build ID: df924a2b08a7e89f6e11251d4602022977af2670
Displaying notes found at file offset 0x002d6c30 with length 0x00000144:
Owner Data size Description
stapsdt 0x00000031 NT_STAPSDT (SystemTap probe descriptors)
Provider: python
Name: gc__start
Location: 0x00000000004371c3, Base: 0x0000000000630ce2, Semaphore: 0x00000000008d6bf6
Arguments: -4@%ebx
stapsdt 0x00000030 NT_STAPSDT (SystemTap probe descriptors)
Provider: python
Name: gc__done
Location: 0x00000000004374e1, Base: 0x0000000000630ce2, Semaphore: 0x00000000008d6bf8
Arguments: -8@%rax
stapsdt 0x00000045 NT_STAPSDT (SystemTap probe descriptors)
Provider: python
Name: function__entry
Location: 0x000000000053db6c, Base: 0x0000000000630ce2, Semaphore: 0x00000000008d6be8
Arguments: 8@%rbp 8@%r12 -4@%eax
stapsdt 0x00000046 NT_STAPSDT (SystemTap probe descriptors)
Provider: python
Name: function__return
Location: 0x000000000053dba8, Base: 0x0000000000630ce2, Semaphore: 0x00000000008d6bea
Arguments: 8@%rbp 8@%r12 -4@%eax
The above metadata contains information for SystemTap describing how it
can patch strategically-placed machine code instructions to enable the
tracing hooks used by a SystemTap script.
Static DTrace probes
--------------------
The following example DTrace script can be used to show the call/return
hierarchy of a Python script, only tracing within the invocation of
a function called "start". In other words, import-time function
invocations are not going to be listed:
.. code-block:: none
self int indent;
python$target:::function-entry
/copyinstr(arg1) == "start"/
{
self->trace = 1;
}
python$target:::function-entry
/self->trace/
{
printf("%d\t%*s:", timestamp, 15, probename);
printf("%*s", self->indent, "");
printf("%s:%s:%d\n", basename(copyinstr(arg0)), copyinstr(arg1), arg2);
self->indent++;
}
python$target:::function-return
/self->trace/
{
self->indent--;
printf("%d\t%*s:", timestamp, 15, probename);
printf("%*s", self->indent, "");
printf("%s:%s:%d\n", basename(copyinstr(arg0)), copyinstr(arg1), arg2);
}
python$target:::function-return
/copyinstr(arg1) == "start"/
{
self->trace = 0;
}
It can be invoked like this::
$ sudo dtrace -q -s call_stack.d -c "python3.6 script.py"
The output looks like this:
.. code-block:: none
156641360502280 function-entry:call_stack.py:start:23
156641360518804 function-entry: call_stack.py:function_1:1
156641360532797 function-entry: call_stack.py:function_3:9
156641360546807 function-return: call_stack.py:function_3:10
156641360563367 function-return: call_stack.py:function_1:2
156641360578365 function-entry: call_stack.py:function_2:5
156641360591757 function-entry: call_stack.py:function_1:1
156641360605556 function-entry: call_stack.py:function_3:9
156641360617482 function-return: call_stack.py:function_3:10
156641360629814 function-return: call_stack.py:function_1:2
156641360642285 function-return: call_stack.py:function_2:6
156641360656770 function-entry: call_stack.py:function_3:9
156641360669707 function-return: call_stack.py:function_3:10
156641360687853 function-entry: call_stack.py:function_4:13
156641360700719 function-return: call_stack.py:function_4:14
156641360719640 function-entry: call_stack.py:function_5:18
156641360732567 function-return: call_stack.py:function_5:21
156641360747370 function-return:call_stack.py:start:28
Static SystemTap markers
------------------------
The low-level way to use the SystemTap integration is to use the static
markers directly. This requires you to explicitly state the binary file
containing them.
For example, this SystemTap script can be used to show the call/return
hierarchy of a Python script:
.. code-block:: none
probe process("python").mark("function__entry") {
filename = user_string($arg1);
funcname = user_string($arg2);
lineno = $arg3;
printf("%s => %s in %s:%d\\n",
thread_indent(1), funcname, filename, lineno);
}
probe process("python").mark("function__return") {
filename = user_string($arg1);
funcname = user_string($arg2);
lineno = $arg3;
printf("%s <= %s in %s:%d\\n",
thread_indent(-1), funcname, filename, lineno);
}
It can be invoked like this::
$ stap \
show-call-hierarchy.stp \
-c "./python test.py"
The output looks like this:
.. code-block:: none
11408 python(8274): => __contains__ in Lib/_abcoll.py:362
11414 python(8274): => __getitem__ in Lib/os.py:425
11418 python(8274): => encode in Lib/os.py:490
11424 python(8274): <= encode in Lib/os.py:493
11428 python(8274): <= __getitem__ in Lib/os.py:426
11433 python(8274): <= __contains__ in Lib/_abcoll.py:366
where the columns are:
- time in microseconds since start of script
- name of executable
- PID of process
and the remainder indicates the call/return hierarchy as the script executes.
For a `--enable-shared` build of CPython, the markers are contained within the
libpython shared library, and the probe's dotted path needs to reflect this. For
example, this line from the above example:
.. code-block:: none
probe process("python").mark("function__entry") {
should instead read:
.. code-block:: none
probe process("python").library("libpython3.6dm.so.1.0").mark("function__entry") {
(assuming a debug build of CPython 3.6)
Available static markers
------------------------
.. I'm reusing the "c:function" type for markers
.. c:function:: function__entry(str filename, str funcname, int lineno)
This marker indicates that execution of a Python function has begun.
It is only triggered for pure-Python (bytecode) functions.
The filename, function name, and line number are provided back to the
tracing script as positional arguments, which must be accessed using
``$arg1``, ``$arg2``, ``$arg3``:
* ``$arg1`` : ``(const char *)`` filename, accessible using ``user_string($arg1)``
* ``$arg2`` : ``(const char *)`` function name, accessible using
``user_string($arg2)``
* ``$arg3`` : ``int`` line number
.. c:function:: function__return(str filename, str funcname, int lineno)
This marker is the converse of :c:func:`function__entry`, and indicates that
execution of a Python function has ended (either via ``return``, or via an
exception). It is only triggered for pure-Python (bytecode) functions.
The arguments are the same as for :c:func:`function__entry`
.. c:function:: line(str filename, str funcname, int lineno)
This marker indicates a Python line is about to be executed. It is
the equivalent of line-by-line tracing with a Python profiler. It is
not triggered within C functions.
The arguments are the same as for :c:func:`function__entry`.
.. c:function:: gc__start(int generation)
Fires when the Python interpreter starts a garbage collection cycle.
``arg0`` is the generation to scan, like :func:`gc.collect()`.
.. c:function:: gc__done(long collected)
Fires when the Python interpreter finishes a garbage collection
cycle. ``arg0`` is the number of collected objects.
SystemTap Tapsets
-----------------
The higher-level way to use the SystemTap integration is to use a "tapset":
SystemTap's equivalent of a library, which hides some of the lower-level
details of the static markers.
Here is a tapset file, based on a non-shared build of CPython:
.. code-block:: none
/*
Provide a higher-level wrapping around the function__entry and
function__return markers:
\*/
probe python.function.entry = process("python").mark("function__entry")
{
filename = user_string($arg1);
funcname = user_string($arg2);
lineno = $arg3;
frameptr = $arg4
}
probe python.function.return = process("python").mark("function__return")
{
filename = user_string($arg1);
funcname = user_string($arg2);
lineno = $arg3;
frameptr = $arg4
}
If this file is installed in SystemTap's tapset directory (e.g.
``/usr/share/systemtap/tapset``), then these additional probepoints become
available:
.. c:function:: python.function.entry(str filename, str funcname, int lineno, frameptr)
This probe point indicates that execution of a Python function has begun.
It is only triggered for pure-python (bytecode) functions.
.. c:function:: python.function.return(str filename, str funcname, int lineno, frameptr)
This probe point is the converse of :c:func:`python.function.return`, and
indicates that execution of a Python function has ended (either via
``return``, or via an exception). It is only triggered for pure-python
(bytecode) functions.
Examples
--------
This SystemTap script uses the tapset above to more cleanly implement the
example given above of tracing the Python function-call hierarchy, without
needing to directly name the static markers:
.. code-block:: none
probe python.function.entry
{
printf("%s => %s in %s:%d\n",
thread_indent(1), funcname, filename, lineno);
}
probe python.function.return
{
printf("%s <= %s in %s:%d\n",
thread_indent(-1), funcname, filename, lineno);
}
The following script uses the tapset above to provide a top-like view of all
running CPython code, showing the top 20 most frequently-entered bytecode
frames, each second, across the whole system:
.. code-block:: none
global fn_calls;
probe python.function.entry
{
fn_calls[pid(), filename, funcname, lineno] += 1;
}
probe timer.ms(1000) {
printf("\033[2J\033[1;1H") /* clear screen \*/
printf("%6s %80s %6s %30s %6s\n",
"PID", "FILENAME", "LINE", "FUNCTION", "CALLS")
foreach ([pid, filename, funcname, lineno] in fn_calls- limit 20) {
printf("%6d %80s %6d %30s %6d\n",
pid, filename, lineno, funcname,
fn_calls[pid, filename, funcname, lineno]);
}
delete fn_calls;
}

View file

@ -0,0 +1,340 @@
.. testsetup::
import ipaddress
.. _ipaddress-howto:
***************************************
An introduction to the ipaddress module
***************************************
:author: Peter Moody
:author: Nick Coghlan
.. topic:: Overview
This document aims to provide a gentle introduction to the
:mod:`ipaddress` module. It is aimed primarily at users that aren't
already familiar with IP networking terminology, but may also be useful
to network engineers wanting an overview of how :mod:`ipaddress`
represents IP network addressing concepts.
Creating Address/Network/Interface objects
==========================================
Since :mod:`ipaddress` is a module for inspecting and manipulating IP addresses,
the first thing you'll want to do is create some objects. You can use
:mod:`ipaddress` to create objects from strings and integers.
A Note on IP Versions
---------------------
For readers that aren't particularly familiar with IP addressing, it's
important to know that the Internet Protocol is currently in the process
of moving from version 4 of the protocol to version 6. This transition is
occurring largely because version 4 of the protocol doesn't provide enough
addresses to handle the needs of the whole world, especially given the
increasing number of devices with direct connections to the internet.
Explaining the details of the differences between the two versions of the
protocol is beyond the scope of this introduction, but readers need to at
least be aware that these two versions exist, and it will sometimes be
necessary to force the use of one version or the other.
IP Host Addresses
-----------------
Addresses, often referred to as "host addresses" are the most basic unit
when working with IP addressing. The simplest way to create addresses is
to use the :func:`ipaddress.ip_address` factory function, which automatically
determines whether to create an IPv4 or IPv6 address based on the passed in
value:
>>> ipaddress.ip_address('192.0.2.1')
IPv4Address('192.0.2.1')
>>> ipaddress.ip_address('2001:DB8::1')
IPv6Address('2001:db8::1')
Addresses can also be created directly from integers. Values that will
fit within 32 bits are assumed to be IPv4 addresses::
>>> ipaddress.ip_address(3221225985)
IPv4Address('192.0.2.1')
>>> ipaddress.ip_address(42540766411282592856903984951653826561)
IPv6Address('2001:db8::1')
To force the use of IPv4 or IPv6 addresses, the relevant classes can be
invoked directly. This is particularly useful to force creation of IPv6
addresses for small integers::
>>> ipaddress.ip_address(1)
IPv4Address('0.0.0.1')
>>> ipaddress.IPv4Address(1)
IPv4Address('0.0.0.1')
>>> ipaddress.IPv6Address(1)
IPv6Address('::1')
Defining Networks
-----------------
Host addresses are usually grouped together into IP networks, so
:mod:`ipaddress` provides a way to create, inspect and manipulate network
definitions. IP network objects are constructed from strings that define the
range of host addresses that are part of that network. The simplest form
for that information is a "network address/network prefix" pair, where the
prefix defines the number of leading bits that are compared to determine
whether or not an address is part of the network and the network address
defines the expected value of those bits.
As for addresses, a factory function is provided that determines the correct
IP version automatically::
>>> ipaddress.ip_network('192.0.2.0/24')
IPv4Network('192.0.2.0/24')
>>> ipaddress.ip_network('2001:db8::0/96')
IPv6Network('2001:db8::/96')
Network objects cannot have any host bits set. The practical effect of this
is that ``192.0.2.1/24`` does not describe a network. Such definitions are
referred to as interface objects since the ip-on-a-network notation is
commonly used to describe network interfaces of a computer on a given network
and are described further in the next section.
By default, attempting to create a network object with host bits set will
result in :exc:`ValueError` being raised. To request that the
additional bits instead be coerced to zero, the flag ``strict=False`` can
be passed to the constructor::
>>> ipaddress.ip_network('192.0.2.1/24')
Traceback (most recent call last):
...
ValueError: 192.0.2.1/24 has host bits set
>>> ipaddress.ip_network('192.0.2.1/24', strict=False)
IPv4Network('192.0.2.0/24')
While the string form offers significantly more flexibility, networks can
also be defined with integers, just like host addresses. In this case, the
network is considered to contain only the single address identified by the
integer, so the network prefix includes the entire network address::
>>> ipaddress.ip_network(3221225984)
IPv4Network('192.0.2.0/32')
>>> ipaddress.ip_network(42540766411282592856903984951653826560)
IPv6Network('2001:db8::/128')
As with addresses, creation of a particular kind of network can be forced
by calling the class constructor directly instead of using the factory
function.
Host Interfaces
---------------
As mentioned just above, if you need to describe an address on a particular
network, neither the address nor the network classes are sufficient.
Notation like ``192.0.2.1/24`` is commonly used by network engineers and the
people who write tools for firewalls and routers as shorthand for "the host
``192.0.2.1`` on the network ``192.0.2.0/24``", Accordingly, :mod:`ipaddress`
provides a set of hybrid classes that associate an address with a particular
network. The interface for creation is identical to that for defining network
objects, except that the address portion isn't constrained to being a network
address.
>>> ipaddress.ip_interface('192.0.2.1/24')
IPv4Interface('192.0.2.1/24')
>>> ipaddress.ip_interface('2001:db8::1/96')
IPv6Interface('2001:db8::1/96')
Integer inputs are accepted (as with networks), and use of a particular IP
version can be forced by calling the relevant constructor directly.
Inspecting Address/Network/Interface Objects
============================================
You've gone to the trouble of creating an IPv(4|6)(Address|Network|Interface)
object, so you probably want to get information about it. :mod:`ipaddress`
tries to make doing this easy and intuitive.
Extracting the IP version::
>>> addr4 = ipaddress.ip_address('192.0.2.1')
>>> addr6 = ipaddress.ip_address('2001:db8::1')
>>> addr6.version
6
>>> addr4.version
4
Obtaining the network from an interface::
>>> host4 = ipaddress.ip_interface('192.0.2.1/24')
>>> host4.network
IPv4Network('192.0.2.0/24')
>>> host6 = ipaddress.ip_interface('2001:db8::1/96')
>>> host6.network
IPv6Network('2001:db8::/96')
Finding out how many individual addresses are in a network::
>>> net4 = ipaddress.ip_network('192.0.2.0/24')
>>> net4.num_addresses
256
>>> net6 = ipaddress.ip_network('2001:db8::0/96')
>>> net6.num_addresses
4294967296
Iterating through the "usable" addresses on a network::
>>> net4 = ipaddress.ip_network('192.0.2.0/24')
>>> for x in net4.hosts():
... print(x) # doctest: +ELLIPSIS
192.0.2.1
192.0.2.2
192.0.2.3
192.0.2.4
...
192.0.2.252
192.0.2.253
192.0.2.254
Obtaining the netmask (i.e. set bits corresponding to the network prefix) or
the hostmask (any bits that are not part of the netmask):
>>> net4 = ipaddress.ip_network('192.0.2.0/24')
>>> net4.netmask
IPv4Address('255.255.255.0')
>>> net4.hostmask
IPv4Address('0.0.0.255')
>>> net6 = ipaddress.ip_network('2001:db8::0/96')
>>> net6.netmask
IPv6Address('ffff:ffff:ffff:ffff:ffff:ffff::')
>>> net6.hostmask
IPv6Address('::ffff:ffff')
Exploding or compressing the address::
>>> addr6.exploded
'2001:0db8:0000:0000:0000:0000:0000:0001'
>>> addr6.compressed
'2001:db8::1'
>>> net6.exploded
'2001:0db8:0000:0000:0000:0000:0000:0000/96'
>>> net6.compressed
'2001:db8::/96'
While IPv4 doesn't support explosion or compression, the associated objects
still provide the relevant properties so that version neutral code can
easily ensure the most concise or most verbose form is used for IPv6
addresses while still correctly handling IPv4 addresses.
Networks as lists of Addresses
==============================
It's sometimes useful to treat networks as lists. This means it is possible
to index them like this::
>>> net4[1]
IPv4Address('192.0.2.1')
>>> net4[-1]
IPv4Address('192.0.2.255')
>>> net6[1]
IPv6Address('2001:db8::1')
>>> net6[-1]
IPv6Address('2001:db8::ffff:ffff')
It also means that network objects lend themselves to using the list
membership test syntax like this::
if address in network:
# do something
Containment testing is done efficiently based on the network prefix::
>>> addr4 = ipaddress.ip_address('192.0.2.1')
>>> addr4 in ipaddress.ip_network('192.0.2.0/24')
True
>>> addr4 in ipaddress.ip_network('192.0.3.0/24')
False
Comparisons
===========
:mod:`ipaddress` provides some simple, hopefully intuitive ways to compare
objects, where it makes sense::
>>> ipaddress.ip_address('192.0.2.1') < ipaddress.ip_address('192.0.2.2')
True
A :exc:`TypeError` exception is raised if you try to compare objects of
different versions or different types.
Using IP Addresses with other modules
=====================================
Other modules that use IP addresses (such as :mod:`socket`) usually won't
accept objects from this module directly. Instead, they must be coerced to
an integer or string that the other module will accept::
>>> addr4 = ipaddress.ip_address('192.0.2.1')
>>> str(addr4)
'192.0.2.1'
>>> int(addr4)
3221225985
Getting more detail when instance creation fails
================================================
When creating address/network/interface objects using the version-agnostic
factory functions, any errors will be reported as :exc:`ValueError` with
a generic error message that simply says the passed in value was not
recognized as an object of that type. The lack of a specific error is
because it's necessary to know whether the value is *supposed* to be IPv4
or IPv6 in order to provide more detail on why it has been rejected.
To support use cases where it is useful to have access to this additional
detail, the individual class constructors actually raise the
:exc:`ValueError` subclasses :exc:`ipaddress.AddressValueError` and
:exc:`ipaddress.NetmaskValueError` to indicate exactly which part of
the definition failed to parse correctly.
The error messages are significantly more detailed when using the
class constructors directly. For example::
>>> ipaddress.ip_address("192.168.0.256")
Traceback (most recent call last):
...
ValueError: '192.168.0.256' does not appear to be an IPv4 or IPv6 address
>>> ipaddress.IPv4Address("192.168.0.256")
Traceback (most recent call last):
...
ipaddress.AddressValueError: Octet 256 (> 255) not permitted in '192.168.0.256'
>>> ipaddress.ip_network("192.168.0.1/64")
Traceback (most recent call last):
...
ValueError: '192.168.0.1/64' does not appear to be an IPv4 or IPv6 network
>>> ipaddress.IPv4Network("192.168.0.1/64")
Traceback (most recent call last):
...
ipaddress.NetmaskValueError: '64' is not a valid netmask
However, both of the module specific exceptions have :exc:`ValueError` as their
parent class, so if you're not concerned with the particular type of error,
you can still write code like the following::
try:
network = ipaddress.IPv4Network(address)
except ValueError:
print('address/netmask is invalid for IPv4:', address)

File diff suppressed because it is too large Load diff

1103
third_party/python/Doc/howto/logging.rst vendored Normal file

File diff suppressed because it is too large Load diff

BIN
third_party/python/Doc/howto/logging_flow.png vendored Executable file

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

View file

@ -0,0 +1,452 @@
.. _pyporting-howto:
*********************************
Porting Python 2 Code to Python 3
*********************************
:author: Brett Cannon
.. topic:: Abstract
With Python 3 being the future of Python while Python 2 is still in active
use, it is good to have your project available for both major releases of
Python. This guide is meant to help you figure out how best to support both
Python 2 & 3 simultaneously.
If you are looking to port an extension module instead of pure Python code,
please see :ref:`cporting-howto`.
If you would like to read one core Python developer's take on why Python 3
came into existence, you can read Nick Coghlan's `Python 3 Q & A`_ or
Brett Cannon's `Why Python 3 exists`_.
For help with porting, you can email the python-porting_ mailing list with
questions.
The Short Explanation
=====================
To make your project be single-source Python 2/3 compatible, the basic steps
are:
#. Only worry about supporting Python 2.7
#. Make sure you have good test coverage (coverage.py_ can help;
``pip install coverage``)
#. Learn the differences between Python 2 & 3
#. Use Futurize_ (or Modernize_) to update your code (e.g. ``pip install future``)
#. Use Pylint_ to help make sure you don't regress on your Python 3 support
(``pip install pylint``)
#. Use caniusepython3_ to find out which of your dependencies are blocking your
use of Python 3 (``pip install caniusepython3``)
#. Once your dependencies are no longer blocking you, use continuous integration
to make sure you stay compatible with Python 2 & 3 (tox_ can help test
against multiple versions of Python; ``pip install tox``)
#. Consider using optional static type checking to make sure your type usage
works in both Python 2 & 3 (e.g. use mypy_ to check your typing under both
Python 2 & Python 3).
Details
=======
A key point about supporting Python 2 & 3 simultaneously is that you can start
**today**! Even if your dependencies are not supporting Python 3 yet that does
not mean you can't modernize your code **now** to support Python 3. Most changes
required to support Python 3 lead to cleaner code using newer practices even in
Python 2 code.
Another key point is that modernizing your Python 2 code to also support
Python 3 is largely automated for you. While you might have to make some API
decisions thanks to Python 3 clarifying text data versus binary data, the
lower-level work is now mostly done for you and thus can at least benefit from
the automated changes immediately.
Keep those key points in mind while you read on about the details of porting
your code to support Python 2 & 3 simultaneously.
Drop support for Python 2.6 and older
-------------------------------------
While you can make Python 2.5 work with Python 3, it is **much** easier if you
only have to work with Python 2.7. If dropping Python 2.5 is not an
option then the six_ project can help you support Python 2.5 & 3 simultaneously
(``pip install six``). Do realize, though, that nearly all the projects listed
in this HOWTO will not be available to you.
If you are able to skip Python 2.5 and older, then the required changes
to your code should continue to look and feel like idiomatic Python code. At
worst you will have to use a function instead of a method in some instances or
have to import a function instead of using a built-in one, but otherwise the
overall transformation should not feel foreign to you.
But you should aim for only supporting Python 2.7. Python 2.6 is no longer
freely supported and thus is not receiving bugfixes. This means **you** will have
to work around any issues you come across with Python 2.6. There are also some
tools mentioned in this HOWTO which do not support Python 2.6 (e.g., Pylint_),
and this will become more commonplace as time goes on. It will simply be easier
for you if you only support the versions of Python that you have to support.
Make sure you specify the proper version support in your ``setup.py`` file
--------------------------------------------------------------------------
In your ``setup.py`` file you should have the proper `trove classifier`_
specifying what versions of Python you support. As your project does not support
Python 3 yet you should at least have
``Programming Language :: Python :: 2 :: Only`` specified. Ideally you should
also specify each major/minor version of Python that you do support, e.g.
``Programming Language :: Python :: 2.7``.
Have good test coverage
-----------------------
Once you have your code supporting the oldest version of Python 2 you want it
to, you will want to make sure your test suite has good coverage. A good rule of
thumb is that if you want to be confident enough in your test suite that any
failures that appear after having tools rewrite your code are actual bugs in the
tools and not in your code. If you want a number to aim for, try to get over 80%
coverage (and don't feel bad if you find it hard to get better than 90%
coverage). If you don't already have a tool to measure test coverage then
coverage.py_ is recommended.
Learn the differences between Python 2 & 3
-------------------------------------------
Once you have your code well-tested you are ready to begin porting your code to
Python 3! But to fully understand how your code is going to change and what
you want to look out for while you code, you will want to learn what changes
Python 3 makes in terms of Python 2. Typically the two best ways of doing that
is reading the `"What's New"`_ doc for each release of Python 3 and the
`Porting to Python 3`_ book (which is free online). There is also a handy
`cheat sheet`_ from the Python-Future project.
Update your code
----------------
Once you feel like you know what is different in Python 3 compared to Python 2,
it's time to update your code! You have a choice between two tools in porting
your code automatically: Futurize_ and Modernize_. Which tool you choose will
depend on how much like Python 3 you want your code to be. Futurize_ does its
best to make Python 3 idioms and practices exist in Python 2, e.g. backporting
the ``bytes`` type from Python 3 so that you have semantic parity between the
major versions of Python. Modernize_,
on the other hand, is more conservative and targets a Python 2/3 subset of
Python, directly relying on six_ to help provide compatibility. As Python 3 is
the future, it might be best to consider Futurize to begin adjusting to any new
practices that Python 3 introduces which you are not accustomed to yet.
Regardless of which tool you choose, they will update your code to run under
Python 3 while staying compatible with the version of Python 2 you started with.
Depending on how conservative you want to be, you may want to run the tool over
your test suite first and visually inspect the diff to make sure the
transformation is accurate. After you have transformed your test suite and
verified that all the tests still pass as expected, then you can transform your
application code knowing that any tests which fail is a translation failure.
Unfortunately the tools can't automate everything to make your code work under
Python 3 and so there are a handful of things you will need to update manually
to get full Python 3 support (which of these steps are necessary vary between
the tools). Read the documentation for the tool you choose to use to see what it
fixes by default and what it can do optionally to know what will (not) be fixed
for you and what you may have to fix on your own (e.g. using ``io.open()`` over
the built-in ``open()`` function is off by default in Modernize). Luckily,
though, there are only a couple of things to watch out for which can be
considered large issues that may be hard to debug if not watched for.
Division
++++++++
In Python 3, ``5 / 2 == 2.5`` and not ``2``; all division between ``int`` values
result in a ``float``. This change has actually been planned since Python 2.2
which was released in 2002. Since then users have been encouraged to add
``from __future__ import division`` to any and all files which use the ``/`` and
``//`` operators or to be running the interpreter with the ``-Q`` flag. If you
have not been doing this then you will need to go through your code and do two
things:
#. Add ``from __future__ import division`` to your files
#. Update any division operator as necessary to either use ``//`` to use floor
division or continue using ``/`` and expect a float
The reason that ``/`` isn't simply translated to ``//`` automatically is that if
an object defines a ``__truediv__`` method but not ``__floordiv__`` then your
code would begin to fail (e.g. a user-defined class that uses ``/`` to
signify some operation but not ``//`` for the same thing or at all).
Text versus binary data
+++++++++++++++++++++++
In Python 2 you could use the ``str`` type for both text and binary data.
Unfortunately this confluence of two different concepts could lead to brittle
code which sometimes worked for either kind of data, sometimes not. It also
could lead to confusing APIs if people didn't explicitly state that something
that accepted ``str`` accepted either text or binary data instead of one
specific type. This complicated the situation especially for anyone supporting
multiple languages as APIs wouldn't bother explicitly supporting ``unicode``
when they claimed text data support.
To make the distinction between text and binary data clearer and more
pronounced, Python 3 did what most languages created in the age of the internet
have done and made text and binary data distinct types that cannot blindly be
mixed together (Python predates widespread access to the internet). For any code
that deals only with text or only binary data, this separation doesn't pose an
issue. But for code that has to deal with both, it does mean you might have to
now care about when you are using text compared to binary data, which is why
this cannot be entirely automated.
To start, you will need to decide which APIs take text and which take binary
(it is **highly** recommended you don't design APIs that can take both due to
the difficulty of keeping the code working; as stated earlier it is difficult to
do well). In Python 2 this means making sure the APIs that take text can work
with ``unicode`` and those that work with binary data work with the
``bytes`` type from Python 3 (which is a subset of ``str`` in Python 2 and acts
as an alias for ``bytes`` type in Python 2). Usually the biggest issue is
realizing which methods exist on which types in Python 2 & 3 simultaneously
(for text that's ``unicode`` in Python 2 and ``str`` in Python 3, for binary
that's ``str``/``bytes`` in Python 2 and ``bytes`` in Python 3). The following
table lists the **unique** methods of each data type across Python 2 & 3
(e.g., the ``decode()`` method is usable on the equivalent binary data type in
either Python 2 or 3, but it can't be used by the textual data type consistently
between Python 2 and 3 because ``str`` in Python 3 doesn't have the method). Do
note that as of Python 3.5 the ``__mod__`` method was added to the bytes type.
======================== =====================
**Text data** **Binary data**
------------------------ ---------------------
\ decode
------------------------ ---------------------
encode
------------------------ ---------------------
format
------------------------ ---------------------
isdecimal
------------------------ ---------------------
isnumeric
======================== =====================
Making the distinction easier to handle can be accomplished by encoding and
decoding between binary data and text at the edge of your code. This means that
when you receive text in binary data, you should immediately decode it. And if
your code needs to send text as binary data then encode it as late as possible.
This allows your code to work with only text internally and thus eliminates
having to keep track of what type of data you are working with.
The next issue is making sure you know whether the string literals in your code
represent text or binary data. You should add a ``b`` prefix to any
literal that presents binary data. For text you should add a ``u`` prefix to
the text literal. (there is a :mod:`__future__` import to force all unspecified
literals to be Unicode, but usage has shown it isn't as effective as adding a
``b`` or ``u`` prefix to all literals explicitly)
As part of this dichotomy you also need to be careful about opening files.
Unless you have been working on Windows, there is a chance you have not always
bothered to add the ``b`` mode when opening a binary file (e.g., ``rb`` for
binary reading). Under Python 3, binary files and text files are clearly
distinct and mutually incompatible; see the :mod:`io` module for details.
Therefore, you **must** make a decision of whether a file will be used for
binary access (allowing binary data to be read and/or written) or textual access
(allowing text data to be read and/or written). You should also use :func:`io.open`
for opening files instead of the built-in :func:`open` function as the :mod:`io`
module is consistent from Python 2 to 3 while the built-in :func:`open` function
is not (in Python 3 it's actually :func:`io.open`). Do not bother with the
outdated practice of using :func:`codecs.open` as that's only necessary for
keeping compatibility with Python 2.5.
The constructors of both ``str`` and ``bytes`` have different semantics for the
same arguments between Python 2 & 3. Passing an integer to ``bytes`` in Python 2
will give you the string representation of the integer: ``bytes(3) == '3'``.
But in Python 3, an integer argument to ``bytes`` will give you a bytes object
as long as the integer specified, filled with null bytes:
``bytes(3) == b'\x00\x00\x00'``. A similar worry is necessary when passing a
bytes object to ``str``. In Python 2 you just get the bytes object back:
``str(b'3') == b'3'``. But in Python 3 you get the string representation of the
bytes object: ``str(b'3') == "b'3'"``.
Finally, the indexing of binary data requires careful handling (slicing does
**not** require any special handling). In Python 2,
``b'123'[1] == b'2'`` while in Python 3 ``b'123'[1] == 50``. Because binary data
is simply a collection of binary numbers, Python 3 returns the integer value for
the byte you index on. But in Python 2 because ``bytes == str``, indexing
returns a one-item slice of bytes. The six_ project has a function
named ``six.indexbytes()`` which will return an integer like in Python 3:
``six.indexbytes(b'123', 1)``.
To summarize:
#. Decide which of your APIs take text and which take binary data
#. Make sure that your code that works with text also works with ``unicode`` and
code for binary data works with ``bytes`` in Python 2 (see the table above
for what methods you cannot use for each type)
#. Mark all binary literals with a ``b`` prefix, textual literals with a ``u``
prefix
#. Decode binary data to text as soon as possible, encode text as binary data as
late as possible
#. Open files using :func:`io.open` and make sure to specify the ``b`` mode when
appropriate
#. Be careful when indexing into binary data
Use feature detection instead of version detection
++++++++++++++++++++++++++++++++++++++++++++++++++
Inevitably you will have code that has to choose what to do based on what
version of Python is running. The best way to do this is with feature detection
of whether the version of Python you're running under supports what you need.
If for some reason that doesn't work then you should make the version check be
against Python 2 and not Python 3. To help explain this, let's look at an
example.
Let's pretend that you need access to a feature of importlib_ that
is available in Python's standard library since Python 3.3 and available for
Python 2 through importlib2_ on PyPI. You might be tempted to write code to
access e.g. the ``importlib.abc`` module by doing the following::
import sys
if sys.version_info[0] == 3:
from importlib import abc
else:
from importlib2 import abc
The problem with this code is what happens when Python 4 comes out? It would
be better to treat Python 2 as the exceptional case instead of Python 3 and
assume that future Python versions will be more compatible with Python 3 than
Python 2::
import sys
if sys.version_info[0] > 2:
from importlib import abc
else:
from importlib2 import abc
The best solution, though, is to do no version detection at all and instead rely
on feature detection. That avoids any potential issues of getting the version
detection wrong and helps keep you future-compatible::
try:
from importlib import abc
except ImportError:
from importlib2 import abc
Prevent compatibility regressions
---------------------------------
Once you have fully translated your code to be compatible with Python 3, you
will want to make sure your code doesn't regress and stop working under
Python 3. This is especially true if you have a dependency which is blocking you
from actually running under Python 3 at the moment.
To help with staying compatible, any new modules you create should have
at least the following block of code at the top of it::
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
You can also run Python 2 with the ``-3`` flag to be warned about various
compatibility issues your code triggers during execution. If you turn warnings
into errors with ``-Werror`` then you can make sure that you don't accidentally
miss a warning.
You can also use the Pylint_ project and its ``--py3k`` flag to lint your code
to receive warnings when your code begins to deviate from Python 3
compatibility. This also prevents you from having to run Modernize_ or Futurize_
over your code regularly to catch compatibility regressions. This does require
you only support Python 2.7 and Python 3.4 or newer as that is Pylint's
minimum Python version support.
Check which dependencies block your transition
----------------------------------------------
**After** you have made your code compatible with Python 3 you should begin to
care about whether your dependencies have also been ported. The caniusepython3_
project was created to help you determine which projects
-- directly or indirectly -- are blocking you from supporting Python 3. There
is both a command-line tool as well as a web interface at
https://caniusepython3.com.
The project also provides code which you can integrate into your test suite so
that you will have a failing test when you no longer have dependencies blocking
you from using Python 3. This allows you to avoid having to manually check your
dependencies and to be notified quickly when you can start running on Python 3.
Update your ``setup.py`` file to denote Python 3 compatibility
--------------------------------------------------------------
Once your code works under Python 3, you should update the classifiers in
your ``setup.py`` to contain ``Programming Language :: Python :: 3`` and to not
specify sole Python 2 support. This will tell anyone using your code that you
support Python 2 **and** 3. Ideally you will also want to add classifiers for
each major/minor version of Python you now support.
Use continuous integration to stay compatible
---------------------------------------------
Once you are able to fully run under Python 3 you will want to make sure your
code always works under both Python 2 & 3. Probably the best tool for running
your tests under multiple Python interpreters is tox_. You can then integrate
tox with your continuous integration system so that you never accidentally break
Python 2 or 3 support.
You may also want to use the ``-bb`` flag with the Python 3 interpreter to
trigger an exception when you are comparing bytes to strings or bytes to an int
(the latter is available starting in Python 3.5). By default type-differing
comparisons simply return ``False``, but if you made a mistake in your
separation of text/binary data handling or indexing on bytes you wouldn't easily
find the mistake. This flag will raise an exception when these kinds of
comparisons occur, making the mistake much easier to track down.
And that's mostly it! At this point your code base is compatible with both
Python 2 and 3 simultaneously. Your testing will also be set up so that you
don't accidentally break Python 2 or 3 compatibility regardless of which version
you typically run your tests under while developing.
Consider using optional static type checking
--------------------------------------------
Another way to help port your code is to use a static type checker like
mypy_ or pytype_ on your code. These tools can be used to analyze your code as
if it's being run under Python 2, then you can run the tool a second time as if
your code is running under Python 3. By running a static type checker twice like
this you can discover if you're e.g. misusing binary data type in one version
of Python compared to another. If you add optional type hints to your code you
can also explicitly state whether your APIs use textual or binary data, helping
to make sure everything functions as expected in both versions of Python.
.. _2to3: https://docs.python.org/3/library/2to3.html
.. _caniusepython3: https://pypi.org/project/caniusepython3
.. _cheat sheet: http://python-future.org/compatible_idioms.html
.. _coverage.py: https://pypi.org/project/coverage
.. _Futurize: http://python-future.org/automatic_conversion.html
.. _importlib: https://docs.python.org/3/library/importlib.html#module-importlib
.. _importlib2: https://pypi.org/project/importlib2
.. _Modernize: https://python-modernize.readthedocs.org/en/latest/
.. _mypy: http://mypy-lang.org/
.. _Porting to Python 3: http://python3porting.com/
.. _Pylint: https://pypi.org/project/pylint
.. _Python 3 Q & A: https://ncoghlan-devs-python-notes.readthedocs.org/en/latest/python3/questions_and_answers.html
.. _pytype: https://github.com/google/pytype
.. _python-future: http://python-future.org/
.. _python-porting: https://mail.python.org/mailman/listinfo/python-porting
.. _six: https://pypi.org/project/six
.. _tox: https://pypi.org/project/tox
.. _trove classifier: https://pypi.org/classifiers
.. _"What's New": https://docs.python.org/3/whatsnew/index.html
.. _Why Python 3 exists: http://www.snarky.ca/why-python-3-exists

1385
third_party/python/Doc/howto/regex.rst vendored Normal file

File diff suppressed because it is too large Load diff

383
third_party/python/Doc/howto/sockets.rst vendored Normal file
View file

@ -0,0 +1,383 @@
.. _socket-howto:
****************************
Socket Programming HOWTO
****************************
:Author: Gordon McMillan
.. topic:: Abstract
Sockets are used nearly everywhere, but are one of the most severely
misunderstood technologies around. This is a 10,000 foot overview of sockets.
It's not really a tutorial - you'll still have work to do in getting things
operational. It doesn't cover the fine points (and there are a lot of them), but
I hope it will give you enough background to begin using them decently.
Sockets
=======
I'm only going to talk about INET (i.e. IPv4) sockets, but they account for at least 99% of
the sockets in use. And I'll only talk about STREAM (i.e. TCP) sockets - unless you really
know what you're doing (in which case this HOWTO isn't for you!), you'll get
better behavior and performance from a STREAM socket than anything else. I will
try to clear up the mystery of what a socket is, as well as some hints on how to
work with blocking and non-blocking sockets. But I'll start by talking about
blocking sockets. You'll need to know how they work before dealing with
non-blocking sockets.
Part of the trouble with understanding these things is that "socket" can mean a
number of subtly different things, depending on context. So first, let's make a
distinction between a "client" socket - an endpoint of a conversation, and a
"server" socket, which is more like a switchboard operator. The client
application (your browser, for example) uses "client" sockets exclusively; the
web server it's talking to uses both "server" sockets and "client" sockets.
History
-------
Of the various forms of :abbr:`IPC (Inter Process Communication)`,
sockets are by far the most popular. On any given platform, there are
likely to be other forms of IPC that are faster, but for
cross-platform communication, sockets are about the only game in town.
They were invented in Berkeley as part of the BSD flavor of Unix. They spread
like wildfire with the Internet. With good reason --- the combination of sockets
with INET makes talking to arbitrary machines around the world unbelievably easy
(at least compared to other schemes).
Creating a Socket
=================
Roughly speaking, when you clicked on the link that brought you to this page,
your browser did something like the following::
# create an INET, STREAMing socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# now connect to the web server on port 80 - the normal http port
s.connect(("www.python.org", 80))
When the ``connect`` completes, the socket ``s`` can be used to send
in a request for the text of the page. The same socket will read the
reply, and then be destroyed. That's right, destroyed. Client sockets
are normally only used for one exchange (or a small set of sequential
exchanges).
What happens in the web server is a bit more complex. First, the web server
creates a "server socket"::
# create an INET, STREAMing socket
serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# bind the socket to a public host, and a well-known port
serversocket.bind((socket.gethostname(), 80))
# become a server socket
serversocket.listen(5)
A couple things to notice: we used ``socket.gethostname()`` so that the socket
would be visible to the outside world. If we had used ``s.bind(('localhost',
80))`` or ``s.bind(('127.0.0.1', 80))`` we would still have a "server" socket,
but one that was only visible within the same machine. ``s.bind(('', 80))``
specifies that the socket is reachable by any address the machine happens to
have.
A second thing to note: low number ports are usually reserved for "well known"
services (HTTP, SNMP etc). If you're playing around, use a nice high number (4
digits).
Finally, the argument to ``listen`` tells the socket library that we want it to
queue up as many as 5 connect requests (the normal max) before refusing outside
connections. If the rest of the code is written properly, that should be plenty.
Now that we have a "server" socket, listening on port 80, we can enter the
mainloop of the web server::
while True:
# accept connections from outside
(clientsocket, address) = serversocket.accept()
# now do something with the clientsocket
# in this case, we'll pretend this is a threaded server
ct = client_thread(clientsocket)
ct.run()
There's actually 3 general ways in which this loop could work - dispatching a
thread to handle ``clientsocket``, create a new process to handle
``clientsocket``, or restructure this app to use non-blocking sockets, and
multiplex between our "server" socket and any active ``clientsocket``\ s using
``select``. More about that later. The important thing to understand now is
this: this is *all* a "server" socket does. It doesn't send any data. It doesn't
receive any data. It just produces "client" sockets. Each ``clientsocket`` is
created in response to some *other* "client" socket doing a ``connect()`` to the
host and port we're bound to. As soon as we've created that ``clientsocket``, we
go back to listening for more connections. The two "clients" are free to chat it
up - they are using some dynamically allocated port which will be recycled when
the conversation ends.
IPC
---
If you need fast IPC between two processes on one machine, you should look into
pipes or shared memory. If you do decide to use AF_INET sockets, bind the
"server" socket to ``'localhost'``. On most platforms, this will take a
shortcut around a couple of layers of network code and be quite a bit faster.
.. seealso::
The :mod:`multiprocessing` integrates cross-platform IPC into a higher-level
API.
Using a Socket
==============
The first thing to note, is that the web browser's "client" socket and the web
server's "client" socket are identical beasts. That is, this is a "peer to peer"
conversation. Or to put it another way, *as the designer, you will have to
decide what the rules of etiquette are for a conversation*. Normally, the
``connect``\ ing socket starts the conversation, by sending in a request, or
perhaps a signon. But that's a design decision - it's not a rule of sockets.
Now there are two sets of verbs to use for communication. You can use ``send``
and ``recv``, or you can transform your client socket into a file-like beast and
use ``read`` and ``write``. The latter is the way Java presents its sockets.
I'm not going to talk about it here, except to warn you that you need to use
``flush`` on sockets. These are buffered "files", and a common mistake is to
``write`` something, and then ``read`` for a reply. Without a ``flush`` in
there, you may wait forever for the reply, because the request may still be in
your output buffer.
Now we come to the major stumbling block of sockets - ``send`` and ``recv`` operate
on the network buffers. They do not necessarily handle all the bytes you hand
them (or expect from them), because their major focus is handling the network
buffers. In general, they return when the associated network buffers have been
filled (``send``) or emptied (``recv``). They then tell you how many bytes they
handled. It is *your* responsibility to call them again until your message has
been completely dealt with.
When a ``recv`` returns 0 bytes, it means the other side has closed (or is in
the process of closing) the connection. You will not receive any more data on
this connection. Ever. You may be able to send data successfully; I'll talk
more about this later.
A protocol like HTTP uses a socket for only one transfer. The client sends a
request, then reads a reply. That's it. The socket is discarded. This means that
a client can detect the end of the reply by receiving 0 bytes.
But if you plan to reuse your socket for further transfers, you need to realize
that *there is no* :abbr:`EOT (End of Transfer)` *on a socket.* I repeat: if a socket
``send`` or ``recv`` returns after handling 0 bytes, the connection has been
broken. If the connection has *not* been broken, you may wait on a ``recv``
forever, because the socket will *not* tell you that there's nothing more to
read (for now). Now if you think about that a bit, you'll come to realize a
fundamental truth of sockets: *messages must either be fixed length* (yuck), *or
be delimited* (shrug), *or indicate how long they are* (much better), *or end by
shutting down the connection*. The choice is entirely yours, (but some ways are
righter than others).
Assuming you don't want to end the connection, the simplest solution is a fixed
length message::
class MySocket:
"""demonstration class only
- coded for clarity, not efficiency
"""
def __init__(self, sock=None):
if sock is None:
self.sock = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
else:
self.sock = sock
def connect(self, host, port):
self.sock.connect((host, port))
def mysend(self, msg):
totalsent = 0
while totalsent < MSGLEN:
sent = self.sock.send(msg[totalsent:])
if sent == 0:
raise RuntimeError("socket connection broken")
totalsent = totalsent + sent
def myreceive(self):
chunks = []
bytes_recd = 0
while bytes_recd < MSGLEN:
chunk = self.sock.recv(min(MSGLEN - bytes_recd, 2048))
if chunk == b'':
raise RuntimeError("socket connection broken")
chunks.append(chunk)
bytes_recd = bytes_recd + len(chunk)
return b''.join(chunks)
The sending code here is usable for almost any messaging scheme - in Python you
send strings, and you can use ``len()`` to determine its length (even if it has
embedded ``\0`` characters). It's mostly the receiving code that gets more
complex. (And in C, it's not much worse, except you can't use ``strlen`` if the
message has embedded ``\0``\ s.)
The easiest enhancement is to make the first character of the message an
indicator of message type, and have the type determine the length. Now you have
two ``recv``\ s - the first to get (at least) that first character so you can
look up the length, and the second in a loop to get the rest. If you decide to
go the delimited route, you'll be receiving in some arbitrary chunk size, (4096
or 8192 is frequently a good match for network buffer sizes), and scanning what
you've received for a delimiter.
One complication to be aware of: if your conversational protocol allows multiple
messages to be sent back to back (without some kind of reply), and you pass
``recv`` an arbitrary chunk size, you may end up reading the start of a
following message. You'll need to put that aside and hold onto it, until it's
needed.
Prefixing the message with its length (say, as 5 numeric characters) gets more
complex, because (believe it or not), you may not get all 5 characters in one
``recv``. In playing around, you'll get away with it; but in high network loads,
your code will very quickly break unless you use two ``recv`` loops - the first
to determine the length, the second to get the data part of the message. Nasty.
This is also when you'll discover that ``send`` does not always manage to get
rid of everything in one pass. And despite having read this, you will eventually
get bit by it!
In the interests of space, building your character, (and preserving my
competitive position), these enhancements are left as an exercise for the
reader. Lets move on to cleaning up.
Binary Data
-----------
It is perfectly possible to send binary data over a socket. The major problem is
that not all machines use the same formats for binary data. For example, a
Motorola chip will represent a 16 bit integer with the value 1 as the two hex
bytes 00 01. Intel and DEC, however, are byte-reversed - that same 1 is 01 00.
Socket libraries have calls for converting 16 and 32 bit integers - ``ntohl,
htonl, ntohs, htons`` where "n" means *network* and "h" means *host*, "s" means
*short* and "l" means *long*. Where network order is host order, these do
nothing, but where the machine is byte-reversed, these swap the bytes around
appropriately.
In these days of 32 bit machines, the ascii representation of binary data is
frequently smaller than the binary representation. That's because a surprising
amount of the time, all those longs have the value 0, or maybe 1. The string "0"
would be two bytes, while binary is four. Of course, this doesn't fit well with
fixed-length messages. Decisions, decisions.
Disconnecting
=============
Strictly speaking, you're supposed to use ``shutdown`` on a socket before you
``close`` it. The ``shutdown`` is an advisory to the socket at the other end.
Depending on the argument you pass it, it can mean "I'm not going to send
anymore, but I'll still listen", or "I'm not listening, good riddance!". Most
socket libraries, however, are so used to programmers neglecting to use this
piece of etiquette that normally a ``close`` is the same as ``shutdown();
close()``. So in most situations, an explicit ``shutdown`` is not needed.
One way to use ``shutdown`` effectively is in an HTTP-like exchange. The client
sends a request and then does a ``shutdown(1)``. This tells the server "This
client is done sending, but can still receive." The server can detect "EOF" by
a receive of 0 bytes. It can assume it has the complete request. The server
sends a reply. If the ``send`` completes successfully then, indeed, the client
was still receiving.
Python takes the automatic shutdown a step further, and says that when a socket
is garbage collected, it will automatically do a ``close`` if it's needed. But
relying on this is a very bad habit. If your socket just disappears without
doing a ``close``, the socket at the other end may hang indefinitely, thinking
you're just being slow. *Please* ``close`` your sockets when you're done.
When Sockets Die
----------------
Probably the worst thing about using blocking sockets is what happens when the
other side comes down hard (without doing a ``close``). Your socket is likely to
hang. TCP is a reliable protocol, and it will wait a long, long time
before giving up on a connection. If you're using threads, the entire thread is
essentially dead. There's not much you can do about it. As long as you aren't
doing something dumb, like holding a lock while doing a blocking read, the
thread isn't really consuming much in the way of resources. Do *not* try to kill
the thread - part of the reason that threads are more efficient than processes
is that they avoid the overhead associated with the automatic recycling of
resources. In other words, if you do manage to kill the thread, your whole
process is likely to be screwed up.
Non-blocking Sockets
====================
If you've understood the preceding, you already know most of what you need to
know about the mechanics of using sockets. You'll still use the same calls, in
much the same ways. It's just that, if you do it right, your app will be almost
inside-out.
In Python, you use ``socket.setblocking(0)`` to make it non-blocking. In C, it's
more complex, (for one thing, you'll need to choose between the BSD flavor
``O_NONBLOCK`` and the almost indistinguishable Posix flavor ``O_NDELAY``, which
is completely different from ``TCP_NODELAY``), but it's the exact same idea. You
do this after creating the socket, but before using it. (Actually, if you're
nuts, you can switch back and forth.)
The major mechanical difference is that ``send``, ``recv``, ``connect`` and
``accept`` can return without having done anything. You have (of course) a
number of choices. You can check return code and error codes and generally drive
yourself crazy. If you don't believe me, try it sometime. Your app will grow
large, buggy and suck CPU. So let's skip the brain-dead solutions and do it
right.
Use ``select``.
In C, coding ``select`` is fairly complex. In Python, it's a piece of cake, but
it's close enough to the C version that if you understand ``select`` in Python,
you'll have little trouble with it in C::
ready_to_read, ready_to_write, in_error = \
select.select(
potential_readers,
potential_writers,
potential_errs,
timeout)
You pass ``select`` three lists: the first contains all sockets that you might
want to try reading; the second all the sockets you might want to try writing
to, and the last (normally left empty) those that you want to check for errors.
You should note that a socket can go into more than one list. The ``select``
call is blocking, but you can give it a timeout. This is generally a sensible
thing to do - give it a nice long timeout (say a minute) unless you have good
reason to do otherwise.
In return, you will get three lists. They contain the sockets that are actually
readable, writable and in error. Each of these lists is a subset (possibly
empty) of the corresponding list you passed in.
If a socket is in the output readable list, you can be
as-close-to-certain-as-we-ever-get-in-this-business that a ``recv`` on that
socket will return *something*. Same idea for the writable list. You'll be able
to send *something*. Maybe not all you want to, but *something* is better than
nothing. (Actually, any reasonably healthy socket will return as writable - it
just means outbound network buffer space is available.)
If you have a "server" socket, put it in the potential_readers list. If it comes
out in the readable list, your ``accept`` will (almost certainly) work. If you
have created a new socket to ``connect`` to someone else, put it in the
potential_writers list. If it shows up in the writable list, you have a decent
chance that it has connected.
Actually, ``select`` can be handy even with blocking sockets. It's one way of
determining whether you will block - the socket returns as readable when there's
something in the buffers. However, this still doesn't help with the problem of
determining whether the other end is done, or just busy with something else.
**Portability alert**: On Unix, ``select`` works both with the sockets and
files. Don't try this on Windows. On Windows, ``select`` works with sockets
only. Also note that in C, many of the more advanced socket options are done
differently on Windows. In fact, on Windows I usually use threads (which work
very, very well) with my sockets.

293
third_party/python/Doc/howto/sorting.rst vendored Normal file
View file

@ -0,0 +1,293 @@
.. _sortinghowto:
Sorting HOW TO
**************
:Author: Andrew Dalke and Raymond Hettinger
:Release: 0.1
Python lists have a built-in :meth:`list.sort` method that modifies the list
in-place. There is also a :func:`sorted` built-in function that builds a new
sorted list from an iterable.
In this document, we explore the various techniques for sorting data using Python.
Sorting Basics
==============
A simple ascending sort is very easy: just call the :func:`sorted` function. It
returns a new sorted list::
>>> sorted([5, 2, 3, 1, 4])
[1, 2, 3, 4, 5]
You can also use the :meth:`list.sort` method. It modifies the list
in-place (and returns ``None`` to avoid confusion). Usually it's less convenient
than :func:`sorted` - but if you don't need the original list, it's slightly
more efficient.
>>> a = [5, 2, 3, 1, 4]
>>> a.sort()
>>> a
[1, 2, 3, 4, 5]
Another difference is that the :meth:`list.sort` method is only defined for
lists. In contrast, the :func:`sorted` function accepts any iterable.
>>> sorted({1: 'D', 2: 'B', 3: 'B', 4: 'E', 5: 'A'})
[1, 2, 3, 4, 5]
Key Functions
=============
Both :meth:`list.sort` and :func:`sorted` have a *key* parameter to specify a
function to be called on each list element prior to making comparisons.
For example, here's a case-insensitive string comparison:
>>> sorted("This is a test string from Andrew".split(), key=str.lower)
['a', 'Andrew', 'from', 'is', 'string', 'test', 'This']
The value of the *key* parameter should be a function that takes a single argument
and returns a key to use for sorting purposes. This technique is fast because
the key function is called exactly once for each input record.
A common pattern is to sort complex objects using some of the object's indices
as keys. For example:
>>> student_tuples = [
... ('john', 'A', 15),
... ('jane', 'B', 12),
... ('dave', 'B', 10),
... ]
>>> sorted(student_tuples, key=lambda student: student[2]) # sort by age
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
The same technique works for objects with named attributes. For example:
>>> class Student:
... def __init__(self, name, grade, age):
... self.name = name
... self.grade = grade
... self.age = age
... def __repr__(self):
... return repr((self.name, self.grade, self.age))
>>> student_objects = [
... Student('john', 'A', 15),
... Student('jane', 'B', 12),
... Student('dave', 'B', 10),
... ]
>>> sorted(student_objects, key=lambda student: student.age) # sort by age
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
Operator Module Functions
=========================
The key-function patterns shown above are very common, so Python provides
convenience functions to make accessor functions easier and faster. The
:mod:`operator` module has :func:`~operator.itemgetter`,
:func:`~operator.attrgetter`, and a :func:`~operator.methodcaller` function.
Using those functions, the above examples become simpler and faster:
>>> from operator import itemgetter, attrgetter
>>> sorted(student_tuples, key=itemgetter(2))
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
>>> sorted(student_objects, key=attrgetter('age'))
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
The operator module functions allow multiple levels of sorting. For example, to
sort by *grade* then by *age*:
>>> sorted(student_tuples, key=itemgetter(1,2))
[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]
>>> sorted(student_objects, key=attrgetter('grade', 'age'))
[('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)]
Ascending and Descending
========================
Both :meth:`list.sort` and :func:`sorted` accept a *reverse* parameter with a
boolean value. This is used to flag descending sorts. For example, to get the
student data in reverse *age* order:
>>> sorted(student_tuples, key=itemgetter(2), reverse=True)
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
>>> sorted(student_objects, key=attrgetter('age'), reverse=True)
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
Sort Stability and Complex Sorts
================================
Sorts are guaranteed to be `stable
<https://en.wikipedia.org/wiki/Sorting_algorithm#Stability>`_\. That means that
when multiple records have the same key, their original order is preserved.
>>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)]
>>> sorted(data, key=itemgetter(0))
[('blue', 1), ('blue', 2), ('red', 1), ('red', 2)]
Notice how the two records for *blue* retain their original order so that
``('blue', 1)`` is guaranteed to precede ``('blue', 2)``.
This wonderful property lets you build complex sorts in a series of sorting
steps. For example, to sort the student data by descending *grade* and then
ascending *age*, do the *age* sort first and then sort again using *grade*:
>>> s = sorted(student_objects, key=attrgetter('age')) # sort on secondary key
>>> sorted(s, key=attrgetter('grade'), reverse=True) # now sort on primary key, descending
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
The `Timsort <https://en.wikipedia.org/wiki/Timsort>`_ algorithm used in Python
does multiple sorts efficiently because it can take advantage of any ordering
already present in a dataset.
The Old Way Using Decorate-Sort-Undecorate
==========================================
This idiom is called Decorate-Sort-Undecorate after its three steps:
* First, the initial list is decorated with new values that control the sort order.
* Second, the decorated list is sorted.
* Finally, the decorations are removed, creating a list that contains only the
initial values in the new order.
For example, to sort the student data by *grade* using the DSU approach:
>>> decorated = [(student.grade, i, student) for i, student in enumerate(student_objects)]
>>> decorated.sort()
>>> [student for grade, i, student in decorated] # undecorate
[('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)]
This idiom works because tuples are compared lexicographically; the first items
are compared; if they are the same then the second items are compared, and so
on.
It is not strictly necessary in all cases to include the index *i* in the
decorated list, but including it gives two benefits:
* The sort is stable -- if two items have the same key, their order will be
preserved in the sorted list.
* The original items do not have to be comparable because the ordering of the
decorated tuples will be determined by at most the first two items. So for
example the original list could contain complex numbers which cannot be sorted
directly.
Another name for this idiom is
`Schwartzian transform <https://en.wikipedia.org/wiki/Schwartzian_transform>`_\,
after Randal L. Schwartz, who popularized it among Perl programmers.
Now that Python sorting provides key-functions, this technique is not often needed.
The Old Way Using the *cmp* Parameter
=====================================
Many constructs given in this HOWTO assume Python 2.4 or later. Before that,
there was no :func:`sorted` builtin and :meth:`list.sort` took no keyword
arguments. Instead, all of the Py2.x versions supported a *cmp* parameter to
handle user specified comparison functions.
In Py3.0, the *cmp* parameter was removed entirely (as part of a larger effort to
simplify and unify the language, eliminating the conflict between rich
comparisons and the :meth:`__cmp__` magic method).
In Py2.x, sort allowed an optional function which can be called for doing the
comparisons. That function should take two arguments to be compared and then
return a negative value for less-than, return zero if they are equal, or return
a positive value for greater-than. For example, we can do:
>>> def numeric_compare(x, y):
... return x - y
>>> sorted([5, 2, 4, 1, 3], cmp=numeric_compare) # doctest: +SKIP
[1, 2, 3, 4, 5]
Or you can reverse the order of comparison with:
>>> def reverse_numeric(x, y):
... return y - x
>>> sorted([5, 2, 4, 1, 3], cmp=reverse_numeric) # doctest: +SKIP
[5, 4, 3, 2, 1]
When porting code from Python 2.x to 3.x, the situation can arise when you have
the user supplying a comparison function and you need to convert that to a key
function. The following wrapper makes that easy to do::
def cmp_to_key(mycmp):
'Convert a cmp= function into a key= function'
class K:
def __init__(self, obj, *args):
self.obj = obj
def __lt__(self, other):
return mycmp(self.obj, other.obj) < 0
def __gt__(self, other):
return mycmp(self.obj, other.obj) > 0
def __eq__(self, other):
return mycmp(self.obj, other.obj) == 0
def __le__(self, other):
return mycmp(self.obj, other.obj) <= 0
def __ge__(self, other):
return mycmp(self.obj, other.obj) >= 0
def __ne__(self, other):
return mycmp(self.obj, other.obj) != 0
return K
To convert to a key function, just wrap the old comparison function:
.. testsetup::
from functools import cmp_to_key
.. doctest::
>>> sorted([5, 2, 4, 1, 3], key=cmp_to_key(reverse_numeric))
[5, 4, 3, 2, 1]
In Python 3.2, the :func:`functools.cmp_to_key` function was added to the
:mod:`functools` module in the standard library.
Odd and Ends
============
* For locale aware sorting, use :func:`locale.strxfrm` for a key function or
:func:`locale.strcoll` for a comparison function.
* The *reverse* parameter still maintains sort stability (so that records with
equal keys retain the original order). Interestingly, that effect can be
simulated without the parameter by using the builtin :func:`reversed` function
twice:
>>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)]
>>> standard_way = sorted(data, key=itemgetter(0), reverse=True)
>>> double_reversed = list(reversed(sorted(reversed(data), key=itemgetter(0))))
>>> assert standard_way == double_reversed
>>> standard_way
[('red', 1), ('red', 2), ('blue', 1), ('blue', 2)]
* The sort routines are guaranteed to use :meth:`__lt__` when making comparisons
between two objects. So, it is easy to add a standard sort order to a class by
defining an :meth:`__lt__` method::
>>> Student.__lt__ = lambda self, other: self.age < other.age
>>> sorted(student_objects)
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
* Key functions need not depend directly on the objects being sorted. A key
function can also access external resources. For instance, if the student grades
are stored in a dictionary, they can be used to sort a separate list of student
names:
>>> students = ['dave', 'john', 'jane']
>>> newgrades = {'john': 'F', 'jane':'A', 'dave': 'C'}
>>> sorted(students, key=newgrades.__getitem__)
['jane', 'dave', 'john']

733
third_party/python/Doc/howto/unicode.rst vendored Normal file
View file

@ -0,0 +1,733 @@
.. _unicode-howto:
*****************
Unicode HOWTO
*****************
:Release: 1.12
This HOWTO discusses Python support for Unicode, and explains
various problems that people commonly encounter when trying to work
with Unicode.
Introduction to Unicode
=======================
History of Character Codes
--------------------------
In 1968, the American Standard Code for Information Interchange, better known by
its acronym ASCII, was standardized. ASCII defined numeric codes for various
characters, with the numeric values running from 0 to 127. For example, the
lowercase letter 'a' is assigned 97 as its code value.
ASCII was an American-developed standard, so it only defined unaccented
characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
which required accented characters couldn't be faithfully represented in ASCII.
(Actually the missing accents matter for English, too, which contains words such
as 'naïve' and 'café', and some publications have house styles which require
spellings such as 'coöperate'.)
For a while people just wrote programs that didn't display accents.
In the mid-1980s an Apple II BASIC program written by a French speaker
might have lines like these:
.. code-block:: basic
PRINT "MISE A JOUR TERMINEE"
PRINT "PARAMETRES ENREGISTRES"
Those messages should contain accents (terminée, paramètre, enregistrés) and
they just look wrong to someone who can read French.
In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
machines assigned values between 128 and 255 to accented characters. Different
machines had different codes, however, which led to problems exchanging files.
Eventually various commonly used sets of values for the 128--255 range emerged.
Some were true standards, defined by the International Organization for
Standardization, and some were *de facto* conventions that were invented by one
company or another and managed to catch on.
255 characters aren't very many. For example, you can't fit both the accented
characters used in Western Europe and the Cyrillic alphabet used for Russian
into the 128--255 range because there are more than 128 such characters.
You could write files using different codes (all your Russian files in a coding
system called KOI8, all your French files in a different coding system called
Latin1), but what if you wanted to write a French document that quotes some
Russian text? In the 1980s people began to want to solve this problem, and the
Unicode standardization effort began.
Unicode started out using 16-bit characters instead of 8-bit characters. 16
bits means you have 2^16 = 65,536 distinct values available, making it possible
to represent many different characters from many different alphabets; an initial
goal was to have Unicode contain the alphabets for every single human language.
It turns out that even 16 bits isn't enough to meet that goal, and the modern
Unicode specification uses a wider range of codes, 0 through 1,114,111 (
``0x10FFFF`` in base 16).
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
originally separate efforts, but the specifications were merged with the 1.1
revision of Unicode.
(This discussion of Unicode's history is highly simplified. The
precise historical details aren't necessary for understanding how to
use Unicode effectively, but if you're curious, consult the Unicode
consortium site listed in the References or
the `Wikipedia entry for Unicode <https://en.wikipedia.org/wiki/Unicode#History>`_
for more information.)
Definitions
-----------
A **character** is the smallest possible component of a text. 'A', 'B', 'C',
etc., are all different characters. So are 'È' and 'Í'. Characters are
abstractions, and vary depending on the language or context you're talking
about. For example, the symbol for ohms (Ω) is usually drawn much like the
capital letter omega (Ω) in the Greek alphabet (they may even be the same in
some fonts), but these are two different characters that have different
meanings.
The Unicode standard describes how characters are represented by **code
points**. A code point is an integer value, usually denoted in base 16. In the
standard, a code point is written using the notation ``U+12CA`` to mean the
character with value ``0x12ca`` (4,810 decimal). The Unicode standard contains
a lot of tables listing characters and their corresponding code points:
.. code-block:: none
0061 'a'; LATIN SMALL LETTER A
0062 'b'; LATIN SMALL LETTER B
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET
Strictly, these definitions imply that it's meaningless to say 'this is
character ``U+12CA``'. ``U+12CA`` is a code point, which represents some particular
character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
informal contexts, this distinction between code points and characters will
sometimes be forgotten.
A character is represented on a screen or on paper by a set of graphical
elements that's called a **glyph**. The glyph for an uppercase A, for example,
is two diagonal strokes and a horizontal stroke, though the exact details will
depend on the font being used. Most Python code doesn't need to worry about
glyphs; figuring out the correct glyph to display is generally the job of a GUI
toolkit or a terminal's font renderer.
Encodings
---------
To summarize the previous section: a Unicode string is a sequence of code
points, which are numbers from 0 through ``0x10FFFF`` (1,114,111 decimal). This
sequence needs to be represented as a set of bytes (meaning, values
from 0 through 255) in memory. The rules for translating a Unicode string
into a sequence of bytes are called an **encoding**.
The first encoding you might think of is an array of 32-bit integers. In this
representation, the string "Python" would look like this:
.. code-block:: none
P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
This representation is straightforward but using it presents a number of
problems.
1. It's not portable; different processors order the bytes differently.
2. It's very wasteful of space. In most texts, the majority of the code points
are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
ASCII representation. Increased RAM usage doesn't matter too much (desktop
computers have gigabytes of RAM, and strings aren't usually that large), but
expanding our usage of disk and network bandwidth by a factor of 4 is
intolerable.
3. It's not compatible with existing C functions such as ``strlen()``, so a new
family of wide string functions would need to be used.
4. Many Internet standards are defined in terms of textual data, and can't
handle content with embedded zero bytes.
Generally people don't use this encoding, instead choosing other
encodings that are more efficient and convenient. UTF-8 is probably
the most commonly supported encoding; it will be discussed below.
Encodings don't have to handle every possible Unicode character, and most
encodings don't. The rules for converting a Unicode string into the ASCII
encoding, for example, are simple; for each code point:
1. If the code point is < 128, each byte is the same as the value of the code
point.
2. If the code point is 128 or greater, the Unicode string can't be represented
in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
case.)
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
0--255 are identical to the Latin-1 values, so converting to this encoding simply
requires converting code points to byte values; if a code point larger than 255
is encountered, the string can't be encoded into Latin-1.
Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
some sort of lookup table to perform the conversion, but this is largely an
internal detail.
UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
Transformation Format", and the '8' means that 8-bit numbers are used in the
encoding. (There are also a UTF-16 and UTF-32 encodings, but they are less
frequently used than UTF-8.) UTF-8 uses the following rules:
1. If the code point is < 128, it's represented by the corresponding byte value.
2. If the code point is >= 128, it's turned into a sequence of two, three, or
four bytes, where each byte of the sequence is between 128 and 255.
UTF-8 has several convenient properties:
1. It can handle any Unicode code point.
2. A Unicode string is turned into a sequence of bytes containing no embedded zero
bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
processed by C functions such as ``strcpy()`` and sent through protocols that
can't handle zero bytes.
3. A string of ASCII text is also valid UTF-8 text.
4. UTF-8 is fairly compact; the majority of commonly used characters can be
represented with one or two bytes.
5. If bytes are corrupted or lost, it's possible to determine the start of the
next UTF-8-encoded code point and resynchronize. It's also unlikely that
random 8-bit data will look like valid UTF-8.
References
----------
The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a
glossary, and PDF versions of the Unicode specification. Be prepared for some
difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
origin and development of Unicode is also available on the site.
To help understand the standard, Jukka Korpela has written `an introductory
guide <https://www.cs.tut.fi/~jkorpela/unicode/guide.html>`_ to reading the
Unicode character tables.
Another `good introductory article <http://www.joelonsoftware.com/articles/Unicode.html>`_
was written by Joel Spolsky.
If this introduction didn't make things clear to you, you should try
reading this alternate article before continuing.
Wikipedia entries are often helpful; see the entries for "`character encoding
<https://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
<https://en.wikipedia.org/wiki/UTF-8>`_, for example.
Python's Unicode Support
========================
Now that you've learned the rudiments of Unicode, we can look at Python's
Unicode features.
The String Type
---------------
Since Python 3.0, the language features a :class:`str` type that contain Unicode
characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
rocks!'``, or the triple-quoted string syntax is stored as Unicode.
The default encoding for Python source code is UTF-8, so you can simply
include a Unicode character in a string literal::
try:
with open('/tmp/input.txt', 'r') as f:
...
except OSError:
# 'File not found' error message.
print("Fichier non trouvé")
You can use a different encoding from UTF-8 by putting a specially-formatted
comment as the first or second line of the source code::
# -*- coding: <encoding name> -*-
Side note: Python 3 also supports using Unicode characters in identifiers::
répertoire = "/tmp/records.log"
with open(répertoire, "w") as f:
f.write("test\n")
If you can't enter a particular character in your editor or want to
keep the source code ASCII-only for some reason, you can also use
escape sequences in string literals. (Depending on your system,
you may see the actual capital-delta glyph instead of a \u escape.) ::
>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
'\u0394'
>>> "\u0394" # Using a 16-bit hex value
'\u0394'
>>> "\U00000394" # Using a 32-bit hex value
'\u0394'
In addition, one can create a string using the :func:`~bytes.decode` method of
:class:`bytes`. This method takes an *encoding* argument, such as ``UTF-8``,
and optionally an *errors* argument.
The *errors* argument specifies the response when the input string can't be
converted according to the encoding's rules. Legal values for this argument are
``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the
character out of the Unicode result), or ``'backslashreplace'`` (inserts a
``\xNN`` escape sequence).
The following examples show the differences::
>>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'
Encodings are specified as strings containing the encoding's name. Python 3.2
comes with roughly 100 different encodings; see the Python Library Reference at
:ref:`standard-encodings` for a list. Some encodings have multiple names; for
example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
the same encoding.
One-character Unicode strings can also be created with the :func:`chr`
built-in function, which takes integers and returns a Unicode string of length 1
that contains the corresponding code point. The reverse operation is the
built-in :func:`ord` function that takes a one-character Unicode string and
returns the code point value::
>>> chr(57344)
'\ue000'
>>> ord('\ue000')
57344
Converting to Bytes
-------------------
The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
which returns a :class:`bytes` representation of the Unicode string, encoded in the
requested *encoding*.
The *errors* parameter is the same as the parameter of the
:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
inserts a question mark instead of the unencodable character), there is
also ``'xmlcharrefreplace'`` (inserts an XML character reference),
``backslashreplace`` (inserts a ``\uNNNN`` escape sequence) and
``namereplace`` (inserts a ``\N{...}`` escape sequence).
The following example shows the different results::
>>> u = chr(40960) + 'abcd' + chr(1972)
>>> u.encode('utf-8')
b'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
b'abcd'
>>> u.encode('ascii', 'replace')
b'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
b'&#40960;abcd&#1972;'
>>> u.encode('ascii', 'backslashreplace')
b'\\ua000abcd\\u07b4'
>>> u.encode('ascii', 'namereplace')
b'\\N{YI SYLLABLE IT}abcd\\u07b4'
The low-level routines for registering and accessing the available
encodings are found in the :mod:`codecs` module. Implementing new
encodings also requires understanding the :mod:`codecs` module.
However, the encoding and decoding functions returned by this module
are usually more low-level than is comfortable, and writing new encodings
is a specialized task, so the module won't be covered in this HOWTO.
Unicode Literals in Python Source Code
--------------------------------------
In Python source code, specific Unicode code points can be written using the
``\u`` escape sequence, which is followed by four hex digits giving the code
point. The ``\U`` escape sequence is similar, but expects eight hex digits,
not four::
>>> s = "a\xac\u1234\u20ac\U00008000"
... # ^^^^ two-digit hex escape
... # ^^^^^^ four-digit Unicode escape
... # ^^^^^^^^^^ eight-digit Unicode escape
>>> [ord(c) for c in s]
[97, 172, 4660, 8364, 32768]
Using escape sequences for code points greater than 127 is fine in small doses,
but becomes an annoyance if you're using many accented characters, as you would
in a program with messages in French or some other accent-using language. You
can also assemble strings using the :func:`chr` built-in function, but this is
even more tedious.
Ideally, you'd want to be able to write literals in your language's natural
encoding. You could then edit Python source code with your favorite editor
which would display the accented characters naturally, and have the right
characters used at runtime.
Python supports writing source code in UTF-8 by default, but you can use almost
any encoding if you declare the encoding being used. This is done by including
a special comment as either the first or second line of the source file::
#!/usr/bin/env python
# -*- coding: latin-1 -*-
u = 'abcdé'
print(ord(u[-1]))
The syntax is inspired by Emacs's notation for specifying variables local to a
file. Emacs supports many different variables, but Python only supports
'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
they have no significance to Python but are a convention. Python looks for
``coding: name`` or ``coding=name`` in the comment.
If you don't include such a comment, the default encoding used will be UTF-8 as
already mentioned. See also :pep:`263` for more information.
Unicode Properties
------------------
The Unicode specification includes a database of information about code points.
For each defined code point, the information includes the character's
name, its category, the numeric value if applicable (Unicode has characters
representing the Roman numerals and fractions such as one-third and
four-fifths). There are also properties related to the code point's use in
bidirectional text and other display-related properties.
The following program displays some information about several characters, and
prints the numeric value of one particular character::
import unicodedata
u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
for i, c in enumerate(u):
print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
print(unicodedata.name(c))
# Get numeric value of second character
print(unicodedata.numeric(u[1]))
When run, this prints:
.. code-block:: none
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED
1000.0
The category codes are abbreviations describing the nature of the character.
These are grouped into categories such as "Letter", "Number", "Punctuation", or
"Symbol", which in turn are broken up into subcategories. To take the codes
from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
other". See
`the General Category Values section of the Unicode Character Database documentation <http://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a
list of category codes.
Unicode Regular Expressions
---------------------------
The regular expressions supported by the :mod:`re` module can be provided
either as bytes or strings. Some of the special character sequences such as
``\d`` and ``\w`` have different meanings depending on whether
the pattern is supplied as bytes or a string. For example,
``\d`` will match the characters ``[0-9]`` in bytes but
in strings will match any character that's in the ``'Nd'`` category.
The string in this example has the number 57 written in both Thai and
Arabic numerals::
import re
p = re.compile(r'\d+')
s = "Over \u0e55\u0e57 57 flavours"
m = p.search(s)
print(repr(m.group()))
When executed, ``\d+`` will match the Thai numerals and print them
out. If you supply the :const:`re.ASCII` flag to
:func:`~re.compile`, ``\d+`` will match the substring "57" instead.
Similarly, ``\w`` matches a wide variety of Unicode characters but
only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied,
and ``\s`` will match either Unicode whitespace characters or
``[ \t\n\r\f\v]``.
References
----------
.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section?
Some good alternative discussions of Python's Unicode support are:
* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan.
* `Pragmatic Unicode <http://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder.
The :class:`str` type is described in the Python library reference at
:ref:`textseq`.
The documentation for the :mod:`unicodedata` module.
The documentation for the :mod:`codecs` module.
Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides)
<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
EuroPython 2002. The slides are an excellent overview of the design of Python
2's Unicode features (where the Unicode string type is called ``unicode`` and
literals start with ``u``).
Reading and Writing Unicode Data
================================
Once you've written some code that works with Unicode data, the next problem is
input/output. How do you get Unicode strings into your program, and how do you
convert Unicode into a form suitable for storage or transmission?
It's possible that you may not need to do anything depending on your input
sources and output destinations; you should check whether the libraries used in
your application support Unicode natively. XML parsers often return Unicode
data, for example. Many relational databases also support Unicode-valued
columns and can return Unicode values from an SQL query.
Unicode data is usually converted to a particular encoding before it gets
written to disk or sent over a socket. It's possible to do all the work
yourself: open a file, read an 8-bit bytes object from it, and convert the bytes
with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
One problem is the multi-byte nature of encodings; one Unicode character can be
represented by several bytes. If you want to read the file in arbitrary-sized
chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case
where only part of the bytes encoding a single Unicode character are read at the
end of a chunk. One solution would be to read the entire file into memory and
then perform the decoding, but that prevents you from working with files that
are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM.
(More, really, since for at least a moment you'd need to have both the encoded
string and its Unicode version in memory.)
The solution would be to use the low-level decoding interface to catch the case
of partial coding sequences. The work of implementing this has already been
done for you: the built-in :func:`open` function can return a file-like object
that assumes the file's contents are in a specified encoding and accepts Unicode
parameters for methods such as :meth:`~io.TextIOBase.read` and
:meth:`~io.TextIOBase.write`. This works through :func:`open`\'s *encoding* and
*errors* parameters which are interpreted just like those in :meth:`str.encode`
and :meth:`bytes.decode`.
Reading Unicode from a file is therefore simple::
with open('unicode.txt', encoding='utf-8') as f:
for line in f:
print(repr(line))
It's also possible to open files in update mode, allowing both reading and
writing::
with open('test', encoding='utf-8', mode='w+') as f:
f.write('\u4500 blah blah blah\n')
f.seek(0)
print(repr(f.readline()[:1]))
The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
written as the first character of a file in order to assist with autodetection
of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
present at the start of a file; when such an encoding is used, the BOM will be
automatically written as the first character and will be silently dropped when
the file is read. There are variants of these encodings, such as 'utf-16-le'
and 'utf-16-be' for little-endian and big-endian encodings, that specify one
particular byte ordering and don't skip the BOM.
In some areas, it is also convention to use a "BOM" at the start of UTF-8
encoded files; the name is misleading since UTF-8 is not byte-order dependent.
The mark simply announces that the file is encoded in UTF-8. Use the
'utf-8-sig' codec to automatically skip the mark if present for reading such
files.
Unicode filenames
-----------------
Most of the operating systems in common use today support filenames that contain
arbitrary Unicode characters. Usually this is implemented by converting the
Unicode string into some encoding that varies depending on the system. For
example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
Windows, Python uses the name "mbcs" to refer to whatever the currently
configured encoding is. On Unix systems, there will only be a filesystem
encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
you haven't, the default encoding is UTF-8.
The :func:`sys.getfilesystemencoding` function returns the encoding to use on
your current system, in case you want to do the encoding manually, but there's
not much reason to bother. When opening a file for reading or writing, you can
usually just provide the Unicode string as the filename, and it will be
automatically converted to the right encoding for you::
filename = 'filename\u4500abc'
with open(filename, 'w') as f:
f.write('blah\n')
Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
filenames.
The :func:`os.listdir` function returns filenames and raises an issue: should it return
the Unicode version of filenames, or should it return bytes containing
the encoded versions? :func:`os.listdir` will do both, depending on whether you
provided the directory path as bytes or a Unicode string. If you pass a
Unicode string as the path, filenames will be decoded using the filesystem's
encoding and a list of Unicode strings will be returned, while passing a byte
path will return the filenames as bytes. For example,
assuming the default filesystem encoding is UTF-8, running the following
program::
fn = 'filename\u4500abc'
f = open(fn, 'w')
f.close()
import os
print(os.listdir(b'.'))
print(os.listdir('.'))
will produce the following output:
.. code-block:: shell-session
amk:~$ python t.py
[b'filename\xe4\x94\x80abc', ...]
['filename\u4500abc', ...]
The first list contains UTF-8-encoded filenames, and the second list contains
the Unicode versions.
Note that on most occasions, the Unicode APIs should be used. The bytes APIs
should only be used on systems where undecodable file names can be present,
i.e. Unix systems.
Tips for Writing Unicode-aware Programs
---------------------------------------
This section provides some suggestions on writing software that deals with
Unicode.
The most important tip is:
Software should only work with Unicode strings internally, decoding the input
data as soon as possible and encoding the output only at the end.
If you attempt to write processing functions that accept both Unicode and byte
strings, you will find your program vulnerable to bugs wherever you combine the
two different kinds of strings. There is no automatic encoding or decoding: if
you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
When using data coming from a web browser or some other untrusted source, a
common technique is to check for illegal characters in a string before using the
string in a generated command line or storing it in a database. If you're doing
this, be careful to check the decoded string, not the encoded bytes data;
some encodings may have interesting properties, such as not being bijective
or not being fully ASCII-compatible. This is especially true if the input
data also specifies the encoding, since the attacker can then choose a
clever way to hide malicious text in the encoded bytestream.
Converting Between File Encodings
'''''''''''''''''''''''''''''''''
The :class:`~codecs.StreamRecoder` class can transparently convert between
encodings, taking a stream that returns data in encoding #1
and behaving like a stream returning data in encoding #2.
For example, if you have an input file *f* that's in Latin-1, you
can wrap it with a :class:`~codecs.StreamRecoder` to return bytes encoded in
UTF-8::
new_f = codecs.StreamRecoder(f,
# en/decoder: used by read() to encode its results and
# by write() to decode its input.
codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
# reader/writer: used to read and write to the stream.
codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
Files in an Unknown Encoding
''''''''''''''''''''''''''''
What can you do if you need to make a change to a file, but don't know
the file's encoding? If you know the encoding is ASCII-compatible and
only want to examine or modify the ASCII parts, you can open the file
with the ``surrogateescape`` error handler::
with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
data = f.read()
# make changes to the string 'data'
with open(fname + '.new', 'w',
encoding="ascii", errors="surrogateescape") as f:
f.write(data)
The ``surrogateescape`` error handler will decode any non-ASCII bytes
as code points in the Unicode Private Use Area ranging from U+DC80 to
U+DCFF. These private code points will then be turned back into the
same bytes when the ``surrogateescape`` error handler is used when
encoding the data and writing it back out.
References
----------
One section of `Mastering Python 3 Input/Output
<http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_,
a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
Applications in Python"
<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
discuss questions of character encodings as well as how to internationalize
and localize an application. These slides cover Python 2.x only.
`The Guts of Unicode in Python
<http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_
is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
representation in Python 3.3.
Acknowledgements
================
The initial draft of this document was written by Andrew Kuchling.
It has since been revised further by Alexander Belopolsky, Georg Brandl,
Andrew Kuchling, and Ezio Melotti.
Thanks to the following people who have noted errors or offered
suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
Lemburg, Martin von Löwis, Terry J. Reedy, Chad Whitacre.

605
third_party/python/Doc/howto/urllib2.rst vendored Normal file
View file

@ -0,0 +1,605 @@
.. _urllib-howto:
***********************************************************
HOWTO Fetch Internet Resources Using The urllib Package
***********************************************************
:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
.. note::
There is a French translation of an earlier revision of this
HOWTO, available at `urllib2 - Le Manuel manquant
<http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
Introduction
============
.. sidebar:: Related Articles
You may also find useful the following article on fetching web resources
with Python:
* `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
A tutorial on *Basic Authentication*, with examples in Python.
**urllib.request** is a Python module for fetching URLs
(Uniform Resource Locators). It offers a very simple interface, in the form of
the *urlopen* function. This is capable of fetching URLs using a variety of
different protocols. It also offers a slightly more complex interface for
handling common situations - like basic authentication, cookies, proxies and so
on. These are provided by objects called handlers and openers.
urllib.request supports fetching URLs for many "URL schemes" (identified by the string
before the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of
``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP).
This tutorial focuses on the most common case, HTTP.
For straightforward situations *urlopen* is very easy to use. But as soon as you
encounter errors or non-trivial cases when opening HTTP URLs, you will need some
understanding of the HyperText Transfer Protocol. The most comprehensive and
authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
not intended to be easy to read. This HOWTO aims to illustrate using *urllib*,
with enough detail about HTTP to help you through. It is not intended to replace
the :mod:`urllib.request` docs, but is supplementary to them.
Fetching URLs
=============
The simplest way to use urllib.request is as follows::
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
If you wish to retrieve a resource via URL and store it in a temporary
location, you can do so via the :func:`shutil.copyfileobj` and
:func:`tempfile.NamedTemporaryFile` functions::
import shutil
import tempfile
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
shutil.copyfileobj(response, tmp_file)
with open(tmp_file.name) as html:
pass
Many uses of urllib will be that simple (note that instead of an 'http:' URL we
could have used a URL starting with 'ftp:', 'file:', etc.). However, it's the
purpose of this tutorial to explain the more complicated cases, concentrating on
HTTP.
HTTP is based on requests and responses - the client makes requests and servers
send responses. urllib.request mirrors this with a ``Request`` object which represents
the HTTP request you are making. In its simplest form you create a Request
object that specifies the URL you want to fetch. Calling ``urlopen`` with this
Request object returns a response object for the URL requested. This response is
a file-like object, which means you can for example call ``.read()`` on the
response::
import urllib.request
req = urllib.request.Request('http://www.voidspace.org.uk')
with urllib.request.urlopen(req) as response:
the_page = response.read()
Note that urllib.request makes use of the same Request interface to handle all URL
schemes. For example, you can make an FTP request like so::
req = urllib.request.Request('ftp://example.com/')
In the case of HTTP, there are two extra things that Request objects allow you
to do: First, you can pass data to be sent to the server. Second, you can pass
extra information ("metadata") *about* the data or the about request itself, to
the server - this information is sent as HTTP "headers". Let's look at each of
these in turn.
Data
----
Sometimes you want to send data to a URL (often the URL will refer to a CGI
(Common Gateway Interface) script or other web application). With HTTP,
this is often done using what's known as a **POST** request. This is often what
your browser does when you submit a HTML form that you filled in on the web. Not
all POSTs have to come from forms: you can use a POST to transmit arbitrary data
to your own application. In the common case of HTML forms, the data needs to be
encoded in a standard way, and then passed to the Request object as the ``data``
argument. The encoding is done using a function from the :mod:`urllib.parse`
library. ::
import urllib.parse
import urllib.request
url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = urllib.parse.urlencode(values)
data = data.encode('ascii') # data should be bytes
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
the_page = response.read()
Note that other encodings are sometimes required (e.g. for file upload from HTML
forms - see `HTML Specification, Form Submission
<https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
details).
If you do not pass the ``data`` argument, urllib uses a **GET** request. One
way in which GET and POST requests differ is that POST requests often have
"side-effects": they change the state of the system in some way (for example by
placing an order with the website for a hundredweight of tinned spam to be
delivered to your door). Though the HTTP standard makes it clear that POSTs are
intended to *always* cause side-effects, and GET requests *never* to cause
side-effects, nothing prevents a GET request from having side-effects, nor a
POST requests from having no side-effects. Data can also be passed in an HTTP
GET request by encoding it in the URL itself.
This is done as follows::
>>> import urllib.request
>>> import urllib.parse
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.parse.urlencode(data)
>>> print(url_values) # The order may differ from below. #doctest: +SKIP
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib.request.urlopen(full_url)
Notice that the full URL is created by adding a ``?`` to the URL, followed by
the encoded values.
Headers
-------
We'll discuss here one particular HTTP header, to illustrate how to add headers
to your HTTP request.
Some websites [#]_ dislike being browsed by programs, or send different versions
to different browsers [#]_. By default urllib identifies itself as
``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
numbers of the Python release,
e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
not work. The way a browser identifies itself is through the
``User-Agent`` header [#]_. When you create a Request object you can
pass a dictionary of headers in. The following example makes the same
request as above, but identifies itself as a version of Internet
Explorer [#]_. ::
import urllib.parse
import urllib.request
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name': 'Michael Foord',
'location': 'Northampton',
'language': 'Python' }
headers = {'User-Agent': user_agent}
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
the_page = response.read()
The response also has two useful methods. See the section on `info and geturl`_
which comes after we have a look at what happens when things go wrong.
Handling Exceptions
===================
*urlopen* raises :exc:`URLError` when it cannot handle a response (though as
usual with Python APIs, built-in exceptions such as :exc:`ValueError`,
:exc:`TypeError` etc. may also be raised).
:exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of
HTTP URLs.
The exception classes are exported from the :mod:`urllib.error` module.
URLError
--------
Often, URLError is raised because there is no network connection (no route to
the specified server), or the specified server doesn't exist. In this case, the
exception raised will have a 'reason' attribute, which is a tuple containing an
error code and a text error message.
e.g. ::
>>> req = urllib.request.Request('http://www.pretend_server.org')
>>> try: urllib.request.urlopen(req)
... except urllib.error.URLError as e:
... print(e.reason) #doctest: +SKIP
...
(4, 'getaddrinfo failed')
HTTPError
---------
Every HTTP response from the server contains a numeric "status code". Sometimes
the status code indicates that the server is unable to fulfil the request. The
default handlers will handle some of these responses for you (for example, if
the response is a "redirection" that requests the client fetch the document from
a different URL, urllib will handle that for you). For those it can't handle,
urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not
found), '403' (request forbidden), and '401' (authentication required).
See section 10 of :rfc:`2616` for a reference on all the HTTP error codes.
The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which
corresponds to the error sent by the server.
Error Codes
~~~~~~~~~~~
Because the default handlers handle redirects (codes in the 300 range), and
codes in the 100--299 range indicate success, you will usually only see error
codes in the 400--599 range.
:attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of
response codes in that shows all the response codes used by :rfc:`2616`. The
dictionary is reproduced here for convenience ::
# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
100: ('Continue', 'Request received, please continue'),
101: ('Switching Protocols',
'Switching to new protocol; obey Upgrade header'),
200: ('OK', 'Request fulfilled, document follows'),
201: ('Created', 'Document created, URL follows'),
202: ('Accepted',
'Request accepted, processing continues off-line'),
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
204: ('No Content', 'Request fulfilled, nothing follows'),
205: ('Reset Content', 'Clear input form for further input.'),
206: ('Partial Content', 'Partial content follows.'),
300: ('Multiple Choices',
'Object has several resources -- see URI list'),
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
302: ('Found', 'Object moved temporarily -- see URI list'),
303: ('See Other', 'Object moved -- see Method and URL list'),
304: ('Not Modified',
'Document has not changed since given time'),
305: ('Use Proxy',
'You must use proxy specified in Location to access this '
'resource.'),
307: ('Temporary Redirect',
'Object moved temporarily -- see URI list'),
400: ('Bad Request',
'Bad request syntax or unsupported method'),
401: ('Unauthorized',
'No permission -- see authorization schemes'),
402: ('Payment Required',
'No payment -- see charging schemes'),
403: ('Forbidden',
'Request forbidden -- authorization will not help'),
404: ('Not Found', 'Nothing matches the given URI'),
405: ('Method Not Allowed',
'Specified method is invalid for this server.'),
406: ('Not Acceptable', 'URI not available in preferred format.'),
407: ('Proxy Authentication Required', 'You must authenticate with '
'this proxy before proceeding.'),
408: ('Request Timeout', 'Request timed out; try again later.'),
409: ('Conflict', 'Request conflict.'),
410: ('Gone',
'URI no longer exists and has been permanently removed.'),
411: ('Length Required', 'Client must specify Content-Length.'),
412: ('Precondition Failed', 'Precondition in headers is false.'),
413: ('Request Entity Too Large', 'Entity is too large.'),
414: ('Request-URI Too Long', 'URI is too long.'),
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
416: ('Requested Range Not Satisfiable',
'Cannot satisfy request range.'),
417: ('Expectation Failed',
'Expect condition could not be satisfied.'),
500: ('Internal Server Error', 'Server got itself in trouble'),
501: ('Not Implemented',
'Server does not support this operation'),
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
503: ('Service Unavailable',
'The server cannot process the request due to a high load'),
504: ('Gateway Timeout',
'The gateway server did not receive a timely response'),
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
}
When an error is raised the server responds by returning an HTTP error code
*and* an error page. You can use the :exc:`HTTPError` instance as a response on the
page returned. This means that as well as the code attribute, it also has read,
geturl, and info, methods as returned by the ``urllib.response`` module::
>>> req = urllib.request.Request('http://www.python.org/fish.html')
>>> try:
... urllib.request.urlopen(req)
... except urllib.error.HTTPError as e:
... print(e.code)
... print(e.read()) #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
...
404
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
...
<title>Page Not Found</title>\n
...
Wrapping it Up
--------------
So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two
basic approaches. I prefer the second approach.
Number 1
~~~~~~~~
::
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
# everything is fine
.. note::
The ``except HTTPError`` *must* come first, otherwise ``except URLError``
will *also* catch an :exc:`HTTPError`.
Number 2
~~~~~~~~
::
from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request(someurl)
try:
response = urlopen(req)
except URLError as e:
if hasattr(e, 'reason'):
print('We failed to reach a server.')
print('Reason: ', e.reason)
elif hasattr(e, 'code'):
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
else:
# everything is fine
info and geturl
===============
The response returned by urlopen (or the :exc:`HTTPError` instance) has two
useful methods :meth:`info` and :meth:`geturl` and is defined in the module
:mod:`urllib.response`..
**geturl** - this returns the real URL of the page fetched. This is useful
because ``urlopen`` (or the opener object used) may have followed a
redirect. The URL of the page fetched may not be the same as the URL requested.
**info** - this returns a dictionary-like object that describes the page
fetched, particularly the headers sent by the server. It is currently an
:class:`http.client.HTTPMessage` instance.
Typical headers include 'Content-length', 'Content-type', and so on. See the
`Quick Reference to HTTP Headers <https://www.cs.tut.fi/~jkorpela/http.html>`_
for a useful listing of HTTP headers with brief explanations of their meaning
and use.
Openers and Handlers
====================
When you fetch a URL you use an opener (an instance of the perhaps
confusingly-named :class:`urllib.request.OpenerDirector`). Normally we have been using
the default opener - via ``urlopen`` - but you can create custom
openers. Openers use handlers. All the "heavy lifting" is done by the
handlers. Each handler knows how to open URLs for a particular URL scheme (http,
ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
redirections or HTTP cookies.
You will want to create openers if you want to fetch URLs with specific handlers
installed, for example to get an opener that handles cookies, or to get an
opener that does not handle redirections.
To create an opener, instantiate an ``OpenerDirector``, and then call
``.add_handler(some_handler_instance)`` repeatedly.
Alternatively, you can use ``build_opener``, which is a convenience function for
creating opener objects with a single function call. ``build_opener`` adds
several handlers by default, but provides a quick way to add more and/or
override the default handlers.
Other sorts of handlers you might want to can handle proxies, authentication,
and other common but slightly specialised situations.
``install_opener`` can be used to make an ``opener`` object the (global) default
opener. This means that calls to ``urlopen`` will use the opener you have
installed.
Opener objects have an ``open`` method, which can be called directly to fetch
urls in the same way as the ``urlopen`` function: there's no need to call
``install_opener``, except as a convenience.
Basic Authentication
====================
To illustrate creating and installing a handler we will use the
``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
including an explanation of how Basic Authentication works - see the `Basic
Authentication Tutorial
<http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
When authentication is required, the server sends a header (as well as the 401
error code) requesting authentication. This specifies the authentication scheme
and a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME
realm="REALM"``.
e.g.
.. code-block:: none
WWW-Authenticate: Basic realm="cPanel Users"
The client should then retry the request with the appropriate name and password
for the realm included as a header in the request. This is 'basic
authentication'. In order to simplify this process we can create an instance of
``HTTPBasicAuthHandler`` and an opener to use this handler.
The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
the mapping of URLs and realms to passwords and usernames. If you know what the
realm is (from the authentication header sent by the server), then you can use a
``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
you to specify a default username and password for a URL. This will be supplied
in the absence of you providing an alternative combination for a specific
realm. We indicate this by providing ``None`` as the realm argument to the
``add_password`` method.
The top-level URL is the first URL that requires authentication. URLs "deeper"
than the URL you pass to .add_password() will also match. ::
# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "http://example.com/foo/"
password_mgr.add_password(None, top_level_url, username, password)
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)
# use the opener to fetch a URL
opener.open(a_url)
# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)
.. note::
In the above example we only supplied our ``HTTPBasicAuthHandler`` to
``build_opener``. By default openers have the handlers for normal situations
-- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy`
environment variable is set), ``UnknownHandler``, ``HTTPHandler``,
``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``.
``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme
component and the hostname and optionally the port number)
e.g. ``"http://example.com/"`` *or* an "authority" (i.e. the hostname,
optionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"``
(the latter example includes a port number). The authority, if present, must
NOT contain the "userinfo" component - for example ``"joe:password@example.com"`` is
not correct.
Proxies
=======
**urllib** will auto-detect your proxy settings and use those. This is through
the ``ProxyHandler``, which is part of the normal handler chain when a proxy
setting is detected. Normally that's a good thing, but there are occasions
when it may not be helpful [#]_. One way to do this is to setup our own
``ProxyHandler``, with no proxies defined. This is done using similar steps to
setting up a `Basic Authentication`_ handler: ::
>>> proxy_support = urllib.request.ProxyHandler({})
>>> opener = urllib.request.build_opener(proxy_support)
>>> urllib.request.install_opener(opener)
.. note::
Currently ``urllib.request`` *does not* support fetching of ``https`` locations
through a proxy. However, this can be enabled by extending urllib.request as
shown in the recipe [#]_.
.. note::
``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see
the documentation on :func:`~urllib.request.getproxies`.
Sockets and Layers
==================
The Python support for fetching resources from the web is layered. urllib uses
the :mod:`http.client` library, which in turn uses the socket library.
As of Python 2.3 you can specify how long a socket should wait for a response
before timing out. This can be useful in applications which have to fetch web
pages. By default the socket module has *no timeout* and can hang. Currently,
the socket timeout is not exposed at the http.client or urllib.request levels.
However, you can set the default timeout globally for all sockets using ::
import socket
import urllib.request
# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)
# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request('http://www.voidspace.org.uk')
response = urllib.request.urlopen(req)
-------
Footnotes
=========
This document was reviewed and revised by John Lee.
.. [#] Google for example.
.. [#] Browser sniffing is a very bad practice for website design - building
sites using web standards is much more sensible. Unfortunately a lot of
sites still send different versions to different browsers.
.. [#] The user agent for MSIE 6 is
*'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
.. [#] For details of more HTTP request headers, see
`Quick Reference to HTTP Headers`_.
.. [#] In my case I have to use a proxy to access the internet at work. If you
attempt to fetch *localhost* URLs through this proxy it blocks them. IE
is set to use the proxy, which urllib picks up on. In order to test
scripts with a localhost server, I have to prevent urllib from using
the proxy.
.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
<https://code.activestate.com/recipes/456195/>`_.