A relatively unknown part of the Python standard library that I find
myself using fairly regularly at work these days is the groupby
function in the itertools module. In a nutshell, groupby takes an
iterator and breaks it up into sub-iterators based on changes in the
"key" of the main iterator. This is of course done without reading the
entire source iterator into memory.
The "key" is almost always based on some part of the items returned by
the iterator. It is defined by a "key function", much like the sorted
builtin function. groupby probably works best when the data is grouped
by the key but this isn't strictly necessary. It depends on the use case.
I've successfully used groupby for splitting up the results of large
database queries or the contents of large data files. The resulting
code ends up being clean and small.
Here's an example:
from itertools import groupby
from operator import itemgetter
things = [('2009-09-02', 11),
for key, items in groupby(things, itemgetter(0)):
for subitem in items:
print '-' * 20
Here the dummy data in the "things" list is grouped by the first item
of each element (that is, the key is the first element). For each key,
the key is printed followed by the items returned by each sub-iterator.
The output looks like:
The "things" list is a contrived example. In a real world situation
this could be a database cursor object or a CSV reader object. Any
iterable object can be used.
Here's a closer look at what groupby is doing using the Python interactive shell:
>>> iterator = groupby(things, itemgetter(0))
<itertools.groupby object at 0x95d3acc>
('2009-09-02', <itertools._grouper object at 0x95e0d0c>)
('2009-09-03', <itertools._grouper object at 0x95e0aec>)
You can see how a key and sub-iterator are returned for each pass
through the groupby iterator.
groupby is a handy tool to have under your belt. Think of it whenever
you need to split up a dataset by some criteria.