Files
bonobo/docs/tutorial/tut01.rst
2017-04-22 22:47:41 +02:00

133 lines
5.1 KiB
ReStructuredText

Basic concepts
==============
To begin with Bonobo, you need to install it in a working python 3.5+ environment:
.. code-block:: shell-session
$ pip install bonobo
See :doc:`/install` for more options.
Let's write a first data transformation
:::::::::::::::::::::::::::::::::::::::
We'll start with the simplest transformation possible.
In **Bonobo**, a transformation is a plain old python callable, not more, not less. Let's write one that takes a string
and uppercase it.
.. code-block:: python
def uppercase(x: str):
return x.upper()
Pretty straightforward.
You could even use :func:`str.upper` directly instead of writing a wrapper, as a type's method (unbound) will take an
instance of this type as its first parameter (what you'd call `self` in your method).
The type annotations written here are not used, but can make your code much more readable, and may very well be used as
validators in the future.
Let's write two more transformations: a generator to produce the data to be transformed, and something that outputs it,
because, yeah, feedback is cool.
.. code-block:: python
def generate_data():
yield 'foo'
yield 'bar'
yield 'baz'
def output(x: str):
print(x)
Once again, you could have skipped the pain of writing this and simply use an iterable to generate the data and the
builtin :func:`print` for the output, but we'll stick to writing our own transformations for now.
Let's chain the three transformations together and run the transformation graph:
.. code-block:: python
import bonobo
graph = bonobo.Graph(generate_data, uppercase, output)
if __name__ == '__main__':
bonobo.run(graph)
.. graphviz::
digraph {
rankdir = LR;
stylesheet = "../_static/graphs.css";
BEGIN [shape="point"];
BEGIN -> "generate_data" -> "uppercase" -> "output";
}
We use the :func:`bonobo.run` helper that hides the underlying object composition necessary to actually run the
transformations in parallel, because it's simpler.
Depending on what you're doing, you may use the shorthand helper method, or the verbose one. Always favor the shorter,
if you don't need to tune the graph or the execution strategy (see below).
Takeaways
:::::::::
① The :class:`bonobo.Graph` class is used to represent a data-processing pipeline.
It can represent simple list-like linear graphs, like here, but it can also represent much more complex graphs, with
branches and cycles.
This is what the graph we defined looks like:
.. graphviz::
digraph {
rankdir = LR;
BEGIN [shape="point"];
BEGIN -> "iter(['foo', 'bar', 'baz'])" -> "str.upper" -> "print";
}
`Transformations` are simple python callables. Whatever can be called can be used as a `transformation`. Callables can
either `return` or `yield` data to send it to the next step. Regular functions (using `return`) should be prefered if
each call is guaranteed to return exactly one result, while generators (using `yield`) should be prefered if the
number of output lines for a given input varies.
③ The `Graph` instance, or `transformation graph` is then executed using an `ExecutionStrategy`. You did not use it
directly in this tutorial, but :func:`bonobo.run` created an instance of :class:`bonobo.ThreadPoolExecutorStrategy`
under the hood (which is the default strategy). Actual behavior of an execution will depend on the strategy chosen, but
the default should be fine in most of the basic cases.
④ Before actually executing the `transformations`, the `ExecutorStrategy` instance will wrap each component in an
`execution context`, whose responsibility is to hold the state of the transformation. It enables to keep the
`transformations` stateless, while allowing to add an external state if required. We'll expand on this later.
Concepts and definitions
::::::::::::::::::::::::
* Transformation: a callable that takes input (as call parameters) and returns output(s), either as its return value or
by yielding values (a.k.a returning a generator).
* Transformation graph (or Graph): a set of transformations tied together in a :class:`bonobo.Graph` instance, which is a simple
directed acyclic graph (also refered as a DAG, sometimes).
* Node: a transformation within the context of a transformation graph. The node defines what to do whith a
transformation's output, and especially what other node to feed with the output.
* Execution strategy (or strategy): a way to run a transformation graph. It's responsibility is mainly to parallelize
(or not) the transformations, on one or more process and/or computer, and to setup the right queuing mechanism for
transformations' inputs and outputs.
* Execution context (or context): a wrapper around a node that holds the state for it. If the node need the state, there
are tools available in bonobo to feed it to the transformation using additional call parameters, and so every
transformation will be atomic.
Next
::::
You now know all the basic concepts necessary to build (batch-like) data processors.
If you're confident with this part, let's get to a more real world example, using files and nice console output:
:doc:`basics2`