211 lines
5.4 KiB
ReStructuredText
211 lines
5.4 KiB
ReStructuredText
Graphs
|
|
======
|
|
|
|
Graphs are the glue that ties transformations together. They are the only data-structure bonobo can execute directly. Graphs
|
|
must be acyclic, and can contain as many nodes as your system can handle. However, although in theory the number of nodes can be rather high, practical use cases usually do not exceed more than a few hundred nodes and only then in extreme cases.
|
|
|
|
|
|
Definitions
|
|
:::::::::::
|
|
|
|
Graph
|
|
|
|
A directed acyclic graph of transformations, that Bonobo can inspect and execute.
|
|
|
|
Node
|
|
|
|
A transformation within a graph. The transformations are stateless, and have no idea whether or not they are
|
|
included in a graph, multiple graph, or not at all.
|
|
|
|
|
|
Creating a graph
|
|
::::::::::::::::
|
|
|
|
Graphs should be instances of :class:`bonobo.Graph`. The :func:`bonobo.Graph.add_chain` method can take as many
|
|
positional parameters as you want.
|
|
|
|
.. code-block:: python
|
|
|
|
import bonobo
|
|
|
|
graph = bonobo.Graph()
|
|
graph.add_chain(a, b, c)
|
|
|
|
Resulting graph:
|
|
|
|
.. graphviz::
|
|
|
|
digraph {
|
|
rankdir = LR;
|
|
stylesheet = "../_static/graphs.css";
|
|
|
|
BEGIN [shape="point"];
|
|
BEGIN -> "a" -> "b" -> "c";
|
|
}
|
|
|
|
Non-linear graphs
|
|
:::::::::::::::::
|
|
|
|
Divergences / forks
|
|
-------------------
|
|
|
|
To create two or more divergent data streams ("forks"), you should specify the `_input` kwarg to `add_chain`.
|
|
|
|
.. code-block:: python
|
|
|
|
import bonobo
|
|
|
|
graph = bonobo.Graph()
|
|
graph.add_chain(a, b, c)
|
|
graph.add_chain(f, g, _input=b)
|
|
|
|
|
|
Resulting graph:
|
|
|
|
.. graphviz::
|
|
|
|
digraph {
|
|
rankdir = LR;
|
|
stylesheet = "../_static/graphs.css";
|
|
|
|
BEGIN [shape="point"];
|
|
BEGIN -> "a" -> "b" -> "c";
|
|
"b" -> "f" -> "g";
|
|
}
|
|
|
|
.. note:: Both branches will receive the same data and at the same time.
|
|
|
|
Convergence / merges
|
|
---------------------
|
|
|
|
To merge two data streams, you can use the `_output` kwarg to `add_chain`, or use named nodes (see below).
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
import bonobo
|
|
|
|
graph = bonobo.Graph()
|
|
|
|
# Here we set _input to None, so normalize won't start on its own but only after it receives input from the other chains.
|
|
graph.add_chain(normalize, store, _input=None)
|
|
|
|
# Add two different chains
|
|
graph.add_chain(a, b, _output=normalize)
|
|
graph.add_chain(f, g, _output=normalize)
|
|
|
|
|
|
Resulting graph:
|
|
|
|
.. graphviz::
|
|
|
|
digraph {
|
|
rankdir = LR;
|
|
stylesheet = "../_static/graphs.css";
|
|
|
|
BEGIN [shape="point"];
|
|
BEGIN -> "a" -> "b" -> "normalize";
|
|
|
|
BEGIN2 [shape="point"];
|
|
BEGIN2 -> "f" -> "g" -> "normalize";
|
|
|
|
"normalize" -> "store"
|
|
}
|
|
|
|
.. note::
|
|
|
|
This is not a "join" or "cartesian product". Any data that comes from `b` or `g` will go through `normalize`, one at
|
|
a time. Think of the graph edges as data flow pipes.
|
|
|
|
|
|
Named nodes
|
|
:::::::::::
|
|
|
|
Using above code to create convergences often leads to code which is hard to read, because you have to define the "target" stream
|
|
before the streams that logically goes to the beginning of the transformation graph. To overcome that, one can use
|
|
"named" nodes:
|
|
|
|
graph.add_chain(x, y, z, _name='zed')
|
|
graph.add_chain(f, g, h, _input='zed')
|
|
|
|
.. code-block:: python
|
|
|
|
import bonobo
|
|
|
|
graph = bonobo.Graph()
|
|
|
|
# Add two different chains
|
|
graph.add_chain(a, b, _output="load")
|
|
graph.add_chain(f, g, _output="load")
|
|
|
|
# Here we mark _input to None, so normalize won't get the "begin" impulsion.
|
|
graph.add_chain(normalize, store, _input=None, _name="load")
|
|
|
|
|
|
Resulting graph:
|
|
|
|
.. graphviz::
|
|
|
|
digraph {
|
|
rankdir = LR;
|
|
stylesheet = "../_static/graphs.css";
|
|
|
|
BEGIN [shape="point"];
|
|
BEGIN -> "a" -> "b" -> "normalize (load)";
|
|
|
|
BEGIN2 [shape="point"];
|
|
BEGIN2 -> "f" -> "g" -> "normalize (load)";
|
|
|
|
"normalize (load)" -> "store"
|
|
}
|
|
|
|
|
|
Inspecting graphs
|
|
:::::::::::::::::
|
|
|
|
Bonobo is bundled with an "inspector", that can use graphviz to let you visualize your graphs.
|
|
|
|
Read `How to inspect and visualize your graph <https://www.bonobo-project.org/how-to/inspect-an-etl-jobs-graph>`_.
|
|
|
|
|
|
Executing graphs
|
|
::::::::::::::::
|
|
|
|
There are two options to execute a graph (which have a similar result, but are targeting different use cases).
|
|
|
|
* You can use the bonobo command line interface, which is the highest level interface.
|
|
* You can use the python API, which is lower level but allows to use bonobo from within your own code (for example, a
|
|
django management command).
|
|
|
|
Executing a graph with the command line interface
|
|
-------------------------------------------------
|
|
|
|
If there is no good reason not to, you should use `bonobo run ...` to run transformation graphs found in your python
|
|
source code files.
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ bonobo run file.py
|
|
|
|
You can also run a python module:
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ bonobo run -m my.own.etlmod
|
|
|
|
In each case, bonobo's CLI will look for an instance of :class:`bonobo.Graph` in your file/module, create the plumbing
|
|
needed to execute it, and run it.
|
|
|
|
If you're in an interactive terminal context, it will use :class:`bonobo.ext.console.ConsoleOutputPlugin` for display.
|
|
|
|
If you're in a jupyter notebook context, it will (try to) use :class:`bonobo.ext.jupyter.JupyterOutputPlugin`.
|
|
|
|
Executing a graph using the internal API
|
|
----------------------------------------
|
|
|
|
To integrate bonobo executions in any other python code, you should use :func:`bonobo.run`. It behaves very similar to
|
|
the CLI, and reading the source you should be able to figure out its usage quite easily.
|
|
|
|
|
|
|