Minor fixes and update documentation. Preparing the upcoming 0.2 release.
This commit is contained in:
@ -1,161 +0,0 @@
|
||||
Basic concepts
|
||||
==============
|
||||
|
||||
To begin with Bonobo, you need to install it in a working python 3.5+ environment:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ pip install bonobo
|
||||
|
||||
See :doc:`/install` for more options.
|
||||
|
||||
Let's write a first data transformation
|
||||
:::::::::::::::::::::::::::::::::::::::
|
||||
|
||||
We'll start with the most simple components we can.
|
||||
|
||||
In **Bonobo**, a component is a plain old python callable, not more, not less. Let's write one that takes a string and
|
||||
uppercase it.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def uppercase(x: str):
|
||||
return x.upper()
|
||||
|
||||
Pretty straightforward.
|
||||
|
||||
You could even use :func:`str.upper` directly instead of writing a wrapper, as a type's method (unbound) will take an
|
||||
instance of this type as its first parameter (what you'd call `self` in your method).
|
||||
|
||||
The type annotations written here are not used, but can make your code much more readable, and may very well be used as
|
||||
validators in the future.
|
||||
|
||||
Let's write two more components: a generator to produce the data to be transformed, and something that outputs it,
|
||||
because, yeah, feedback is cool.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def generate_data():
|
||||
yield 'foo'
|
||||
yield 'bar'
|
||||
yield 'baz'
|
||||
|
||||
def output(x: str):
|
||||
print(x)
|
||||
|
||||
Once again, you could have skipped the pain of writing this and simply use an iterable to generate the data and the
|
||||
builtin :func:`print` for the output, but we'll stick to writing our own components for now.
|
||||
|
||||
Let's chain the three components together and run the transformation:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo import run
|
||||
|
||||
run(generate_data, uppercase, output)
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
rankdir = LR;
|
||||
stylesheet = "../_static/graphs.css";
|
||||
|
||||
BEGIN [shape="point"];
|
||||
BEGIN -> "generate_data" -> "uppercase" -> "output";
|
||||
}
|
||||
|
||||
We use the :func:`bonobo.run` helper that hides the underlying object composition necessary to actually run the
|
||||
components in parralel, because it's simpler.
|
||||
|
||||
Depending on what you're doing, you may use the shorthand helper method, or the verbose one. Always favor the shorter,
|
||||
if you don't need to tune the graph or the execution strategy (see below).
|
||||
|
||||
Diving in
|
||||
:::::::::
|
||||
|
||||
Let's rewrite it using the builtin functions :func:`str.upper` and :func:`print` instead of our own wrappers, and expand
|
||||
the :func:`bonobo.run()` helper so you see what's inside...
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo import Graph, ThreadPoolExecutorStrategy
|
||||
|
||||
# Represent our data processor as a simple directed graph of callables.
|
||||
graph = Graph()
|
||||
graph.add_chain(
|
||||
('foo', 'bar', 'baz'),
|
||||
str.upper,
|
||||
print,
|
||||
)
|
||||
|
||||
# Use a thread pool.
|
||||
executor = ThreadPoolExecutorStrategy()
|
||||
|
||||
# Run the thing.
|
||||
executor.execute(graph)
|
||||
|
||||
We also switched our generator for a tuple, **Bonobo** will wrap it as a generator itself if it's not callable but
|
||||
iterable.
|
||||
|
||||
The shorthand version with builtins would look like this:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo import run
|
||||
|
||||
run(
|
||||
('foo', 'bar', 'baz'),
|
||||
str.upper,
|
||||
print,
|
||||
)
|
||||
|
||||
Both methods are strictly equivalent (see :func:`bonobo.run`). When in doubt, prefer the shorter version.
|
||||
|
||||
Takeaways
|
||||
:::::::::
|
||||
|
||||
① The :class:`bonobo.Graph` class is used to represent a data-processing pipeline.
|
||||
|
||||
It can represent simple list-like linear graphs, like here, but it can also represent much more complex graphs, with
|
||||
branches and cycles.
|
||||
|
||||
This is what the graph we defined looks like:
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
rankdir = LR;
|
||||
"iter(['foo', 'bar', 'baz'])" -> "str.upper" -> "print";
|
||||
}
|
||||
|
||||
|
||||
② `Components` are simple python callables. Whatever can be called can be used as a `component`. Callables can
|
||||
either `return` or `yield` data to send it to the next step. Regular functions (using `return`) should be prefered if
|
||||
each call is guaranteed to return exactly one result, while generators (using `yield`) should be prefered if the
|
||||
number of output lines for a given input varies.
|
||||
|
||||
③ The `graph` is then executed using an `ExecutionStrategy`. In this tutorial, we'll only use
|
||||
:class:`bonobo.ThreadPoolExecutorStrategy`, which use an underlying `concurrent.futures.ThreadPoolExecutor` to
|
||||
schedule calls in a pool of threads, but basically this strategy is what determines the actual behaviour of execution.
|
||||
|
||||
④ Before actually executing the `components`, the `ExecutorStrategy` instance will wrap each component in a `context`,
|
||||
whose responsibility is to hold the state, to keep the `components` stateless. We'll expand on this later.
|
||||
|
||||
Concepts and definitions
|
||||
::::::::::::::::::::::::
|
||||
|
||||
* Component
|
||||
* Graph
|
||||
* Executor
|
||||
|
||||
.. todo:: Definitions, and substitute vague terms in the page by the exact term defined here
|
||||
|
||||
|
||||
Next
|
||||
::::
|
||||
|
||||
You now know all the basic concepts necessary to build (batch-like) data processors.
|
||||
|
||||
If you're confident with this part, let's get to a more real world example, using files and nice console output:
|
||||
:doc:`basics2`
|
||||
|
||||
@ -1,46 +0,0 @@
|
||||
Working with files
|
||||
==================
|
||||
|
||||
Bonobo would not be of any use if the aim was to uppercase small lists of strings. In fact, Bonobo should not be used
|
||||
if you don't expect any gain from parralelization of tasks.
|
||||
|
||||
Let's take the following graph as an example:
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
rankdir = LR;
|
||||
"A" -> "B" -> "C";
|
||||
}
|
||||
|
||||
The execution strategy does a bit of under the scene work, wrapping every component in a thread (assuming you're using
|
||||
the :class:`bonobo.ThreadPoolExecutorStrategy`), which allows to start running `B` as soon as `A` yielded the first line
|
||||
of data, and `C` as soon as `B` yielded the first line of data, even if `A` or `B` still have data to yield.
|
||||
|
||||
The great thing is that you generally don't have to think about it. Just be aware that your components will be run in
|
||||
parralel, and don't worry too much about blocking components, as they won't block their siblings.
|
||||
|
||||
That being said, let's try to write a more real-world like transformation.
|
||||
|
||||
Reading a file
|
||||
::::::::::::::
|
||||
|
||||
There are a few component builders available in **Bonobo** that let you read files. You should at least know about the following:
|
||||
|
||||
* :class:`bonobo.FileReader` (aliased as :func:`bonobo.from_file`)
|
||||
* :class:`bonobo.JsonFileReader` (aliased as :func:`bonobo.from_json`)
|
||||
* :class:`bonobo.CsvFileReader` (aliased as :func:`bonobo.from_csv`)
|
||||
|
||||
Reading a file is as simple as using one of those, and for the example, we'll use a text file that was generated using
|
||||
Bonobo from the "liste-des-cafes-a-un-euro" dataset made available by Mairie de Paris under the Open Database
|
||||
License (ODbL). You can `explore the original dataset <https://opendata.paris.fr/explore/dataset/liste-des-cafes-a-un-euro/information/>`_.
|
||||
You'll need the example dataset, available in **Bonobo**'s repository.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo import FileReader, run
|
||||
|
||||
run(
|
||||
FileReader('examples/datasets/cheap_coffeeshops_in_paris.txt'),
|
||||
print,
|
||||
)
|
||||
@ -3,12 +3,38 @@ First steps
|
||||
|
||||
We tried hard to make **Bonobo** simple. We use simple python, and we believe it should be simple to learn.
|
||||
|
||||
Tutorial
|
||||
::::::::
|
||||
|
||||
We strongly advice that even if you're an advanced python developper, you go through the whole tutorial for two
|
||||
reasons: that should be sufficient to do anything possible with **Bonobo** and that's a good moment to learn the few
|
||||
concepts you'll see everywhere in the software.
|
||||
|
||||
If you're not familiar with python, you should first read :doc:`./python`.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
basics
|
||||
basics2
|
||||
tut01
|
||||
tut02
|
||||
|
||||
Where to go next?
|
||||
:::::::::::::::::
|
||||
|
||||
When you're done with the tutorial, you may be interested in the following next steps:
|
||||
|
||||
Read the :doc:`../reference/examples`
|
||||
|
||||
Read about best development practices
|
||||
-------------------------------------
|
||||
|
||||
* :doc:`../guide/index`
|
||||
* :doc:`../guide/purity`
|
||||
|
||||
Read about integrating external tools with bonobo
|
||||
-------------------------------------------------
|
||||
|
||||
* :doc:`../guide/ext/docker`: run transformation graphs in isolated containers.
|
||||
* :doc:`../guide/ext/jupyter`: run transformations within jupyter notebooks.
|
||||
* :doc:`../guide/ext/selenium`: run
|
||||
* :doc:`../guide/ext/sqlalchemy`: everything you need to interract with SQL databases.
|
||||
|
||||
16
docs/tutorial/python.rst
Normal file
16
docs/tutorial/python.rst
Normal file
@ -0,0 +1,16 @@
|
||||
Just enough Python for Bonobo
|
||||
=============================
|
||||
|
||||
This guide is intended to help programmers or enthusiasts to grasp the python basics necessary to use Bonobo. It should
|
||||
definately not be considered as a general python introduction, neither a deep dive into details.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
python01
|
||||
python02
|
||||
python03
|
||||
python04
|
||||
python05
|
||||
|
||||
|
||||
132
docs/tutorial/tut01.rst
Normal file
132
docs/tutorial/tut01.rst
Normal file
@ -0,0 +1,132 @@
|
||||
Basic concepts
|
||||
==============
|
||||
|
||||
To begin with Bonobo, you need to install it in a working python 3.5+ environment:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ pip install bonobo
|
||||
|
||||
See :doc:`/install` for more options.
|
||||
|
||||
Let's write a first data transformation
|
||||
:::::::::::::::::::::::::::::::::::::::
|
||||
|
||||
We'll start with the simplest transformation possible.
|
||||
|
||||
In **Bonobo**, a transformation is a plain old python callable, not more, not less. Let's write one that takes a string
|
||||
and uppercase it.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def uppercase(x: str):
|
||||
return x.upper()
|
||||
|
||||
Pretty straightforward.
|
||||
|
||||
You could even use :func:`str.upper` directly instead of writing a wrapper, as a type's method (unbound) will take an
|
||||
instance of this type as its first parameter (what you'd call `self` in your method).
|
||||
|
||||
The type annotations written here are not used, but can make your code much more readable, and may very well be used as
|
||||
validators in the future.
|
||||
|
||||
Let's write two more transformations: a generator to produce the data to be transformed, and something that outputs it,
|
||||
because, yeah, feedback is cool.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def generate_data():
|
||||
yield 'foo'
|
||||
yield 'bar'
|
||||
yield 'baz'
|
||||
|
||||
def output(x: str):
|
||||
print(x)
|
||||
|
||||
Once again, you could have skipped the pain of writing this and simply use an iterable to generate the data and the
|
||||
builtin :func:`print` for the output, but we'll stick to writing our own transformations for now.
|
||||
|
||||
Let's chain the three transformations together and run the transformation graph:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import bonobo
|
||||
|
||||
graph = bonobo.Graph(generate_data, uppercase, output)
|
||||
|
||||
if __name__ == '__main__':
|
||||
bonobo.run(graph)
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
rankdir = LR;
|
||||
stylesheet = "../_static/graphs.css";
|
||||
|
||||
BEGIN [shape="point"];
|
||||
BEGIN -> "generate_data" -> "uppercase" -> "output";
|
||||
}
|
||||
|
||||
We use the :func:`bonobo.run` helper that hides the underlying object composition necessary to actually run the
|
||||
transformations in parralel, because it's simpler.
|
||||
|
||||
Depending on what you're doing, you may use the shorthand helper method, or the verbose one. Always favor the shorter,
|
||||
if you don't need to tune the graph or the execution strategy (see below).
|
||||
|
||||
Takeaways
|
||||
:::::::::
|
||||
|
||||
① The :class:`bonobo.Graph` class is used to represent a data-processing pipeline.
|
||||
|
||||
It can represent simple list-like linear graphs, like here, but it can also represent much more complex graphs, with
|
||||
branches and cycles.
|
||||
|
||||
This is what the graph we defined looks like:
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
rankdir = LR;
|
||||
BEGIN [shape="point"];
|
||||
BEGIN -> "iter(['foo', 'bar', 'baz'])" -> "str.upper" -> "print";
|
||||
}
|
||||
|
||||
|
||||
② `Transformations` are simple python callables. Whatever can be called can be used as a `transformation`. Callables can
|
||||
either `return` or `yield` data to send it to the next step. Regular functions (using `return`) should be prefered if
|
||||
each call is guaranteed to return exactly one result, while generators (using `yield`) should be prefered if the
|
||||
number of output lines for a given input varies.
|
||||
|
||||
③ The `Graph` instance, or `transformation graph` is then executed using an `ExecutionStrategy`. You did not use it
|
||||
directly in this tutorial, but :func:`bonobo.run` created an instance of :class:`bonobo.ThreadPoolExecutorStrategy`
|
||||
under the hood (which is the default strategy). Actual behavior of an execution will depend on the strategy chosen, but
|
||||
the default should be fine in most of the basic cases.
|
||||
|
||||
④ Before actually executing the `transformations`, the `ExecutorStrategy` instance will wrap each component in an
|
||||
`execution context`, whose responsibility is to hold the state of the transformation. It enables to keep the
|
||||
`transformations` stateless, while allowing to add an external state if required. We'll expand on this later.
|
||||
|
||||
Concepts and definitions
|
||||
::::::::::::::::::::::::
|
||||
|
||||
* Transformation: a callable that takes input (as call parameters) and returns output(s), either as its return value or
|
||||
by yielding values (a.k.a returning a generator).
|
||||
* Transformation graph (or Graph): a set of transformations tied together in a :class:`bonobo.Graph` instance, which is a simple
|
||||
directed acyclic graph (also refered as a DAG, sometimes).
|
||||
* Node: a transformation within the context of a transformation graph. The node defines what to do whith a
|
||||
transformation's output, and especially what other node to feed with the output.
|
||||
* Execution strategy (or strategy): a way to run a transformation graph. It's responsibility is mainly to parralelize
|
||||
(or not) the transformations, on one or more process and/or computer, and to setup the right queuing mechanism for
|
||||
transformations' inputs and outputs.
|
||||
* Execution context (or context): a wrapper around a node that holds the state for it. If the node need the state, there
|
||||
are tools available in bonobo to feed it to the transformation using additional call parameters, and so every
|
||||
transformation will be atomic.
|
||||
|
||||
Next
|
||||
::::
|
||||
|
||||
You now know all the basic concepts necessary to build (batch-like) data processors.
|
||||
|
||||
If you're confident with this part, let's get to a more real world example, using files and nice console output:
|
||||
:doc:`basics2`
|
||||
|
||||
63
docs/tutorial/tut02.rst
Normal file
63
docs/tutorial/tut02.rst
Normal file
@ -0,0 +1,63 @@
|
||||
Working with files
|
||||
==================
|
||||
|
||||
Bonobo would not be of any use if the aim was to uppercase small lists of strings. In fact, Bonobo should not be used
|
||||
if you don't expect any gain from parralelization/distribution of tasks.
|
||||
|
||||
Let's take the following graph as an example:
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
rankdir = LR;
|
||||
BEGIN [shape="point"];
|
||||
BEGIN -> "A" -> "B" -> "C";
|
||||
}
|
||||
|
||||
The execution strategy does a bit of under the scene work, wrapping every component in a thread (assuming you're using
|
||||
the :class:`bonobo.ThreadPoolExecutorStrategy`), which allows to start running `B` as soon as `A` yielded the first line
|
||||
of data, and `C` as soon as `B` yielded the first line of data, even if `A` or `B` still have data to yield.
|
||||
|
||||
The great thing is that you generally don't have to think about it. Just be aware that your components will be run in
|
||||
parralel (with the default strategy), and don't worry too much about blocking components, as they won't block their
|
||||
siblings when run in bonobo.
|
||||
|
||||
That being said, let's try to write a more real-world like transformation.
|
||||
|
||||
Reading a file
|
||||
::::::::::::::
|
||||
|
||||
There are a few component builders available in **Bonobo** that let you read files. You should at least know about the
|
||||
following:
|
||||
|
||||
* :class:`bonobo.io.FileReader`
|
||||
* :class:`bonobo.io.JsonReader`
|
||||
* :class:`bonobo.io.CsvReader`
|
||||
|
||||
Reading a file is as simple as using one of those, and for the example, we'll use a text file that was generated using
|
||||
Bonobo from the "liste-des-cafes-a-un-euro" dataset made available by Mairie de Paris under the Open Database
|
||||
License (ODbL). You can `explore the original dataset <https://opendata.paris.fr/explore/dataset/liste-des-cafes-a-un-euro/information/>`_.
|
||||
You'll need the example dataset, available in **Bonobo**'s repository.
|
||||
|
||||
.. literalinclude:: ../../examples/tut02_01_read.py
|
||||
:language: python
|
||||
|
||||
Until then, we ran the file directly using our python interpreter, but there is other options, one of them being
|
||||
`bonobo run`. This command allows to run a graph defined by a python file, and is replacing the :func:`bonobo.run`
|
||||
helper. It's the exact reason why we call :func:`bonobo.run` in the `if __name__ == '__main__'` block, to only
|
||||
instanciate it if it is run directly.
|
||||
|
||||
Using bonobo command line has a few advantages. It will look for one and only one :class:`bonobo.Graph` instance defined
|
||||
in the file given as argument, configure an execution strategy, eventually plugins, and execute it. It has the benefit
|
||||
of allowing to tune the "artifacts" surrounding the transformation graph on command line (verbosity, plugins ...), and
|
||||
it will also ease the transition to run transformation graphs in containers, as the syntax will be the same. Of course,
|
||||
it is not required, and the containerization capabilities are provided by an optional and separate python package.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo run examples/tut02_01_read.py
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user