{% trans %}
- Bonobo is an Extract Transform Load framework for the Python (3.5+) language.
+ Bonobo is an Extract Transform Load (or ETL) framework for the Python (3.5+) language.
{% endtrans %}
diff --git a/docs/conf.py b/docs/conf.py
index e164f8d..52fb506 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -75,9 +75,9 @@ html_theme = 'alabaster'
html_theme_options = {
'github_user': 'python-bonobo',
'github_repo': 'bonobo',
- 'github_button': True,
- 'show_powered_by': False,
- 'show_related': True,
+ 'github_button': 'true',
+ 'show_powered_by': 'false',
+ 'show_related': 'true',
}
html_sidebars = {
diff --git a/docs/genindex.rst b/docs/genindex.rst
new file mode 100644
index 0000000..de2955c
--- /dev/null
+++ b/docs/genindex.rst
@@ -0,0 +1,3 @@
+Full Index
+==========
+
diff --git a/docs/guide/graphs.rst b/docs/guide/graphs.rst
index 14af705..1c753a1 100644
--- a/docs/guide/graphs.rst
+++ b/docs/guide/graphs.rst
@@ -1,11 +1,211 @@
Graphs
======
-Writing graphs
-::::::::::::::
+Graphs are the glue that ties transformations together. It's the only data-structure bonobo can execute directly. Graphs
+must be acyclic, and can contain as much nodes as your system can handle. Although this number can be rather high in
+theory, extreme practical cases usually do not exceed hundreds of nodes (and this is already extreme, really).
-Debugging graphs
+
+Definitions
+:::::::::::
+
+Graph
+
+ A directed acyclic graph of transformations, that Bonobo can inspect and execute.
+
+Node
+
+ A transformation within a graph. The transformations are stateless, and have no idea whether or not they are
+ included in a graph, multiple graph, or not at all.
+
+
+Creating a graph
::::::::::::::::
+Graphs should be instances of :class:`bonobo.Graph`. The :func:`bonobo.Graph.add_chain` method can take as many
+positional parameters as you want.
+
+.. code-block:: python
+
+ import bonobo
+
+ graph = bonobo.Graph()
+ graph.add_chain(a, b, c)
+
+Resulting graph:
+
+.. graphviz::
+
+ digraph {
+ rankdir = LR;
+ stylesheet = "../_static/graphs.css";
+
+ BEGIN [shape="point"];
+ BEGIN -> "a" -> "b" -> "c";
+ }
+
+Non-linear graphs
+:::::::::::::::::
+
+Divergences / forks
+-------------------
+
+To create two or more divergent data streams ("fork"), you should specify `_input` kwarg to `add_chain`.
+
+.. code-block:: python
+
+ import bonobo
+
+ graph = bonobo.Graph()
+ graph.add_chain(a, b, c)
+ graph.add_chain(f, g, _input=b)
+
+
+Resulting graph:
+
+.. graphviz::
+
+ digraph {
+ rankdir = LR;
+ stylesheet = "../_static/graphs.css";
+
+ BEGIN [shape="point"];
+ BEGIN -> "a" -> "b" -> "c";
+ "b" -> "f" -> "g";
+ }
+
+.. note:: Both branch will receive the same data, at the same time.
+
+Convergences / merges
+---------------------
+
+To merge two data streams ("merge"), you can use the `_output` kwarg to `add_chain`, or use named nodes (see below).
+
+
+.. code-block:: python
+
+ import bonobo
+
+ graph = bonobo.Graph()
+
+ # Here we mark _input to None, so normalize won't get the "begin" impulsion.
+ graph.add_chain(normalize, store, _input=None)
+
+ # Add two different chains
+ graph.add_chain(a, b, _output=normalize)
+ graph.add_chain(f, g, _output=normalize)
+
+
+Resulting graph:
+
+.. graphviz::
+
+ digraph {
+ rankdir = LR;
+ stylesheet = "../_static/graphs.css";
+
+ BEGIN [shape="point"];
+ BEGIN -> "a" -> "b" -> "normalize";
+
+ BEGIN2 [shape="point"];
+ BEGIN2 -> "f" -> "g" -> "normalize";
+
+ "normalize" -> "store"
+ }
+
+.. note::
+
+ This is not a "join" or "cartesian product". Any data that comes from `b` or `g` will go through `normalize`, one at
+ a time. Think of the graph edges as data flow pipes.
+
+
+Named nodes
+:::::::::::
+
+Using above code to create convergences can lead to hard to read code, because you have to define the "target" stream
+before the streams that logically goes to the beginning of the transformation graph. To overcome that, one can use
+"named" nodes:
+
+ graph.add_chain(x, y, z, _name='zed')
+ graph.add_chain(f, g, h, _input='zed')
+
+.. code-block:: python
+
+ import bonobo
+
+ graph = bonobo.Graph()
+
+ # Add two different chains
+ graph.add_chain(a, b, _output="load")
+ graph.add_chain(f, g, _output="load")
+
+ # Here we mark _input to None, so normalize won't get the "begin" impulsion.
+ graph.add_chain(normalize, store, _input=None, _name="load")
+
+
+Resulting graph:
+
+.. graphviz::
+
+ digraph {
+ rankdir = LR;
+ stylesheet = "../_static/graphs.css";
+
+ BEGIN [shape="point"];
+ BEGIN -> "a" -> "b" -> "normalize (load)";
+
+ BEGIN2 [shape="point"];
+ BEGIN2 -> "f" -> "g" -> "normalize (load)";
+
+ "normalize (load)" -> "store"
+ }
+
+
+Inspecting graphs
+:::::::::::::::::
+
+Bonobo is bundled with an "inspector", that can use graphviz to let you visualize your graphs.
+
+Read `How to inspect and visualize your graph `_.
+
+
Executing graphs
::::::::::::::::
+
+There are two options to execute a graph (which have a similar result, but are targeting different use cases).
+
+* You can use the bonobo command line interface, which is the highest level interface.
+* You can use the python API, which is lower level but allows to use bonobo from within your own code (for example, a
+ django management command).
+
+Executing a graph with the command line interface
+-------------------------------------------------
+
+If there is no good reason not to, you should use `bonobo run ...` to run transformation graphs found in your python
+source code files.
+
+.. code-block:: shell-session
+
+ $ bonobo run file.py
+
+You can also run a python module:
+
+.. code-block:: shell-session
+
+ $ bonobo run -m my.own.etlmod
+
+In each case, bonobo's CLI will look for an instance of :class:`bonobo.Graph` in your file/module, create the plumbery
+needed to execute it, and run it.
+
+If you're in an interactive terminal context, it will use :class:`bonobo.ext.console.ConsoleOutputPlugin` for display.
+
+If you're in a jupyter notebook context, it will (try to) use :class:`bonobo.ext.jupyter.JupyterOutputPlugin`.
+
+Executing a graph using the internal API
+----------------------------------------
+
+To integrate bonobo executions in any other python code, you should use :func:`bonobo.run`. It behaves very similar to
+the CLI, and reading the source you should be able to figure out its usage quite easily.
+
+
+
diff --git a/docs/guide/index.rst b/docs/guide/index.rst
index 76e426a..360ed61 100644
--- a/docs/guide/index.rst
+++ b/docs/guide/index.rst
@@ -1,13 +1,14 @@
Guides
======
-Here are a few guides and best practices to work with bonobo.
+This section will guide you through your journey with Bonobo ETL.
.. toctree::
:maxdepth: 2
- graphs
+ introduction
transformations
+ graphs
services
environment
purity
diff --git a/docs/guide/introduction.rst b/docs/guide/introduction.rst
new file mode 100644
index 0000000..7d89253
--- /dev/null
+++ b/docs/guide/introduction.rst
@@ -0,0 +1,106 @@
+Introduction
+============
+
+The first thing you need to understand before you use Bonobo, or not, is what it does and what it does not, so you can
+understand if it could be a good fit for your use cases.
+
+How it works?
+:::::::::::::
+
+**Bonobo** is an **Extract Transform Load** framework aimed at coders, hackers, or any other person who's at ease with
+terminals and source code files.
+
+It is a **data streaming** solution, that treat datasets as ordered collections of independant rows, allowing to process
+them "first in, first out" using a set of transformations organized together in a directed graph.
+
+Let's take a few examples:
+
+.. graphviz::
+
+ digraph {
+ rankdir = LR;
+ stylesheet = "../_static/graphs.css";
+
+ BEGIN [shape="point"];
+ END [shape="none" label="..."];
+ BEGIN -> "A" -> "B" -> "C" -> "END";
+ }
+
+One of the simplest, by the book, cases, is an extractor sending to a transformation, itself sending to a loader.
+
+Bonobo will send an "impulsion" to all transformations linked to the little black dot on the left, here `A`.
+`A`'s main topic will be to extract data from somewhere (a file, an endpoint, a database...) and generate some output.
+As soon as the first row of `A`'s output is available, Bonobo will start asking `B` to process it. As soon as the first
+row of `B`'s output is available, Bonobo will start asking `C` to process it.
+
+While `B` and `C` are processing, `A` continues to generate data.
+
+This approach can be efficient, depending on your requirements, because you may rely on a lot of services that may be
+long to answer or unreliable, and you don't have to handle optimizations, parallelism or retry logic by yourself.
+
+.. graphviz::
+
+ digraph {
+ rankdir = LR;
+ stylesheet = "../_static/graphs.css";
+
+ BEGIN [shape="point"];
+ END [shape="none" label="..."];
+ END2 [shape="none" label="..."];
+ BEGIN -> "A" -> "B" -> "END";
+ "A" -> "C" -> "END2";
+ }
+
+In this case, any output row of `A`, will be **sent to both** `B` and `C` simultaneously. Again, `A` will continue its
+processing while `B` and `C` are working.
+
+
+.. graphviz::
+
+ digraph {
+ rankdir = LR;
+ stylesheet = "../_static/graphs.css";
+
+ BEGIN [shape="point"];
+ BEGIN2 [shape="point"];
+ END [shape="none" label="..."];
+ BEGIN -> "A" -> "C" -> "END";
+ BEGIN2 -> "B" -> "C";
+ }
+
+
+What is it not?
+:::::::::::::::
+
+**Bonobo** is not:
+
+* A data science, or statistical analysis tool, which need to treat the dataset as a whole and not as a collection of
+ independant rows. If this is your need, you probably want to look at `pandas `_.
+
+* A workflow or scheduling solution for independant data-engineering tasks. If you're looking to manage your sets of
+ data processing tasks as a whole, you probably want to look at `airflow `_.
+ Although there is no Bonobo extension yet that handles that, it does make sense to integrate Bonobo jobs in an airflow
+ (or other similar tool) workflow.
+
+* A big data solution, `as defined by wikipedia `_. We're aiming at "small
+ scale" data processing, which can be still quite huge for humans, but not for computers. If you don't know whether or
+ not this is sufficient for your needs, it probably means you're not in the "big data" land.
+
+
+Where to jump next?
+:::::::::::::::::::
+
+If you did not run through it yet, we highly suggest that you go through the :doc:`tutorial ` first.
+
+Then, you can jump to the following guides, in no particuliar order:
+
+.. toctree::
+ :maxdepth: 1
+
+ transformations
+ graphs
+ services
+ environment
+ purity
+
+
diff --git a/docs/guide/services.rst b/docs/guide/services.rst
index 4e1a22c..8c4ec69 100644
--- a/docs/guide/services.rst
+++ b/docs/guide/services.rst
@@ -1,14 +1,10 @@
Services and dependencies
=========================
-:Last-Modified: 20 may 2017
+You'll want to use external systems within your transformations, including databases, HTTP APIs, other web services,
+filesystems, etc.
-You'll probably want to use external systems within your transformations. Those systems may include databases, apis
-(using http, for example), filesystems, etc.
-
-You can start by hardcoding those services. That does the job, at first.
-
-If you're going a little further than that, you'll feel limited, for a few reasons:
+Hardcoding those services is a good first step, but as your codebase grows, will show limits rather quickly.
* Hardcoded and tightly linked dependencies make your transformations hard to test, and hard to reuse.
* Processing data on your laptop is great, but being able to do it on different target systems (or stages), in different
@@ -16,70 +12,77 @@ If you're going a little further than that, you'll feel limited, for a few reaso
pre-production environment, or production system. Maybe you have similar systems for different clients and want to select
the system at runtime. Etc.
-Service injection
-:::::::::::::::::
+Definition of service dependencies
+::::::::::::::::::::::::::::::::::
-To solve this problem, we introduce a light dependency injection system. It allows to define named dependencies in
+To solve this problem, we introduce a light dependency injection system. It allows to define **named dependencies** in
your transformations, and provide an implementation at runtime.
-Class-based transformations
----------------------------
+For function-based transformations, you can use the :func:`bonobo.config.use` decorator to mark the dependencies. You'll
+still be able to call it manually, providing the implementation yourself, but in a bonobo execution context, it will
+be resolve and injected automatically, as long as you provided an implementation to the executor (more on that below).
-To define a service dependency in a class-based transformation, use :class:`bonobo.config.Service`, a special
-descriptor (and subclass of :class:`bonobo.config.Option`) that will hold the service names and act as a marker
-for runtime resolution of service instances.
+.. code-block:: python
-Let's define such a transformation:
+ from bonobo.config import use
+
+ @use('orders_database')
+ def select_all(database):
+ yield from database.query('SELECT * FROM foo;')
+
+For class based transformations, you can use :class:`bonobo.config.Service`, a special descriptor (and subclass of
+:class:`bonobo.config.Option`) that will hold the service names and act as a marker for runtime resolution of service
+instances.
.. code-block:: python
from bonobo.config import Configurable, Service
class JoinDatabaseCategories(Configurable):
- database = Service('primary_sql_database')
+ database = Service('orders_database')
- def __call__(self, database, row):
+ def call(self, database, row):
return {
**row,
'category': database.get_category_name_for_sku(row['sku'])
}
-This piece of code tells bonobo that your transformation expect a service called "primary_sql_database", that will be
+Both pieces of code tells bonobo that your transformation expect a service called "orders_database", that will be
injected to your calls under the parameter name "database".
-Function-based transformations
-------------------------------
+Providing implementations at run-time
+-------------------------------------
-No implementation yet, but expect something similar to CBT API, maybe using a `@Service(...)` decorator. See
-`issue #70 `_.
-
-Provide implementation at run time
-----------------------------------
-
-Let's see how to execute it:
+Bonobo will expect you to provide a dictionary of all service implementations required by your graph.
.. code-block:: python
import bonobo
- graph = bonobo.graph(
- *before,
- JoinDatabaseCategories(),
- *after,
- )
+ graph = bonobo.graph(...)
+
+ def get_services():
+ return {
+ 'orders_database': my_database_service,
+ }
if __name__ == '__main__':
- bonobo.run(
- graph,
- services={
- 'primary_sql_database': my_database_service,
- }
- )
-
-A dictionary, or dictionary-like, "services" named argument can be passed to the :func:`bonobo.run` helper. The
-"dictionary-like" part is the real keyword here. Bonobo is not a DIC library, and won't become one. So the implementation
-provided is pretty basic, and feature-less. But you can use much more evolved libraries instead of the provided
-stub, and as long as it works the same (a.k.a implements a dictionary-like interface), the system will use it.
+ bonobo.run(graph, services=get_services())
+
+
+.. note::
+
+ A dictionary, or dictionary-like, "services" named argument can be passed to the :func:`bonobo.run` API method.
+ The "dictionary-like" part is the real keyword here. Bonobo is not a DIC library, and won't become one. So the
+ implementation provided is pretty basic, and feature-less. But you can use much more evolved libraries instead of
+ the provided stub, and as long as it works the same (a.k.a implements a dictionary-like interface), the system will
+ use it.
+
+Command line interface will look at services in two different places:
+
+* A `get_services()` function present at the same level of your graph definition.
+* A `get_services()` function in a `_services.py` file in the same directory as your graph's file, allowing to reuse the
+ same service implementations for more than one graph.
Solving concurrency problems
----------------------------
@@ -87,7 +90,7 @@ Solving concurrency problems
If a service cannot be used by more than one thread at a time, either because it's just not threadsafe, or because
it requires to carefully order the calls made (apis that includes nonces, or work on results returned by previous
calls are usually good candidates), you can use the :class:`bonobo.config.Exclusive` context processor to lock the
-use of a dependency for a time period.
+use of a dependency for the time of the context manager (`with` statement)
.. code-block:: python
@@ -101,18 +104,10 @@ use of a dependency for a time period.
api.last_call()
-Service configuration (to be decided and implemented)
-:::::::::::::::::::::::::::::::::::::::::::::::::::::
-
-* There should be a way to configure default service implementation for a python file, a directory, a project ...
-* There should be a way to override services when running a transformation.
-* There should be a way to use environment for service configuration.
-
Future and proposals
::::::::::::::::::::
-This is the first proposed implementation and it will evolve, but looks a lot like how we used bonobo ancestor in
-production.
+This first implementation and it will evolve. Base concepts will stay, though.
May or may not happen, depending on discussions.
diff --git a/docs/guide/transformations.rst b/docs/guide/transformations.rst
index e0fc347..e108a44 100644
--- a/docs/guide/transformations.rst
+++ b/docs/guide/transformations.rst
@@ -1,8 +1,90 @@
Transformations
===============
-Here is some guidelines on how to write transformations, to avoid the convention-jungle that could happen without
-a few rules.
+Transformations are the smallest building blocks in Bonobo ETL.
+
+They are written using standard python callables (or iterables, if you're writing transformations that have no input,
+a.k.a extractors).
+
+Definitions
+:::::::::::
+
+Transformation
+
+ The base building block of Bonobo, anything you would insert in a graph as a node. Mostly, a callable or an iterable.
+
+Extractor
+
+ Special case transformation that use no input. It will be only called once, and its purpose is to generate data,
+ either by itself or by requesting it from an external service.
+
+Loader
+
+ Special case transformation that feed an external service with data. For convenience, it can also yield the data but
+ a "pure" loader would have no output (although yielding things should have no bad side effect).
+
+Callable
+
+ Anything one can call, in python. Can be a function, a python builtin, or anything that implements `__call__`
+
+Iterable
+
+ Something we can iterate on, in python, so basically anything you'd be able to use in a `for` loop.
+
+
+Function based transformations
+::::::::::::::::::::::::::::::
+
+The most basic transformations are function-based. Which means that you define a function, and it will be used directly
+in a graph.
+
+.. code-block:: python
+
+ def get_representation(row):
+ return repr(row)
+
+ graph = bonobo.Graph(
+ [...],
+ get_representation,
+ [...],
+ )
+
+
+It does not allow any configuration, but if it's an option, prefer it as it's simpler to write.
+
+
+Class based transformations
+:::::::::::::::::::::::::::
+
+For less basic use cases, you'll want to use classes to define some of your transformations. It's also a better choice
+to build reusable blocks, as you'll be able to create parametrizable transformations that the end user will be able to
+configure at the last minute.
+
+
+Configurable
+------------
+
+.. autoclass:: bonobo.config.Configurable
+
+Options
+-------
+
+.. autoclass:: bonobo.config.Option
+
+Services
+--------
+
+.. autoclass:: bonobo.config.Service
+
+Methods
+-------
+
+.. autoclass:: bonobo.config.Method
+
+ContextProcessors
+-----------------
+
+.. autoclass:: bonobo.config.ContextProcessor
Naming conventions
@@ -44,50 +126,35 @@ can be used as a graph node, then use camelcase names:
upper = Apply(str.upper)
-Function based transformations
-::::::::::::::::::::::::::::::
+Testing
+:::::::
+
+As Bonobo use plain old python objects as transformations, it's very easy to unit test your transformations using your
+favourite testing framework. We're using pytest internally for Bonobo, but it's up to you to use the one you prefer.
+
+If you want to test a transformation with the surrounding context provided (for example, service instances injected, and
+context processors applied), you can use :class:`bonobo.execution.NodeExecutionContext` as a context processor and have
+bonobo send the data to your transformation.
-The most basic transformations are function-based. Which means that you define a function, and it will be used directly
-in a graph.
.. code-block:: python
- def get_representation(row):
- return repr(row)
+ from bonobo.constants import BEGIN, END
+ from bonobo.execution import NodeExecutionContext
- graph = bonobo.Graph(
- [...],
- get_representation,
- )
+ with NodeExecutionContext(
+ JsonWriter(filename), services={'fs': ...}
+ ) as context:
+ # Write a list of rows, including BEGIN/END control messages.
+ context.write(
+ BEGIN,
+ Bag({'foo': 'bar'}),
+ Bag({'foo': 'baz'}),
+ END
+ )
-It does not allow any configuration, but if it's an option, prefer it as it's simpler to write.
-
-
-Class based transformations
-:::::::::::::::::::::::::::
-
-A lot of logic is a bit more complex, and you'll want to use classes to define some of your transformations.
-
-The :class:`bonobo.config.Configurable` class gives you a few toys to write configurable transformations.
-
-Options
--------
-
-.. autoclass:: bonobo.config.Option
-
-Services
---------
-
-.. autoclass:: bonobo.config.Service
-
-Methods
--------
-
-.. autoclass:: bonobo.config.Method
-
-ContextProcessors
------------------
-
-.. autoclass:: bonobo.config.ContextProcessor
+ # Out of the bonobo main loop, we need to call `step` explicitely.
+ context.step()
+ context.step()
diff --git a/docs/index.rst b/docs/index.rst
index 1d6b708..b747669 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -11,6 +11,10 @@ Bonobo
reference/index
faq
contribute/index
+
+
+.. toctree::
+ :hidden:
+
genindex
modindex
-