Working on the new version of the tutorial. Only Step1 implemented.

This commit is contained in:
Romain Dorgueil
2017-11-05 19:41:27 +01:00
parent eb393331cd
commit 8f3c4252b4
13 changed files with 586 additions and 43 deletions

View File

@ -10,16 +10,33 @@ __all__ = []
def register_api(x, __all__=__all__):
"""Register a function as being part of Bonobo's API, then returns the original function."""
__all__.append(get_name(x))
return x
def register_graph_api(x, __all__=__all__):
"""
Register a function as being part of Bonobo's API, after checking that its signature contains the right parameters
to work correctly, then returns the original function.
"""
from inspect import signature
parameters = list(signature(x).parameters)
required_parameters = {'plugins', 'services', 'strategy'}
assert parameters[0] == 'graph', 'First parameter of a graph api function must be "graph".'
assert required_parameters.intersection(
parameters) == required_parameters, 'Graph api functions must define the following parameters: ' + ', '.join(
sorted(required_parameters))
return register_api(x, __all__=__all__)
def register_api_group(*args):
for attr in args:
register_api(attr)
@register_api
@register_graph_api
def run(graph, *, plugins=None, services=None, strategy=None):
"""
Main entry point of bonobo. It takes a graph and creates all the necessary plumbery around to execute it.
@ -82,8 +99,8 @@ def _inspect_as_graph(graph):
_inspect_formats = {'graph': _inspect_as_graph}
@register_api
def inspect(graph, *, format):
@register_graph_api
def inspect(graph, *, plugins=None, services=None, strategy=None, format):
if not format in _inspect_formats:
raise NotImplementedError(
'Output format {} not implemented. Choices are: {}.'.format(

View File

@ -1,3 +1,19 @@
svg {
border: 2px solid green
}
}
div.related {
width: 940px;
margin: 30px auto 0 auto;
}
@media screen and (max-width: 875px) {
div.related {
visibility: hidden;
display: none;
}
}
.brand {
font-family: 'Ubuntu', 'goudy old style', 'minion pro', 'bell mt', Georgia, 'Hiragino Mincho Pro', serif;
}

View File

@ -4,17 +4,8 @@
{%- block extrahead %}
{{ super() }}
<style>
div.related {
width: 940px;
margin: 30px auto 0 auto;
}
@media screen and (max-width: 875px) {
div.related {
visibility: hidden;
display: none;
}
}
</style>
<link href="https://fonts.googleapis.com/css?family=Ubuntu" rel="stylesheet">
{% endblock %}
{%- block footer %}

View File

@ -186,3 +186,12 @@ epub_exclude_files = ['search.html']
# Example configuration for intersphinx: refer to the Python standard library.
intersphinx_mapping = {'python': ('https://docs.python.org/3', None)}
rst_epilog = """
.. |bonobo| replace:: **Bonobo**
.. |longversion| replace:: v.{version}
""".format(
version = version,
)

258
docs/tutorial/1-init.rst Normal file
View File

@ -0,0 +1,258 @@
Part 1: Let's get started!
==========================
To get started with |bonobo|, you need to install it in a working python 3.5+ environment (you should use a
`virtualenv <https://virtualenv.pypa.io/>`_).
.. code-block:: shell-session
$ pip install bonobo
Check that the installation worked, and that you're using a version that matches this tutorial (written for bonobo
|longversion|).
.. code-block:: shell-session
$ bonobo version
See :doc:`/install` for more options.
Create an ETL job
:::::::::::::::::
Since Bonobo 0.6, it's easy to bootstrap a simple ETL job using just one file.
We'll start here, and the later stages of the tutorial will guide you toward refactoring this to a python package.
.. code-block:: shell-session
$ bonobo init tutorial.py
This will create a simple job in a `tutorial.py` file. Let's run it:
.. code-block:: shell-session
$ python tutorial.py
Hello
World
- extract in=1 out=2 [done]
- transform in=2 out=2 [done]
- load in=2 [done]
If you have a similar result, then congratulations! You just ran your first |bonobo| ETL job.
Inspect your graph
::::::::::::::::::
The basic building blocks of |bonobo| are **transformations** and **graphs**.
**Transformations** are simple python callables (like functions) that handle a transformation step for a line of data.
**Graphs** are a set of transformations, with directional links between them to define the data-flow that will happen
at runtime.
To inspect the graph of your first transformation (you must install graphviz first to do so), run:
.. code-block:: shell-session
$ bonobo inspect --graph tutorial.py | dot -Tpng -o tutorial.png
Open the generated `tutorial.png` file to have a quick look at the graph.
.. graphviz::
digraph {
rankdir = LR;
"BEGIN" [shape="point"];
"BEGIN" -> {0 [label="extract"]};
{0 [label="extract"]} -> {1 [label="transform"]};
{1 [label="transform"]} -> {2 [label="load"]};
}
You can easily understand here the structure of your graph. For such a simple graph, it's pretty much useless, but as
you'll write more complex transformations, it will be helpful.
Read the Code
:::::::::::::
Before we write our own job, let's look at the code we have in `tutorial.py`.
Import
------
.. code-block:: python
import bonobo
The highest level APIs of |bonobo| are all contained within the top level **bonobo** namespace.
If you're a beginner with the library, stick to using only those APIs (they also are the most stable APIs).
If you're an advanced user (and you'll be one quite soon), you can safely use second level APIs.
The third level APIs are considered private, and you should not use them unless you're hacking on |bonobo| directly.
Extract
-------
.. code-block:: python
def extract():
yield 'hello'
yield 'world'
This is a first transformation, written as a python generator, that will send some strings, one after the other, to its
output.
Transformations that take no input and yields a variable number of outputs are usually called **extractors**. You'll
encounter a few different types, either purely generating the data (like here), using an external service (a
database, for example) or using some filesystem (which is considered an external service too).
Extractors do not need to have its input connected to anything, and will be called exactly once when the graph is
executed.
Transform
---------
.. code-block:: python
def transform(*args):
yield tuple(
map(str.title, args)
)
This is a second transformation. It will get called a bunch of times, once for each input row it gets, and apply some
logic on the input to generate the output.
This is the most **generic** case. For each input row, you can generate zero, one or many lines of output for each line
of input.
Load
----
.. code-block:: python
def load(*args):
print(*args)
This is the third and last transformation in our "hello world" example. It will apply some logic to each row, and have
absolutely no output.
Transformations that take input and yields nothing are also called **loaders**. Like extractors, you'll encounter
different types, to work with various external systems.
Please note that as a convenience mean and because the cost is marginal, most builtin `loaders` will send their
inputs to their output, so you can easily chain more than one loader, or apply more transformations after a given
loader was applied.
Graph Factory
-------------
.. code-block:: python
def get_graph(**options):
graph = bonobo.Graph()
graph.add_chain(extract, transform, load)
return graph
All our transformations were defined above, but nothing ties them together, for now.
This "graph factory" function is in charge of the creation and configuration of a :class:`bonobo.Graph` instance, that
will be executed later.
By no mean is |bonobo| limited to simple graphs like this one. You can add as many chains as you want, and each chain
can contain as many nodes as you want.
Services Factory
----------------
.. code-block:: python
def get_services(**options):
return {}
This is the "services factory", that we'll use later to connect to external systems. Let's skip this one, for now.
(we'll dive into this topic in :doc:`4-services`)
Main Block
----------
.. code-block:: python
if __name__ == '__main__':
parser = bonobo.get_argument_parser()
with bonobo.parse_args(parser) as options:
bonobo.run(
get_graph(**options),
services=get_services(**options)
)
Here, the real thing happens.
Without diving into too much details for now, using the :func:`bonobo.parse_args` context manager will allow our job to
be configurable, later, and although we don't really need it right now, it does not harm neither.
Reading the output
::::::::::::::::::
Let's run this job once again:
.. code-block:: shell-session
$ python tutorial.py
Hello
World
- extract in=1 out=2 [done]
- transform in=2 out=2 [done]
- load in=2 [done]
The console output contains two things.
* First, it contains the real output of your job (what was :func:`print`-ed to `sys.stdout`).
* Second, it displays the execution status (on `sys.stderr`). Each line contains a "status" character, the node name,
numbers and a human readable status. This status will evolve in real time, and allows to understand a job's progress
while it's running.
* Status character:
* “ ” means that the node was not yet started.
*`-`” means that the node finished its execution.
*`+`” means that the node is currently running.
*`!`” means that the node had problems running.
* Numerical statistics:
* “`in=...`” shows the input lines count, also known as the amount of calls to your transformation.
*`out=...`” shows the output lines count.
*`read=...`” shows the count of reads applied to an external system, if the transformation supports it.
*`write=...`” shows the count of writes applied to an external system, if the transformation supports it.
*`err=...`” shows the count of exceptions that happened while running the transformation. Note that exception will abort
a call, but the execution will move to the next row.
Moving forward
::::::::::::::
That's all for this first step.
You now know:
* How to create a new job file.
* How to inspect the content of a job file.
* What should go in a job file.
* How to execute a job file.
* How to read the console output.
**Next: :doc:`2-jobs`**

12
docs/tutorial/2-jobs.rst Normal file
View File

@ -0,0 +1,12 @@
Part 2: Writing ETL Jobs
========================
Moving forward
::::::::::::::
You now know:
* How to ...
**Next: :doc:`3-files`**

12
docs/tutorial/3-files.rst Normal file
View File

@ -0,0 +1,12 @@
Part 3: Working with Files
==========================
Moving forward
::::::::::::::
You now know:
* How to ...
**Next: :doc:`4-services`**

View File

@ -0,0 +1,210 @@
Part 4: Services and Configurables
==================================
.. note::
This section lacks completeness, sorry for that (but you can still read it!).
In the last section, we used a few new tools.
Class-based transformations and configurables
:::::::::::::::::::::::::::::::::::::::::::::
Bonobo is a bit dumb. If something is callable, it considers it can be used as a transformation, and it's up to the
user to provide callables that logically fits in a graph.
You can use plain python objects with a `__call__()` method, and it ill just work.
As a lot of transformations needs common machinery, there is a few tools to quickly build transformations, most of
them requiring your class to subclass :class:`bonobo.config.Configurable`.
Configurables allows to use the following features:
* You can add **Options** (using the :class:`bonobo.config.Option` descriptor). Options can be positional, or keyword
based, can have a default value and will be consumed from the constructor arguments.
.. code-block:: python
from bonobo.config import Configurable, Option
class PrefixIt(Configurable):
prefix = Option(str, positional=True, default='>>>')
def call(self, row):
return self.prefix + ' ' + row
prefixer = PrefixIt('$')
* You can add **Services** (using the :class:`bonobo.config.Service` descriptor). Services are a subclass of
:class:`bonobo.config.Option`, sharing the same basics, but specialized in the definition of "named services" that
will be resolved at runtime (a.k.a for which we will provide an implementation at runtime). We'll dive more into that
in the next section
.. code-block:: python
from bonobo.config import Configurable, Option, Service
class HttpGet(Configurable):
url = Option(default='https://jsonplaceholder.typicode.com/users')
http = Service('http.client')
def call(self, http):
resp = http.get(self.url)
for row in resp.json():
yield row
http_get = HttpGet()
* You can add **Methods** (using the :class:`bonobo.config.Method` descriptor). :class:`bonobo.config.Method` is a
subclass of :class:`bonobo.config.Option` that allows to pass callable parameters, either to the class constructor,
or using the class as a decorator.
.. code-block:: python
from bonobo.config import Configurable, Method
class Applier(Configurable):
apply = Method()
def call(self, row):
return self.apply(row)
@Applier
def Prefixer(self, row):
return 'Hello, ' + row
prefixer = Prefixer()
* You can add **ContextProcessors**, which are an advanced feature we won't introduce here. If you're familiar with
pytest, you can think of them as pytest fixtures, execution wise.
Services
::::::::
The motivation behind services is mostly separation of concerns, testability and deployability.
Usually, your transformations will depend on services (like a filesystem, an http client, a database, a rest api, ...).
Those services can very well be hardcoded in the transformations, but there is two main drawbacks:
* You won't be able to change the implementation depending on the current environment (development laptop versus
production servers, bug-hunting session versus execution, etc.)
* You won't be able to test your transformations without testing the associated services.
To overcome those caveats of hardcoding things, we define Services in the configurable, which are basically
string-options of the service names, and we provide an implementation at the last moment possible.
There are two ways of providing implementations:
* Either file-wide, by providing a `get_services()` function that returns a dict of named implementations (we did so
with filesystems in the previous step, :doc:`tut02`)
* Either directory-wide, by providing a `get_services()` function in a specially named `_services.py` file.
The first is simpler if you only have one transformation graph in one file, the second allows to group coherent
transformations together in a directory and share the implementations.
Let's see how to use it, starting from the previous service example:
.. code-block:: python
from bonobo.config import Configurable, Option, Service
class HttpGet(Configurable):
url = Option(default='https://jsonplaceholder.typicode.com/users')
http = Service('http.client')
def call(self, http):
resp = http.get(self.url)
for row in resp.json():
yield row
We defined an "http.client" service, that obviously should have a `get()` method, returning responses that have a
`json()` method.
Let's provide two implementations for that. The first one will be using `requests <http://docs.python-requests.org/>`_,
that coincidally satisfies the described interface:
.. code-block:: python
import bonobo
import requests
def get_services():
return {
'http.client': requests
}
graph = bonobo.Graph(
HttpGet(),
print,
)
If you run this code, you should see some mock data returned by the webservice we called (assuming it's up and you can
reach it).
Now, the second implementation will replace that with a mock, used for testing purposes:
.. code-block:: python
class HttpResponseStub:
def json(self):
return [
{'id': 1, 'name': 'Leanne Graham', 'username': 'Bret', 'email': 'Sincere@april.biz', 'address': {'street': 'Kulas Light', 'suite': 'Apt. 556', 'city': 'Gwenborough', 'zipcode': '92998-3874', 'geo': {'lat': '-37.3159', 'lng': '81.1496'}}, 'phone': '1-770-736-8031 x56442', 'website': 'hildegard.org', 'company': {'name': 'Romaguera-Crona', 'catchPhrase': 'Multi-layered client-server neural-net', 'bs': 'harness real-time e-markets'}},
{'id': 2, 'name': 'Ervin Howell', 'username': 'Antonette', 'email': 'Shanna@melissa.tv', 'address': {'street': 'Victor Plains', 'suite': 'Suite 879', 'city': 'Wisokyburgh', 'zipcode': '90566-7771', 'geo': {'lat': '-43.9509', 'lng': '-34.4618'}}, 'phone': '010-692-6593 x09125', 'website': 'anastasia.net', 'company': {'name': 'Deckow-Crist', 'catchPhrase': 'Proactive didactic contingency', 'bs': 'synergize scalable supply-chains'}},
]
class HttpStub:
def get(self, url):
return HttpResponseStub()
def get_services():
return {
'http.client': HttpStub()
}
graph = bonobo.Graph(
HttpGet(),
print,
)
The `Graph` definition staying the exact same, you can easily substitute the `_services.py` file depending on your
environment (the way you're doing this is out of bonobo scope and heavily depends on your usual way of managing
configuration files on different platforms).
Starting with bonobo 0.5 (not yet released), you will be able to use service injections with function-based
transformations too, using the `bonobo.config.requires` decorator to mark a dependency.
.. code-block:: python
from bonobo.config import requires
@requires('http.client')
def http_get(http):
resp = http.get('https://jsonplaceholder.typicode.com/users')
for row in resp.json():
yield row
Read more
:::::::::
* :doc:`/guide/services`
* :doc:`/reference/api_config`
Next
::::
:doc:`tut04`.
Moving forward
::::::::::::::
You now know:
* How to ...
**Next: :doc:`5-packaging`**

View File

@ -0,0 +1,11 @@
Part 5: Projects and Packaging
==============================
Moving forward
::::::::::::::
You now know:
* How to ...

3
docs/tutorial/django.rst Normal file
View File

@ -0,0 +1,3 @@
Working with Django
===================

View File

@ -17,47 +17,43 @@ Bonobo uses simple python and should be quick and easy to learn.
Tutorial
::::::::
.. note::
.. toctree::
:maxdepth: 1
Good documentation is not easy to write. We do our best to make it better and better.
1-init
2-jobs
3-files
4-services
5-packaging
Although all content here should be accurate, you may feel a lack of completeness, for which we plead guilty and
apologize.
If you're stuck, please come and ask on our `slack channel <https://bonobo-slack.herokuapp.com/>`_, we'll figure
something out.
If you're not stuck but had trouble understanding something, please consider contributing to the docs (via GitHub
pull requests).
More
::::
.. toctree::
:maxdepth: 2
tut01
tut02
tut03
tut04
:maxdepth: 1
django
notebooks
sqlalchemy
What's next?
::::::::::::
Read a few examples
-------------------
* :doc:`The Bonobo Guide <../guide/index>`
* :doc:`Extensions <../extension/index>`
* :doc:`../reference/examples`
Read about best development practices
-------------------------------------
We're there!
::::::::::::
* :doc:`../guide/index`
* :doc:`../guide/purity`
Good documentation is not easy to write.
Read about integrating external tools with bonobo
-------------------------------------------------
Although all content here should be accurate, you may feel a lack of completeness, for which we plead guilty and
apologize.
* :doc:`../extension/docker`: run transformation graphs in isolated containers.
* :doc:`../extension/jupyter`: run transformations within jupyter notebooks.
* :doc:`../extension/selenium`: crawl the web using a real browser and work with the gathered data.
* :doc:`../extension/sqlalchemy`: everything you need to interract with SQL databases.
If you're stuck, please come to the `Bonobo Slack Channel <https://bonobo-slack.herokuapp.com/>`_ and we'll figure it
out.
If you're not stuck but had trouble understanding something, please consider contributing to the docs (using GitHub
pull requests).

View File

@ -0,0 +1,4 @@
Working with Jupyter Notebooks
==============================

View File

@ -0,0 +1,4 @@
Working with SQL Databases
==========================