Major update to documentation, removing deprecated docs and adding the new syntax to graph building options.
This commit is contained in:
@ -1,9 +0,0 @@
|
||||
.. warning::
|
||||
|
||||
This tutorial was written for |bonobo| 0.5, while the current stable version is |bonobo| 0.6.
|
||||
|
||||
Please be aware that some things changed.
|
||||
|
||||
A summary of changes is available in the `migration guide from 0.5 to 0.6 <https://news.bonobo-project.org/migration-guide-for-bonobo-0-6-alpha-c1d36b0a9d35>`_.
|
||||
|
||||
|
||||
@ -1,65 +0,0 @@
|
||||
First steps
|
||||
===========
|
||||
|
||||
.. include:: _outdated_note.rst
|
||||
|
||||
What is Bonobo?
|
||||
:::::::::::::::
|
||||
|
||||
Bonobo is an ETL (Extract-Transform-Load) framework for python 3.5. The goal is to define data-transformations, with
|
||||
python code in charge of handling similar shaped independent lines of data.
|
||||
|
||||
Bonobo *is not* a statistical or data-science tool. If you're looking for a data-analysis tool in python, use Pandas.
|
||||
|
||||
Bonobo is a lean manufacturing assembly line for data that let you focus on the actual work instead of the plumbery
|
||||
(execution contexts, parallelism, error handling, console output, logging, ...).
|
||||
|
||||
Bonobo uses simple python and should be quick and easy to learn.
|
||||
|
||||
Tutorial
|
||||
::::::::
|
||||
|
||||
.. note::
|
||||
|
||||
Good documentation is not easy to write. We do our best to make it better and better.
|
||||
|
||||
Although all content here should be accurate, you may feel a lack of completeness, for which we plead guilty and
|
||||
apologize.
|
||||
|
||||
If you're stuck, please come and ask on our `slack channel <https://bonobo-slack.herokuapp.com/>`_, we'll figure
|
||||
something out.
|
||||
|
||||
If you're not stuck but had trouble understanding something, please consider contributing to the docs (via GitHub
|
||||
pull requests).
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
tut01
|
||||
tut02
|
||||
tut03
|
||||
tut04
|
||||
|
||||
|
||||
What's next?
|
||||
::::::::::::
|
||||
|
||||
Read a few examples
|
||||
-------------------
|
||||
|
||||
* :doc:`/reference/examples`
|
||||
|
||||
Read about best development practices
|
||||
-------------------------------------
|
||||
|
||||
* :doc:`/guide/index`
|
||||
* :doc:`/guide/purity`
|
||||
|
||||
Read about integrating external tools with bonobo
|
||||
-------------------------------------------------
|
||||
|
||||
* :doc:`/extension/docker`: run transformation graphs in isolated containers.
|
||||
* :doc:`/extension/jupyter`: run transformations within jupyter notebooks.
|
||||
* :doc:`/extension/selenium`: crawl the web using a real browser and work with the gathered data.
|
||||
* :doc:`/extension/sqlalchemy`: everything you need to interract with SQL databases.
|
||||
|
||||
@ -1,13 +0,0 @@
|
||||
Just enough Python for Bonobo
|
||||
=============================
|
||||
|
||||
.. include:: _outdated_note.rst
|
||||
|
||||
.. todo::
|
||||
|
||||
This is a work in progress and it is not yet available. Please come back later or even better, help us write this
|
||||
guide!
|
||||
|
||||
This guide is intended to help programmers or enthusiasts to grasp the python basics necessary to use Bonobo. It
|
||||
should definately not be considered as a general python introduction, neither a deep dive into details.
|
||||
|
||||
@ -1,202 +0,0 @@
|
||||
Let's get started!
|
||||
==================
|
||||
|
||||
.. include:: _outdated_note.rst
|
||||
|
||||
To begin with Bonobo, you need to install it in a working python 3.5+ environment, and you'll also need cookiecutter
|
||||
to bootstrap your project.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ pip install bonobo cookiecutter
|
||||
|
||||
See :doc:`/install` for more options.
|
||||
|
||||
|
||||
Create an empty project
|
||||
:::::::::::::::::::::::
|
||||
|
||||
Your ETL code will live in ETL projects, which are basically a bunch of files, including python code, that bonobo
|
||||
can run.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo init tutorial
|
||||
|
||||
This will create a `tutorial` directory (`content description here <https://www.bonobo-project.org/with/cookiecutter>`_).
|
||||
|
||||
To run this project, use:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo run tutorial
|
||||
|
||||
|
||||
Write a first transformation
|
||||
::::::::::::::::::::::::::::
|
||||
|
||||
Open `tutorial/main.py`, and delete all the code here.
|
||||
|
||||
A transformation can be whatever python can call. Simplest transformations are functions and generators.
|
||||
|
||||
Let's write one:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def transform(x):
|
||||
return x.upper()
|
||||
|
||||
Easy.
|
||||
|
||||
.. note::
|
||||
|
||||
This function is very similar to :func:`str.upper`, which you can use directly.
|
||||
|
||||
Let's write two more transformations for the "extract" and "load" steps. In this example, we'll generate the data from
|
||||
scratch, and we'll use stdout to "simulate" data-persistence.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def extract():
|
||||
yield 'foo'
|
||||
yield 'bar'
|
||||
yield 'baz'
|
||||
|
||||
def load(x):
|
||||
print(x)
|
||||
|
||||
Bonobo makes no difference between generators (yielding functions) and regular functions. It will, in all cases, iterate
|
||||
on things returned, and a normal function will just be seen as a generator that yields only once.
|
||||
|
||||
.. note::
|
||||
|
||||
Once again, you should use the builtin :func:`print` directly instead of this `load()` function.
|
||||
|
||||
|
||||
Create a transformation graph
|
||||
:::::::::::::::::::::::::::::
|
||||
|
||||
Amongst other features, Bonobo will mostly help you there with the following:
|
||||
|
||||
* Execute the transformations in independent threads
|
||||
* Pass the outputs of one thread to other(s) thread(s) inputs.
|
||||
|
||||
To do this, it needs to know what data-flow you want to achieve, and you'll use a :class:`bonobo.Graph` to describe it.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import bonobo
|
||||
|
||||
graph = bonobo.Graph(extract, transform, load)
|
||||
|
||||
if __name__ == '__main__':
|
||||
bonobo.run(graph)
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
rankdir = LR;
|
||||
stylesheet = "../_static/graphs.css";
|
||||
|
||||
BEGIN [shape="point"];
|
||||
BEGIN -> "extract" -> "transform" -> "load";
|
||||
}
|
||||
|
||||
.. note::
|
||||
|
||||
The `if __name__ == '__main__':` section is not required, unless you want to run it directly using the python
|
||||
interpreter.
|
||||
|
||||
|
||||
Execute the job
|
||||
:::::::::::::::
|
||||
|
||||
Save `tutorial/main.py` and execute your transformation again:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo run tutorial
|
||||
|
||||
This example is available in :mod:`bonobo.examples.tutorials.tut01e01`, and you can also run it as a module:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo run -m bonobo.examples.tutorials.tut01e01
|
||||
|
||||
|
||||
Rewrite it using builtins
|
||||
:::::::::::::::::::::::::
|
||||
|
||||
There is a much simpler way to describe an equivalent graph:
|
||||
|
||||
.. literalinclude:: ../../bonobo/examples/tutorials/tut01e02.py
|
||||
:language: python
|
||||
|
||||
The `extract()` generator has been replaced by a list, as Bonobo will interpret non-callable iterables as a no-input
|
||||
generator.
|
||||
|
||||
This example is also available in :mod:`bonobo.examples.tutorials.tut01e02`, and you can also run it as a module:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo run -m bonobo.examples.tutorials.tut01e02
|
||||
|
||||
You can now jump to the next part (:doc:`tut02`), or read a small summary of concepts and definitions introduced here
|
||||
below.
|
||||
|
||||
Takeaways
|
||||
:::::::::
|
||||
|
||||
① The :class:`bonobo.Graph` class is used to represent a data-processing pipeline.
|
||||
|
||||
It can represent simple list-like linear graphs, like here, but it can also represent much more complex graphs, with
|
||||
forks and joins.
|
||||
|
||||
This is what the graph we defined looks like:
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
rankdir = LR;
|
||||
BEGIN [shape="point"];
|
||||
BEGIN -> "iter(['foo', 'bar', 'baz'])" -> "str.upper" -> "print";
|
||||
}
|
||||
|
||||
|
||||
② `Transformations` are simple python callables. Whatever can be called can be used as a `transformation`. Callables can
|
||||
either `return` or `yield` data to send it to the next step. Regular functions (using `return`) should be prefered if
|
||||
each call is guaranteed to return exactly one result, while generators (using `yield`) should be prefered if the
|
||||
number of output lines for a given input varies.
|
||||
|
||||
③ The `Graph` instance, or `transformation graph` is executed using an `ExecutionStrategy`. You won't use it directly,
|
||||
but :func:`bonobo.run` created an instance of :class:`bonobo.ThreadPoolExecutorStrategy` under the hood (the default
|
||||
strategy). Actual behavior of an execution will depend on the strategy chosen, but the default should be fine for most
|
||||
cases.
|
||||
|
||||
④ Before actually executing the `transformations`, the `ExecutorStrategy` instance will wrap each component in an
|
||||
`execution context`, whose responsibility is to hold the state of the transformation. It enables you to keep the
|
||||
`transformations` stateless, while allowing you to add an external state if required. We'll expand on this later.
|
||||
|
||||
Concepts and definitions
|
||||
::::::::::::::::::::::::
|
||||
|
||||
* **Transformation**: a callable that takes input (as call parameters) and returns output(s), either as its return value or
|
||||
by yielding values (a.k.a returning a generator).
|
||||
|
||||
* **Transformation graph (or Graph)**: a set of transformations tied together in a :class:`bonobo.Graph` instance, which is
|
||||
a directed acyclic graph (or DAG).
|
||||
|
||||
* **Node**: a graph element, most probably a transformation in a graph.
|
||||
|
||||
* **Execution strategy (or strategy)**: a way to run a transformation graph. It's responsibility is mainly to parallelize
|
||||
(or not) the transformations, on one or more process and/or computer, and to setup the right queuing mechanism for
|
||||
transformations' inputs and outputs.
|
||||
|
||||
* **Execution context (or context)**: a wrapper around a node that holds the state for it. If the node needs state, there
|
||||
are tools available in bonobo to feed it to the transformation using additional call parameters, keeping
|
||||
transformations stateless.
|
||||
|
||||
Next
|
||||
::::
|
||||
|
||||
Time to jump to the second part: :doc:`tut02`.
|
||||
@ -1,123 +0,0 @@
|
||||
Working with files
|
||||
==================
|
||||
|
||||
.. include:: _outdated_note.rst
|
||||
|
||||
Bonobo would be pointless if the aim was just to uppercase small lists of strings.
|
||||
|
||||
In fact, Bonobo should not be used if you don't expect any gain from parallelization/distribution of tasks.
|
||||
|
||||
Some background...
|
||||
::::::::::::::::::
|
||||
|
||||
Let's take the following graph:
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
rankdir = LR;
|
||||
BEGIN [shape="point"];
|
||||
BEGIN -> "A" -> "B" -> "C";
|
||||
"B" -> "D";
|
||||
}
|
||||
|
||||
When run, the execution strategy wraps every component in a thread (assuming you're using the default
|
||||
:class:`bonobo.strategies.ThreadPoolExecutorStrategy`).
|
||||
|
||||
Bonobo will send each line of data in the input node's thread (here, `A`). Now, each time `A` *yields* or *returns*
|
||||
something, it will be pushed on `B` input :class:`queue.Queue`, and will be consumed by `B`'s thread. Meanwhile, `A`
|
||||
will continue to run, if it's not done.
|
||||
|
||||
When there is more than one node linked as the output of a node (for example, with `B`, `C`, and `D`), the same thing
|
||||
happens except that each result coming out of `B` will be sent to both on `C` and `D` input :class:`queue.Queue`.
|
||||
|
||||
One thing to keep in mind here is that as the objects are passed from thread to thread, you need to write "pure"
|
||||
transformations (see :doc:`/guide/purity`).
|
||||
|
||||
You generally don't have to think about it. Just be aware that your nodes will run in parallel, and don't worry
|
||||
too much about nodes running blocking operations, as they will run in parallel. As soon as a line of output is ready,
|
||||
the next nodes will start consuming it.
|
||||
|
||||
That being said, let's manipulate some files.
|
||||
|
||||
Reading a file
|
||||
::::::::::::::
|
||||
|
||||
There are a few component builders available in **Bonobo** that let you read from (or write to) files.
|
||||
|
||||
All readers work the same way. They need a filesystem to work with, and open a "path" they will read from.
|
||||
|
||||
* :class:`bonobo.CsvReader`
|
||||
* :class:`bonobo.FileReader`
|
||||
* :class:`bonobo.JsonReader`
|
||||
* :class:`bonobo.PickleReader`
|
||||
|
||||
We'll use a text file that was generated using Bonobo from the "liste-des-cafes-a-un-euro" dataset made available by
|
||||
Mairie de Paris under the Open Database License (ODbL). You can `explore the original dataset
|
||||
<https://opendata.paris.fr/explore/dataset/liste-des-cafes-a-un-euro/information/>`_.
|
||||
|
||||
You'll need the `"coffeeshops.txt" example dataset <https://github.com/python-bonobo/bonobo/blob/master/bonobo/examples/datasets/coffeeshops.txt>`_,
|
||||
available in **Bonobo**'s repository:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ curl https://raw.githubusercontent.com/python-bonobo/bonobo/master/bonobo/examples/datasets/coffeeshops.txt > `python3 -c 'import bonobo; print(bonobo.get_examples_path("datasets/coffeeshops.txt"))'`
|
||||
|
||||
.. note::
|
||||
|
||||
The "example dataset download" step will be easier in the future.
|
||||
|
||||
https://github.com/python-bonobo/bonobo/issues/134
|
||||
|
||||
.. literalinclude:: ../../bonobo/examples/tutorials/tut02e01_read.py
|
||||
:language: python
|
||||
|
||||
You can also run this example as a module (but you'll still need the dataset...):
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo run -m bonobo.examples.tutorials.tut02e01_read
|
||||
|
||||
.. note::
|
||||
|
||||
Don't focus too much on the `get_services()` function for now. It is required, with this exact name, but we'll get
|
||||
into that in a few minutes.
|
||||
|
||||
Writing to files
|
||||
::::::::::::::::
|
||||
|
||||
Let's split this file's each lines on the first comma and store a json file mapping coffee names to their addresses.
|
||||
|
||||
Here are, like the readers, the classes available to write files
|
||||
|
||||
* :class:`bonobo.CsvWriter`
|
||||
* :class:`bonobo.FileWriter`
|
||||
* :class:`bonobo.JsonWriter`
|
||||
* :class:`bonobo.PickleWriter`
|
||||
|
||||
Let's write a first implementation:
|
||||
|
||||
.. literalinclude:: ../../bonobo/examples/tutorials/tut02e02_write.py
|
||||
:language: python
|
||||
|
||||
(run it with :code:`bonobo run -m bonobo.examples.tutorials.tut02e02_write` or :code:`bonobo run myfile.py`)
|
||||
|
||||
If you read the output file, you'll see it misses the "map" part of the problem.
|
||||
|
||||
Let's extend :class:`bonobo.io.JsonWriter` to finish the job:
|
||||
|
||||
.. literalinclude:: ../../bonobo/examples/tutorials/tut02e03_writeasmap.py
|
||||
:language: python
|
||||
|
||||
(run it with :code:`bonobo run -m bonobo.examples.tutorials.tut02e03_writeasmap` or :code:`bonobo run myfile.py`)
|
||||
|
||||
It should produce a nice map.
|
||||
|
||||
We favored a bit hackish solution here instead of constructing a map in python then passing the whole to
|
||||
:func:`json.dumps` because we want to work with streams, if you have to construct the whole data structure in python,
|
||||
you'll loose a lot of bonobo's benefits.
|
||||
|
||||
Next
|
||||
::::
|
||||
|
||||
Time to write some more advanced transformations, with service dependencies: :doc:`tut03`.
|
||||
@ -1,202 +0,0 @@
|
||||
Configurables and Services
|
||||
==========================
|
||||
|
||||
.. include:: _outdated_note.rst
|
||||
|
||||
.. note::
|
||||
|
||||
This section lacks completeness, sorry for that (but you can still read it!).
|
||||
|
||||
In the last section, we used a few new tools.
|
||||
|
||||
Class-based transformations and configurables
|
||||
:::::::::::::::::::::::::::::::::::::::::::::
|
||||
|
||||
Bonobo is a bit dumb. If something is callable, it considers it can be used as a transformation, and it's up to the
|
||||
user to provide callables that logically fits in a graph.
|
||||
|
||||
You can use plain python objects with a `__call__()` method, and it will just work.
|
||||
|
||||
As a lot of transformations needs common machinery, there is a few tools to quickly build transformations, most of
|
||||
them requiring your class to subclass :class:`bonobo.config.Configurable`.
|
||||
|
||||
Configurables allows to use the following features:
|
||||
|
||||
* You can add **Options** (using the :class:`bonobo.config.Option` descriptor). Options can be positional, or keyword
|
||||
based, can have a default value and will be consumed from the constructor arguments.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import Configurable, Option
|
||||
|
||||
class PrefixIt(Configurable):
|
||||
prefix = Option(str, positional=True, default='>>>')
|
||||
|
||||
def call(self, row):
|
||||
return self.prefix + ' ' + row
|
||||
|
||||
prefixer = PrefixIt('$')
|
||||
|
||||
* You can add **Services** (using the :class:`bonobo.config.Service` descriptor). Services are a subclass of
|
||||
:class:`bonobo.config.Option`, sharing the same basics, but specialized in the definition of "named services" that
|
||||
will be resolved at runtime (a.k.a for which we will provide an implementation at runtime). We'll dive more into that
|
||||
in the next section
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import Configurable, Option, Service
|
||||
|
||||
class HttpGet(Configurable):
|
||||
url = Option(default='https://jsonplaceholder.typicode.com/users')
|
||||
http = Service('http.client')
|
||||
|
||||
def call(self, http):
|
||||
resp = http.get(self.url)
|
||||
|
||||
for row in resp.json():
|
||||
yield row
|
||||
|
||||
http_get = HttpGet()
|
||||
|
||||
|
||||
* You can add **Methods** (using the :class:`bonobo.config.Method` descriptor). :class:`bonobo.config.Method` is a
|
||||
subclass of :class:`bonobo.config.Option` that allows to pass callable parameters, either to the class constructor,
|
||||
or using the class as a decorator.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import Configurable, Method
|
||||
|
||||
class Applier(Configurable):
|
||||
apply = Method()
|
||||
|
||||
def call(self, row):
|
||||
return self.apply(row)
|
||||
|
||||
@Applier
|
||||
def Prefixer(self, row):
|
||||
return 'Hello, ' + row
|
||||
|
||||
prefixer = Prefixer()
|
||||
|
||||
* You can add **ContextProcessors**, which are an advanced feature we won't introduce here. If you're familiar with
|
||||
pytest, you can think of them as pytest fixtures, execution wise.
|
||||
|
||||
Services
|
||||
::::::::
|
||||
|
||||
The motivation behind services is mostly separation of concerns, testability and deployability.
|
||||
|
||||
Usually, your transformations will depend on services (like a filesystem, an http client, a database, a rest api, ...).
|
||||
Those services can very well be hardcoded in the transformations, but there is two main drawbacks:
|
||||
|
||||
* You won't be able to change the implementation depending on the current environment (development laptop versus
|
||||
production servers, bug-hunting session versus execution, etc.)
|
||||
* You won't be able to test your transformations without testing the associated services.
|
||||
|
||||
To overcome those caveats of hardcoding things, we define Services in the configurable, which are basically
|
||||
string-options of the service names, and we provide an implementation at the last moment possible.
|
||||
|
||||
There are two ways of providing implementations:
|
||||
|
||||
* Either file-wide, by providing a `get_services()` function that returns a dict of named implementations (we did so
|
||||
with filesystems in the previous step, :doc:`tut02`)
|
||||
* Either directory-wide, by providing a `get_services()` function in a specially named `_services.py` file.
|
||||
|
||||
The first is simpler if you only have one transformation graph in one file, the second allows to group coherent
|
||||
transformations together in a directory and share the implementations.
|
||||
|
||||
Let's see how to use it, starting from the previous service example:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import Configurable, Option, Service
|
||||
|
||||
class HttpGet(Configurable):
|
||||
url = Option(default='https://jsonplaceholder.typicode.com/users')
|
||||
http = Service('http.client')
|
||||
|
||||
def call(self, http):
|
||||
resp = http.get(self.url)
|
||||
|
||||
for row in resp.json():
|
||||
yield row
|
||||
|
||||
We defined an "http.client" service, that obviously should have a `get()` method, returning responses that have a
|
||||
`json()` method.
|
||||
|
||||
Let's provide two implementations for that. The first one will be using `requests <http://docs.python-requests.org/>`_,
|
||||
that coincidally satisfies the described interface:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import bonobo
|
||||
import requests
|
||||
|
||||
def get_services():
|
||||
return {
|
||||
'http.client': requests
|
||||
}
|
||||
|
||||
graph = bonobo.Graph(
|
||||
HttpGet(),
|
||||
print,
|
||||
)
|
||||
|
||||
If you run this code, you should see some mock data returned by the webservice we called (assuming it's up and you can
|
||||
reach it).
|
||||
|
||||
Now, the second implementation will replace that with a mock, used for testing purposes:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class HttpResponseStub:
|
||||
def json(self):
|
||||
return [
|
||||
{'id': 1, 'name': 'Leanne Graham', 'username': 'Bret', 'email': 'Sincere@april.biz', 'address': {'street': 'Kulas Light', 'suite': 'Apt. 556', 'city': 'Gwenborough', 'zipcode': '92998-3874', 'geo': {'lat': '-37.3159', 'lng': '81.1496'}}, 'phone': '1-770-736-8031 x56442', 'website': 'hildegard.org', 'company': {'name': 'Romaguera-Crona', 'catchPhrase': 'Multi-layered client-server neural-net', 'bs': 'harness real-time e-markets'}},
|
||||
{'id': 2, 'name': 'Ervin Howell', 'username': 'Antonette', 'email': 'Shanna@melissa.tv', 'address': {'street': 'Victor Plains', 'suite': 'Suite 879', 'city': 'Wisokyburgh', 'zipcode': '90566-7771', 'geo': {'lat': '-43.9509', 'lng': '-34.4618'}}, 'phone': '010-692-6593 x09125', 'website': 'anastasia.net', 'company': {'name': 'Deckow-Crist', 'catchPhrase': 'Proactive didactic contingency', 'bs': 'synergize scalable supply-chains'}},
|
||||
]
|
||||
|
||||
class HttpStub:
|
||||
def get(self, url):
|
||||
return HttpResponseStub()
|
||||
|
||||
def get_services():
|
||||
return {
|
||||
'http.client': HttpStub()
|
||||
}
|
||||
|
||||
graph = bonobo.Graph(
|
||||
HttpGet(),
|
||||
print,
|
||||
)
|
||||
|
||||
The `Graph` definition staying the exact same, you can easily substitute the `_services.py` file depending on your
|
||||
environment (the way you're doing this is out of bonobo scope and heavily depends on your usual way of managing
|
||||
configuration files on different platforms).
|
||||
|
||||
Starting with bonobo 0.5 (not yet released), you will be able to use service injections with function-based
|
||||
transformations too, using the `bonobo.config.requires` decorator to mark a dependency.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import requires
|
||||
|
||||
@requires('http.client')
|
||||
def http_get(http):
|
||||
resp = http.get('https://jsonplaceholder.typicode.com/users')
|
||||
|
||||
for row in resp.json():
|
||||
yield row
|
||||
|
||||
|
||||
Read more
|
||||
:::::::::
|
||||
|
||||
* :doc:`/guide/services`
|
||||
* :doc:`/reference/api_config`
|
||||
|
||||
Next
|
||||
::::
|
||||
|
||||
:doc:`tut04`.
|
||||
@ -1,216 +0,0 @@
|
||||
Working with databases
|
||||
======================
|
||||
|
||||
.. include:: _outdated_note.rst
|
||||
|
||||
Databases (and especially SQL databases here) are not the focus of Bonobo, thus support for it is not (and will never
|
||||
be) included in the main package. Instead, working with databases is done using third party, well maintained and
|
||||
specialized packages, like SQLAlchemy, or other database access libraries from the python cheese shop.
|
||||
|
||||
.. note::
|
||||
|
||||
SQLAlchemy extension is not yet complete. Things may be not optimal, and some APIs will change. You can still try,
|
||||
of course.
|
||||
|
||||
Consider the following document as a "preview" (yes, it should work, yes it may break in the future).
|
||||
|
||||
Also, note that for early development stages, we explicitely support only PostreSQL, although it may work well
|
||||
with `any other database supported by SQLAlchemy <http://docs.sqlalchemy.org/en/latest/core/engines.html#supported-databases>`_.
|
||||
|
||||
First, read https://www.bonobo-project.org/with/sqlalchemy for instructions on how to install. You **do need** the
|
||||
bleeding edge version of `bonobo` and `bonobo-sqlalchemy` to make this work.
|
||||
|
||||
Requirements
|
||||
::::::::::::
|
||||
|
||||
Once you installed `bonobo_sqlalchemy` (read https://www.bonobo-project.org/with/sqlalchemy to use bleeding edge
|
||||
version), install the following additional packages:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ pip install -U python-dotenv psycopg2 awesome-slugify
|
||||
|
||||
Those packages are not required by the extension, but `python-dotenv` will help us configure the database DSN, and
|
||||
`psycopg2` is required by SQLAlchemy to connect to PostgreSQL databases. Also, we'll use a slugifier to create unique
|
||||
identifiers for the database (maybe not what you'd do in the real world, but very much sufficient for example purpose).
|
||||
|
||||
Configure a database engine
|
||||
:::::::::::::::::::::::::::
|
||||
|
||||
Open your `_services.py` file and replace the code:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import bonobo, dotenv, logging, os
|
||||
from bonobo_sqlalchemy.util import create_postgresql_engine
|
||||
|
||||
dotenv.load_dotenv(dotenv.find_dotenv())
|
||||
logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)
|
||||
|
||||
def get_services():
|
||||
return {
|
||||
'fs': bonobo.open_examples_fs('datasets'),
|
||||
'fs.output': bonobo.open_fs(),
|
||||
'sqlalchemy.engine': create_postgresql_engine(**{
|
||||
'name': 'tutorial',
|
||||
'user': 'tutorial',
|
||||
'pass': 'tutorial',
|
||||
})
|
||||
}
|
||||
|
||||
The `create_postgresql_engine` is a tiny function building the DSN from reasonable defaults, that you can override
|
||||
either by providing kwargs, or with system environment variables. If you want to override something, open the `.env`
|
||||
file and add values for one or more of `POSTGRES_NAME`, `POSTGRES_USER`, 'POSTGRES_PASS`, `POSTGRES_HOST`,
|
||||
`POSTGRES_PORT`. Please note that kwargs always have precedence on environment, but that you should prefer using
|
||||
environment variables for anything that is not immutable from one platform to another.
|
||||
|
||||
Add database operation to the graph
|
||||
:::::::::::::::::::::::::::::::::::
|
||||
|
||||
Let's create a `tutorial/pgdb.py` job:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import bonobo
|
||||
import bonobo_sqlalchemy
|
||||
|
||||
from bonobo.examples.tutorials.tut02e03_writeasmap import graph, split_one_to_map
|
||||
|
||||
graph = graph.copy()
|
||||
graph.add_chain(
|
||||
bonobo_sqlalchemy.InsertOrUpdate('coffeeshops'),
|
||||
_input=split_one_to_map
|
||||
)
|
||||
|
||||
Notes here:
|
||||
|
||||
* We use the code from :doc:`tut02`, which is bundled with bonobo in the `bonobo.examples.tutorials` package.
|
||||
* We "fork" the graph, by creating a copy and appending a new "chain", starting at a point that exists in the other
|
||||
graph.
|
||||
* We use :class:`bonobo_sqlalchemy.InsertOrUpdate` (which role, in case it is not obvious, is to create database rows if
|
||||
they do not exist yet, or update the existing row, based on a "discriminant" criteria (by default, "id")).
|
||||
|
||||
If we run this transformation (with `bonobo run tutorial/pgdb.py`), we should get an error:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
| File ".../lib/python3.6/site-packages/psycopg2/__init__.py", line 130, in connect
|
||||
| conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
|
||||
| sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: database "tutorial" does not exist
|
||||
|
|
||||
|
|
||||
| The above exception was the direct cause of the following exception:
|
||||
|
|
||||
| Traceback (most recent call last):
|
||||
| File ".../bonobo-devkit/bonobo/bonobo/strategies/executor.py", line 45, in _runner
|
||||
| node_context.start()
|
||||
| File ".../bonobo-devkit/bonobo/bonobo/execution/base.py", line 75, in start
|
||||
| self._stack.setup(self)
|
||||
| File ".../bonobo-devkit/bonobo/bonobo/config/processors.py", line 94, in setup
|
||||
| _append_to_context = next(_processed)
|
||||
| File ".../bonobo-devkit/bonobo-sqlalchemy/bonobo_sqlalchemy/writers.py", line 43, in create_connection
|
||||
| raise UnrecoverableError('Could not create SQLAlchemy connection: {}.'.format(str(exc).replace('\n', ''))) from exc
|
||||
| bonobo.errors.UnrecoverableError: Could not create SQLAlchemy connection: (psycopg2.OperationalError) FATAL: database "tutorial" does not exist.
|
||||
|
||||
The database we requested do not exist. It is not the role of bonobo to do database administration, and thus there is
|
||||
no tool here to create neither the database, nor the tables we want to use.
|
||||
|
||||
Create database and table
|
||||
:::::::::::::::::::::::::
|
||||
|
||||
There are however tools in `sqlalchemy` to manage tables, so we'll create the database by ourselves, and ask sqlalchemy
|
||||
to create the table:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ psql -U postgres -h localhost
|
||||
|
||||
psql (9.6.1, server 9.6.3)
|
||||
Type "help" for help.
|
||||
|
||||
postgres=# CREATE ROLE tutorial WITH LOGIN PASSWORD 'tutorial';
|
||||
CREATE ROLE
|
||||
postgres=# CREATE DATABASE tutorial WITH OWNER=tutorial TEMPLATE=template0 ENCODING='utf-8';
|
||||
CREATE DATABASE
|
||||
|
||||
Now, let's use a little trick and add this section to `pgdb.py`:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import sys
|
||||
from sqlalchemy import Table, Column, String, Integer, MetaData
|
||||
|
||||
def main():
|
||||
from bonobo.commands.run import get_default_services
|
||||
services = get_default_services(__file__)
|
||||
if len(sys.argv) == 1:
|
||||
return bonobo.run(graph, services=services)
|
||||
elif len(sys.argv) == 2 and sys.argv[1] == 'reset':
|
||||
engine = services.get('sqlalchemy.engine')
|
||||
metadata = MetaData()
|
||||
|
||||
coffee_table = Table(
|
||||
'coffeeshops',
|
||||
metadata,
|
||||
Column('id', String(255), primary_key=True),
|
||||
Column('name', String(255)),
|
||||
Column('address', String(255)),
|
||||
)
|
||||
|
||||
metadata.drop_all(engine)
|
||||
metadata.create_all(engine)
|
||||
else:
|
||||
raise NotImplementedError('I do not understand.')
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
.. note::
|
||||
|
||||
We're using private API of bonobo here, which is unsatisfactory, discouraged and may change. Some way to get the
|
||||
service dictionnary will be added to the public api in a future release of bonobo.
|
||||
|
||||
Now run:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
$ python tutorial/pgdb.py reset
|
||||
|
||||
Database and table should now exist.
|
||||
|
||||
Format the data
|
||||
:::::::::::::::
|
||||
|
||||
Let's prepare our data for database, and change the `.add_chain(..)` call to do it prior to `InsertOrUpdate(...)`
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from slugify import slugify_url
|
||||
|
||||
def format_for_db(row):
|
||||
name, address = list(row.items())[0]
|
||||
return {
|
||||
'id': slugify_url(name),
|
||||
'name': name,
|
||||
'address': address,
|
||||
}
|
||||
|
||||
# ...
|
||||
|
||||
graph = graph.copy()
|
||||
graph.add_chain(
|
||||
format_for_db,
|
||||
bonobo_sqlalchemy.InsertOrUpdate('coffeeshops'),
|
||||
_input=split_one_to_map
|
||||
)
|
||||
|
||||
Run!
|
||||
::::
|
||||
|
||||
You can now run the script (either with `bonobo run tutorial/pgdb.py` or directly with the python interpreter, as we
|
||||
added a "main" section) and the dataset should be inserted in your database. If you run it again, no new rows are
|
||||
created.
|
||||
|
||||
Note that as we forked the graph from :doc:`tut02`, the transformation also writes the data to `coffeeshops.json`, as
|
||||
before.
|
||||
|
||||
Reference in New Issue
Block a user