259 lines
7.5 KiB
ReStructuredText
259 lines
7.5 KiB
ReStructuredText
Part 1: Let's get started!
|
|
==========================
|
|
|
|
To get started with |bonobo|, you need to install it in a working python 3.5+ environment (you should use a
|
|
`virtualenv <https://virtualenv.pypa.io/>`_).
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ pip install bonobo
|
|
|
|
Check that the installation worked, and that you're using a version that matches this tutorial (written for bonobo
|
|
|longversion|).
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ bonobo version
|
|
|
|
See :doc:`/install` for more options.
|
|
|
|
|
|
Create an ETL job
|
|
:::::::::::::::::
|
|
|
|
Since Bonobo 0.6, it's easy to bootstrap a simple ETL job using just one file.
|
|
|
|
We'll start here, and the later stages of the tutorial will guide you toward refactoring this to a python package.
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ bonobo init tutorial.py
|
|
|
|
This will create a simple job in a `tutorial.py` file. Let's run it:
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ python tutorial.py
|
|
Hello
|
|
World
|
|
- extract in=1 out=2 [done]
|
|
- transform in=2 out=2 [done]
|
|
- load in=2 [done]
|
|
|
|
If you have a similar result, then congratulations! You just ran your first |bonobo| ETL job.
|
|
|
|
|
|
Inspect your graph
|
|
::::::::::::::::::
|
|
|
|
The basic building blocks of |bonobo| are **transformations** and **graphs**.
|
|
|
|
**Transformations** are simple python callables (like functions) that handle a transformation step for a line of data.
|
|
|
|
**Graphs** are a set of transformations, with directional links between them to define the data-flow that will happen
|
|
at runtime.
|
|
|
|
To inspect the graph of your first transformation (you must install graphviz first to do so), run:
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ bonobo inspect --graph tutorial.py | dot -Tpng -o tutorial.png
|
|
|
|
Open the generated `tutorial.png` file to have a quick look at the graph.
|
|
|
|
.. graphviz::
|
|
|
|
digraph {
|
|
rankdir = LR;
|
|
"BEGIN" [shape="point"];
|
|
"BEGIN" -> {0 [label="extract"]};
|
|
{0 [label="extract"]} -> {1 [label="transform"]};
|
|
{1 [label="transform"]} -> {2 [label="load"]};
|
|
}
|
|
|
|
You can easily understand here the structure of your graph. For such a simple graph, it's pretty much useless, but as
|
|
you'll write more complex transformations, it will be helpful.
|
|
|
|
|
|
Read the Code
|
|
:::::::::::::
|
|
|
|
Before we write our own job, let's look at the code we have in `tutorial.py`.
|
|
|
|
|
|
Import
|
|
------
|
|
|
|
.. code-block:: python
|
|
|
|
import bonobo
|
|
|
|
|
|
The highest level APIs of |bonobo| are all contained within the top level **bonobo** namespace.
|
|
|
|
If you're a beginner with the library, stick to using only those APIs (they also are the most stable APIs).
|
|
|
|
If you're an advanced user (and you'll be one quite soon), you can safely use second level APIs.
|
|
|
|
The third level APIs are considered private, and you should not use them unless you're hacking on |bonobo| directly.
|
|
|
|
|
|
Extract
|
|
-------
|
|
|
|
.. code-block:: python
|
|
|
|
def extract():
|
|
yield 'hello'
|
|
yield 'world'
|
|
|
|
This is a first transformation, written as a python generator, that will send some strings, one after the other, to its
|
|
output.
|
|
|
|
Transformations that take no input and yields a variable number of outputs are usually called **extractors**. You'll
|
|
encounter a few different types, either purely generating the data (like here), using an external service (a
|
|
database, for example) or using some filesystem (which is considered an external service too).
|
|
|
|
Extractors do not need to have its input connected to anything, and will be called exactly once when the graph is
|
|
executed.
|
|
|
|
|
|
Transform
|
|
---------
|
|
|
|
.. code-block:: python
|
|
|
|
def transform(*args):
|
|
yield tuple(
|
|
map(str.title, args)
|
|
)
|
|
|
|
This is a second transformation. It will get called a bunch of times, once for each input row it gets, and apply some
|
|
logic on the input to generate the output.
|
|
|
|
This is the most **generic** case. For each input row, you can generate zero, one or many lines of output for each line
|
|
of input.
|
|
|
|
|
|
Load
|
|
----
|
|
|
|
.. code-block:: python
|
|
|
|
def load(*args):
|
|
print(*args)
|
|
|
|
This is the third and last transformation in our "hello world" example. It will apply some logic to each row, and have
|
|
absolutely no output.
|
|
|
|
Transformations that take input and yields nothing are also called **loaders**. Like extractors, you'll encounter
|
|
different types, to work with various external systems.
|
|
|
|
Please note that as a convenience mean and because the cost is marginal, most builtin `loaders` will send their
|
|
inputs to their output, so you can easily chain more than one loader, or apply more transformations after a given
|
|
loader was applied.
|
|
|
|
|
|
Graph Factory
|
|
-------------
|
|
|
|
.. code-block:: python
|
|
|
|
def get_graph(**options):
|
|
graph = bonobo.Graph()
|
|
graph.add_chain(extract, transform, load)
|
|
return graph
|
|
|
|
All our transformations were defined above, but nothing ties them together, for now.
|
|
|
|
This "graph factory" function is in charge of the creation and configuration of a :class:`bonobo.Graph` instance, that
|
|
will be executed later.
|
|
|
|
By no mean is |bonobo| limited to simple graphs like this one. You can add as many chains as you want, and each chain
|
|
can contain as many nodes as you want.
|
|
|
|
|
|
Services Factory
|
|
----------------
|
|
|
|
.. code-block:: python
|
|
|
|
def get_services(**options):
|
|
return {}
|
|
|
|
This is the "services factory", that we'll use later to connect to external systems. Let's skip this one, for now.
|
|
|
|
(we'll dive into this topic in :doc:`4-services`)
|
|
|
|
|
|
Main Block
|
|
----------
|
|
|
|
.. code-block:: python
|
|
|
|
if __name__ == '__main__':
|
|
parser = bonobo.get_argument_parser()
|
|
with bonobo.parse_args(parser) as options:
|
|
bonobo.run(
|
|
get_graph(**options),
|
|
services=get_services(**options)
|
|
)
|
|
|
|
Here, the real thing happens.
|
|
|
|
Without diving into too much details for now, using the :func:`bonobo.parse_args` context manager will allow our job to
|
|
be configurable, later, and although we don't really need it right now, it does not harm neither.
|
|
|
|
Reading the output
|
|
::::::::::::::::::
|
|
|
|
Let's run this job once again:
|
|
|
|
.. code-block:: shell-session
|
|
|
|
$ python tutorial.py
|
|
Hello
|
|
World
|
|
- extract in=1 out=2 [done]
|
|
- transform in=2 out=2 [done]
|
|
- load in=2 [done]
|
|
|
|
The console output contains two things.
|
|
|
|
* First, it contains the real output of your job (what was :func:`print`-ed to `sys.stdout`).
|
|
* Second, it displays the execution status (on `sys.stderr`). Each line contains a "status" character, the node name,
|
|
numbers and a human readable status. This status will evolve in real time, and allows to understand a job's progress
|
|
while it's running.
|
|
|
|
* Status character:
|
|
|
|
* “ ” means that the node was not yet started.
|
|
* “`-`” means that the node finished its execution.
|
|
* “`+`” means that the node is currently running.
|
|
* “`!`” means that the node had problems running.
|
|
|
|
* Numerical statistics:
|
|
|
|
* “`in=...`” shows the input lines count, also known as the amount of calls to your transformation.
|
|
* “`out=...`” shows the output lines count.
|
|
* “`read=...`” shows the count of reads applied to an external system, if the transformation supports it.
|
|
* “`write=...`” shows the count of writes applied to an external system, if the transformation supports it.
|
|
* “`err=...`” shows the count of exceptions that happened while running the transformation. Note that exception will abort
|
|
a call, but the execution will move to the next row.
|
|
|
|
|
|
Wrap up
|
|
:::::::
|
|
|
|
That's all for this first step.
|
|
|
|
You now know:
|
|
|
|
* How to create a new job file.
|
|
* How to inspect the content of a job file.
|
|
* What should go in a job file.
|
|
* How to execute a job file.
|
|
* How to read the console output.
|
|
|
|
**Next: :doc:`2-jobs`**
|