Working on the new version of the tutorial. Only Step1 implemented.
This commit is contained in:
258
docs/tutorial/1-init.rst
Normal file
258
docs/tutorial/1-init.rst
Normal file
@ -0,0 +1,258 @@
|
||||
Part 1: Let's get started!
|
||||
==========================
|
||||
|
||||
To get started with |bonobo|, you need to install it in a working python 3.5+ environment (you should use a
|
||||
`virtualenv <https://virtualenv.pypa.io/>`_).
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ pip install bonobo
|
||||
|
||||
Check that the installation worked, and that you're using a version that matches this tutorial (written for bonobo
|
||||
|longversion|).
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo version
|
||||
|
||||
See :doc:`/install` for more options.
|
||||
|
||||
|
||||
Create an ETL job
|
||||
:::::::::::::::::
|
||||
|
||||
Since Bonobo 0.6, it's easy to bootstrap a simple ETL job using just one file.
|
||||
|
||||
We'll start here, and the later stages of the tutorial will guide you toward refactoring this to a python package.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo init tutorial.py
|
||||
|
||||
This will create a simple job in a `tutorial.py` file. Let's run it:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ python tutorial.py
|
||||
Hello
|
||||
World
|
||||
- extract in=1 out=2 [done]
|
||||
- transform in=2 out=2 [done]
|
||||
- load in=2 [done]
|
||||
|
||||
If you have a similar result, then congratulations! You just ran your first |bonobo| ETL job.
|
||||
|
||||
|
||||
Inspect your graph
|
||||
::::::::::::::::::
|
||||
|
||||
The basic building blocks of |bonobo| are **transformations** and **graphs**.
|
||||
|
||||
**Transformations** are simple python callables (like functions) that handle a transformation step for a line of data.
|
||||
|
||||
**Graphs** are a set of transformations, with directional links between them to define the data-flow that will happen
|
||||
at runtime.
|
||||
|
||||
To inspect the graph of your first transformation (you must install graphviz first to do so), run:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo inspect --graph tutorial.py | dot -Tpng -o tutorial.png
|
||||
|
||||
Open the generated `tutorial.png` file to have a quick look at the graph.
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
rankdir = LR;
|
||||
"BEGIN" [shape="point"];
|
||||
"BEGIN" -> {0 [label="extract"]};
|
||||
{0 [label="extract"]} -> {1 [label="transform"]};
|
||||
{1 [label="transform"]} -> {2 [label="load"]};
|
||||
}
|
||||
|
||||
You can easily understand here the structure of your graph. For such a simple graph, it's pretty much useless, but as
|
||||
you'll write more complex transformations, it will be helpful.
|
||||
|
||||
|
||||
Read the Code
|
||||
:::::::::::::
|
||||
|
||||
Before we write our own job, let's look at the code we have in `tutorial.py`.
|
||||
|
||||
|
||||
Import
|
||||
------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import bonobo
|
||||
|
||||
|
||||
The highest level APIs of |bonobo| are all contained within the top level **bonobo** namespace.
|
||||
|
||||
If you're a beginner with the library, stick to using only those APIs (they also are the most stable APIs).
|
||||
|
||||
If you're an advanced user (and you'll be one quite soon), you can safely use second level APIs.
|
||||
|
||||
The third level APIs are considered private, and you should not use them unless you're hacking on |bonobo| directly.
|
||||
|
||||
|
||||
Extract
|
||||
-------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def extract():
|
||||
yield 'hello'
|
||||
yield 'world'
|
||||
|
||||
This is a first transformation, written as a python generator, that will send some strings, one after the other, to its
|
||||
output.
|
||||
|
||||
Transformations that take no input and yields a variable number of outputs are usually called **extractors**. You'll
|
||||
encounter a few different types, either purely generating the data (like here), using an external service (a
|
||||
database, for example) or using some filesystem (which is considered an external service too).
|
||||
|
||||
Extractors do not need to have its input connected to anything, and will be called exactly once when the graph is
|
||||
executed.
|
||||
|
||||
|
||||
Transform
|
||||
---------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def transform(*args):
|
||||
yield tuple(
|
||||
map(str.title, args)
|
||||
)
|
||||
|
||||
This is a second transformation. It will get called a bunch of times, once for each input row it gets, and apply some
|
||||
logic on the input to generate the output.
|
||||
|
||||
This is the most **generic** case. For each input row, you can generate zero, one or many lines of output for each line
|
||||
of input.
|
||||
|
||||
|
||||
Load
|
||||
----
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def load(*args):
|
||||
print(*args)
|
||||
|
||||
This is the third and last transformation in our "hello world" example. It will apply some logic to each row, and have
|
||||
absolutely no output.
|
||||
|
||||
Transformations that take input and yields nothing are also called **loaders**. Like extractors, you'll encounter
|
||||
different types, to work with various external systems.
|
||||
|
||||
Please note that as a convenience mean and because the cost is marginal, most builtin `loaders` will send their
|
||||
inputs to their output, so you can easily chain more than one loader, or apply more transformations after a given
|
||||
loader was applied.
|
||||
|
||||
|
||||
Graph Factory
|
||||
-------------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def get_graph(**options):
|
||||
graph = bonobo.Graph()
|
||||
graph.add_chain(extract, transform, load)
|
||||
return graph
|
||||
|
||||
All our transformations were defined above, but nothing ties them together, for now.
|
||||
|
||||
This "graph factory" function is in charge of the creation and configuration of a :class:`bonobo.Graph` instance, that
|
||||
will be executed later.
|
||||
|
||||
By no mean is |bonobo| limited to simple graphs like this one. You can add as many chains as you want, and each chain
|
||||
can contain as many nodes as you want.
|
||||
|
||||
|
||||
Services Factory
|
||||
----------------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def get_services(**options):
|
||||
return {}
|
||||
|
||||
This is the "services factory", that we'll use later to connect to external systems. Let's skip this one, for now.
|
||||
|
||||
(we'll dive into this topic in :doc:`4-services`)
|
||||
|
||||
|
||||
Main Block
|
||||
----------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = bonobo.get_argument_parser()
|
||||
with bonobo.parse_args(parser) as options:
|
||||
bonobo.run(
|
||||
get_graph(**options),
|
||||
services=get_services(**options)
|
||||
)
|
||||
|
||||
Here, the real thing happens.
|
||||
|
||||
Without diving into too much details for now, using the :func:`bonobo.parse_args` context manager will allow our job to
|
||||
be configurable, later, and although we don't really need it right now, it does not harm neither.
|
||||
|
||||
Reading the output
|
||||
::::::::::::::::::
|
||||
|
||||
Let's run this job once again:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ python tutorial.py
|
||||
Hello
|
||||
World
|
||||
- extract in=1 out=2 [done]
|
||||
- transform in=2 out=2 [done]
|
||||
- load in=2 [done]
|
||||
|
||||
The console output contains two things.
|
||||
|
||||
* First, it contains the real output of your job (what was :func:`print`-ed to `sys.stdout`).
|
||||
* Second, it displays the execution status (on `sys.stderr`). Each line contains a "status" character, the node name,
|
||||
numbers and a human readable status. This status will evolve in real time, and allows to understand a job's progress
|
||||
while it's running.
|
||||
|
||||
* Status character:
|
||||
|
||||
* “ ” means that the node was not yet started.
|
||||
* “`-`” means that the node finished its execution.
|
||||
* “`+`” means that the node is currently running.
|
||||
* “`!`” means that the node had problems running.
|
||||
|
||||
* Numerical statistics:
|
||||
|
||||
* “`in=...`” shows the input lines count, also known as the amount of calls to your transformation.
|
||||
* “`out=...`” shows the output lines count.
|
||||
* “`read=...`” shows the count of reads applied to an external system, if the transformation supports it.
|
||||
* “`write=...`” shows the count of writes applied to an external system, if the transformation supports it.
|
||||
* “`err=...`” shows the count of exceptions that happened while running the transformation. Note that exception will abort
|
||||
a call, but the execution will move to the next row.
|
||||
|
||||
|
||||
Moving forward
|
||||
::::::::::::::
|
||||
|
||||
That's all for this first step.
|
||||
|
||||
You now know:
|
||||
|
||||
* How to create a new job file.
|
||||
* How to inspect the content of a job file.
|
||||
* What should go in a job file.
|
||||
* How to execute a job file.
|
||||
* How to read the console output.
|
||||
|
||||
**Next: :doc:`2-jobs`**
|
||||
Reference in New Issue
Block a user