Working on 0.6 documentation.

This commit is contained in:
Romain Dorgueil
2018-01-10 06:18:41 +01:00
parent 0a9a27ae08
commit c1ffbe7b5f
10 changed files with 265 additions and 60 deletions

11
docs/guide/_toc.rst Normal file
View File

@ -0,0 +1,11 @@
.. toctree::
:maxdepth: 2
introduction
transformations
graphs
services
environment
purity
debugging
plugins

0
docs/guide/debugging.rst Normal file
View File

View File

@ -5,6 +5,47 @@ Graphs are the glue that ties transformations together. They are the only data-s
must be acyclic, and can contain as many nodes as your system can handle. However, although in theory the number of nodes can be rather high, practical use cases usually do not exceed more than a few hundred nodes and only then in extreme cases.
Each node of a graph will be executed in isolation from the other nodes, and the data is passed from one node to the
next using FIFO queues, managed by the framework. It's transparent to the end-user, though, and you'll only use
function arguments (for inputs) and return/yield values (for outputs).
Each input row of a node will cause one call to this node's callable. Each output is cast internally as a tuple-like
data structure (or more precisely, a namedtuple-like data structure), and for one given node, each output row must
have the same structure.
If you return/yield something which is not a tuple, bonobo will create a tuple of one element.
Properties
----------
|bonobo| assists you with defining the data-flow of your data engineering process, and then streams data through your
callable graphs.
* Each node call will process one row of data.
* Queues that flows the data between node are first-in, first-out (FIFO) standard python :class:`queue.Queue`.
* Each node will run in parallel
* Default execution strategy use threading, and each node will run in a separate thread.
Fault tolerance
---------------
Node execution is fault tolerant.
If an exception is raised from a node call, then this node call will be aborted but bonobo will continue the execution
with the next row (after outputing the stack trace and incrementing the "err" counter for the node context).
It allows to have ETL jobs that ignore faulty data and try their best to process the valid rows of a dataset.
Some errors are fatal, though.
If you pass a 2 elements tuple to a node that takes 3 args, |bonobo| will raise an :class:`bonobo.errors.UnrecoverableTypeError`, and exit the
current graph execution as fast as it can (finishing the other node executions that are in progress first, but not
starting new ones if there are remaining input rows).
Definitions
:::::::::::

View File

@ -3,13 +3,8 @@ Guides
This section will guide you through your journey with Bonobo ETL.
.. toctree::
:maxdepth: 2
introduction
transformations
graphs
services
environment
purity
.. include:: _toc.rst

View File

@ -1,61 +1,138 @@
Part 2: Writing ETL Jobs
========================
.. include:: _wip_note.rst
In |bonobo|, an ETL job is a graph with some logic to execute it, like the file we created in the previous section.
What's an ETL job ?
:::::::::::::::::::
In |bonobo|, an ETL job is a formal definition of an executable graph.
Each node of a graph will be executed in isolation from the other nodes, and the data is passed from one node to the
next using FIFO queues, managed by the framework. It's transparent to the end-user, though, and you'll only use
function arguments (for inputs) and return/yield values (for outputs).
Each input row of a node will cause one call to this node's callable. Each output is cast internally as a tuple-like
data structure (or more precisely, a namedtuple-like data structure), and for one given node, each output row must
have the same structure.
If you return/yield something which is not a tuple, bonobo will create a tuple of one element.
Properties
----------
|bonobo| assists you with defining the data-flow of your data engineering process, and then streams data through your
callable graphs.
* Each node call will process one row of data.
* Queues that flows the data between node are first-in, first-out (FIFO) standard python :class:`queue.Queue`.
* Each node will run in parallel
* Default execution strategy use threading, and each node will run in a separate thread.
Fault tolerance
---------------
Node execution is fault tolerant.
If an exception is raised from a node call, then this node call will be aborted but bonobo will continue the execution
with the next row (after outputing the stack trace and incrementing the "err" counter for the node context).
It allows to have ETL jobs that ignore faulty data and try their best to process the valid rows of a dataset.
Some errors are fatal, though.
If you pass a 2 elements tuple to a node that takes 3 args, |bonobo| will raise an :class:`bonobo.errors.UnrecoverableTypeError`, and exit the
current graph execution as fast as it can (finishing the other node executions that are in progress first, but not
starting new ones if there are remaining input rows).
You can learn more about the :class:`bonobo.Graph` data-structure and its properties in the
:doc:`graphs guide </guide/graphs>`.
Let's write a sample data integration job
:::::::::::::::::::::::::::::::::::::::::
Scenario
::::::::
Let's create a sample application.
Let's create a sample application, which goal will be to integrate some data in various systems.
The goal of this application will be to extract all the fablabs in the world using an open-data API, normalize this
data and, for now, display it. We'll then build on this foundation in the next steps to write to files, databases, etc.
We'll use an open-data dataset, containing all the fablabs in the world.
We will normalize this data using a few different rules, then write it somewhere.
In this step, we'll focus on getting this data normalized and output to the console. In the next steps, we'll extend it
to other targets, like files, and databases.
Setup
:::::
We'll change the `tutorial.py` file created in the last step to handle this new scenario.
First, let's remove all boilerplate code, so it looks like this:
.. code-block:: python
import bonobo
def get_graph(**options):
graph = bonobo.Graph()
return graph
def get_services(**options):
return {}
if __name__ == '__main__':
parser = bonobo.get_argument_parser()
with bonobo.parse_args(parser) as options:
bonobo.run(get_graph(**options), services=get_services(**options))
Your job now contains the logic for executing an empty graph, and we'll complete this with our application logic.
Reading the source data
:::::::::::::::::::::::
Let's add a simple chain to our `get_graph(...)` function, so that it reads from the fablabs open-data api.
The source dataset we'll use can be found on `this site <https://public-us.opendatasoft.com/explore/dataset/fablabs/>`_.
It's licensed under `Public Domain`, which makes it just perfect for our example.
.. note::
There is a :mod:`bonobo.contrib.opendatasoft` module that makes reading from OpenDataSoft APIs easier, including
pagination and limits, but for our tutorial, we'll avoid that and build it manually.
Let's write our extractor:
.. code-block:: python
import requests
FABLABS_API_URL = 'https://public-us.opendatasoft.com/api/records/1.0/search/?dataset=fablabs&rows=1000'
def extract_fablabs():
yield from requests.get(FABLABS_API_URL).json().get('records')
This extractor will get called once, query the API url, parse it as JSON, and yield the items from the "records" list,
one by one.
.. note::
You'll probably want to make it a bit more verbose in a real application, to handle all kind of errors that can
happen here. What if the server is down? What if it returns a response which is not JSON? What if the data is not
in the expected format?
For simplicity sake, we'll ignore that here but that's the kind of questions you should have in mind when writing
pipelines.
To test our pipeline, let's use a :class:`bonobo.Limit` and a :class:`bonobo.PrettyPrinter`, and change our
`get_graph(...)` function accordingly:
.. code-block:: python
import bonobo
def get_graph(**options):
graph = bonobo.Graph()
graph.add_chain(
extract_fablabs,
bonobo.Limit(10),
bonobo.PrettyPrinter(),
)
return graph
Running this job should output a bit of data, along with some statistics.
First, let's look at the statistics:
.. code-block:: shell-session
- extract_fablabs in=1 out=995 [done]
- Limit in=995 out=10 [done]
- PrettyPrinter in=10 out=10 [done]
It is important to understand that we extracted everything (995 rows), before droping 99% of the dataset.
This is OK for debugging, but not efficient.
.. note::
You should always try to limit the amount of data as early as possible, which often means not generating the data
you won't need in the first place. Here, we could have used the `rows=` query parameter in the API URL to not
request the data we would anyway drop.
Normalize
:::::::::
.. include:: _todo.rst
Output
::::::
We used :class:`bonobo.PrettyPrinter` to output the data.
It's a flexible transformation provided that helps you display the content of a stream, and you'll probably use it a
lot for various reasons.
Moving forward
@ -63,6 +140,10 @@ Moving forward
You now know:
* How to ...
* How to use a reader node.
* How to use the console output.
* How to limit the number of elements in a stream.
* How to pass data from one node to another.
* How to structure a graph using chains.
**Next: :doc:`3-files`**
It's now time to jump to :doc:`3-files`.

View File

@ -3,6 +3,49 @@ Part 3: Working with Files
.. include:: _wip_note.rst
Writing to the console is nice, but using files is probably more realistic.
Let's see how to use a few builtin writers and both local and remote filesystems.
Filesystems
:::::::::::
In |bonobo|, files are accessed within a **filesystem** service which must be something with the same interface as
`fs' FileSystem objects <https://docs.pyfilesystem.org/en/latest/builtin.html>`_. As a default, you'll get an instance
of a local filesystem mapped to the current working directory as the `fs` service. You'll learn more about services in
the next step, but for now, let's just use it.
Writing using the service
:::::::::::::::::::::::::
Although |bonobo| contains helpers to write to common file formats, let's start by writing it manually.
.. code-block:: python
from bonobo.config import use
from bonobo.constants import NOT_MODIFIED
@use('fs')
def write_repr_to_file(*row, fs):
with fs.open('output.txt', 'a+') as f:
print(row, file=f)
return NOT_MODIFIED
Then, update the `get_graph(...)` function, by adding `write_repr_to_file` just before your `PrettyPrinter()` node.
Let's try to run that and think about what happens.
Each time a row comes to this node, the output file is open in "append or create" mode, a line is written, and the file
is closed.
This is **NOT** how you want to do things. Let's rewrite it so our `open(...)` call becomes execution-wide.
* Filesystems
* Reading files
@ -21,4 +64,4 @@ You now know:
* How to ...
**Next: :doc:`4-services`**
It's now time to jump to :doc:`4-services`.

View File

@ -205,4 +205,4 @@ You now know:
* How to ...
**Next: :doc:`5-packaging`**
It's now time to jump to :doc:`5-packaging`.

View File

@ -15,7 +15,6 @@ kind of project structure, as the targert structure will be dicated by the hosti
sub-package would perfectly fit a django or flask project, or even a regular package, but it's up to you to chose the
structure of your project.
about using |bonobo| in a pyt
is about set of jobs working together within a project.
Let's see how to move from the current status to a package.
@ -28,3 +27,19 @@ You now know:
* How to ...
That's the end of the tutorial, you should now be familiar with all the basics.
A few appendixes to the tutorial can explain how to integrate with other systems (we'll use the "fablabs" application
created in this tutorial and extend it):
* :doc:`notebooks`
* :doc:`sqlalchemy`
* :doc:`django`
* :doc:`docker`
Then, you can either to jump head-first into your code, or you can have a better grasp at all concepts by
:doc:`reading the full bonobo guide </guide/index>`.
Happy data flows!

3
docs/tutorial/_todo.rst Normal file
View File

@ -0,0 +1,3 @@
.. warning::
This section is missing. Sorry, but stay tuned! It'll be added soon.

16
docs/tutorial/docker.rst Normal file
View File

@ -0,0 +1,16 @@
Working with Docker
===================
.. warning::
This section does not exist yet, but it's in the plans to write it quite soon.
Meanwhile, you can check the source code and other links provided below.
Source code
:::::::::::
https://github.com/python-bonobo/bonobo-docker