Working on 0.6 documentation.
This commit is contained in:
11
docs/guide/_toc.rst
Normal file
11
docs/guide/_toc.rst
Normal file
@ -0,0 +1,11 @@
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
introduction
|
||||
transformations
|
||||
graphs
|
||||
services
|
||||
environment
|
||||
purity
|
||||
debugging
|
||||
plugins
|
||||
0
docs/guide/debugging.rst
Normal file
0
docs/guide/debugging.rst
Normal file
@ -5,6 +5,47 @@ Graphs are the glue that ties transformations together. They are the only data-s
|
||||
must be acyclic, and can contain as many nodes as your system can handle. However, although in theory the number of nodes can be rather high, practical use cases usually do not exceed more than a few hundred nodes and only then in extreme cases.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Each node of a graph will be executed in isolation from the other nodes, and the data is passed from one node to the
|
||||
next using FIFO queues, managed by the framework. It's transparent to the end-user, though, and you'll only use
|
||||
function arguments (for inputs) and return/yield values (for outputs).
|
||||
|
||||
Each input row of a node will cause one call to this node's callable. Each output is cast internally as a tuple-like
|
||||
data structure (or more precisely, a namedtuple-like data structure), and for one given node, each output row must
|
||||
have the same structure.
|
||||
|
||||
If you return/yield something which is not a tuple, bonobo will create a tuple of one element.
|
||||
|
||||
Properties
|
||||
----------
|
||||
|
||||
|bonobo| assists you with defining the data-flow of your data engineering process, and then streams data through your
|
||||
callable graphs.
|
||||
|
||||
* Each node call will process one row of data.
|
||||
* Queues that flows the data between node are first-in, first-out (FIFO) standard python :class:`queue.Queue`.
|
||||
* Each node will run in parallel
|
||||
* Default execution strategy use threading, and each node will run in a separate thread.
|
||||
|
||||
Fault tolerance
|
||||
---------------
|
||||
|
||||
Node execution is fault tolerant.
|
||||
|
||||
If an exception is raised from a node call, then this node call will be aborted but bonobo will continue the execution
|
||||
with the next row (after outputing the stack trace and incrementing the "err" counter for the node context).
|
||||
|
||||
It allows to have ETL jobs that ignore faulty data and try their best to process the valid rows of a dataset.
|
||||
|
||||
Some errors are fatal, though.
|
||||
|
||||
If you pass a 2 elements tuple to a node that takes 3 args, |bonobo| will raise an :class:`bonobo.errors.UnrecoverableTypeError`, and exit the
|
||||
current graph execution as fast as it can (finishing the other node executions that are in progress first, but not
|
||||
starting new ones if there are remaining input rows).
|
||||
|
||||
|
||||
Definitions
|
||||
:::::::::::
|
||||
|
||||
|
||||
@ -3,13 +3,8 @@ Guides
|
||||
|
||||
This section will guide you through your journey with Bonobo ETL.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
introduction
|
||||
transformations
|
||||
graphs
|
||||
services
|
||||
environment
|
||||
purity
|
||||
.. include:: _toc.rst
|
||||
|
||||
|
||||
|
||||
|
||||
@ -1,61 +1,138 @@
|
||||
Part 2: Writing ETL Jobs
|
||||
========================
|
||||
|
||||
.. include:: _wip_note.rst
|
||||
In |bonobo|, an ETL job is a graph with some logic to execute it, like the file we created in the previous section.
|
||||
|
||||
What's an ETL job ?
|
||||
:::::::::::::::::::
|
||||
|
||||
In |bonobo|, an ETL job is a formal definition of an executable graph.
|
||||
|
||||
Each node of a graph will be executed in isolation from the other nodes, and the data is passed from one node to the
|
||||
next using FIFO queues, managed by the framework. It's transparent to the end-user, though, and you'll only use
|
||||
function arguments (for inputs) and return/yield values (for outputs).
|
||||
|
||||
Each input row of a node will cause one call to this node's callable. Each output is cast internally as a tuple-like
|
||||
data structure (or more precisely, a namedtuple-like data structure), and for one given node, each output row must
|
||||
have the same structure.
|
||||
|
||||
If you return/yield something which is not a tuple, bonobo will create a tuple of one element.
|
||||
|
||||
Properties
|
||||
----------
|
||||
|
||||
|bonobo| assists you with defining the data-flow of your data engineering process, and then streams data through your
|
||||
callable graphs.
|
||||
|
||||
* Each node call will process one row of data.
|
||||
* Queues that flows the data between node are first-in, first-out (FIFO) standard python :class:`queue.Queue`.
|
||||
* Each node will run in parallel
|
||||
* Default execution strategy use threading, and each node will run in a separate thread.
|
||||
|
||||
Fault tolerance
|
||||
---------------
|
||||
|
||||
Node execution is fault tolerant.
|
||||
|
||||
If an exception is raised from a node call, then this node call will be aborted but bonobo will continue the execution
|
||||
with the next row (after outputing the stack trace and incrementing the "err" counter for the node context).
|
||||
|
||||
It allows to have ETL jobs that ignore faulty data and try their best to process the valid rows of a dataset.
|
||||
|
||||
Some errors are fatal, though.
|
||||
|
||||
If you pass a 2 elements tuple to a node that takes 3 args, |bonobo| will raise an :class:`bonobo.errors.UnrecoverableTypeError`, and exit the
|
||||
current graph execution as fast as it can (finishing the other node executions that are in progress first, but not
|
||||
starting new ones if there are remaining input rows).
|
||||
You can learn more about the :class:`bonobo.Graph` data-structure and its properties in the
|
||||
:doc:`graphs guide </guide/graphs>`.
|
||||
|
||||
|
||||
Let's write a sample data integration job
|
||||
:::::::::::::::::::::::::::::::::::::::::
|
||||
Scenario
|
||||
::::::::
|
||||
|
||||
Let's create a sample application.
|
||||
Let's create a sample application, which goal will be to integrate some data in various systems.
|
||||
|
||||
The goal of this application will be to extract all the fablabs in the world using an open-data API, normalize this
|
||||
data and, for now, display it. We'll then build on this foundation in the next steps to write to files, databases, etc.
|
||||
We'll use an open-data dataset, containing all the fablabs in the world.
|
||||
|
||||
We will normalize this data using a few different rules, then write it somewhere.
|
||||
|
||||
In this step, we'll focus on getting this data normalized and output to the console. In the next steps, we'll extend it
|
||||
to other targets, like files, and databases.
|
||||
|
||||
|
||||
Setup
|
||||
:::::
|
||||
|
||||
We'll change the `tutorial.py` file created in the last step to handle this new scenario.
|
||||
|
||||
First, let's remove all boilerplate code, so it looks like this:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import bonobo
|
||||
|
||||
|
||||
def get_graph(**options):
|
||||
graph = bonobo.Graph()
|
||||
return graph
|
||||
|
||||
|
||||
def get_services(**options):
|
||||
return {}
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = bonobo.get_argument_parser()
|
||||
with bonobo.parse_args(parser) as options:
|
||||
bonobo.run(get_graph(**options), services=get_services(**options))
|
||||
|
||||
|
||||
Your job now contains the logic for executing an empty graph, and we'll complete this with our application logic.
|
||||
|
||||
Reading the source data
|
||||
:::::::::::::::::::::::
|
||||
|
||||
Let's add a simple chain to our `get_graph(...)` function, so that it reads from the fablabs open-data api.
|
||||
|
||||
The source dataset we'll use can be found on `this site <https://public-us.opendatasoft.com/explore/dataset/fablabs/>`_.
|
||||
It's licensed under `Public Domain`, which makes it just perfect for our example.
|
||||
|
||||
.. note::
|
||||
|
||||
There is a :mod:`bonobo.contrib.opendatasoft` module that makes reading from OpenDataSoft APIs easier, including
|
||||
pagination and limits, but for our tutorial, we'll avoid that and build it manually.
|
||||
|
||||
Let's write our extractor:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import requests
|
||||
|
||||
FABLABS_API_URL = 'https://public-us.opendatasoft.com/api/records/1.0/search/?dataset=fablabs&rows=1000'
|
||||
|
||||
def extract_fablabs():
|
||||
yield from requests.get(FABLABS_API_URL).json().get('records')
|
||||
|
||||
This extractor will get called once, query the API url, parse it as JSON, and yield the items from the "records" list,
|
||||
one by one.
|
||||
|
||||
.. note::
|
||||
|
||||
You'll probably want to make it a bit more verbose in a real application, to handle all kind of errors that can
|
||||
happen here. What if the server is down? What if it returns a response which is not JSON? What if the data is not
|
||||
in the expected format?
|
||||
|
||||
For simplicity sake, we'll ignore that here but that's the kind of questions you should have in mind when writing
|
||||
pipelines.
|
||||
|
||||
To test our pipeline, let's use a :class:`bonobo.Limit` and a :class:`bonobo.PrettyPrinter`, and change our
|
||||
`get_graph(...)` function accordingly:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import bonobo
|
||||
|
||||
def get_graph(**options):
|
||||
graph = bonobo.Graph()
|
||||
graph.add_chain(
|
||||
extract_fablabs,
|
||||
bonobo.Limit(10),
|
||||
bonobo.PrettyPrinter(),
|
||||
)
|
||||
return graph
|
||||
|
||||
Running this job should output a bit of data, along with some statistics.
|
||||
|
||||
First, let's look at the statistics:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
- extract_fablabs in=1 out=995 [done]
|
||||
- Limit in=995 out=10 [done]
|
||||
- PrettyPrinter in=10 out=10 [done]
|
||||
|
||||
It is important to understand that we extracted everything (995 rows), before droping 99% of the dataset.
|
||||
|
||||
This is OK for debugging, but not efficient.
|
||||
|
||||
.. note::
|
||||
|
||||
You should always try to limit the amount of data as early as possible, which often means not generating the data
|
||||
you won't need in the first place. Here, we could have used the `rows=` query parameter in the API URL to not
|
||||
request the data we would anyway drop.
|
||||
|
||||
Normalize
|
||||
:::::::::
|
||||
|
||||
.. include:: _todo.rst
|
||||
|
||||
Output
|
||||
::::::
|
||||
|
||||
We used :class:`bonobo.PrettyPrinter` to output the data.
|
||||
|
||||
It's a flexible transformation provided that helps you display the content of a stream, and you'll probably use it a
|
||||
lot for various reasons.
|
||||
|
||||
|
||||
Moving forward
|
||||
@ -63,6 +140,10 @@ Moving forward
|
||||
|
||||
You now know:
|
||||
|
||||
* How to ...
|
||||
* How to use a reader node.
|
||||
* How to use the console output.
|
||||
* How to limit the number of elements in a stream.
|
||||
* How to pass data from one node to another.
|
||||
* How to structure a graph using chains.
|
||||
|
||||
**Next: :doc:`3-files`**
|
||||
It's now time to jump to :doc:`3-files`.
|
||||
|
||||
@ -3,6 +3,49 @@ Part 3: Working with Files
|
||||
|
||||
.. include:: _wip_note.rst
|
||||
|
||||
Writing to the console is nice, but using files is probably more realistic.
|
||||
|
||||
Let's see how to use a few builtin writers and both local and remote filesystems.
|
||||
|
||||
|
||||
Filesystems
|
||||
:::::::::::
|
||||
|
||||
In |bonobo|, files are accessed within a **filesystem** service which must be something with the same interface as
|
||||
`fs' FileSystem objects <https://docs.pyfilesystem.org/en/latest/builtin.html>`_. As a default, you'll get an instance
|
||||
of a local filesystem mapped to the current working directory as the `fs` service. You'll learn more about services in
|
||||
the next step, but for now, let's just use it.
|
||||
|
||||
|
||||
Writing using the service
|
||||
:::::::::::::::::::::::::
|
||||
|
||||
Although |bonobo| contains helpers to write to common file formats, let's start by writing it manually.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import use
|
||||
from bonobo.constants import NOT_MODIFIED
|
||||
|
||||
@use('fs')
|
||||
def write_repr_to_file(*row, fs):
|
||||
with fs.open('output.txt', 'a+') as f:
|
||||
print(row, file=f)
|
||||
return NOT_MODIFIED
|
||||
|
||||
Then, update the `get_graph(...)` function, by adding `write_repr_to_file` just before your `PrettyPrinter()` node.
|
||||
|
||||
Let's try to run that and think about what happens.
|
||||
|
||||
Each time a row comes to this node, the output file is open in "append or create" mode, a line is written, and the file
|
||||
is closed.
|
||||
|
||||
This is **NOT** how you want to do things. Let's rewrite it so our `open(...)` call becomes execution-wide.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
* Filesystems
|
||||
|
||||
* Reading files
|
||||
@ -21,4 +64,4 @@ You now know:
|
||||
|
||||
* How to ...
|
||||
|
||||
**Next: :doc:`4-services`**
|
||||
It's now time to jump to :doc:`4-services`.
|
||||
|
||||
@ -205,4 +205,4 @@ You now know:
|
||||
|
||||
* How to ...
|
||||
|
||||
**Next: :doc:`5-packaging`**
|
||||
It's now time to jump to :doc:`5-packaging`.
|
||||
|
||||
@ -15,7 +15,6 @@ kind of project structure, as the targert structure will be dicated by the hosti
|
||||
sub-package would perfectly fit a django or flask project, or even a regular package, but it's up to you to chose the
|
||||
structure of your project.
|
||||
|
||||
about using |bonobo| in a pyt
|
||||
is about set of jobs working together within a project.
|
||||
|
||||
Let's see how to move from the current status to a package.
|
||||
@ -28,3 +27,19 @@ You now know:
|
||||
|
||||
* How to ...
|
||||
|
||||
That's the end of the tutorial, you should now be familiar with all the basics.
|
||||
|
||||
A few appendixes to the tutorial can explain how to integrate with other systems (we'll use the "fablabs" application
|
||||
created in this tutorial and extend it):
|
||||
|
||||
* :doc:`notebooks`
|
||||
* :doc:`sqlalchemy`
|
||||
* :doc:`django`
|
||||
* :doc:`docker`
|
||||
|
||||
Then, you can either to jump head-first into your code, or you can have a better grasp at all concepts by
|
||||
:doc:`reading the full bonobo guide </guide/index>`.
|
||||
|
||||
Happy data flows!
|
||||
|
||||
|
||||
|
||||
3
docs/tutorial/_todo.rst
Normal file
3
docs/tutorial/_todo.rst
Normal file
@ -0,0 +1,3 @@
|
||||
.. warning::
|
||||
|
||||
This section is missing. Sorry, but stay tuned! It'll be added soon.
|
||||
16
docs/tutorial/docker.rst
Normal file
16
docs/tutorial/docker.rst
Normal file
@ -0,0 +1,16 @@
|
||||
Working with Docker
|
||||
===================
|
||||
|
||||
.. warning::
|
||||
|
||||
This section does not exist yet, but it's in the plans to write it quite soon.
|
||||
|
||||
Meanwhile, you can check the source code and other links provided below.
|
||||
|
||||
Source code
|
||||
:::::::::::
|
||||
|
||||
https://github.com/python-bonobo/bonobo-docker
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user