Working on 0.6 documentation.
This commit is contained in:
@ -1,61 +1,138 @@
|
||||
Part 2: Writing ETL Jobs
|
||||
========================
|
||||
|
||||
.. include:: _wip_note.rst
|
||||
In |bonobo|, an ETL job is a graph with some logic to execute it, like the file we created in the previous section.
|
||||
|
||||
What's an ETL job ?
|
||||
:::::::::::::::::::
|
||||
|
||||
In |bonobo|, an ETL job is a formal definition of an executable graph.
|
||||
|
||||
Each node of a graph will be executed in isolation from the other nodes, and the data is passed from one node to the
|
||||
next using FIFO queues, managed by the framework. It's transparent to the end-user, though, and you'll only use
|
||||
function arguments (for inputs) and return/yield values (for outputs).
|
||||
|
||||
Each input row of a node will cause one call to this node's callable. Each output is cast internally as a tuple-like
|
||||
data structure (or more precisely, a namedtuple-like data structure), and for one given node, each output row must
|
||||
have the same structure.
|
||||
|
||||
If you return/yield something which is not a tuple, bonobo will create a tuple of one element.
|
||||
|
||||
Properties
|
||||
----------
|
||||
|
||||
|bonobo| assists you with defining the data-flow of your data engineering process, and then streams data through your
|
||||
callable graphs.
|
||||
|
||||
* Each node call will process one row of data.
|
||||
* Queues that flows the data between node are first-in, first-out (FIFO) standard python :class:`queue.Queue`.
|
||||
* Each node will run in parallel
|
||||
* Default execution strategy use threading, and each node will run in a separate thread.
|
||||
|
||||
Fault tolerance
|
||||
---------------
|
||||
|
||||
Node execution is fault tolerant.
|
||||
|
||||
If an exception is raised from a node call, then this node call will be aborted but bonobo will continue the execution
|
||||
with the next row (after outputing the stack trace and incrementing the "err" counter for the node context).
|
||||
|
||||
It allows to have ETL jobs that ignore faulty data and try their best to process the valid rows of a dataset.
|
||||
|
||||
Some errors are fatal, though.
|
||||
|
||||
If you pass a 2 elements tuple to a node that takes 3 args, |bonobo| will raise an :class:`bonobo.errors.UnrecoverableTypeError`, and exit the
|
||||
current graph execution as fast as it can (finishing the other node executions that are in progress first, but not
|
||||
starting new ones if there are remaining input rows).
|
||||
You can learn more about the :class:`bonobo.Graph` data-structure and its properties in the
|
||||
:doc:`graphs guide </guide/graphs>`.
|
||||
|
||||
|
||||
Let's write a sample data integration job
|
||||
:::::::::::::::::::::::::::::::::::::::::
|
||||
Scenario
|
||||
::::::::
|
||||
|
||||
Let's create a sample application.
|
||||
Let's create a sample application, which goal will be to integrate some data in various systems.
|
||||
|
||||
The goal of this application will be to extract all the fablabs in the world using an open-data API, normalize this
|
||||
data and, for now, display it. We'll then build on this foundation in the next steps to write to files, databases, etc.
|
||||
We'll use an open-data dataset, containing all the fablabs in the world.
|
||||
|
||||
We will normalize this data using a few different rules, then write it somewhere.
|
||||
|
||||
In this step, we'll focus on getting this data normalized and output to the console. In the next steps, we'll extend it
|
||||
to other targets, like files, and databases.
|
||||
|
||||
|
||||
Setup
|
||||
:::::
|
||||
|
||||
We'll change the `tutorial.py` file created in the last step to handle this new scenario.
|
||||
|
||||
First, let's remove all boilerplate code, so it looks like this:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import bonobo
|
||||
|
||||
|
||||
def get_graph(**options):
|
||||
graph = bonobo.Graph()
|
||||
return graph
|
||||
|
||||
|
||||
def get_services(**options):
|
||||
return {}
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = bonobo.get_argument_parser()
|
||||
with bonobo.parse_args(parser) as options:
|
||||
bonobo.run(get_graph(**options), services=get_services(**options))
|
||||
|
||||
|
||||
Your job now contains the logic for executing an empty graph, and we'll complete this with our application logic.
|
||||
|
||||
Reading the source data
|
||||
:::::::::::::::::::::::
|
||||
|
||||
Let's add a simple chain to our `get_graph(...)` function, so that it reads from the fablabs open-data api.
|
||||
|
||||
The source dataset we'll use can be found on `this site <https://public-us.opendatasoft.com/explore/dataset/fablabs/>`_.
|
||||
It's licensed under `Public Domain`, which makes it just perfect for our example.
|
||||
|
||||
.. note::
|
||||
|
||||
There is a :mod:`bonobo.contrib.opendatasoft` module that makes reading from OpenDataSoft APIs easier, including
|
||||
pagination and limits, but for our tutorial, we'll avoid that and build it manually.
|
||||
|
||||
Let's write our extractor:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import requests
|
||||
|
||||
FABLABS_API_URL = 'https://public-us.opendatasoft.com/api/records/1.0/search/?dataset=fablabs&rows=1000'
|
||||
|
||||
def extract_fablabs():
|
||||
yield from requests.get(FABLABS_API_URL).json().get('records')
|
||||
|
||||
This extractor will get called once, query the API url, parse it as JSON, and yield the items from the "records" list,
|
||||
one by one.
|
||||
|
||||
.. note::
|
||||
|
||||
You'll probably want to make it a bit more verbose in a real application, to handle all kind of errors that can
|
||||
happen here. What if the server is down? What if it returns a response which is not JSON? What if the data is not
|
||||
in the expected format?
|
||||
|
||||
For simplicity sake, we'll ignore that here but that's the kind of questions you should have in mind when writing
|
||||
pipelines.
|
||||
|
||||
To test our pipeline, let's use a :class:`bonobo.Limit` and a :class:`bonobo.PrettyPrinter`, and change our
|
||||
`get_graph(...)` function accordingly:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import bonobo
|
||||
|
||||
def get_graph(**options):
|
||||
graph = bonobo.Graph()
|
||||
graph.add_chain(
|
||||
extract_fablabs,
|
||||
bonobo.Limit(10),
|
||||
bonobo.PrettyPrinter(),
|
||||
)
|
||||
return graph
|
||||
|
||||
Running this job should output a bit of data, along with some statistics.
|
||||
|
||||
First, let's look at the statistics:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
- extract_fablabs in=1 out=995 [done]
|
||||
- Limit in=995 out=10 [done]
|
||||
- PrettyPrinter in=10 out=10 [done]
|
||||
|
||||
It is important to understand that we extracted everything (995 rows), before droping 99% of the dataset.
|
||||
|
||||
This is OK for debugging, but not efficient.
|
||||
|
||||
.. note::
|
||||
|
||||
You should always try to limit the amount of data as early as possible, which often means not generating the data
|
||||
you won't need in the first place. Here, we could have used the `rows=` query parameter in the API URL to not
|
||||
request the data we would anyway drop.
|
||||
|
||||
Normalize
|
||||
:::::::::
|
||||
|
||||
.. include:: _todo.rst
|
||||
|
||||
Output
|
||||
::::::
|
||||
|
||||
We used :class:`bonobo.PrettyPrinter` to output the data.
|
||||
|
||||
It's a flexible transformation provided that helps you display the content of a stream, and you'll probably use it a
|
||||
lot for various reasons.
|
||||
|
||||
|
||||
Moving forward
|
||||
@ -63,6 +140,10 @@ Moving forward
|
||||
|
||||
You now know:
|
||||
|
||||
* How to ...
|
||||
* How to use a reader node.
|
||||
* How to use the console output.
|
||||
* How to limit the number of elements in a stream.
|
||||
* How to pass data from one node to another.
|
||||
* How to structure a graph using chains.
|
||||
|
||||
**Next: :doc:`3-files`**
|
||||
It's now time to jump to :doc:`3-files`.
|
||||
|
||||
@ -3,6 +3,49 @@ Part 3: Working with Files
|
||||
|
||||
.. include:: _wip_note.rst
|
||||
|
||||
Writing to the console is nice, but using files is probably more realistic.
|
||||
|
||||
Let's see how to use a few builtin writers and both local and remote filesystems.
|
||||
|
||||
|
||||
Filesystems
|
||||
:::::::::::
|
||||
|
||||
In |bonobo|, files are accessed within a **filesystem** service which must be something with the same interface as
|
||||
`fs' FileSystem objects <https://docs.pyfilesystem.org/en/latest/builtin.html>`_. As a default, you'll get an instance
|
||||
of a local filesystem mapped to the current working directory as the `fs` service. You'll learn more about services in
|
||||
the next step, but for now, let's just use it.
|
||||
|
||||
|
||||
Writing using the service
|
||||
:::::::::::::::::::::::::
|
||||
|
||||
Although |bonobo| contains helpers to write to common file formats, let's start by writing it manually.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import use
|
||||
from bonobo.constants import NOT_MODIFIED
|
||||
|
||||
@use('fs')
|
||||
def write_repr_to_file(*row, fs):
|
||||
with fs.open('output.txt', 'a+') as f:
|
||||
print(row, file=f)
|
||||
return NOT_MODIFIED
|
||||
|
||||
Then, update the `get_graph(...)` function, by adding `write_repr_to_file` just before your `PrettyPrinter()` node.
|
||||
|
||||
Let's try to run that and think about what happens.
|
||||
|
||||
Each time a row comes to this node, the output file is open in "append or create" mode, a line is written, and the file
|
||||
is closed.
|
||||
|
||||
This is **NOT** how you want to do things. Let's rewrite it so our `open(...)` call becomes execution-wide.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
* Filesystems
|
||||
|
||||
* Reading files
|
||||
@ -21,4 +64,4 @@ You now know:
|
||||
|
||||
* How to ...
|
||||
|
||||
**Next: :doc:`4-services`**
|
||||
It's now time to jump to :doc:`4-services`.
|
||||
|
||||
@ -205,4 +205,4 @@ You now know:
|
||||
|
||||
* How to ...
|
||||
|
||||
**Next: :doc:`5-packaging`**
|
||||
It's now time to jump to :doc:`5-packaging`.
|
||||
|
||||
@ -15,7 +15,6 @@ kind of project structure, as the targert structure will be dicated by the hosti
|
||||
sub-package would perfectly fit a django or flask project, or even a regular package, but it's up to you to chose the
|
||||
structure of your project.
|
||||
|
||||
about using |bonobo| in a pyt
|
||||
is about set of jobs working together within a project.
|
||||
|
||||
Let's see how to move from the current status to a package.
|
||||
@ -28,3 +27,19 @@ You now know:
|
||||
|
||||
* How to ...
|
||||
|
||||
That's the end of the tutorial, you should now be familiar with all the basics.
|
||||
|
||||
A few appendixes to the tutorial can explain how to integrate with other systems (we'll use the "fablabs" application
|
||||
created in this tutorial and extend it):
|
||||
|
||||
* :doc:`notebooks`
|
||||
* :doc:`sqlalchemy`
|
||||
* :doc:`django`
|
||||
* :doc:`docker`
|
||||
|
||||
Then, you can either to jump head-first into your code, or you can have a better grasp at all concepts by
|
||||
:doc:`reading the full bonobo guide </guide/index>`.
|
||||
|
||||
Happy data flows!
|
||||
|
||||
|
||||
|
||||
3
docs/tutorial/_todo.rst
Normal file
3
docs/tutorial/_todo.rst
Normal file
@ -0,0 +1,3 @@
|
||||
.. warning::
|
||||
|
||||
This section is missing. Sorry, but stay tuned! It'll be added soon.
|
||||
16
docs/tutorial/docker.rst
Normal file
16
docs/tutorial/docker.rst
Normal file
@ -0,0 +1,16 @@
|
||||
Working with Docker
|
||||
===================
|
||||
|
||||
.. warning::
|
||||
|
||||
This section does not exist yet, but it's in the plans to write it quite soon.
|
||||
|
||||
Meanwhile, you can check the source code and other links provided below.
|
||||
|
||||
Source code
|
||||
:::::::::::
|
||||
|
||||
https://github.com/python-bonobo/bonobo-docker
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user