Doc update

This commit is contained in:
Romain Dorgueil
2018-01-01 22:18:21 +01:00
parent 7d4fb1dff0
commit f640e358b4
3 changed files with 49 additions and 32 deletions

View File

@ -249,10 +249,10 @@ That's all for this first step.
You now know:
* How to create a new job file.
* How to inspect the content of a job file.
* How to create a new job (using a single file).
* How to inspect the content of a job.
* What should go in a job file.
* How to execute a job file.
* How to read the console output.
**Jump to** :doc:`2-jobs`
It's now time to jump to :doc:`2-jobs`.

View File

@ -4,31 +4,56 @@ Part 2: Writing ETL Jobs
What's an ETL job ?
:::::::::::::::::::
- data flow, stream processing
- each node, first in first out
- parallelism
In |bonobo|, an ETL job is a formal definition of an executable graph.
Each node has input rows, each row is one call, and each call has the input row passed as *args.
Each node of a graph will be executed in isolation from the other nodes, and the data is passed from one node to the
next using FIFO queues, managed by the framework. It's transparent to the end-user, though, and you'll only use
function arguments (for inputs) and return/yield values (for outputs).
Each call can have outputs, sent either using return, or yield.
Each input row of a node will cause one call to this node's callable. Each output is cast internally as a tuple-like
data structure (or more precisely, a namedtuple-like data structure), and for one given node, each output row must
have the same structure.
Each output row is stored internally as a tuple (or a namedtuple-like structure), and each output row must have the same structure (same number of fields, same len for tuple).
If you return/yield something which is not a tuple, bonobo will create a tuple of one element.
If you yield something which is not a tuple, bonobo will create a tuple of one element.
Properties
----------
By default, exceptions are not fatal in bonobo. If a call raise an error, then bonobo will display the stack trace, increment the "err" counter for this node and move to the next input row.
|bonobo| assists you with defining the data-flow of your data engineering process, and then streams data through your
callable graphs.
Some errors are fatal, though. For example, if you pass a 2 elements tuple to a node that takes 3 args, bonobo will raise an UnrecoverableTypeError, and exit the current execution.
* Each node call will process one row of data.
* Queues that flows the data between node are first-in, first-out (FIFO) standard python :class:`queue.Queue`.
* Each node will run in parallel
* Default execution strategy use threading, and each node will run in a separate thread.
Fault tolerance
---------------
Node execution is fault tolerant.
If an exception is raised from a node call, then this node call will be aborted but bonobo will continue the execution
with the next row (after outputing the stack trace and incrementing the "err" counter for the node context).
It allows to have ETL jobs that ignore faulty data and try their best to process the valid rows of a dataset.
Some errors are fatal, though.
If you pass a 2 elements tuple to a node that takes 3 args, |bonobo| will raise an :class:`bonobo.errors.UnrecoverableTypeError`, and exit the
current graph execution as fast as it can (finishing the other node executions that are in progress first, but not
starting new ones if there are remaining input rows).
Let's write a sample data integration job
:::::::::::::::::::::::::::::::::::::::::
Let's create a sample application.
The goal of this application will be to extract all the fablabs in the world using an open-data API, normalize this
data and, for now, display it. We'll then build on this foundation in the next steps to write to files, databases, etc.
Let's write one
:::::::::::::::
We'll create a job to do the following
* Extract all the FabLabs from an open data API
* Apply a bit of formating
* Geocode the address and normalize it, if we can
* Display it (in the next step, we'll learn about writing the result to a file.
Moving forward

View File

@ -1,9 +1,6 @@
First steps
===========
What is Bonobo?
:::::::::::::::
Bonobo is an ETL (Extract-Transform-Load) framework for python 3.5. The goal is to define data-transformations, with
python code in charge of handling similar shaped independent lines of data.
@ -14,8 +11,7 @@ Bonobo is a lean manufacturing assembly line for data that let you focus on the
Bonobo uses simple python and should be quick and easy to learn.
Tutorial
::::::::
**Tutorials**
.. toctree::
:maxdepth: 1
@ -26,8 +22,8 @@ Tutorial
4-services
5-packaging
More
::::
**Integrations**
.. toctree::
:maxdepth: 1
@ -36,9 +32,7 @@ More
notebooks
sqlalchemy
What's next?
::::::::::::
**What's next?**
Once you're familiar with all the base concepts, you can...
@ -46,9 +40,7 @@ Once you're familiar with all the base concepts, you can...
* Explore the :doc:`Extensions </extension/index>` to widen the possibilities.
* Open the :doc:`References </reference/index>` and start hacking like crazy.
You're not alone!
:::::::::::::::::
**You're not alone!**
Good documentation is not easy to write.