Doc update
This commit is contained in:
@ -249,10 +249,10 @@ That's all for this first step.
|
||||
|
||||
You now know:
|
||||
|
||||
* How to create a new job file.
|
||||
* How to inspect the content of a job file.
|
||||
* How to create a new job (using a single file).
|
||||
* How to inspect the content of a job.
|
||||
* What should go in a job file.
|
||||
* How to execute a job file.
|
||||
* How to read the console output.
|
||||
|
||||
**Jump to** :doc:`2-jobs`
|
||||
It's now time to jump to :doc:`2-jobs`.
|
||||
|
||||
@ -4,31 +4,56 @@ Part 2: Writing ETL Jobs
|
||||
What's an ETL job ?
|
||||
:::::::::::::::::::
|
||||
|
||||
- data flow, stream processing
|
||||
- each node, first in first out
|
||||
- parallelism
|
||||
In |bonobo|, an ETL job is a formal definition of an executable graph.
|
||||
|
||||
Each node has input rows, each row is one call, and each call has the input row passed as *args.
|
||||
Each node of a graph will be executed in isolation from the other nodes, and the data is passed from one node to the
|
||||
next using FIFO queues, managed by the framework. It's transparent to the end-user, though, and you'll only use
|
||||
function arguments (for inputs) and return/yield values (for outputs).
|
||||
|
||||
Each call can have outputs, sent either using return, or yield.
|
||||
Each input row of a node will cause one call to this node's callable. Each output is cast internally as a tuple-like
|
||||
data structure (or more precisely, a namedtuple-like data structure), and for one given node, each output row must
|
||||
have the same structure.
|
||||
|
||||
Each output row is stored internally as a tuple (or a namedtuple-like structure), and each output row must have the same structure (same number of fields, same len for tuple).
|
||||
If you return/yield something which is not a tuple, bonobo will create a tuple of one element.
|
||||
|
||||
If you yield something which is not a tuple, bonobo will create a tuple of one element.
|
||||
Properties
|
||||
----------
|
||||
|
||||
By default, exceptions are not fatal in bonobo. If a call raise an error, then bonobo will display the stack trace, increment the "err" counter for this node and move to the next input row.
|
||||
|bonobo| assists you with defining the data-flow of your data engineering process, and then streams data through your
|
||||
callable graphs.
|
||||
|
||||
Some errors are fatal, though. For example, if you pass a 2 elements tuple to a node that takes 3 args, bonobo will raise an UnrecoverableTypeError, and exit the current execution.
|
||||
* Each node call will process one row of data.
|
||||
* Queues that flows the data between node are first-in, first-out (FIFO) standard python :class:`queue.Queue`.
|
||||
* Each node will run in parallel
|
||||
* Default execution strategy use threading, and each node will run in a separate thread.
|
||||
|
||||
Fault tolerance
|
||||
---------------
|
||||
|
||||
Node execution is fault tolerant.
|
||||
|
||||
If an exception is raised from a node call, then this node call will be aborted but bonobo will continue the execution
|
||||
with the next row (after outputing the stack trace and incrementing the "err" counter for the node context).
|
||||
|
||||
It allows to have ETL jobs that ignore faulty data and try their best to process the valid rows of a dataset.
|
||||
|
||||
Some errors are fatal, though.
|
||||
|
||||
If you pass a 2 elements tuple to a node that takes 3 args, |bonobo| will raise an :class:`bonobo.errors.UnrecoverableTypeError`, and exit the
|
||||
current graph execution as fast as it can (finishing the other node executions that are in progress first, but not
|
||||
starting new ones if there are remaining input rows).
|
||||
|
||||
|
||||
Let's write a sample data integration job
|
||||
:::::::::::::::::::::::::::::::::::::::::
|
||||
|
||||
Let's create a sample application.
|
||||
|
||||
The goal of this application will be to extract all the fablabs in the world using an open-data API, normalize this
|
||||
data and, for now, display it. We'll then build on this foundation in the next steps to write to files, databases, etc.
|
||||
|
||||
Let's write one
|
||||
:::::::::::::::
|
||||
|
||||
We'll create a job to do the following
|
||||
|
||||
* Extract all the FabLabs from an open data API
|
||||
* Apply a bit of formating
|
||||
* Geocode the address and normalize it, if we can
|
||||
* Display it (in the next step, we'll learn about writing the result to a file.
|
||||
|
||||
|
||||
Moving forward
|
||||
|
||||
@ -1,9 +1,6 @@
|
||||
First steps
|
||||
===========
|
||||
|
||||
What is Bonobo?
|
||||
:::::::::::::::
|
||||
|
||||
Bonobo is an ETL (Extract-Transform-Load) framework for python 3.5. The goal is to define data-transformations, with
|
||||
python code in charge of handling similar shaped independent lines of data.
|
||||
|
||||
@ -14,8 +11,7 @@ Bonobo is a lean manufacturing assembly line for data that let you focus on the
|
||||
|
||||
Bonobo uses simple python and should be quick and easy to learn.
|
||||
|
||||
Tutorial
|
||||
::::::::
|
||||
**Tutorials**
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
@ -26,8 +22,8 @@ Tutorial
|
||||
4-services
|
||||
5-packaging
|
||||
|
||||
More
|
||||
::::
|
||||
|
||||
**Integrations**
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
@ -36,9 +32,7 @@ More
|
||||
notebooks
|
||||
sqlalchemy
|
||||
|
||||
What's next?
|
||||
::::::::::::
|
||||
|
||||
**What's next?**
|
||||
|
||||
Once you're familiar with all the base concepts, you can...
|
||||
|
||||
@ -46,9 +40,7 @@ Once you're familiar with all the base concepts, you can...
|
||||
* Explore the :doc:`Extensions </extension/index>` to widen the possibilities.
|
||||
* Open the :doc:`References </reference/index>` and start hacking like crazy.
|
||||
|
||||
|
||||
You're not alone!
|
||||
:::::::::::::::::
|
||||
**You're not alone!**
|
||||
|
||||
Good documentation is not easy to write.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user