Doc update
This commit is contained in:
@ -249,10 +249,10 @@ That's all for this first step.
|
|||||||
|
|
||||||
You now know:
|
You now know:
|
||||||
|
|
||||||
* How to create a new job file.
|
* How to create a new job (using a single file).
|
||||||
* How to inspect the content of a job file.
|
* How to inspect the content of a job.
|
||||||
* What should go in a job file.
|
* What should go in a job file.
|
||||||
* How to execute a job file.
|
* How to execute a job file.
|
||||||
* How to read the console output.
|
* How to read the console output.
|
||||||
|
|
||||||
**Jump to** :doc:`2-jobs`
|
It's now time to jump to :doc:`2-jobs`.
|
||||||
|
|||||||
@ -4,31 +4,56 @@ Part 2: Writing ETL Jobs
|
|||||||
What's an ETL job ?
|
What's an ETL job ?
|
||||||
:::::::::::::::::::
|
:::::::::::::::::::
|
||||||
|
|
||||||
- data flow, stream processing
|
In |bonobo|, an ETL job is a formal definition of an executable graph.
|
||||||
- each node, first in first out
|
|
||||||
- parallelism
|
|
||||||
|
|
||||||
Each node has input rows, each row is one call, and each call has the input row passed as *args.
|
Each node of a graph will be executed in isolation from the other nodes, and the data is passed from one node to the
|
||||||
|
next using FIFO queues, managed by the framework. It's transparent to the end-user, though, and you'll only use
|
||||||
|
function arguments (for inputs) and return/yield values (for outputs).
|
||||||
|
|
||||||
Each call can have outputs, sent either using return, or yield.
|
Each input row of a node will cause one call to this node's callable. Each output is cast internally as a tuple-like
|
||||||
|
data structure (or more precisely, a namedtuple-like data structure), and for one given node, each output row must
|
||||||
|
have the same structure.
|
||||||
|
|
||||||
Each output row is stored internally as a tuple (or a namedtuple-like structure), and each output row must have the same structure (same number of fields, same len for tuple).
|
If you return/yield something which is not a tuple, bonobo will create a tuple of one element.
|
||||||
|
|
||||||
If you yield something which is not a tuple, bonobo will create a tuple of one element.
|
Properties
|
||||||
|
----------
|
||||||
|
|
||||||
By default, exceptions are not fatal in bonobo. If a call raise an error, then bonobo will display the stack trace, increment the "err" counter for this node and move to the next input row.
|
|bonobo| assists you with defining the data-flow of your data engineering process, and then streams data through your
|
||||||
|
callable graphs.
|
||||||
|
|
||||||
Some errors are fatal, though. For example, if you pass a 2 elements tuple to a node that takes 3 args, bonobo will raise an UnrecoverableTypeError, and exit the current execution.
|
* Each node call will process one row of data.
|
||||||
|
* Queues that flows the data between node are first-in, first-out (FIFO) standard python :class:`queue.Queue`.
|
||||||
|
* Each node will run in parallel
|
||||||
|
* Default execution strategy use threading, and each node will run in a separate thread.
|
||||||
|
|
||||||
|
Fault tolerance
|
||||||
|
---------------
|
||||||
|
|
||||||
|
Node execution is fault tolerant.
|
||||||
|
|
||||||
|
If an exception is raised from a node call, then this node call will be aborted but bonobo will continue the execution
|
||||||
|
with the next row (after outputing the stack trace and incrementing the "err" counter for the node context).
|
||||||
|
|
||||||
|
It allows to have ETL jobs that ignore faulty data and try their best to process the valid rows of a dataset.
|
||||||
|
|
||||||
|
Some errors are fatal, though.
|
||||||
|
|
||||||
|
If you pass a 2 elements tuple to a node that takes 3 args, |bonobo| will raise an :class:`bonobo.errors.UnrecoverableTypeError`, and exit the
|
||||||
|
current graph execution as fast as it can (finishing the other node executions that are in progress first, but not
|
||||||
|
starting new ones if there are remaining input rows).
|
||||||
|
|
||||||
|
|
||||||
|
Let's write a sample data integration job
|
||||||
|
:::::::::::::::::::::::::::::::::::::::::
|
||||||
|
|
||||||
|
Let's create a sample application.
|
||||||
|
|
||||||
|
The goal of this application will be to extract all the fablabs in the world using an open-data API, normalize this
|
||||||
|
data and, for now, display it. We'll then build on this foundation in the next steps to write to files, databases, etc.
|
||||||
|
|
||||||
Let's write one
|
|
||||||
:::::::::::::::
|
|
||||||
|
|
||||||
We'll create a job to do the following
|
|
||||||
|
|
||||||
* Extract all the FabLabs from an open data API
|
|
||||||
* Apply a bit of formating
|
|
||||||
* Geocode the address and normalize it, if we can
|
|
||||||
* Display it (in the next step, we'll learn about writing the result to a file.
|
|
||||||
|
|
||||||
|
|
||||||
Moving forward
|
Moving forward
|
||||||
|
|||||||
@ -1,9 +1,6 @@
|
|||||||
First steps
|
First steps
|
||||||
===========
|
===========
|
||||||
|
|
||||||
What is Bonobo?
|
|
||||||
:::::::::::::::
|
|
||||||
|
|
||||||
Bonobo is an ETL (Extract-Transform-Load) framework for python 3.5. The goal is to define data-transformations, with
|
Bonobo is an ETL (Extract-Transform-Load) framework for python 3.5. The goal is to define data-transformations, with
|
||||||
python code in charge of handling similar shaped independent lines of data.
|
python code in charge of handling similar shaped independent lines of data.
|
||||||
|
|
||||||
@ -14,8 +11,7 @@ Bonobo is a lean manufacturing assembly line for data that let you focus on the
|
|||||||
|
|
||||||
Bonobo uses simple python and should be quick and easy to learn.
|
Bonobo uses simple python and should be quick and easy to learn.
|
||||||
|
|
||||||
Tutorial
|
**Tutorials**
|
||||||
::::::::
|
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
@ -26,8 +22,8 @@ Tutorial
|
|||||||
4-services
|
4-services
|
||||||
5-packaging
|
5-packaging
|
||||||
|
|
||||||
More
|
|
||||||
::::
|
**Integrations**
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 1
|
:maxdepth: 1
|
||||||
@ -36,9 +32,7 @@ More
|
|||||||
notebooks
|
notebooks
|
||||||
sqlalchemy
|
sqlalchemy
|
||||||
|
|
||||||
What's next?
|
**What's next?**
|
||||||
::::::::::::
|
|
||||||
|
|
||||||
|
|
||||||
Once you're familiar with all the base concepts, you can...
|
Once you're familiar with all the base concepts, you can...
|
||||||
|
|
||||||
@ -46,9 +40,7 @@ Once you're familiar with all the base concepts, you can...
|
|||||||
* Explore the :doc:`Extensions </extension/index>` to widen the possibilities.
|
* Explore the :doc:`Extensions </extension/index>` to widen the possibilities.
|
||||||
* Open the :doc:`References </reference/index>` and start hacking like crazy.
|
* Open the :doc:`References </reference/index>` and start hacking like crazy.
|
||||||
|
|
||||||
|
**You're not alone!**
|
||||||
You're not alone!
|
|
||||||
:::::::::::::::::
|
|
||||||
|
|
||||||
Good documentation is not easy to write.
|
Good documentation is not easy to write.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user