diff --git a/docs/tutorial/1-init.rst b/docs/tutorial/1-init.rst index 5a1bbd9..9fc92f5 100644 --- a/docs/tutorial/1-init.rst +++ b/docs/tutorial/1-init.rst @@ -249,10 +249,10 @@ That's all for this first step. You now know: -* How to create a new job file. -* How to inspect the content of a job file. +* How to create a new job (using a single file). +* How to inspect the content of a job. * What should go in a job file. * How to execute a job file. * How to read the console output. -**Jump to** :doc:`2-jobs` +It's now time to jump to :doc:`2-jobs`. diff --git a/docs/tutorial/2-jobs.rst b/docs/tutorial/2-jobs.rst index d2bbfe5..e7d4baf 100644 --- a/docs/tutorial/2-jobs.rst +++ b/docs/tutorial/2-jobs.rst @@ -4,31 +4,56 @@ Part 2: Writing ETL Jobs What's an ETL job ? ::::::::::::::::::: -- data flow, stream processing -- each node, first in first out -- parallelism +In |bonobo|, an ETL job is a formal definition of an executable graph. -Each node has input rows, each row is one call, and each call has the input row passed as *args. +Each node of a graph will be executed in isolation from the other nodes, and the data is passed from one node to the +next using FIFO queues, managed by the framework. It's transparent to the end-user, though, and you'll only use +function arguments (for inputs) and return/yield values (for outputs). -Each call can have outputs, sent either using return, or yield. +Each input row of a node will cause one call to this node's callable. Each output is cast internally as a tuple-like +data structure (or more precisely, a namedtuple-like data structure), and for one given node, each output row must +have the same structure. -Each output row is stored internally as a tuple (or a namedtuple-like structure), and each output row must have the same structure (same number of fields, same len for tuple). +If you return/yield something which is not a tuple, bonobo will create a tuple of one element. -If you yield something which is not a tuple, bonobo will create a tuple of one element. +Properties +---------- -By default, exceptions are not fatal in bonobo. If a call raise an error, then bonobo will display the stack trace, increment the "err" counter for this node and move to the next input row. +|bonobo| assists you with defining the data-flow of your data engineering process, and then streams data through your +callable graphs. -Some errors are fatal, though. For example, if you pass a 2 elements tuple to a node that takes 3 args, bonobo will raise an UnrecoverableTypeError, and exit the current execution. +* Each node call will process one row of data. +* Queues that flows the data between node are first-in, first-out (FIFO) standard python :class:`queue.Queue`. +* Each node will run in parallel +* Default execution strategy use threading, and each node will run in a separate thread. + +Fault tolerance +--------------- + +Node execution is fault tolerant. + +If an exception is raised from a node call, then this node call will be aborted but bonobo will continue the execution +with the next row (after outputing the stack trace and incrementing the "err" counter for the node context). + +It allows to have ETL jobs that ignore faulty data and try their best to process the valid rows of a dataset. + +Some errors are fatal, though. + +If you pass a 2 elements tuple to a node that takes 3 args, |bonobo| will raise an :class:`bonobo.errors.UnrecoverableTypeError`, and exit the +current graph execution as fast as it can (finishing the other node executions that are in progress first, but not +starting new ones if there are remaining input rows). + + +Let's write a sample data integration job +::::::::::::::::::::::::::::::::::::::::: + +Let's create a sample application. + +The goal of this application will be to extract all the fablabs in the world using an open-data API, normalize this +data and, for now, display it. We'll then build on this foundation in the next steps to write to files, databases, etc. -Let's write one -::::::::::::::: -We'll create a job to do the following -* Extract all the FabLabs from an open data API -* Apply a bit of formating -* Geocode the address and normalize it, if we can -* Display it (in the next step, we'll learn about writing the result to a file. Moving forward diff --git a/docs/tutorial/index.rst b/docs/tutorial/index.rst index 438d9d8..6f57dc1 100644 --- a/docs/tutorial/index.rst +++ b/docs/tutorial/index.rst @@ -1,9 +1,6 @@ First steps =========== -What is Bonobo? -::::::::::::::: - Bonobo is an ETL (Extract-Transform-Load) framework for python 3.5. The goal is to define data-transformations, with python code in charge of handling similar shaped independent lines of data. @@ -14,8 +11,7 @@ Bonobo is a lean manufacturing assembly line for data that let you focus on the Bonobo uses simple python and should be quick and easy to learn. -Tutorial -:::::::: +**Tutorials** .. toctree:: :maxdepth: 1 @@ -26,8 +22,8 @@ Tutorial 4-services 5-packaging -More -:::: + +**Integrations** .. toctree:: :maxdepth: 1 @@ -36,9 +32,7 @@ More notebooks sqlalchemy -What's next? -:::::::::::: - +**What's next?** Once you're familiar with all the base concepts, you can... @@ -46,9 +40,7 @@ Once you're familiar with all the base concepts, you can... * Explore the :doc:`Extensions ` to widen the possibilities. * Open the :doc:`References ` and start hacking like crazy. - -You're not alone! -::::::::::::::::: +**You're not alone!** Good documentation is not easy to write.