[docs] rewriting the tutorial.
This commit is contained in:
@ -1,8 +1,8 @@
|
||||
Introduction
|
||||
============
|
||||
|
||||
The first thing you need to understand before you use Bonobo, or not, is what it does and what it does not, so you can
|
||||
understand if it could be a good fit for your use cases.
|
||||
The first thing you need to understand before you use |bonobo|, or not, is what it does and what it does not, so you
|
||||
can understand if it could be a good fit for your use cases.
|
||||
|
||||
How it works?
|
||||
:::::::::::::
|
||||
@ -13,7 +13,10 @@ terminals and source code files.
|
||||
It is a **data streaming** solution, that treat datasets as ordered collections of independant rows, allowing to process
|
||||
them "first in, first out" using a set of transformations organized together in a directed graph.
|
||||
|
||||
Let's take a few examples:
|
||||
Let's take a few examples.
|
||||
|
||||
Simplest linear graph
|
||||
---------------------
|
||||
|
||||
.. graphviz::
|
||||
|
||||
@ -26,18 +29,35 @@ Let's take a few examples:
|
||||
BEGIN -> "A" -> "B" -> "C" -> "END";
|
||||
}
|
||||
|
||||
One of the simplest, by the book, cases, is an extractor sending to a transformation, itself sending to a loader.
|
||||
One of the simplest, by the book, cases, is an extractor sending to a transformation, itself sending to a loader (hence
|
||||
the "Extract Transform Load" name).
|
||||
|
||||
.. note::
|
||||
|
||||
Of course, |bonobo| is aiming at real-world data transformations and can help you build all kinds of data-flows.
|
||||
|
||||
Bonobo will send an "impulsion" to all transformations linked to the `BEGIN` node (shown as a little black dot on the left).
|
||||
|
||||
On our example, the only node having its input linked to `BEGIN` is `A`.
|
||||
|
||||
Bonobo will send an "impulsion" to all transformations linked to the little black dot on the left, here `A`.
|
||||
`A`'s main topic will be to extract data from somewhere (a file, an endpoint, a database...) and generate some output.
|
||||
As soon as the first row of `A`'s output is available, Bonobo will start asking `B` to process it. As soon as the first
|
||||
row of `B`'s output is available, Bonobo will start asking `C` to process it.
|
||||
As soon as the first row of `A`'s output is available, |bonobo| will start asking `B` to process it. As soon as the first
|
||||
row of `B`'s output is available, |bonobo| will start asking `C` to process it.
|
||||
|
||||
While `B` and `C` are processing, `A` continues to generate data.
|
||||
|
||||
This approach can be efficient, depending on your requirements, because you may rely on a lot of services that may be
|
||||
long to answer or unreliable, and you don't have to handle optimizations, parallelism or retry logic by yourself.
|
||||
|
||||
.. note::
|
||||
|
||||
The default execution strategy uses threads, and makes it efficient to work on I/O bound tasks. It's in the plans
|
||||
to have other execution strategies, based on subprocesses (for CPU-bound tasks) or `dask.distributed` (for big
|
||||
data tasks that requires a cluster of computers to process in reasonable time).
|
||||
|
||||
Graphs with divergence points (or forks)
|
||||
----------------------------------------
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
@ -55,6 +75,9 @@ In this case, any output row of `A`, will be **sent to both** `B` and `C` simult
|
||||
processing while `B` and `C` are working.
|
||||
|
||||
|
||||
Graph with convergence points (or merges)
|
||||
-----------------------------------------
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
@ -71,38 +94,23 @@ processing while `B` and `C` are working.
|
||||
Now, we feed `C` with both `A` and `B` output. It is not a "join", or "cartesian product". It is just two different
|
||||
pipes plugged to `C` input, and whichever yields data will see this data feeded to `C`, one row at a time.
|
||||
|
||||
|
||||
What is it not?
|
||||
:::::::::::::::
|
||||
|
||||
**Bonobo** is not:
|
||||
|bonobo| is not:
|
||||
|
||||
* A data science, or statistical analysis tool, which need to treat the dataset as a whole and not as a collection of
|
||||
independant rows. If this is your need, you probably want to look at `pandas <https://pandas.pydata.org/>`_.
|
||||
|
||||
* A workflow or scheduling solution for independant data-engineering tasks. If you're looking to manage your sets of
|
||||
data processing tasks as a whole, you probably want to look at `airflow <https://airflow.incubator.apache.org/>`_.
|
||||
Although there is no Bonobo extension yet that handles that, it does make sense to integrate Bonobo jobs in an airflow
|
||||
(or other similar tool) workflow.
|
||||
Although there is no |bonobo| extension yet that handles that, it does make sense to integrate |bonobo| jobs in an
|
||||
airflow (or other similar tool) workflow.
|
||||
|
||||
* A big data solution, `as defined by wikipedia <https://en.wikipedia.org/wiki/Big_data>`_. We're aiming at "small
|
||||
scale" data processing, which can be still quite huge for humans, but not for computers. If you don't know whether or
|
||||
not this is sufficient for your needs, it probably means you're not in the "big data" land.
|
||||
|
||||
|
||||
Where to jump next?
|
||||
:::::::::::::::::::
|
||||
|
||||
If you did not run through it yet, we highly suggest that you go through the :doc:`tutorial </tutorial/index>` first.
|
||||
|
||||
Then, you can jump to the following guides, in no particuliar order:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
transformations
|
||||
graphs
|
||||
services
|
||||
environment
|
||||
purity
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
Reference in New Issue
Block a user