[docs] rewriting the tutorial.
This commit is contained in:
16
docs/guide/_next.rst
Normal file
16
docs/guide/_next.rst
Normal file
@ -0,0 +1,16 @@
|
||||
Where to jump next?
|
||||
:::::::::::::::::::
|
||||
|
||||
We suggest that you go through the :doc:`tutorial </tutorial/index>` first.
|
||||
|
||||
Then, you can read the guides, either using the order suggested or by picking the chapter that interest you the most at
|
||||
one given moment:
|
||||
|
||||
* :doc:`introduction`
|
||||
* :doc:`transformations`
|
||||
* :doc:`graphs`
|
||||
* :doc:`services`
|
||||
* :doc:`environment`
|
||||
* :doc:`purity`
|
||||
* :doc:`debugging`
|
||||
* :doc:`plugins`
|
||||
@ -1,11 +0,0 @@
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
introduction
|
||||
transformations
|
||||
graphs
|
||||
services
|
||||
environment
|
||||
purity
|
||||
debugging
|
||||
plugins
|
||||
@ -0,0 +1,5 @@
|
||||
Debugging
|
||||
=========
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -127,3 +127,5 @@ function and used to get data from the database.
|
||||
bonobo.PrettyPrinter(),
|
||||
)
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -249,3 +249,5 @@ the CLI, and reading the source you should be able to figure out its usage quite
|
||||
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
|
||||
@ -3,8 +3,14 @@ Guides
|
||||
|
||||
This section will guide you through your journey with Bonobo ETL.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
.. include:: _toc.rst
|
||||
|
||||
|
||||
|
||||
introduction
|
||||
transformations
|
||||
graphs
|
||||
services
|
||||
environment
|
||||
purity
|
||||
debugging
|
||||
plugins
|
||||
|
||||
@ -1,8 +1,8 @@
|
||||
Introduction
|
||||
============
|
||||
|
||||
The first thing you need to understand before you use Bonobo, or not, is what it does and what it does not, so you can
|
||||
understand if it could be a good fit for your use cases.
|
||||
The first thing you need to understand before you use |bonobo|, or not, is what it does and what it does not, so you
|
||||
can understand if it could be a good fit for your use cases.
|
||||
|
||||
How it works?
|
||||
:::::::::::::
|
||||
@ -13,7 +13,10 @@ terminals and source code files.
|
||||
It is a **data streaming** solution, that treat datasets as ordered collections of independant rows, allowing to process
|
||||
them "first in, first out" using a set of transformations organized together in a directed graph.
|
||||
|
||||
Let's take a few examples:
|
||||
Let's take a few examples.
|
||||
|
||||
Simplest linear graph
|
||||
---------------------
|
||||
|
||||
.. graphviz::
|
||||
|
||||
@ -26,18 +29,35 @@ Let's take a few examples:
|
||||
BEGIN -> "A" -> "B" -> "C" -> "END";
|
||||
}
|
||||
|
||||
One of the simplest, by the book, cases, is an extractor sending to a transformation, itself sending to a loader.
|
||||
One of the simplest, by the book, cases, is an extractor sending to a transformation, itself sending to a loader (hence
|
||||
the "Extract Transform Load" name).
|
||||
|
||||
.. note::
|
||||
|
||||
Of course, |bonobo| is aiming at real-world data transformations and can help you build all kinds of data-flows.
|
||||
|
||||
Bonobo will send an "impulsion" to all transformations linked to the `BEGIN` node (shown as a little black dot on the left).
|
||||
|
||||
On our example, the only node having its input linked to `BEGIN` is `A`.
|
||||
|
||||
Bonobo will send an "impulsion" to all transformations linked to the little black dot on the left, here `A`.
|
||||
`A`'s main topic will be to extract data from somewhere (a file, an endpoint, a database...) and generate some output.
|
||||
As soon as the first row of `A`'s output is available, Bonobo will start asking `B` to process it. As soon as the first
|
||||
row of `B`'s output is available, Bonobo will start asking `C` to process it.
|
||||
As soon as the first row of `A`'s output is available, |bonobo| will start asking `B` to process it. As soon as the first
|
||||
row of `B`'s output is available, |bonobo| will start asking `C` to process it.
|
||||
|
||||
While `B` and `C` are processing, `A` continues to generate data.
|
||||
|
||||
This approach can be efficient, depending on your requirements, because you may rely on a lot of services that may be
|
||||
long to answer or unreliable, and you don't have to handle optimizations, parallelism or retry logic by yourself.
|
||||
|
||||
.. note::
|
||||
|
||||
The default execution strategy uses threads, and makes it efficient to work on I/O bound tasks. It's in the plans
|
||||
to have other execution strategies, based on subprocesses (for CPU-bound tasks) or `dask.distributed` (for big
|
||||
data tasks that requires a cluster of computers to process in reasonable time).
|
||||
|
||||
Graphs with divergence points (or forks)
|
||||
----------------------------------------
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
@ -55,6 +75,9 @@ In this case, any output row of `A`, will be **sent to both** `B` and `C` simult
|
||||
processing while `B` and `C` are working.
|
||||
|
||||
|
||||
Graph with convergence points (or merges)
|
||||
-----------------------------------------
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
@ -71,38 +94,23 @@ processing while `B` and `C` are working.
|
||||
Now, we feed `C` with both `A` and `B` output. It is not a "join", or "cartesian product". It is just two different
|
||||
pipes plugged to `C` input, and whichever yields data will see this data feeded to `C`, one row at a time.
|
||||
|
||||
|
||||
What is it not?
|
||||
:::::::::::::::
|
||||
|
||||
**Bonobo** is not:
|
||||
|bonobo| is not:
|
||||
|
||||
* A data science, or statistical analysis tool, which need to treat the dataset as a whole and not as a collection of
|
||||
independant rows. If this is your need, you probably want to look at `pandas <https://pandas.pydata.org/>`_.
|
||||
|
||||
* A workflow or scheduling solution for independant data-engineering tasks. If you're looking to manage your sets of
|
||||
data processing tasks as a whole, you probably want to look at `airflow <https://airflow.incubator.apache.org/>`_.
|
||||
Although there is no Bonobo extension yet that handles that, it does make sense to integrate Bonobo jobs in an airflow
|
||||
(or other similar tool) workflow.
|
||||
Although there is no |bonobo| extension yet that handles that, it does make sense to integrate |bonobo| jobs in an
|
||||
airflow (or other similar tool) workflow.
|
||||
|
||||
* A big data solution, `as defined by wikipedia <https://en.wikipedia.org/wiki/Big_data>`_. We're aiming at "small
|
||||
scale" data processing, which can be still quite huge for humans, but not for computers. If you don't know whether or
|
||||
not this is sufficient for your needs, it probably means you're not in the "big data" land.
|
||||
|
||||
|
||||
Where to jump next?
|
||||
:::::::::::::::::::
|
||||
|
||||
If you did not run through it yet, we highly suggest that you go through the :doc:`tutorial </tutorial/index>` first.
|
||||
|
||||
Then, you can jump to the following guides, in no particuliar order:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
transformations
|
||||
graphs
|
||||
services
|
||||
environment
|
||||
purity
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -15,3 +15,5 @@ enhancers
|
||||
node
|
||||
-
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -147,3 +147,5 @@ a new dict will of course create a new envelope, but the unchanged objects insid
|
||||
|
||||
Last thing, copies made in the "pure" approach are explicit, and usually, explicit is better than implicit.
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -157,3 +157,5 @@ Read more
|
||||
:::::::::
|
||||
|
||||
* See https://github.com/hartym/bonobo-sqlalchemy/blob/work-in-progress/bonobo_sqlalchemy/writers.py#L19 for example usage (work in progress).
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -233,22 +233,16 @@ bonobo send the data to your transformation.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.constants import BEGIN, END
|
||||
from bonobo.execution import NodeExecutionContext
|
||||
|
||||
with NodeExecutionContext(
|
||||
JsonWriter(filename), services={'fs': ...}
|
||||
) as context:
|
||||
|
||||
# Write a list of rows, including BEGIN/END control messages.
|
||||
context.write(
|
||||
BEGIN,
|
||||
Bag({'foo': 'bar'}),
|
||||
Bag({'foo': 'baz'}),
|
||||
END
|
||||
context.write_sync(
|
||||
{'foo': 'bar'},
|
||||
{'foo': 'baz'},
|
||||
)
|
||||
|
||||
# Out of the bonobo main loop, we need to call `step` explicitely.
|
||||
context.step()
|
||||
context.step()
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
Reference in New Issue
Block a user