Merge remote-tracking branch 'upstream/develop'
This commit is contained in:
46
docs/_static/custom.css
vendored
46
docs/_static/custom.css
vendored
@ -1,3 +1,47 @@
|
||||
svg {
|
||||
border: 2px solid green
|
||||
}
|
||||
}
|
||||
|
||||
div.related {
|
||||
width: 940px;
|
||||
margin: 30px auto 0 auto;
|
||||
}
|
||||
|
||||
@media screen and (max-width: 875px) {
|
||||
div.related {
|
||||
visibility: hidden;
|
||||
display: none;
|
||||
}
|
||||
}
|
||||
|
||||
.brand {
|
||||
font-family: 'Ubuntu', 'goudy old style', 'minion pro', 'bell mt', Georgia, 'Hiragino Mincho Pro', serif;
|
||||
font-size: 0.9em;
|
||||
}
|
||||
|
||||
div.sphinxsidebar h3 {
|
||||
margin: 30px 0 10px 0;
|
||||
}
|
||||
|
||||
div.admonition p.admonition-title {
|
||||
font-family: 'Ubuntu', 'goudy old style', 'minion pro', 'bell mt', Georgia, 'Hiragino Mincho Pro', serif;
|
||||
}
|
||||
|
||||
div.sphinxsidebarwrapper {
|
||||
padding: 0;
|
||||
}
|
||||
|
||||
div.note {
|
||||
border: 0;
|
||||
}
|
||||
|
||||
div.admonition {
|
||||
padding: 20px;
|
||||
}
|
||||
|
||||
.last {
|
||||
margin-bottom: 0 !important;
|
||||
}
|
||||
pre {
|
||||
padding: 6px 20px;
|
||||
}
|
||||
|
||||
11
docs/_templates/base.html
vendored
11
docs/_templates/base.html
vendored
@ -4,17 +4,8 @@
|
||||
{%- block extrahead %}
|
||||
{{ super() }}
|
||||
<style>
|
||||
div.related {
|
||||
width: 940px;
|
||||
margin: 30px auto 0 auto;
|
||||
}
|
||||
@media screen and (max-width: 875px) {
|
||||
div.related {
|
||||
visibility: hidden;
|
||||
display: none;
|
||||
}
|
||||
}
|
||||
</style>
|
||||
<link href="https://fonts.googleapis.com/css?family=Ubuntu" rel="stylesheet">
|
||||
{% endblock %}
|
||||
|
||||
{%- block footer %}
|
||||
|
||||
17
docs/_templates/sidebarintro.html
vendored
17
docs/_templates/sidebarintro.html
vendored
@ -1,22 +1,21 @@
|
||||
<h3>About Bonobo</h3>
|
||||
<p>
|
||||
Bonobo is a data-processing toolkit for python 3.5+, with emphasis on simplicity, atomicity and testability. Oh,
|
||||
and performances, too!
|
||||
Bonobo is a data-processing toolkit for python 3.5+, your swiss-army knife for everyday's data.
|
||||
</p>
|
||||
|
||||
<h3>Other Formats</h3>
|
||||
<p>
|
||||
You can download the documentation in other formats as well:
|
||||
Download the docs...
|
||||
</p>
|
||||
<ul>
|
||||
<li><a href="http://readthedocs.org/projects/bonobo/downloads/pdf/master/">as PDF</a></li>
|
||||
<li><a href="http://readthedocs.org/projects/bonobo/downloads/htmlzip/master/">as zipped HTML</a></li>
|
||||
<li><a href="http://readthedocs.org/projects/bonobo/downloads/epub/master/">as EPUB</a></li>
|
||||
<li><a href="http://readthedocs.org/projects/bonobo/downloads/pdf/master/" title="Bonobo ETL documentation as PDF">... as PDF</a></li>
|
||||
<li><a href="http://readthedocs.org/projects/bonobo/downloads/htmlzip/master/" title="Bonobo ETL documentation as zipped HTML">... as zipped HTML</a></li>
|
||||
<li><a href="http://readthedocs.org/projects/bonobo/downloads/epub/master/" title="Bonobo ETL documentation as EPUB">... as EPUB</a></li>
|
||||
</ul>
|
||||
|
||||
<h3>Useful Links</h3>
|
||||
<ul>
|
||||
<li><a href="https://www.bonobo-project.org/">Bonobo ETL</a></li>
|
||||
<li><a href="http://pypi.python.org/pypi/bonobo">Bonobo ETL @ PyPI</a></li>
|
||||
<li><a href="http://github.com/python-bonobo/bonobo">Bonobo ETL @ GitHub</a></li>
|
||||
<li><a href="https://www.bonobo-project.org/">Bonobo's homepage</a></li>
|
||||
<li><a href="http://pypi.python.org/pypi/bonobo">Package on PyPI</a></li>
|
||||
<li><a href="http://github.com/python-bonobo/bonobo">Source code on GitHub</a></li>
|
||||
</ul>
|
||||
|
||||
8
docs/_templates/sidebarlogo.html
vendored
8
docs/_templates/sidebarlogo.html
vendored
@ -1,10 +1,12 @@
|
||||
<a href="{{ pathto(master_doc) }}" style="border: none">
|
||||
<h1 style="text-align: center; margin: 0;">
|
||||
<img class="logo" src="{{ pathto('_static/bonobo.png', 1) }}" title="Bonobo" style="width: 48px; height: 48px; vertical-align: bottom"/>
|
||||
Bonobo
|
||||
<img class="logo" src="{{ pathto('_static/bonobo.png', 1) }}" title="Bonobo" style="width: 40px; height: 40px; vertical-align: bottom"/>
|
||||
<span class="brand">
|
||||
Bonobo
|
||||
</span>
|
||||
</h1>
|
||||
</a>
|
||||
|
||||
<p style="text-align: center">
|
||||
<p style="text-align: center" class="first">
|
||||
Data processing for humans.
|
||||
</p>
|
||||
|
||||
131
docs/changelog-0.6.rst
Normal file
131
docs/changelog-0.6.rst
Normal file
@ -0,0 +1,131 @@
|
||||
Bonobo 0.6.0
|
||||
::::::::::::
|
||||
|
||||
* Removes dead snippet. (Romain Dorgueil)
|
||||
* Example datasets are now stored by bonobo minor version. (Romain Dorgueil)
|
||||
* Removing datasets from the repository. (Romain Dorgueil)
|
||||
* For some obscure reason, coverage is broken under python 3.7 making the test suite fail, disabled python3.7 in travis waiting for it to be fixed. (Romain Dorgueil)
|
||||
* [tests] adding a spec to magicmock of nodes to avoid it being seen as partially configured nodes (Romain Dorgueil)
|
||||
* Adds an OrderFields transformation factory, update examples. (Romain Dorgueil)
|
||||
* Check partially configured transformations that are function based (aka transformation factories) on execution context setup. (Romain Dorgueil)
|
||||
* Fix PrettyPrinter, output verbosity is now slightly more discreete. (Romain Dorgueil)
|
||||
* Inheritance of bags and better jupyter output for pretty printer. (Romain Dorgueil)
|
||||
* Documentation cosmetics. (Romain Dorgueil)
|
||||
* Simple "examples" command that just show examples for now. (Romain Dorgueil)
|
||||
* Rewritting Bags from scratch using a namedtuple approach, along with other (less major) updates. (Romain Dorgueil)
|
||||
* Adding services to naive execution (Kenneth Koski)
|
||||
* Fix another typo in `run` (Daniel Jilg)
|
||||
* Fix two typos in the ContextProcessor documentation (Daniel Jilg)
|
||||
* Core: refactoring contexts with more logical responsibilities, stopping to rely on kargs ordering for compat with python3.5 (Romain Dorgueil)
|
||||
* Simplification of node execution context, handle_result is now in step() as it is the only logical place where this will actually be called. (Romain Dorgueil)
|
||||
* Less strict CSV processing, to allow dirty input. (Romain Dorgueil)
|
||||
* [stdlib] Adds Update(...) and FixedWindow(...) the the standard nodes provided with bonobo. (Romain Dorgueil)
|
||||
* Adds a benchmarks directory with small scripts to test performances of things. (Romain Dorgueil)
|
||||
* Moves jupyter extension to both bonobo.contrib.jupyter (for the jupyter widget) and to bonobo.plugins (for the executor-side plugin). (Romain Dorgueil)
|
||||
* Fix examples with new module paths. (Romain Dorgueil)
|
||||
* IOFormats: if no kwargs, then try with one positional argument. (Romain Dorgueil)
|
||||
* Adds a __getattr__ dunder to ValueHolder to enable getting attributes, and especially method calls, on contained objects. (Romain Dorgueil)
|
||||
* Moves ODS extension to contrib module. (Romain Dorgueil)
|
||||
* Moves google extension to contrib module. (Romain Dorgueil)
|
||||
* Moves django extension to contrib module. (Romain Dorgueil)
|
||||
* Update graphs.rst (CW Andrews)
|
||||
* Adds argument parser support to django extension. (Romain Dorgueil)
|
||||
* Trying to understand conda... (Romain Dorgueil)
|
||||
* Trying to understand conda... (Romain Dorgueil)
|
||||
* Trying to understand conda... (Romain Dorgueil)
|
||||
* Update conda conf so readthedocs can maybe build. (Romain Dorgueil)
|
||||
* Working on the new version of the tutorial. Only Step1 implemented. (Romain Dorgueil)
|
||||
* Adds a "bare" template, containing the very minimum you want to have in 90% of cases. (Romain Dorgueil)
|
||||
* Fix default logging level, adds options to default template. (Romain Dorgueil)
|
||||
* Skip failing order test for python 3.5 (temporary). (Romain Dorgueil)
|
||||
* Switch to stable mondrian. (Romain Dorgueil)
|
||||
* Moves timer to statistics utilities. (Romain Dorgueil)
|
||||
* Adds basic test for convert command. (Romain Dorgueil)
|
||||
* [tests] adds node context lifecycle test.( (Romain Dorgueil)
|
||||
* Small changes in events, and associated tests. (Romain Dorgueil)
|
||||
* [core] Moves bonobo.execution context related package to new bonobo.execution.contexts package, also moves bonobo.strategies to new bonobo.execution.strategies package, so everything related to execution is now contained under the bonobo.execution package. (Romain Dorgueil)
|
||||
* Remove the sleep() in tick() that causes a minimum execution time of 2*PERIOD, more explicit status display and a small test case for console plugin. (Romain Dorgueil)
|
||||
* [tests] Fix path usage for python 3.5 (Romain Dorgueil)
|
||||
* Adds a test for default file init command. (Romain Dorgueil)
|
||||
* Adds 3.7-dev target to travis runner. (Romain Dorgueil)
|
||||
* Update requirements with first whistle stable. (Romain Dorgueil)
|
||||
* [core] Refactoring to use an event dispatcher in the main thread. (Romain Dorgueil)
|
||||
* Update to mondrian 0.4a0. (Romain Dorgueil)
|
||||
* Fix imports. (Romain Dorgueil)
|
||||
* Removing old error handler. (Romain Dorgueil)
|
||||
* [errors] Move error handling in transformations to use mondrian. (Romain Dorgueil)
|
||||
* [logging] Switching to mondrian, who got all our formating code. (Romain Dorgueil)
|
||||
* Adds argument parser support in default template. (Romain Dorgueil)
|
||||
* Adds the ability to initialize a package from bonobo init. (Romain Dorgueil)
|
||||
* Still cleaning up. (Romain Dorgueil)
|
||||
* [examples] comments. (Romain Dorgueil)
|
||||
* Update dependencies, remove python-dotenv. (Romain Dorgueil)
|
||||
* Remove unused argument. (Romain Dorgueil)
|
||||
* Remove files in examples that are not used anymore. (Romain Dorgueil)
|
||||
* Refactoring the runner to go more towards standard python, also adds the ability to use bonobo argument parser from standard python execution. (Romain Dorgueil)
|
||||
* Removes cookiecutter. (Romain Dorgueil)
|
||||
* Switch logger setup to mondrian (deps). (Romain Dorgueil)
|
||||
* Module registry reimported as it is needed for "bonobo convert". (Romain Dorgueil)
|
||||
* [core] Simplification: as truthfully stated by Maik at Pycon.DE sprint «lets try not to turn python into javascript». (Romain Dorgueil)
|
||||
* [core] still refactoring env-related stuff towards using __main__ blocks (but with argparser, if needed). (Romain Dorgueil)
|
||||
* [core] Refactoring of commands to move towards a more pythonic way of running the jobs. Commands are now classes, and bonobo "graph" related commands now hooks into bonobo.run() calls so it will use what you actually put in your __main__ block. (Romain Dorgueil)
|
||||
* Minor test change. (Romain Dorgueil)
|
||||
* [core] Change the token parsing part in prevision of different flags. (Romain Dorgueil)
|
||||
* Support line-delimited JSON (Michael Penkov)
|
||||
* Update Makefile/setup. (Romain Dorgueil)
|
||||
* [tests] simplify assertion (Romain Dorgueil)
|
||||
* Issue #134: use requests.get as a context manager (Michael Penkov)
|
||||
* Issue #134: use requests instead of urllib (Michael Penkov)
|
||||
* update Projectfile with download entry point (Michael Penkov)
|
||||
* Issue #134: update documentation (Michael Penkov)
|
||||
* Issue #134: add a `bonobo download url` command (Michael Penkov)
|
||||
* commands.run: Enable relative imports in main.py (Stefan Zimmermann)
|
||||
* adapt tutorial "Working with files" to the latest develop version (Peter Uebele)
|
||||
* Add a note about the graph variable (Michael Penkov)
|
||||
* [tests] trying to speed up the init test. (Romain Dorgueil)
|
||||
* [tests] bonobo.util.objects (Romain Dorgueil)
|
||||
* [nodes] Removing draft quality factory from bonobo main package, will live in separate personnal package until it is good enough to live here. (Romain Dorgueil)
|
||||
* [tests] rename factory test and move bag detecting so any bag is returned as is as an output. (Romain Dorgueil)
|
||||
* [core] Still refactoring the core behaviour of bags, starting to be much simpler. (Romain Dorgueil)
|
||||
* Fix python 3.5 os.chdir not accepting LocalPath (arimbr)
|
||||
* Remove unused shutil import (arimbr)
|
||||
* Use pytest tmpdir fixture and add more init tests (arimbr)
|
||||
* Check if target directory is empty instead of current directory and remove overwrite_if_exists argument (arimbr)
|
||||
* Remove dispatcher as it is not a dependency, for now, and as such breaks the continuous integration (yes, again.). (Romain Dorgueil)
|
||||
* Remove dispatcher as it is not a dependency, for now, and as such breaks the continuous integration. (Romain Dorgueil)
|
||||
* Code formating. (Romain Dorgueil)
|
||||
* [core] Testing and fixing new args/kwargs behaviour. (Romain Dorgueil)
|
||||
* [core] simplification of result interpretation. (Romain Dorgueil)
|
||||
* [tests] fix uncaptured output in test_commands (Romain Dorgueil)
|
||||
* Documentation for new behaviour. (Romain Dorgueil)
|
||||
* [django, misc] adds create_or_update to djangos ETLCommand class, adds getitem/setitem/contains dunders to ValueHolder. (Romain Dorgueil)
|
||||
* [core] (..., dict) means Bag(..., **dict) (Romain Dorgueil)
|
||||
* [django, google] Implements basic extensions for django and google oauth systems. (Romain Dorgueil)
|
||||
* Test tweak to work for Windows CI. (cwandrews)
|
||||
* Updated requirements files using edgy-project. (cwandrews)
|
||||
* Updated Projectfile to include python-dotenv dependency. (cwandrews)
|
||||
* Add tests for bonobo init new directory and init within empty directory (arimbr)
|
||||
* Update environment.rst (CW Andrews)
|
||||
* Update environment.rst (CW Andrews)
|
||||
* Cast env_dir to string before passing to load_dotenv as passing a PosixPath to load_dotenv raises an exception in 3.5. (cwandrews)
|
||||
* Updated environment documentation in guides to account for env files. (cwandrews)
|
||||
* Added more tests and moved all env and env file testing to classes (it might make more sense to just move them to separate files?). (cwandrews)
|
||||
* Moved env vars tests to class. (cwandrews)
|
||||
* Updated .env >>> .env_one to include in repo (.env ignored). (cwandrews)
|
||||
* [core] Refactoring IOFormats so there is one and only obvious way to send it. (Romain Dorgueil)
|
||||
* Set cookiecutter overwrite_if_exists parameter to True if current directory is empty (arimbr)
|
||||
* [cli/util] fix requires to use the right stack frame, remove --print as "-" does the job (Romain Dorgueil)
|
||||
* [cli] Adds a --filter option to "convert" command, allowing to use arbitrary filters to a command line conversion. Also adds --print and "-" output to pretty print to terminal instead of file output. (Romain Dorgueil)
|
||||
* [cli] convert, remove useless import. (Romain Dorgueil)
|
||||
* [config] adds a __doc__ constructor kwarg to set option documentation inline. (Romain Dorgueil)
|
||||
* [doc] formating (Romain Dorgueil)
|
||||
* [cli] adds ability to override reader/writer options from cli convert. (Romain Dorgueil)
|
||||
* comparison to None|True|False should be 'if cond is None:' (mouadhkaabachi)
|
||||
* Fixed bug involved in finding env when running module. (cwandrews)
|
||||
* Moved default-env-file tests to class. (cwandrews)
|
||||
* Small adjustment to test parameters. (cwandrews)
|
||||
* Added tests for running file with combinations of multiple default env files, env files, and env vars. Also reorganized environment directory in examples. (cwandrews)
|
||||
* Updated requirements.txt and requirements-dev.txt to include python-dotenv and dependencies. (cwandrews)
|
||||
* default-env-file, default-env, and env-file now in place alongside env. default-env-file and default-env both use os.environ.setdefault so as not to overwrite existing variables (system environment) while env-file and env will overwrite existing variables. All four allow for multiple values (***How might this affect multiple default-env and default-env-file values, I expect that unlike env-file and env the first passed variables would win). (cwandrews)
|
||||
* Further Refactored the setting of env vars passed via the env flag. (cwandrews)
|
||||
* Refactored setting of env vars passed via the env flag. (cwandrews)
|
||||
@ -1,6 +1,25 @@
|
||||
Changelog
|
||||
=========
|
||||
|
||||
Unreleased
|
||||
::::::::::
|
||||
|
||||
* Cookiecutter usage is removed. Linked to the fact that bonobo now use either a single file (up to you to get python
|
||||
imports working as you want) or a regular fully fledged python package, we do not need it anymore.
|
||||
|
||||
New features
|
||||
------------
|
||||
|
||||
Command line
|
||||
............
|
||||
|
||||
* `bonobo download /examples/datasets/coffeeshops.txt` now downloads the coffeeshops example
|
||||
|
||||
Graphs and Nodes
|
||||
................
|
||||
|
||||
* New `LdjsonReader` and `LdjsonWriter` nodes for handling `line-delimited JSON <https://en.wikipedia.org/wiki/JSON_Streaming>`_.
|
||||
|
||||
v.0.5.0 - 5 october 2017
|
||||
::::::::::::::::::::::::
|
||||
|
||||
|
||||
12
docs/conf.py
12
docs/conf.py
@ -1,8 +1,9 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import sys
|
||||
import datetime
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.abspath('..'))
|
||||
sys.path.insert(0, os.path.abspath('_themes'))
|
||||
@ -36,8 +37,8 @@ master_doc = 'index'
|
||||
|
||||
# General information about the project.
|
||||
project = 'Bonobo'
|
||||
copyright = '2012-2017, Romain Dorgueil'
|
||||
author = 'Romain Dorgueil'
|
||||
copyright = '2012-{}, {}'.format(datetime.datetime.now().year, author)
|
||||
|
||||
# The version info for the project you're documenting, acts as replacement for
|
||||
# |version| and |release|, also used in various other places throughout the
|
||||
@ -185,3 +186,10 @@ epub_exclude_files = ['search.html']
|
||||
|
||||
# Example configuration for intersphinx: refer to the Python standard library.
|
||||
intersphinx_mapping = {'python': ('https://docs.python.org/3', None)}
|
||||
|
||||
rst_epilog = """
|
||||
.. |bonobo| replace:: **Bonobo**
|
||||
|
||||
.. |longversion| replace:: v.{version}
|
||||
|
||||
""".format(version=version, )
|
||||
|
||||
@ -4,8 +4,6 @@ Jupyter Extension
|
||||
There is a builtin plugin that integrates (somewhat minimallistically, for now) bonobo within jupyter notebooks, so
|
||||
you can read the execution status of a graph within a nice (ok, not so nice) html/javascript widget.
|
||||
|
||||
See https://github.com/jupyter-widgets/widget-cookiecutter for the base template used.
|
||||
|
||||
Installation
|
||||
::::::::::::
|
||||
|
||||
|
||||
@ -23,25 +23,76 @@ simply to use the optional ``--env`` argument when running bonobo from the shell
|
||||
syntax ``VAR_NAME=VAR_VALUE``. Multiple environment variables can be passed by using multiple ``--env`` / ``-e`` flags
|
||||
(i.e. ``bonobo run --env FIZZ=buzz ...`` and ``bonobo run --env FIZZ=buzz --env Foo=bar ...``). Additionally, in bash
|
||||
you can also set environment variables by listing those you wish to set before the `bonobo run` command with space
|
||||
separating the key-value pairs (i.e. ``FIZZ=buzz bonobo run ...`` or ``FIZZ=buzz FOO=bar bonobo run ...``).
|
||||
separating the key-value pairs (i.e. ``FIZZ=buzz bonobo run ...`` or ``FIZZ=buzz FOO=bar bonobo run ...``). Additionally,
|
||||
bonobo is able to pull environment variables from local '.env' files rather than having to pass each key-value pair
|
||||
individually at runtime. Importantly, a strict 'order of priority' is followed when setting environment variables so
|
||||
it is advisable to read and understand the order listed below to prevent
|
||||
|
||||
|
||||
The order of priority is from lower to higher with the higher "winning" if set:
|
||||
|
||||
1. default values
|
||||
``os.getenv("VARNAME", default_value)``
|
||||
The user/writer/creator of the graph is responsible for setting these.
|
||||
|
||||
2. ``--default-env-file`` values
|
||||
Specify file to read default env values from. Each env var in the file is used if the var isn't already a corresponding value set at the system environment (system environment vars not overwritten).
|
||||
|
||||
3. ``--default-env`` values
|
||||
Works like #2 but the default ``NAME=var`` are passed individually, with one ``key=value`` pair for each ``--default-env`` flag rather than gathered from a specified file.
|
||||
|
||||
4. system environment values
|
||||
Env vars already set at the system level. It is worth noting that passed env vars via ``NAME=value bonobo run ...`` falls here in the order of priority.
|
||||
|
||||
5. ``--env-file`` values
|
||||
Env vars specified here are set like those in #2 albeit that these values have priority over those set at the system level.
|
||||
|
||||
6. ``--env`` values
|
||||
Env vars set using the ``--env`` / ``-e`` flag work like #3 but take priority over all other env vars.
|
||||
|
||||
|
||||
|
||||
Examples
|
||||
::::::::
|
||||
|
||||
The Examples below demonstrate setting one or multiple variables using both of these methods:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Using one environment variable via --env flag:
|
||||
# Using one environment variable via a --env or --defualt-env flag:
|
||||
bonobo run csvsanitizer --env SECRET_TOKEN=secret123
|
||||
bonobo run csvsanitizer --defaul-env SECRET_TOKEN=secret123
|
||||
|
||||
# Using multiple environment variables via -e (env) flag:
|
||||
# Using multiple environment variables via -e (env) and --default-env flags:
|
||||
bonobo run csvsanitizer -e SRC_FILE=inventory.txt -e DST_FILE=inventory_processed.csv
|
||||
|
||||
# Using one environment variable inline (bash only):
|
||||
bonobo run csvsanitizer --default-env SRC_FILE=inventory.txt --default-env DST_FILE=inventory_processed.csv
|
||||
|
||||
# Using one environment variable inline (bash-like shells only):
|
||||
SECRET_TOKEN=secret123 bonobo run csvsanitizer
|
||||
|
||||
# Using multiple environment variables inline (bash only):
|
||||
# Using multiple environment variables inline (bash-like shells only):
|
||||
SRC_FILE=inventory.txt DST_FILE=inventory_processed.csv bonobo run csvsanitizer
|
||||
|
||||
*Though not-yet implemented, the bonobo roadmap includes implementing environment / .env files as well.*
|
||||
|
||||
# Using an env file for default env values:
|
||||
bonobo run csvsanitizer --default-env-file .env
|
||||
|
||||
# Using an env file for env values:
|
||||
bonobo run csvsanitizer --env-file '.env.private'
|
||||
|
||||
|
||||
ENV File Structure
|
||||
::::::::::::::::::
|
||||
|
||||
The file structure for env files is incredibly simple. The only text in the file
|
||||
should be `NAME=value` pairs with one pair per line like the below.
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
# .env
|
||||
|
||||
DB_USER='bonobo'
|
||||
DB_PASS='cicero'
|
||||
|
||||
|
||||
Accessing Environment Variables from within the Graph Context
|
||||
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
|
||||
|
||||
@ -1,9 +1,8 @@
|
||||
Graphs
|
||||
======
|
||||
|
||||
Graphs are the glue that ties transformations together. It's the only data-structure bonobo can execute directly. Graphs
|
||||
must be acyclic, and can contain as much nodes as your system can handle. Although this number can be rather high in
|
||||
theory, extreme practical cases usually do not exceed hundreds of nodes (and this is already extreme, really).
|
||||
Graphs are the glue that ties transformations together. They are the only data-structure bonobo can execute directly. Graphs
|
||||
must be acyclic, and can contain as many nodes as your system can handle. However, although in theory the number of nodes can be rather high, practical use cases usually do not exceed more than a few hundred nodes and only then in extreme cases.
|
||||
|
||||
|
||||
Definitions
|
||||
@ -50,7 +49,7 @@ Non-linear graphs
|
||||
Divergences / forks
|
||||
-------------------
|
||||
|
||||
To create two or more divergent data streams ("fork"), you should specify `_input` kwarg to `add_chain`.
|
||||
To create two or more divergent data streams ("forks"), you should specify the `_input` kwarg to `add_chain`.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@ -74,12 +73,12 @@ Resulting graph:
|
||||
"b" -> "f" -> "g";
|
||||
}
|
||||
|
||||
.. note:: Both branch will receive the same data, at the same time.
|
||||
.. note:: Both branches will receive the same data and at the same time.
|
||||
|
||||
Convergences / merges
|
||||
Convergence / merges
|
||||
---------------------
|
||||
|
||||
To merge two data streams ("merge"), you can use the `_output` kwarg to `add_chain`, or use named nodes (see below).
|
||||
To merge two data streams, you can use the `_output` kwarg to `add_chain`, or use named nodes (see below).
|
||||
|
||||
|
||||
.. code-block:: python
|
||||
@ -88,7 +87,7 @@ To merge two data streams ("merge"), you can use the `_output` kwarg to `add_cha
|
||||
|
||||
graph = bonobo.Graph()
|
||||
|
||||
# Here we mark _input to None, so normalize won't get the "begin" impulsion.
|
||||
# Here we set _input to None, so normalize won't start on its own but only after it receives input from the other chains.
|
||||
graph.add_chain(normalize, store, _input=None)
|
||||
|
||||
# Add two different chains
|
||||
@ -122,7 +121,7 @@ Resulting graph:
|
||||
Named nodes
|
||||
:::::::::::
|
||||
|
||||
Using above code to create convergences can lead to hard to read code, because you have to define the "target" stream
|
||||
Using above code to create convergences often leads to code which is hard to read, because you have to define the "target" stream
|
||||
before the streams that logically goes to the beginning of the transformation graph. To overcome that, one can use
|
||||
"named" nodes:
|
||||
|
||||
@ -194,7 +193,7 @@ You can also run a python module:
|
||||
|
||||
$ bonobo run -m my.own.etlmod
|
||||
|
||||
In each case, bonobo's CLI will look for an instance of :class:`bonobo.Graph` in your file/module, create the plumbery
|
||||
In each case, bonobo's CLI will look for an instance of :class:`bonobo.Graph` in your file/module, create the plumbing
|
||||
needed to execute it, and run it.
|
||||
|
||||
If you're in an interactive terminal context, it will use :class:`bonobo.ext.console.ConsoleOutputPlugin` for display.
|
||||
|
||||
@ -41,7 +41,7 @@ instances.
|
||||
class JoinDatabaseCategories(Configurable):
|
||||
database = Service('orders_database')
|
||||
|
||||
def call(self, database, row):
|
||||
def __call__(self, database, row):
|
||||
return {
|
||||
**row,
|
||||
'category': database.get_category_name_for_sku(row['sku'])
|
||||
|
||||
@ -32,6 +32,100 @@ Iterable
|
||||
Something we can iterate on, in python, so basically anything you'd be able to use in a `for` loop.
|
||||
|
||||
|
||||
Concepts
|
||||
::::::::
|
||||
|
||||
Whatever kind of transformation you want to use, there are a few common concepts you should know about.
|
||||
|
||||
Input
|
||||
-----
|
||||
|
||||
All input is retrieved via the call arguments. Each line of input means one call to the callable provided. Arguments
|
||||
will be, in order:
|
||||
|
||||
* Injected dependencies (database, http, filesystem, ...)
|
||||
* Position based arguments
|
||||
* Keyword based arguments
|
||||
|
||||
You'll see below how to pass each of those.
|
||||
|
||||
Output
|
||||
------
|
||||
|
||||
Each callable can return/yield different things (all examples will use yield, but if there is only one output per input
|
||||
line, you can also return your output row and expect the exact same behaviour).
|
||||
|
||||
Let's see the rules (first to match wins).
|
||||
|
||||
1. A flag, eventually followed by something else, marks a special behaviour. If it supports it, the remaining part of
|
||||
the output line will be interpreted using the same rules, and some flags can be combined.
|
||||
|
||||
**NOT_MODIFIED**
|
||||
|
||||
**NOT_MODIFIED** tells bonobo to use the input row unmodified as the output.
|
||||
|
||||
*CANNOT be combined*
|
||||
|
||||
Example:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo import NOT_MODIFIED
|
||||
|
||||
def output_will_be_same_as_input(*args, **kwargs):
|
||||
yield NOT_MODIFIED
|
||||
|
||||
**APPEND**
|
||||
|
||||
**APPEND** tells bonobo to append this output to the input (positional arguments will equal `input_args + output_args`,
|
||||
keyword arguments will equal `{**input_kwargs, **output_kwargs}`).
|
||||
|
||||
*CAN be combined, but not with itself*
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo import APPEND
|
||||
|
||||
def output_will_be_appended_to_input(*args, **kwargs):
|
||||
yield APPEND, 'foo', 'bar', {'eat_at': 'joe'}
|
||||
|
||||
**LOOPBACK**
|
||||
|
||||
**LOOPBACK** tells bonobo that this output must be looped back into our own input queue, allowing to create the stream
|
||||
processing version of recursive algorithms.
|
||||
|
||||
*CAN be combined, but not with itself*
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo import LOOPBACK
|
||||
|
||||
def output_will_be_sent_to_self(*args, **kwargs):
|
||||
yield LOOPBACK, 'Hello, I am the future "you".'
|
||||
|
||||
**CHANNEL(...)**
|
||||
|
||||
**CHANNEL(...)** tells bonobo that this output does not use the default channel and is routed through another path.
|
||||
This is something you should probably not use unless your data flow design is complex, and if you're not certain
|
||||
about it, it probably means that it is not the feature you're looking for.
|
||||
|
||||
*CAN be combined, but not with itself*
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo import CHANNEL
|
||||
|
||||
def output_will_be_sent_to_self(*args, **kwargs):
|
||||
yield CHANNEL("errors"), 'That is not cool.'
|
||||
|
||||
2. Once all flags are "consumed", the remaining part is interpreted.
|
||||
|
||||
* If it is a :class:`bonobo.Bag` instance, then it's used directly.
|
||||
* If it is a :class:`dict` then a kwargs-only :class:`bonobo.Bag` will be created.
|
||||
* If it is a :class:`tuple` then an args-only :class:`bonobo.Bag` will be created, unless its last argument is a
|
||||
:class:`dict` in which case a args+kwargs :class:`bonobo.Bag` will be created.
|
||||
* If it's something else, it will be used to create a one-arg-only :class:`bonobo.Bag`.
|
||||
|
||||
Function based transformations
|
||||
::::::::::::::::::::::::::::::
|
||||
|
||||
@ -112,7 +206,7 @@ can be used as a graph node, then use camelcase names:
|
||||
# configurable
|
||||
class ChangeCase(Configurable):
|
||||
modifier = Option(default='upper')
|
||||
def call(self, s: str) -> str:
|
||||
def __call__(self, s: str) -> str:
|
||||
return getattr(s, self.modifier)()
|
||||
|
||||
# transformation factory
|
||||
|
||||
@ -1,20 +1,39 @@
|
||||
Installation
|
||||
============
|
||||
|
||||
|
||||
Create an ETL project
|
||||
:::::::::::::::::::::
|
||||
|
||||
Creating a project and starting to write code should take less than a minute:
|
||||
First, install the framework:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ pip install --upgrade bonobo cookiecutter
|
||||
$ bonobo init my-etl-project
|
||||
$ bonobo run my-etl-project
|
||||
$ pip install --upgrade bonobo
|
||||
|
||||
Once you bootstrapped a project, you can start editing the default example transformation by editing
|
||||
`my-etl-project/main.py`. Now, you can head to :doc:`tutorial/index`.
|
||||
Create a simple job:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo init my-etl.py
|
||||
|
||||
And let's go for a test drive:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ python my-etl.py
|
||||
|
||||
Congratulations, you ran your first Bonobo ETL job.
|
||||
|
||||
Now, you can head to :doc:`tutorial/index`.
|
||||
|
||||
.. note::
|
||||
|
||||
It's often best to start with a single file then move it into a project
|
||||
(which, in python, needs to live in a package).
|
||||
|
||||
You can read more about this topic in the :doc:`guide/packaging` section,
|
||||
along with pointers on how to move this first file into an existing fully
|
||||
featured python package.
|
||||
|
||||
|
||||
Other installation options
|
||||
@ -29,6 +48,12 @@ You can install it directly from the `Python Package Index <https://pypi.python.
|
||||
|
||||
$ pip install bonobo
|
||||
|
||||
To upgrade an existing installation, use `--upgrade`:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ pip install --upgrade bonobo
|
||||
|
||||
|
||||
Install from source
|
||||
-------------------
|
||||
@ -81,18 +106,29 @@ from the local clone.
|
||||
$ git clone git@github.com:python-bonobo/bonobo.git
|
||||
$ cd bonobo
|
||||
$ pip install --editable .
|
||||
|
||||
|
||||
You can develop on this clone, but you probably want to add your own repository if you want to push code back and make pull requests.
|
||||
I usually name the git remote for the main bonobo repository "upstream", and my own repository "origin".
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
|
||||
$ git remote rename origin upstream
|
||||
$ git remote add origin git@github.com:hartym/bonobo.git
|
||||
$ git fetch --all
|
||||
|
||||
Of course, replace my github username by the one you used to fork bonobo. You should be good to go!
|
||||
|
||||
Preview versions
|
||||
----------------
|
||||
|
||||
Sometimes, there are pre-versions available (before a major release, for example). By default, pip does not target
|
||||
pre-versions to avoid accidental upgrades to a potentially instable software, but you can easily opt-in:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ pip install --upgrade --pre bonobo
|
||||
|
||||
|
||||
Supported platforms
|
||||
:::::::::::::::::::
|
||||
|
||||
@ -117,4 +153,3 @@ users.
|
||||
We're trying to look into that but energy available to provide serious support on windows is very limited.
|
||||
|
||||
If you have experience in this domain and you're willing to help, you're more than welcome!
|
||||
|
||||
|
||||
@ -16,16 +16,6 @@ Syntax: `bonobo convert [-r reader] input_filename [-w writer] output_filename`
|
||||
to read from csv and write to csv too (or other format) but adding a geocoder filter that would add some fields.
|
||||
|
||||
|
||||
Bonobo Init
|
||||
:::::::::::
|
||||
|
||||
Create an empty project, ready to use bonobo.
|
||||
|
||||
Syntax: `bonobo init`
|
||||
|
||||
Requires `cookiecutter`.
|
||||
|
||||
|
||||
Bonobo Inspect
|
||||
::::::::::::::
|
||||
|
||||
|
||||
@ -1,54 +0,0 @@
|
||||
Internal roadmap notes
|
||||
======================
|
||||
|
||||
Things that should be thought about and/or implemented, but that I don't know where to store.
|
||||
|
||||
Graph and node level plugins
|
||||
::::::::::::::::::::::::::::
|
||||
|
||||
* Enhancers or node-level plugins
|
||||
* Graph level plugins
|
||||
* Documentation
|
||||
|
||||
Command line interface and environment
|
||||
::::::::::::::::::::::::::::::::::::::
|
||||
|
||||
* How do we manage environment ? .env ?
|
||||
* How do we configure plugins ?
|
||||
|
||||
Services and Processors
|
||||
:::::::::::::::::::::::
|
||||
|
||||
* ContextProcessors not clean (a bit better, but still not in love with the api)
|
||||
|
||||
Next...
|
||||
:::::::
|
||||
|
||||
* Release process specialised for bonobo. With changelog production, etc.
|
||||
* Document how to upgrade version, like, minor need change badges, etc.
|
||||
* Windows console looks crappy.
|
||||
* bonobo init --with sqlalchemy,docker; cookiecutter?
|
||||
* logger, vebosity level
|
||||
|
||||
|
||||
External libs that looks good
|
||||
:::::::::::::::::::::::::::::
|
||||
|
||||
* dask.distributed
|
||||
* mediator (event dispatcher)
|
||||
|
||||
Version 0.4
|
||||
:::::::::::
|
||||
|
||||
* SQLAlchemy 101
|
||||
|
||||
Design decisions
|
||||
::::::::::::::::
|
||||
|
||||
* initialize / finalize better than start / stop ?
|
||||
|
||||
Minor stuff
|
||||
:::::::::::
|
||||
|
||||
* Should we include datasets in the repo or not? As they may change, grow, and even eventually have licenses we can't use,
|
||||
it's probably best if we don't.
|
||||
258
docs/tutorial/1-init.rst
Normal file
258
docs/tutorial/1-init.rst
Normal file
@ -0,0 +1,258 @@
|
||||
Part 1: Let's get started!
|
||||
==========================
|
||||
|
||||
To get started with |bonobo|, you need to install it in a working python 3.5+ environment (you should use a
|
||||
`virtualenv <https://virtualenv.pypa.io/>`_).
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ pip install bonobo
|
||||
|
||||
Check that the installation worked, and that you're using a version that matches this tutorial (written for bonobo
|
||||
|longversion|).
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo version
|
||||
|
||||
See :doc:`/install` for more options.
|
||||
|
||||
|
||||
Create an ETL job
|
||||
:::::::::::::::::
|
||||
|
||||
Since Bonobo 0.6, it's easy to bootstrap a simple ETL job using just one file.
|
||||
|
||||
We'll start here, and the later stages of the tutorial will guide you toward refactoring this to a python package.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo init tutorial.py
|
||||
|
||||
This will create a simple job in a `tutorial.py` file. Let's run it:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ python tutorial.py
|
||||
Hello
|
||||
World
|
||||
- extract in=1 out=2 [done]
|
||||
- transform in=2 out=2 [done]
|
||||
- load in=2 [done]
|
||||
|
||||
If you have a similar result, then congratulations! You just ran your first |bonobo| ETL job.
|
||||
|
||||
|
||||
Inspect your graph
|
||||
::::::::::::::::::
|
||||
|
||||
The basic building blocks of |bonobo| are **transformations** and **graphs**.
|
||||
|
||||
**Transformations** are simple python callables (like functions) that handle a transformation step for a line of data.
|
||||
|
||||
**Graphs** are a set of transformations, with directional links between them to define the data-flow that will happen
|
||||
at runtime.
|
||||
|
||||
To inspect the graph of your first transformation (you must install graphviz first to do so), run:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo inspect --graph tutorial.py | dot -Tpng -o tutorial.png
|
||||
|
||||
Open the generated `tutorial.png` file to have a quick look at the graph.
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
rankdir = LR;
|
||||
"BEGIN" [shape="point"];
|
||||
"BEGIN" -> {0 [label="extract"]};
|
||||
{0 [label="extract"]} -> {1 [label="transform"]};
|
||||
{1 [label="transform"]} -> {2 [label="load"]};
|
||||
}
|
||||
|
||||
You can easily understand here the structure of your graph. For such a simple graph, it's pretty much useless, but as
|
||||
you'll write more complex transformations, it will be helpful.
|
||||
|
||||
|
||||
Read the Code
|
||||
:::::::::::::
|
||||
|
||||
Before we write our own job, let's look at the code we have in `tutorial.py`.
|
||||
|
||||
|
||||
Import
|
||||
------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import bonobo
|
||||
|
||||
|
||||
The highest level APIs of |bonobo| are all contained within the top level **bonobo** namespace.
|
||||
|
||||
If you're a beginner with the library, stick to using only those APIs (they also are the most stable APIs).
|
||||
|
||||
If you're an advanced user (and you'll be one quite soon), you can safely use second level APIs.
|
||||
|
||||
The third level APIs are considered private, and you should not use them unless you're hacking on |bonobo| directly.
|
||||
|
||||
|
||||
Extract
|
||||
-------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def extract():
|
||||
yield 'hello'
|
||||
yield 'world'
|
||||
|
||||
This is a first transformation, written as a python generator, that will send some strings, one after the other, to its
|
||||
output.
|
||||
|
||||
Transformations that take no input and yields a variable number of outputs are usually called **extractors**. You'll
|
||||
encounter a few different types, either purely generating the data (like here), using an external service (a
|
||||
database, for example) or using some filesystem (which is considered an external service too).
|
||||
|
||||
Extractors do not need to have its input connected to anything, and will be called exactly once when the graph is
|
||||
executed.
|
||||
|
||||
|
||||
Transform
|
||||
---------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def transform(*args):
|
||||
yield tuple(
|
||||
map(str.title, args)
|
||||
)
|
||||
|
||||
This is a second transformation. It will get called a bunch of times, once for each input row it gets, and apply some
|
||||
logic on the input to generate the output.
|
||||
|
||||
This is the most **generic** case. For each input row, you can generate zero, one or many lines of output for each line
|
||||
of input.
|
||||
|
||||
|
||||
Load
|
||||
----
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def load(*args):
|
||||
print(*args)
|
||||
|
||||
This is the third and last transformation in our "hello world" example. It will apply some logic to each row, and have
|
||||
absolutely no output.
|
||||
|
||||
Transformations that take input and yields nothing are also called **loaders**. Like extractors, you'll encounter
|
||||
different types, to work with various external systems.
|
||||
|
||||
Please note that as a convenience mean and because the cost is marginal, most builtin `loaders` will send their
|
||||
inputs to their output unmodified, so you can easily chain more than one loader, or apply more transformations after a
|
||||
given loader.
|
||||
|
||||
|
||||
Graph Factory
|
||||
-------------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def get_graph(**options):
|
||||
graph = bonobo.Graph()
|
||||
graph.add_chain(extract, transform, load)
|
||||
return graph
|
||||
|
||||
All our transformations were defined above, but nothing ties them together, for now.
|
||||
|
||||
This "graph factory" function is in charge of the creation and configuration of a :class:`bonobo.Graph` instance, that
|
||||
will be executed later.
|
||||
|
||||
By no mean is |bonobo| limited to simple graphs like this one. You can add as many chains as you want, and each chain
|
||||
can contain as many nodes as you want.
|
||||
|
||||
|
||||
Services Factory
|
||||
----------------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def get_services(**options):
|
||||
return {}
|
||||
|
||||
This is the "services factory", that we'll use later to connect to external systems. Let's skip this one, for now.
|
||||
|
||||
(we'll dive into this topic in :doc:`4-services`)
|
||||
|
||||
|
||||
Main Block
|
||||
----------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = bonobo.get_argument_parser()
|
||||
with bonobo.parse_args(parser) as options:
|
||||
bonobo.run(
|
||||
get_graph(**options),
|
||||
services=get_services(**options)
|
||||
)
|
||||
|
||||
Here, the real thing happens.
|
||||
|
||||
Without diving into too much details for now, using the :func:`bonobo.parse_args` context manager will allow our job to
|
||||
be configurable, later, and although we don't really need it right now, it does not harm neither.
|
||||
|
||||
Reading the output
|
||||
::::::::::::::::::
|
||||
|
||||
Let's run this job once again:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ python tutorial.py
|
||||
Hello
|
||||
World
|
||||
- extract in=1 out=2 [done]
|
||||
- transform in=2 out=2 [done]
|
||||
- load in=2 [done]
|
||||
|
||||
The console output contains two things.
|
||||
|
||||
* First, it contains the real output of your job (what was :func:`print`-ed to `sys.stdout`).
|
||||
* Second, it displays the execution status (on `sys.stderr`). Each line contains a "status" character, the node name,
|
||||
numbers and a human readable status. This status will evolve in real time, and allows to understand a job's progress
|
||||
while it's running.
|
||||
|
||||
* Status character:
|
||||
|
||||
* “ ” means that the node was not yet started.
|
||||
* “`-`” means that the node finished its execution.
|
||||
* “`+`” means that the node is currently running.
|
||||
* “`!`” means that the node had problems running.
|
||||
|
||||
* Numerical statistics:
|
||||
|
||||
* “`in=...`” shows the input lines count, also known as the amount of calls to your transformation.
|
||||
* “`out=...`” shows the output lines count.
|
||||
* “`read=...`” shows the count of reads applied to an external system, if the transformation supports it.
|
||||
* “`write=...`” shows the count of writes applied to an external system, if the transformation supports it.
|
||||
* “`err=...`” shows the count of exceptions that happened while running the transformation. Note that exception will abort
|
||||
a call, but the execution will move to the next row.
|
||||
|
||||
|
||||
Wrap up
|
||||
:::::::
|
||||
|
||||
That's all for this first step.
|
||||
|
||||
You now know:
|
||||
|
||||
* How to create a new job (using a single file).
|
||||
* How to inspect the content of a job.
|
||||
* What should go in a job file.
|
||||
* How to execute a job file.
|
||||
* How to read the console output.
|
||||
|
||||
It's now time to jump to :doc:`2-jobs`.
|
||||
66
docs/tutorial/2-jobs.rst
Normal file
66
docs/tutorial/2-jobs.rst
Normal file
@ -0,0 +1,66 @@
|
||||
Part 2: Writing ETL Jobs
|
||||
========================
|
||||
|
||||
What's an ETL job ?
|
||||
:::::::::::::::::::
|
||||
|
||||
In |bonobo|, an ETL job is a formal definition of an executable graph.
|
||||
|
||||
Each node of a graph will be executed in isolation from the other nodes, and the data is passed from one node to the
|
||||
next using FIFO queues, managed by the framework. It's transparent to the end-user, though, and you'll only use
|
||||
function arguments (for inputs) and return/yield values (for outputs).
|
||||
|
||||
Each input row of a node will cause one call to this node's callable. Each output is cast internally as a tuple-like
|
||||
data structure (or more precisely, a namedtuple-like data structure), and for one given node, each output row must
|
||||
have the same structure.
|
||||
|
||||
If you return/yield something which is not a tuple, bonobo will create a tuple of one element.
|
||||
|
||||
Properties
|
||||
----------
|
||||
|
||||
|bonobo| assists you with defining the data-flow of your data engineering process, and then streams data through your
|
||||
callable graphs.
|
||||
|
||||
* Each node call will process one row of data.
|
||||
* Queues that flows the data between node are first-in, first-out (FIFO) standard python :class:`queue.Queue`.
|
||||
* Each node will run in parallel
|
||||
* Default execution strategy use threading, and each node will run in a separate thread.
|
||||
|
||||
Fault tolerance
|
||||
---------------
|
||||
|
||||
Node execution is fault tolerant.
|
||||
|
||||
If an exception is raised from a node call, then this node call will be aborted but bonobo will continue the execution
|
||||
with the next row (after outputing the stack trace and incrementing the "err" counter for the node context).
|
||||
|
||||
It allows to have ETL jobs that ignore faulty data and try their best to process the valid rows of a dataset.
|
||||
|
||||
Some errors are fatal, though.
|
||||
|
||||
If you pass a 2 elements tuple to a node that takes 3 args, |bonobo| will raise an :class:`bonobo.errors.UnrecoverableTypeError`, and exit the
|
||||
current graph execution as fast as it can (finishing the other node executions that are in progress first, but not
|
||||
starting new ones if there are remaining input rows).
|
||||
|
||||
|
||||
Let's write a sample data integration job
|
||||
:::::::::::::::::::::::::::::::::::::::::
|
||||
|
||||
Let's create a sample application.
|
||||
|
||||
The goal of this application will be to extract all the fablabs in the world using an open-data API, normalize this
|
||||
data and, for now, display it. We'll then build on this foundation in the next steps to write to files, databases, etc.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Moving forward
|
||||
::::::::::::::
|
||||
|
||||
You now know:
|
||||
|
||||
* How to ...
|
||||
|
||||
**Next: :doc:`3-files`**
|
||||
22
docs/tutorial/3-files.rst
Normal file
22
docs/tutorial/3-files.rst
Normal file
@ -0,0 +1,22 @@
|
||||
Part 3: Working with Files
|
||||
==========================
|
||||
|
||||
* Filesystems
|
||||
|
||||
* Reading files
|
||||
|
||||
* Writing files
|
||||
|
||||
* Writing files to S3
|
||||
|
||||
* Atomic writes ???
|
||||
|
||||
|
||||
Moving forward
|
||||
::::::::::::::
|
||||
|
||||
You now know:
|
||||
|
||||
* How to ...
|
||||
|
||||
**Next: :doc:`4-services`**
|
||||
207
docs/tutorial/4-services.rst
Normal file
207
docs/tutorial/4-services.rst
Normal file
@ -0,0 +1,207 @@
|
||||
Part 4: Services and Configurables
|
||||
==================================
|
||||
|
||||
|
||||
In the last section, we used a few new tools.
|
||||
|
||||
Class-based transformations and configurables
|
||||
:::::::::::::::::::::::::::::::::::::::::::::
|
||||
|
||||
Bonobo is a bit dumb. If something is callable, it considers it can be used as a transformation, and it's up to the
|
||||
user to provide callables that logically fits in a graph.
|
||||
|
||||
You can use plain python objects with a `__call__()` method, and it ill just work.
|
||||
|
||||
As a lot of transformations needs common machinery, there is a few tools to quickly build transformations, most of
|
||||
them requiring your class to subclass :class:`bonobo.config.Configurable`.
|
||||
|
||||
Configurables allows to use the following features:
|
||||
|
||||
* You can add **Options** (using the :class:`bonobo.config.Option` descriptor). Options can be positional, or keyword
|
||||
based, can have a default value and will be consumed from the constructor arguments.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import Configurable, Option
|
||||
|
||||
class PrefixIt(Configurable):
|
||||
prefix = Option(str, positional=True, default='>>>')
|
||||
|
||||
def __call__(self, row):
|
||||
return self.prefix + ' ' + row
|
||||
|
||||
prefixer = PrefixIt('$')
|
||||
|
||||
* You can add **Services** (using the :class:`bonobo.config.Service` descriptor). Services are a subclass of
|
||||
:class:`bonobo.config.Option`, sharing the same basics, but specialized in the definition of "named services" that
|
||||
will be resolved at runtime (a.k.a for which we will provide an implementation at runtime). We'll dive more into that
|
||||
in the next section
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import Configurable, Option, Service
|
||||
|
||||
class HttpGet(Configurable):
|
||||
url = Option(default='https://jsonplaceholder.typicode.com/users')
|
||||
http = Service('http.client')
|
||||
|
||||
def __call__(self, http):
|
||||
resp = http.get(self.url)
|
||||
|
||||
for row in resp.json():
|
||||
yield row
|
||||
|
||||
http_get = HttpGet()
|
||||
|
||||
|
||||
* You can add **Methods** (using the :class:`bonobo.config.Method` descriptor). :class:`bonobo.config.Method` is a
|
||||
subclass of :class:`bonobo.config.Option` that allows to pass callable parameters, either to the class constructor,
|
||||
or using the class as a decorator.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import Configurable, Method
|
||||
|
||||
class Applier(Configurable):
|
||||
apply = Method()
|
||||
|
||||
def __call__(self, row):
|
||||
return self.apply(row)
|
||||
|
||||
@Applier
|
||||
def Prefixer(self, row):
|
||||
return 'Hello, ' + row
|
||||
|
||||
prefixer = Prefixer()
|
||||
|
||||
* You can add **ContextProcessors**, which are an advanced feature we won't introduce here. If you're familiar with
|
||||
pytest, you can think of them as pytest fixtures, execution wise.
|
||||
|
||||
Services
|
||||
::::::::
|
||||
|
||||
The motivation behind services is mostly separation of concerns, testability and deployability.
|
||||
|
||||
Usually, your transformations will depend on services (like a filesystem, an http client, a database, a rest api, ...).
|
||||
Those services can very well be hardcoded in the transformations, but there is two main drawbacks:
|
||||
|
||||
* You won't be able to change the implementation depending on the current environment (development laptop versus
|
||||
production servers, bug-hunting session versus execution, etc.)
|
||||
* You won't be able to test your transformations without testing the associated services.
|
||||
|
||||
To overcome those caveats of hardcoding things, we define Services in the configurable, which are basically
|
||||
string-options of the service names, and we provide an implementation at the last moment possible.
|
||||
|
||||
There are two ways of providing implementations:
|
||||
|
||||
* Either file-wide, by providing a `get_services()` function that returns a dict of named implementations (we did so
|
||||
with filesystems in the previous step, :doc:`tut02`)
|
||||
* Either directory-wide, by providing a `get_services()` function in a specially named `_services.py` file.
|
||||
|
||||
The first is simpler if you only have one transformation graph in one file, the second allows to group coherent
|
||||
transformations together in a directory and share the implementations.
|
||||
|
||||
Let's see how to use it, starting from the previous service example:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import Configurable, Option, Service
|
||||
|
||||
class HttpGet(Configurable):
|
||||
url = Option(default='https://jsonplaceholder.typicode.com/users')
|
||||
http = Service('http.client')
|
||||
|
||||
def __call__(self, http):
|
||||
resp = http.get(self.url)
|
||||
|
||||
for row in resp.json():
|
||||
yield row
|
||||
|
||||
We defined an "http.client" service, that obviously should have a `get()` method, returning responses that have a
|
||||
`json()` method.
|
||||
|
||||
Let's provide two implementations for that. The first one will be using `requests <http://docs.python-requests.org/>`_,
|
||||
that coincidally satisfies the described interface:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import bonobo
|
||||
import requests
|
||||
|
||||
def get_services():
|
||||
return {
|
||||
'http.client': requests
|
||||
}
|
||||
|
||||
graph = bonobo.Graph(
|
||||
HttpGet(),
|
||||
print,
|
||||
)
|
||||
|
||||
If you run this code, you should see some mock data returned by the webservice we called (assuming it's up and you can
|
||||
reach it).
|
||||
|
||||
Now, the second implementation will replace that with a mock, used for testing purposes:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
class HttpResponseStub:
|
||||
def json(self):
|
||||
return [
|
||||
{'id': 1, 'name': 'Leanne Graham', 'username': 'Bret', 'email': 'Sincere@april.biz', 'address': {'street': 'Kulas Light', 'suite': 'Apt. 556', 'city': 'Gwenborough', 'zipcode': '92998-3874', 'geo': {'lat': '-37.3159', 'lng': '81.1496'}}, 'phone': '1-770-736-8031 x56442', 'website': 'hildegard.org', 'company': {'name': 'Romaguera-Crona', 'catchPhrase': 'Multi-layered client-server neural-net', 'bs': 'harness real-time e-markets'}},
|
||||
{'id': 2, 'name': 'Ervin Howell', 'username': 'Antonette', 'email': 'Shanna@melissa.tv', 'address': {'street': 'Victor Plains', 'suite': 'Suite 879', 'city': 'Wisokyburgh', 'zipcode': '90566-7771', 'geo': {'lat': '-43.9509', 'lng': '-34.4618'}}, 'phone': '010-692-6593 x09125', 'website': 'anastasia.net', 'company': {'name': 'Deckow-Crist', 'catchPhrase': 'Proactive didactic contingency', 'bs': 'synergize scalable supply-chains'}},
|
||||
]
|
||||
|
||||
class HttpStub:
|
||||
def get(self, url):
|
||||
return HttpResponseStub()
|
||||
|
||||
def get_services():
|
||||
return {
|
||||
'http.client': HttpStub()
|
||||
}
|
||||
|
||||
graph = bonobo.Graph(
|
||||
HttpGet(),
|
||||
print,
|
||||
)
|
||||
|
||||
The `Graph` definition staying the exact same, you can easily substitute the `_services.py` file depending on your
|
||||
environment (the way you're doing this is out of bonobo scope and heavily depends on your usual way of managing
|
||||
configuration files on different platforms).
|
||||
|
||||
Starting with bonobo 0.5 (not yet released), you will be able to use service injections with function-based
|
||||
transformations too, using the `bonobo.config.requires` decorator to mark a dependency.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import requires
|
||||
|
||||
@requires('http.client')
|
||||
def http_get(http):
|
||||
resp = http.get('https://jsonplaceholder.typicode.com/users')
|
||||
|
||||
for row in resp.json():
|
||||
yield row
|
||||
|
||||
|
||||
Read more
|
||||
:::::::::
|
||||
|
||||
* :doc:`/guide/services`
|
||||
* :doc:`/reference/api_config`
|
||||
|
||||
Next
|
||||
::::
|
||||
|
||||
:doc:`tut04`.
|
||||
|
||||
|
||||
Moving forward
|
||||
::::::::::::::
|
||||
|
||||
You now know:
|
||||
|
||||
* How to ...
|
||||
|
||||
**Next: :doc:`5-packaging`**
|
||||
28
docs/tutorial/5-packaging.rst
Normal file
28
docs/tutorial/5-packaging.rst
Normal file
@ -0,0 +1,28 @@
|
||||
Part 5: Projects and Packaging
|
||||
==============================
|
||||
|
||||
Until then, we worked with one file managing a job.
|
||||
|
||||
Real life often involves more complicated setups, with relations and imports between different files.
|
||||
|
||||
This section will describe the options available to move this file into a package, either a new one or something
|
||||
that already exists in your own project.
|
||||
|
||||
Data processing is something a wide variety of tools may want to include, and thus |bonobo| does not enforce any
|
||||
kind of project structure, as the targert structure will be dicated by the hosting project. For example, a `pipelines`
|
||||
sub-package would perfectly fit a django or flask project, or even a regular package, but it's up to you to chose the
|
||||
structure of your project.
|
||||
|
||||
about using |bonobo| in a pyt
|
||||
is about set of jobs working together within a project.
|
||||
|
||||
Let's see how to move from the current status to a package.
|
||||
|
||||
|
||||
Moving forward
|
||||
::::::::::::::
|
||||
|
||||
You now know:
|
||||
|
||||
* How to ...
|
||||
|
||||
3
docs/tutorial/django.rst
Normal file
3
docs/tutorial/django.rst
Normal file
@ -0,0 +1,3 @@
|
||||
Working with Django
|
||||
===================
|
||||
|
||||
@ -1,9 +1,6 @@
|
||||
First steps
|
||||
===========
|
||||
|
||||
What is Bonobo?
|
||||
:::::::::::::::
|
||||
|
||||
Bonobo is an ETL (Extract-Transform-Load) framework for python 3.5. The goal is to define data-transformations, with
|
||||
python code in charge of handling similar shaped independent lines of data.
|
||||
|
||||
@ -14,50 +11,45 @@ Bonobo is a lean manufacturing assembly line for data that let you focus on the
|
||||
|
||||
Bonobo uses simple python and should be quick and easy to learn.
|
||||
|
||||
Tutorial
|
||||
::::::::
|
||||
|
||||
.. note::
|
||||
|
||||
Good documentation is not easy to write. We do our best to make it better and better.
|
||||
|
||||
Although all content here should be accurate, you may feel a lack of completeness, for which we plead guilty and
|
||||
apologize.
|
||||
|
||||
If you're stuck, please come and ask on our `slack channel <https://bonobo-slack.herokuapp.com/>`_, we'll figure
|
||||
something out.
|
||||
|
||||
If you're not stuck but had trouble understanding something, please consider contributing to the docs (via GitHub
|
||||
pull requests).
|
||||
**Tutorials**
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:maxdepth: 1
|
||||
|
||||
tut01
|
||||
tut02
|
||||
tut03
|
||||
tut04
|
||||
1-init
|
||||
2-jobs
|
||||
3-files
|
||||
4-services
|
||||
5-packaging
|
||||
|
||||
|
||||
What's next?
|
||||
::::::::::::
|
||||
**Integrations**
|
||||
|
||||
Read a few examples
|
||||
-------------------
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
* :doc:`../reference/examples`
|
||||
django
|
||||
notebooks
|
||||
sqlalchemy
|
||||
|
||||
Read about best development practices
|
||||
-------------------------------------
|
||||
**What's next?**
|
||||
|
||||
* :doc:`../guide/index`
|
||||
* :doc:`../guide/purity`
|
||||
Once you're familiar with all the base concepts, you can...
|
||||
|
||||
Read about integrating external tools with bonobo
|
||||
-------------------------------------------------
|
||||
* Read the :doc:`Guides </guide/index>` to have a deep dive in each concept.
|
||||
* Explore the :doc:`Extensions </extension/index>` to widen the possibilities.
|
||||
* Open the :doc:`References </reference/index>` and start hacking like crazy.
|
||||
|
||||
* :doc:`../extension/docker`: run transformation graphs in isolated containers.
|
||||
* :doc:`../extension/jupyter`: run transformations within jupyter notebooks.
|
||||
* :doc:`../extension/selenium`: crawl the web using a real browser and work with the gathered data.
|
||||
* :doc:`../extension/sqlalchemy`: everything you need to interract with SQL databases.
|
||||
**You're not alone!**
|
||||
|
||||
Good documentation is not easy to write.
|
||||
|
||||
Although all content here should be accurate, you may feel a lack of completeness, for which we plead guilty and
|
||||
apologize.
|
||||
|
||||
If you're stuck, please come to the `Bonobo Slack Channel <https://bonobo-slack.herokuapp.com/>`_ and we'll figure it
|
||||
out.
|
||||
|
||||
If you're not stuck but had trouble understanding something, please consider contributing to the docs (using GitHub
|
||||
pull requests).
|
||||
|
||||
|
||||
4
docs/tutorial/notebooks.rst
Normal file
4
docs/tutorial/notebooks.rst
Normal file
@ -0,0 +1,4 @@
|
||||
Working with Jupyter Notebooks
|
||||
==============================
|
||||
|
||||
|
||||
@ -1,11 +0,0 @@
|
||||
Just enough Python for Bonobo
|
||||
=============================
|
||||
|
||||
.. todo::
|
||||
|
||||
This is a work in progress and it is not yet available. Please come back later or even better, help us write this
|
||||
guide!
|
||||
|
||||
This guide is intended to help programmers or enthusiasts to grasp the python basics necessary to use Bonobo. It
|
||||
should definately not be considered as a general python introduction, neither a deep dive into details.
|
||||
|
||||
4
docs/tutorial/sqlalchemy.rst
Normal file
4
docs/tutorial/sqlalchemy.rst
Normal file
@ -0,0 +1,4 @@
|
||||
Working with SQL Databases
|
||||
==========================
|
||||
|
||||
|
||||
@ -1,8 +1,7 @@
|
||||
Let's get started!
|
||||
==================
|
||||
|
||||
To begin with Bonobo, you need to install it in a working python 3.5+ environment, and you'll also need cookiecutter
|
||||
to bootstrap your project.
|
||||
To get started with Bonobo, you need to install it in a working python 3.5+ environment:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
@ -14,21 +13,24 @@ See :doc:`/install` for more options.
|
||||
Create an empty project
|
||||
:::::::::::::::::::::::
|
||||
|
||||
Your ETL code will live in ETL projects, which are basically a bunch of files, including python code, that bonobo
|
||||
can run.
|
||||
Your ETL code will live in standard python files and packages.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo init tutorial
|
||||
$ bonobo create tutorial.py
|
||||
|
||||
This will create a `tutorial` directory (`content description here <https://www.bonobo-project.org/with/cookiecutter>`_).
|
||||
This will create a simple example job in a `tutorial.py` file.
|
||||
|
||||
To run this project, use:
|
||||
Now, try to execute it:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ bonobo run tutorial
|
||||
$ python tutorial.py
|
||||
|
||||
Congratulations, you just ran your first ETL job!
|
||||
|
||||
|
||||
.. todo:: XXX **CHANGES NEEDED BELOW THIS POINTS BEFORE 0.6** XXX
|
||||
|
||||
Write a first transformation
|
||||
::::::::::::::::::::::::::::
|
||||
@ -105,6 +107,9 @@ To do this, it needs to know what data-flow you want to achieve, and you'll use
|
||||
The `if __name__ == '__main__':` section is not required, unless you want to run it directly using the python
|
||||
interpreter.
|
||||
|
||||
The name of the `graph` variable is arbitrary, but this variable must be global and available unconditionally.
|
||||
Do not put it in its own function or in the `if __name__ == '__main__':` section.
|
||||
|
||||
|
||||
Execute the job
|
||||
:::::::::::::::
|
||||
@ -128,9 +133,9 @@ Rewrite it using builtins
|
||||
There is a much simpler way to describe an equivalent graph:
|
||||
|
||||
.. literalinclude:: ../../bonobo/examples/tutorials/tut01e02.py
|
||||
:language: python
|
||||
:language: python
|
||||
|
||||
The `extract()` generator has been replaced by a list, as Bonobo will interpret non-callable iterables as a no-input
|
||||
The `extract()` generator has been replaced by a list, as Bonobo will interpret non-callable iterables as a no-input
|
||||
generator.
|
||||
|
||||
This example is also available in :mod:`bonobo.examples.tutorials.tut01e02`, and you can also run it as a module:
|
||||
|
||||
@ -59,13 +59,7 @@ available in **Bonobo**'s repository:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ curl https://raw.githubusercontent.com/python-bonobo/bonobo/master/bonobo/examples/datasets/coffeeshops.txt > `python3 -c 'import bonobo; print(bonobo.get_examples_path("datasets/coffeeshops.txt"))'`
|
||||
|
||||
.. note::
|
||||
|
||||
The "example dataset download" step will be easier in the future.
|
||||
|
||||
https://github.com/python-bonobo/bonobo/issues/134
|
||||
$ bonobo download examples/datasets/coffeeshops.txt
|
||||
|
||||
.. literalinclude:: ../../bonobo/examples/tutorials/tut02e01_read.py
|
||||
:language: python
|
||||
|
||||
@ -30,7 +30,7 @@ Configurables allows to use the following features:
|
||||
class PrefixIt(Configurable):
|
||||
prefix = Option(str, positional=True, default='>>>')
|
||||
|
||||
def call(self, row):
|
||||
def __call__(self, row):
|
||||
return self.prefix + ' ' + row
|
||||
|
||||
prefixer = PrefixIt('$')
|
||||
@ -48,7 +48,7 @@ Configurables allows to use the following features:
|
||||
url = Option(default='https://jsonplaceholder.typicode.com/users')
|
||||
http = Service('http.client')
|
||||
|
||||
def call(self, http):
|
||||
def __call__(self, http):
|
||||
resp = http.get(self.url)
|
||||
|
||||
for row in resp.json():
|
||||
@ -68,7 +68,7 @@ Configurables allows to use the following features:
|
||||
class Applier(Configurable):
|
||||
apply = Method()
|
||||
|
||||
def call(self, row):
|
||||
def __call__(self, row):
|
||||
return self.apply(row)
|
||||
|
||||
@Applier
|
||||
@ -114,7 +114,7 @@ Let's see how to use it, starting from the previous service example:
|
||||
url = Option(default='https://jsonplaceholder.typicode.com/users')
|
||||
http = Service('http.client')
|
||||
|
||||
def call(self, http):
|
||||
def __call__(self, http):
|
||||
resp = http.get(self.url)
|
||||
|
||||
for row in resp.json():
|
||||
|
||||
Reference in New Issue
Block a user