[docs] rewriting the tutorial.
This commit is contained in:
2
Makefile
2
Makefile
@ -1,4 +1,4 @@
|
||||
# Generated by Medikit 0.4.3 on 2018-01-08.
|
||||
# Generated by Medikit 0.4.6 on 2018-01-14.
|
||||
# All changes will be overriden.
|
||||
|
||||
PACKAGE ?= bonobo
|
||||
|
||||
@ -1,131 +0,0 @@
|
||||
Bonobo 0.6.0
|
||||
::::::::::::
|
||||
|
||||
* Removes dead snippet. (Romain Dorgueil)
|
||||
* Example datasets are now stored by bonobo minor version. (Romain Dorgueil)
|
||||
* Removing datasets from the repository. (Romain Dorgueil)
|
||||
* For some obscure reason, coverage is broken under python 3.7 making the test suite fail, disabled python3.7 in travis waiting for it to be fixed. (Romain Dorgueil)
|
||||
* [tests] adding a spec to magicmock of nodes to avoid it being seen as partially configured nodes (Romain Dorgueil)
|
||||
* Adds an OrderFields transformation factory, update examples. (Romain Dorgueil)
|
||||
* Check partially configured transformations that are function based (aka transformation factories) on execution context setup. (Romain Dorgueil)
|
||||
* Fix PrettyPrinter, output verbosity is now slightly more discreete. (Romain Dorgueil)
|
||||
* Inheritance of bags and better jupyter output for pretty printer. (Romain Dorgueil)
|
||||
* Documentation cosmetics. (Romain Dorgueil)
|
||||
* Simple "examples" command that just show examples for now. (Romain Dorgueil)
|
||||
* Rewritting Bags from scratch using a namedtuple approach, along with other (less major) updates. (Romain Dorgueil)
|
||||
* Adding services to naive execution (Kenneth Koski)
|
||||
* Fix another typo in `run` (Daniel Jilg)
|
||||
* Fix two typos in the ContextProcessor documentation (Daniel Jilg)
|
||||
* Core: refactoring contexts with more logical responsibilities, stopping to rely on kargs ordering for compat with python3.5 (Romain Dorgueil)
|
||||
* Simplification of node execution context, handle_result is now in step() as it is the only logical place where this will actually be called. (Romain Dorgueil)
|
||||
* Less strict CSV processing, to allow dirty input. (Romain Dorgueil)
|
||||
* [stdlib] Adds Update(...) and FixedWindow(...) the the standard nodes provided with bonobo. (Romain Dorgueil)
|
||||
* Adds a benchmarks directory with small scripts to test performances of things. (Romain Dorgueil)
|
||||
* Moves jupyter extension to both bonobo.contrib.jupyter (for the jupyter widget) and to bonobo.plugins (for the executor-side plugin). (Romain Dorgueil)
|
||||
* Fix examples with new module paths. (Romain Dorgueil)
|
||||
* IOFormats: if no kwargs, then try with one positional argument. (Romain Dorgueil)
|
||||
* Adds a __getattr__ dunder to ValueHolder to enable getting attributes, and especially method calls, on contained objects. (Romain Dorgueil)
|
||||
* Moves ODS extension to contrib module. (Romain Dorgueil)
|
||||
* Moves google extension to contrib module. (Romain Dorgueil)
|
||||
* Moves django extension to contrib module. (Romain Dorgueil)
|
||||
* Update graphs.rst (CW Andrews)
|
||||
* Adds argument parser support to django extension. (Romain Dorgueil)
|
||||
* Trying to understand conda... (Romain Dorgueil)
|
||||
* Trying to understand conda... (Romain Dorgueil)
|
||||
* Trying to understand conda... (Romain Dorgueil)
|
||||
* Update conda conf so readthedocs can maybe build. (Romain Dorgueil)
|
||||
* Working on the new version of the tutorial. Only Step1 implemented. (Romain Dorgueil)
|
||||
* Adds a "bare" template, containing the very minimum you want to have in 90% of cases. (Romain Dorgueil)
|
||||
* Fix default logging level, adds options to default template. (Romain Dorgueil)
|
||||
* Skip failing order test for python 3.5 (temporary). (Romain Dorgueil)
|
||||
* Switch to stable mondrian. (Romain Dorgueil)
|
||||
* Moves timer to statistics utilities. (Romain Dorgueil)
|
||||
* Adds basic test for convert command. (Romain Dorgueil)
|
||||
* [tests] adds node context lifecycle test.( (Romain Dorgueil)
|
||||
* Small changes in events, and associated tests. (Romain Dorgueil)
|
||||
* [core] Moves bonobo.execution context related package to new bonobo.execution.contexts package, also moves bonobo.strategies to new bonobo.execution.strategies package, so everything related to execution is now contained under the bonobo.execution package. (Romain Dorgueil)
|
||||
* Remove the sleep() in tick() that causes a minimum execution time of 2*PERIOD, more explicit status display and a small test case for console plugin. (Romain Dorgueil)
|
||||
* [tests] Fix path usage for python 3.5 (Romain Dorgueil)
|
||||
* Adds a test for default file init command. (Romain Dorgueil)
|
||||
* Adds 3.7-dev target to travis runner. (Romain Dorgueil)
|
||||
* Update requirements with first whistle stable. (Romain Dorgueil)
|
||||
* [core] Refactoring to use an event dispatcher in the main thread. (Romain Dorgueil)
|
||||
* Update to mondrian 0.4a0. (Romain Dorgueil)
|
||||
* Fix imports. (Romain Dorgueil)
|
||||
* Removing old error handler. (Romain Dorgueil)
|
||||
* [errors] Move error handling in transformations to use mondrian. (Romain Dorgueil)
|
||||
* [logging] Switching to mondrian, who got all our formating code. (Romain Dorgueil)
|
||||
* Adds argument parser support in default template. (Romain Dorgueil)
|
||||
* Adds the ability to initialize a package from bonobo init. (Romain Dorgueil)
|
||||
* Still cleaning up. (Romain Dorgueil)
|
||||
* [examples] comments. (Romain Dorgueil)
|
||||
* Update dependencies, remove python-dotenv. (Romain Dorgueil)
|
||||
* Remove unused argument. (Romain Dorgueil)
|
||||
* Remove files in examples that are not used anymore. (Romain Dorgueil)
|
||||
* Refactoring the runner to go more towards standard python, also adds the ability to use bonobo argument parser from standard python execution. (Romain Dorgueil)
|
||||
* Removes cookiecutter. (Romain Dorgueil)
|
||||
* Switch logger setup to mondrian (deps). (Romain Dorgueil)
|
||||
* Module registry reimported as it is needed for "bonobo convert". (Romain Dorgueil)
|
||||
* [core] Simplification: as truthfully stated by Maik at Pycon.DE sprint «lets try not to turn python into javascript». (Romain Dorgueil)
|
||||
* [core] still refactoring env-related stuff towards using __main__ blocks (but with argparser, if needed). (Romain Dorgueil)
|
||||
* [core] Refactoring of commands to move towards a more pythonic way of running the jobs. Commands are now classes, and bonobo "graph" related commands now hooks into bonobo.run() calls so it will use what you actually put in your __main__ block. (Romain Dorgueil)
|
||||
* Minor test change. (Romain Dorgueil)
|
||||
* [core] Change the token parsing part in prevision of different flags. (Romain Dorgueil)
|
||||
* Support line-delimited JSON (Michael Penkov)
|
||||
* Update Makefile/setup. (Romain Dorgueil)
|
||||
* [tests] simplify assertion (Romain Dorgueil)
|
||||
* Issue #134: use requests.get as a context manager (Michael Penkov)
|
||||
* Issue #134: use requests instead of urllib (Michael Penkov)
|
||||
* update Projectfile with download entry point (Michael Penkov)
|
||||
* Issue #134: update documentation (Michael Penkov)
|
||||
* Issue #134: add a `bonobo download url` command (Michael Penkov)
|
||||
* commands.run: Enable relative imports in main.py (Stefan Zimmermann)
|
||||
* adapt tutorial "Working with files" to the latest develop version (Peter Uebele)
|
||||
* Add a note about the graph variable (Michael Penkov)
|
||||
* [tests] trying to speed up the init test. (Romain Dorgueil)
|
||||
* [tests] bonobo.util.objects (Romain Dorgueil)
|
||||
* [nodes] Removing draft quality factory from bonobo main package, will live in separate personnal package until it is good enough to live here. (Romain Dorgueil)
|
||||
* [tests] rename factory test and move bag detecting so any bag is returned as is as an output. (Romain Dorgueil)
|
||||
* [core] Still refactoring the core behaviour of bags, starting to be much simpler. (Romain Dorgueil)
|
||||
* Fix python 3.5 os.chdir not accepting LocalPath (arimbr)
|
||||
* Remove unused shutil import (arimbr)
|
||||
* Use pytest tmpdir fixture and add more init tests (arimbr)
|
||||
* Check if target directory is empty instead of current directory and remove overwrite_if_exists argument (arimbr)
|
||||
* Remove dispatcher as it is not a dependency, for now, and as such breaks the continuous integration (yes, again.). (Romain Dorgueil)
|
||||
* Remove dispatcher as it is not a dependency, for now, and as such breaks the continuous integration. (Romain Dorgueil)
|
||||
* Code formating. (Romain Dorgueil)
|
||||
* [core] Testing and fixing new args/kwargs behaviour. (Romain Dorgueil)
|
||||
* [core] simplification of result interpretation. (Romain Dorgueil)
|
||||
* [tests] fix uncaptured output in test_commands (Romain Dorgueil)
|
||||
* Documentation for new behaviour. (Romain Dorgueil)
|
||||
* [django, misc] adds create_or_update to djangos ETLCommand class, adds getitem/setitem/contains dunders to ValueHolder. (Romain Dorgueil)
|
||||
* [core] (..., dict) means Bag(..., **dict) (Romain Dorgueil)
|
||||
* [django, google] Implements basic extensions for django and google oauth systems. (Romain Dorgueil)
|
||||
* Test tweak to work for Windows CI. (cwandrews)
|
||||
* Updated requirements files using edgy-project. (cwandrews)
|
||||
* Updated Projectfile to include python-dotenv dependency. (cwandrews)
|
||||
* Add tests for bonobo init new directory and init within empty directory (arimbr)
|
||||
* Update environment.rst (CW Andrews)
|
||||
* Update environment.rst (CW Andrews)
|
||||
* Cast env_dir to string before passing to load_dotenv as passing a PosixPath to load_dotenv raises an exception in 3.5. (cwandrews)
|
||||
* Updated environment documentation in guides to account for env files. (cwandrews)
|
||||
* Added more tests and moved all env and env file testing to classes (it might make more sense to just move them to separate files?). (cwandrews)
|
||||
* Moved env vars tests to class. (cwandrews)
|
||||
* Updated .env >>> .env_one to include in repo (.env ignored). (cwandrews)
|
||||
* [core] Refactoring IOFormats so there is one and only obvious way to send it. (Romain Dorgueil)
|
||||
* Set cookiecutter overwrite_if_exists parameter to True if current directory is empty (arimbr)
|
||||
* [cli/util] fix requires to use the right stack frame, remove --print as "-" does the job (Romain Dorgueil)
|
||||
* [cli] Adds a --filter option to "convert" command, allowing to use arbitrary filters to a command line conversion. Also adds --print and "-" output to pretty print to terminal instead of file output. (Romain Dorgueil)
|
||||
* [cli] convert, remove useless import. (Romain Dorgueil)
|
||||
* [config] adds a __doc__ constructor kwarg to set option documentation inline. (Romain Dorgueil)
|
||||
* [doc] formating (Romain Dorgueil)
|
||||
* [cli] adds ability to override reader/writer options from cli convert. (Romain Dorgueil)
|
||||
* comparison to None|True|False should be 'if cond is None:' (mouadhkaabachi)
|
||||
* Fixed bug involved in finding env when running module. (cwandrews)
|
||||
* Moved default-env-file tests to class. (cwandrews)
|
||||
* Small adjustment to test parameters. (cwandrews)
|
||||
* Added tests for running file with combinations of multiple default env files, env files, and env vars. Also reorganized environment directory in examples. (cwandrews)
|
||||
* Updated requirements.txt and requirements-dev.txt to include python-dotenv and dependencies. (cwandrews)
|
||||
* default-env-file, default-env, and env-file now in place alongside env. default-env-file and default-env both use os.environ.setdefault so as not to overwrite existing variables (system environment) while env-file and env will overwrite existing variables. All four allow for multiple values (***How might this affect multiple default-env and default-env-file values, I expect that unlike env-file and env the first passed variables would win). (cwandrews)
|
||||
* Further Refactored the setting of env vars passed via the env flag. (cwandrews)
|
||||
* Refactored setting of env vars passed via the env flag. (cwandrews)
|
||||
@ -1,311 +0,0 @@
|
||||
Changelog
|
||||
=========
|
||||
|
||||
Unreleased
|
||||
::::::::::
|
||||
|
||||
* Cookiecutter usage is removed. Linked to the fact that bonobo now use either a single file (up to you to get python
|
||||
imports working as you want) or a regular fully fledged python package, we do not need it anymore.
|
||||
|
||||
New features
|
||||
------------
|
||||
|
||||
Command line
|
||||
............
|
||||
|
||||
* `bonobo download /examples/datasets/coffeeshops.txt` now downloads the coffeeshops example
|
||||
|
||||
Graphs and Nodes
|
||||
................
|
||||
|
||||
* New `LdjsonReader` and `LdjsonWriter` nodes for handling `line-delimited JSON <https://en.wikipedia.org/wiki/JSON_Streaming>`_.
|
||||
|
||||
v.0.5.0 - 5 october 2017
|
||||
::::::::::::::::::::::::
|
||||
|
||||
Important highlights
|
||||
--------------------
|
||||
|
||||
* `bonobo.pprint` and `bonobo.PrettyPrint` have been removed, in favor of `bonobo.PrettyPrinter` (BC break).
|
||||
* The `bonobo.config` API has suffered a major refactoring. It has been done carefully and most of your code should
|
||||
work unchanged, but you may have surprises. This was necessary for this API to be more uniform (potential BC break).
|
||||
* bonobo.pprint and bonobo.PrettyPrint have been removed, in favor of new bonobo.PrettyPrinter() generic printer. If
|
||||
you're still using the old versions, time to switch (BC break).
|
||||
* Secondary APIs start to be more uniform (bonobo.config, bonobo.util).
|
||||
|
||||
New features
|
||||
------------
|
||||
|
||||
Graphs & Nodes
|
||||
..............
|
||||
|
||||
* Graphs now have a .copy() method.
|
||||
* New helper transformations arg0_to_kwargs and kwargs_to_arg0.
|
||||
* The unique pretty printer provided by the core library is now bonobo.PrettyPrinter().
|
||||
* Services now have "fs" and "http" configured by default.
|
||||
|
||||
Command line
|
||||
............
|
||||
|
||||
* New `bonobo convert` command now allows to run simple conversion jobs without coding anything.
|
||||
* New `bonobo inspect` command now allows to generate graphviz source for graph visualization.
|
||||
* Passing environment variables to graph executions now can be done using -e/--env. (cwandrews)
|
||||
* Add ability to install requirements with for a requirements.txt residing in the same dir (Alex Vykaliuk)
|
||||
|
||||
Preview
|
||||
.......
|
||||
|
||||
* A "transformation factory" makes its first appearance. It is considered a preview unstable feature. Stay
|
||||
tuned.
|
||||
|
||||
Internals
|
||||
---------
|
||||
|
||||
* Configurables have undergone a refactoring, all types of descriptors should now behave in the same way.
|
||||
* An UnrecoverrableError exception subclass allows for some errors to stop the whole execution.
|
||||
* Refactoring of Settings (bonobo.settings).
|
||||
* Add a reference to graph context (private) in service container.
|
||||
* Few internal APIs changes and refactorings.
|
||||
|
||||
Bugfixes
|
||||
--------
|
||||
|
||||
* Check if PluginExecutionContext was started before shutting it down. (Vitalii Vokhmin)
|
||||
* Move patch one level up because importlib brakes all the CI tools. (Alex Vykaliuk)
|
||||
* Do not fail in ipykernel without ipywidgets. (Alex Vykaliuk)
|
||||
* Escaping issues (Tomas Zubiri)
|
||||
|
||||
Miscellaneous
|
||||
-------------
|
||||
|
||||
* Windows console output should now be correct. (Parthiv20)
|
||||
* Various bugfixes.
|
||||
* More readable statistics on Ubuntu workstation standard terminal (spagoc)
|
||||
* Documentation, more documentation, documentation again.
|
||||
|
||||
|
||||
v.0.4.3 - 16 july 2017
|
||||
::::::::::::::::::::::
|
||||
|
||||
* #113 - Add flush() method to IOBuffer (Vitalii Vokhmin)
|
||||
* Dependencies updated.
|
||||
* Minor project artifacts updated.
|
||||
|
||||
v.0.4.2 - 18 june 2017
|
||||
::::::::::::::::::::::
|
||||
|
||||
* [config] Implements a "requires()" service injection decorator for functions (api may change).
|
||||
* [core] Execution contexts are now context managers.
|
||||
* [fs] adds a defaut to current working directory in open_fs(...).
|
||||
* [logging] Adds logging alias for easier imports.
|
||||
* [stdlib] Fix I/O related nodes (especially json), there were bad bugs with ioformat.
|
||||
|
||||
Dependency updates
|
||||
------------------
|
||||
|
||||
* Update bonobo-docker from 0.2.6 to 0.2.8
|
||||
* Update dependencies.
|
||||
* Update fs from 2.0.3 to 2.0.4
|
||||
* Update requests from 2.17.3 to 2.18.1
|
||||
|
||||
v.0.4.0 - 10 june 2017
|
||||
::::::::::::::::::::::
|
||||
|
||||
Important highlights
|
||||
--------------------
|
||||
|
||||
* **BC BREAK WARNING** New IOFORMAT option determines the default expected input and output format of transformations.
|
||||
New default input/output format of transformations is now kwargs-based, instead of first-argument based. The
|
||||
rationale behind this is that it does not make any sense to put a dict as the only argument of a transformation
|
||||
knowing that python has a well supported syntax to do so already. Of course, it may break some of your
|
||||
transformations but you can require the old behaviour by setting the IOFORMAT=arg0 environment variable.
|
||||
|
||||
New features
|
||||
------------
|
||||
|
||||
Command line interface
|
||||
......................
|
||||
|
||||
* Allow to run directories or modules using "bonobo run".
|
||||
* Bonobo version command now shows where the package is installed, and an optional "--all/-a" flag show all
|
||||
extensions in the same way. (#81)
|
||||
* Bonobo run flag "--install/-I" allow to pip install a requirements.txt file if run targets a directory. (#71)
|
||||
* Adds python logging facility configuration in bonobo cli commands.
|
||||
* Bonobo init now uses cookiecutter template.
|
||||
|
||||
Configuration
|
||||
.............
|
||||
|
||||
* `Exclusive(...)` context manager locks an object usage to one thread at a time.
|
||||
([docs](http://docs-dev.bonobo-project.org/en/develop/guide/services.html#solving-concurrency-problems))
|
||||
|
||||
Standard library
|
||||
................
|
||||
|
||||
* New PrettyPrinter and deprecate old crappy modules.
|
||||
* New pickle reader and writer (thanks @jelloslinger).
|
||||
|
||||
Internals
|
||||
---------
|
||||
|
||||
* ConsoleOutputPlugin now buffers stdout to avoid terminal conflicts. Side effect, output is only done every few tenth
|
||||
of a second.
|
||||
|
||||
Bugfixes
|
||||
--------
|
||||
|
||||
* Fixes jupyter widget.
|
||||
|
||||
Extensions
|
||||
----------
|
||||
|
||||
* First release officially supporting bonobo-docker extension. See https://www.bonobo-project.org/with/docker.
|
||||
* Docker extension can be now installed using the "docker" extra on bonobo (`pip install bonobo[docker]`).
|
||||
* Jupyter widget now displays the status in topological order, like console.
|
||||
|
||||
Miscellaneous
|
||||
-------------
|
||||
|
||||
* Allow "main.py" as well as "__main__.py" to be the main entrypoint of an etl job.
|
||||
* Better error display (329296c).
|
||||
* Better testing.
|
||||
* Code sweeping (ecfdc81).
|
||||
* Dependencies updated.
|
||||
* Filesystem now resolve (expand) ~ in path.
|
||||
* Moving project artifact management (Projectfile) to edgy.project 0.3 format.
|
||||
* Refactoring and fixes around ioformats.
|
||||
* Some really minor changes.
|
||||
|
||||
v.0.3.2 - 10 june 2017
|
||||
::::::::::::::::::::::
|
||||
|
||||
Weekly maintenance release.
|
||||
|
||||
* Updated frozen version numbers in requirements.
|
||||
|
||||
* pytest==3.1.1
|
||||
* requests==2.17.3
|
||||
* sphinx==1.6.2
|
||||
* stevedore==1.22.0
|
||||
|
||||
Note: this does not change anything when used as a dependency if you freeze your requirements, as the setup.py
|
||||
requirement specifiers did not change.
|
||||
|
||||
v.0.3.1 - 28 may 2017
|
||||
:::::::::::::::::::::
|
||||
|
||||
Weekly maintenance release.
|
||||
|
||||
* Updated project management model to edgy.project 0.3 format.
|
||||
* Updated frozen version numbers in requirements.
|
||||
|
||||
* certifi==2017.4.17
|
||||
* chardet==3.0.3
|
||||
* coverage==4.4.1
|
||||
* idna==2.5
|
||||
* nbconvert==5.2.1
|
||||
* pbr==3.0.1
|
||||
* pytest-cov==2.5.1
|
||||
* pytest==3.1.0
|
||||
* requests==2.16.5
|
||||
* sphinx==1.6.1
|
||||
* sphinxcontrib-websupport==1.0.1
|
||||
* testpath==0.3.1
|
||||
* typing==3.6.1
|
||||
* urllib3==1.21.1
|
||||
|
||||
Note: this does not change anything when used as a dependency if you freeze your requirements, as the setup.py
|
||||
requirement specifiers did not change.
|
||||
|
||||
v.0.3.0 - 22 may 2017
|
||||
:::::::::::::::::::::
|
||||
|
||||
Features
|
||||
--------
|
||||
|
||||
* ContextProcessors can now be implemented by getting the "yield" value (v = yield x), shortening the teardown-only
|
||||
context processors by one line.
|
||||
* File related writers (file, csv, json ...) now returns NOT_MODIFIED, making it easier to chain something after.
|
||||
* More consistent console output, nodes are now sorted in a topological order before display.
|
||||
* Graph.add_chain(...) now takes _input and _output parameters the same way, accepting indexes, instances or names
|
||||
(subject to change).
|
||||
* Graph.add_chain(...) now allows to "name" a chain, using _name keyword argument, to easily reference its output later
|
||||
(subject to change).
|
||||
* New settings module (bonobo.settings) read environment for some global configuration stuff (DEBUG and PROFILE, for
|
||||
now).
|
||||
* New Method subclass of Option allows to use Configurable objects as decorator (see bonobo.nodes.filter.Filter for a
|
||||
simple example).
|
||||
* New Filter transformation in standard library.
|
||||
|
||||
Internal features
|
||||
-----------------
|
||||
|
||||
* Better ContextProcessor implementation, avoiding to use a decorator on the parent class. Now works with Configurable
|
||||
instances like Option, Service and Method.
|
||||
* ContextCurrifier replaces the logic that was in NodeExecutionContext, that setup and teardown the context stack. Maybe
|
||||
the name is not ideal.
|
||||
* All builtin transformations are of course updated to use the improved API, and should be 100% backward compatible.
|
||||
* The "core" package has been dismantled, and its rare remaining members are now in "structs" and "util" packages.
|
||||
* Standard transformation library has been moved under the bonobo.nodes package. It does not change anything if you used
|
||||
bonobo.* (which you should).
|
||||
* ValueHolder is now more restrictive, not allowing to use .value anymore.
|
||||
|
||||
Miscellaneous
|
||||
-------------
|
||||
|
||||
* Code cleanup, dead code removal, more tests, etc.
|
||||
* More documentation.
|
||||
|
||||
v.0.2.4 - 2 may 2017
|
||||
::::::::::::::::::::
|
||||
|
||||
* Cosmetic release for PyPI package page formating. Same content as v.0.2.3.
|
||||
|
||||
v.0.2.3 - 1 may 2017
|
||||
:::::::::::::::::::::
|
||||
|
||||
* Positional options now supported, backward compatible. All FileHandler subclasses supports their path argument as
|
||||
positional.
|
||||
* Better transformation lifecycle management (still work needed here).
|
||||
* Windows continuous integration now works.
|
||||
* Refactoring the "API" a lot to have a much cleaner first glance at it.
|
||||
* More documentation, tutorials, and tuning project artifacts.
|
||||
|
||||
v.0.2.2 - 28 apr 2017
|
||||
:::::::::::::::::::::
|
||||
|
||||
* First implementation of services and basic injection.
|
||||
* Default service configuration for directories and files.
|
||||
* Code structure refactoring.
|
||||
* Critical bug fix in default strategy causing end of pipeline not to terminate correctly.
|
||||
* Force tighter dependency management to avoid unexpected upgrade problems.
|
||||
* Filesystems are now injected as a service, using new filesystem2 (fs) dependency.
|
||||
|
||||
v.0.2.1 - 25 apr 2017
|
||||
:::::::::::::::::::::
|
||||
|
||||
* Plugins (jupyter, console) are now auto-activated depending on the environment when using bonobo.run(...).
|
||||
* Remove dependencies to toolz (which was unused) and blessings (which caused problems on windows).
|
||||
* New dependency on colorama, which has better cross-platform support than blessings.
|
||||
* New bonobo.structs package containing basic datastructures, like graphs, tokens and bags.
|
||||
* Enhancements of ValueHolder to implement basic operators on its value without referencing the value attribute.
|
||||
* Fix issue with timezone argument of OpenDataSoftAPI (Sanket Dasgupta).
|
||||
* Fix Jupyter plugin.
|
||||
* Better continuous integration, testing and fixes in documentation.
|
||||
* Version updates for dependencies (psutil install problem on windows).
|
||||
|
||||
Initial release
|
||||
:::::::::::::::
|
||||
|
||||
* Migration from rdc.etl.
|
||||
* New cool name (ok, that's debatable).
|
||||
* Only supports python 3.5+, aggressively (which means, we can use async, and we remove all things from python 2/six
|
||||
compat)
|
||||
* Removes all thing deprecated and/or not really convincing from rdc.etl.
|
||||
* We want transforms to be simple callables, so refactoring of the harness mess.
|
||||
* We want to use plain python data structures, so hashes are removed. If you use python 3.6, you may even get sorted
|
||||
dicts.
|
||||
* Input/output MUX DEMUX removed, maybe no need for that in the real world. May come back, but not in 1.0
|
||||
* Change dependency policy. We need to include only the very basic requirements (and very required). Everything related
|
||||
to transforms that we may not use (bs, sqla, ...) should be optional dependencies.
|
||||
* Execution strategies, threaded by default.
|
||||
@ -127,7 +127,11 @@ See https://github.com/python-bonobo/bonobo/issues/24
|
||||
Who is behind this?
|
||||
-------------------
|
||||
|
||||
Me (as an individual), and a few great people that helped me along the way. Not commercially endorsed, or supported.
|
||||
`Me (as an individual) <https://romain.dorgueil.net/>`_, and the `growing number of contributors
|
||||
<https://github.com/python-bonobo/bonobo/graphs/contributors>`_ that give of their time to move the project forward.
|
||||
|
||||
|bonobo| is not commercially endorsed, or supported. If your company wants to sponsor parts of |bonobo| development
|
||||
effort, `let's talk <mailto:romain@bonobo-project.org>`_.
|
||||
|
||||
The code, documentation, and surrounding material is created using spare time and may lack a bit velocity. Feel free
|
||||
to jump in so we can go faster!
|
||||
|
||||
16
docs/guide/_next.rst
Normal file
16
docs/guide/_next.rst
Normal file
@ -0,0 +1,16 @@
|
||||
Where to jump next?
|
||||
:::::::::::::::::::
|
||||
|
||||
We suggest that you go through the :doc:`tutorial </tutorial/index>` first.
|
||||
|
||||
Then, you can read the guides, either using the order suggested or by picking the chapter that interest you the most at
|
||||
one given moment:
|
||||
|
||||
* :doc:`introduction`
|
||||
* :doc:`transformations`
|
||||
* :doc:`graphs`
|
||||
* :doc:`services`
|
||||
* :doc:`environment`
|
||||
* :doc:`purity`
|
||||
* :doc:`debugging`
|
||||
* :doc:`plugins`
|
||||
@ -1,11 +0,0 @@
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
introduction
|
||||
transformations
|
||||
graphs
|
||||
services
|
||||
environment
|
||||
purity
|
||||
debugging
|
||||
plugins
|
||||
@ -0,0 +1,5 @@
|
||||
Debugging
|
||||
=========
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -127,3 +127,5 @@ function and used to get data from the database.
|
||||
bonobo.PrettyPrinter(),
|
||||
)
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -249,3 +249,5 @@ the CLI, and reading the source you should be able to figure out its usage quite
|
||||
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
|
||||
@ -3,8 +3,14 @@ Guides
|
||||
|
||||
This section will guide you through your journey with Bonobo ETL.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
.. include:: _toc.rst
|
||||
|
||||
|
||||
|
||||
introduction
|
||||
transformations
|
||||
graphs
|
||||
services
|
||||
environment
|
||||
purity
|
||||
debugging
|
||||
plugins
|
||||
|
||||
@ -1,8 +1,8 @@
|
||||
Introduction
|
||||
============
|
||||
|
||||
The first thing you need to understand before you use Bonobo, or not, is what it does and what it does not, so you can
|
||||
understand if it could be a good fit for your use cases.
|
||||
The first thing you need to understand before you use |bonobo|, or not, is what it does and what it does not, so you
|
||||
can understand if it could be a good fit for your use cases.
|
||||
|
||||
How it works?
|
||||
:::::::::::::
|
||||
@ -13,7 +13,10 @@ terminals and source code files.
|
||||
It is a **data streaming** solution, that treat datasets as ordered collections of independant rows, allowing to process
|
||||
them "first in, first out" using a set of transformations organized together in a directed graph.
|
||||
|
||||
Let's take a few examples:
|
||||
Let's take a few examples.
|
||||
|
||||
Simplest linear graph
|
||||
---------------------
|
||||
|
||||
.. graphviz::
|
||||
|
||||
@ -26,18 +29,35 @@ Let's take a few examples:
|
||||
BEGIN -> "A" -> "B" -> "C" -> "END";
|
||||
}
|
||||
|
||||
One of the simplest, by the book, cases, is an extractor sending to a transformation, itself sending to a loader.
|
||||
One of the simplest, by the book, cases, is an extractor sending to a transformation, itself sending to a loader (hence
|
||||
the "Extract Transform Load" name).
|
||||
|
||||
.. note::
|
||||
|
||||
Of course, |bonobo| is aiming at real-world data transformations and can help you build all kinds of data-flows.
|
||||
|
||||
Bonobo will send an "impulsion" to all transformations linked to the `BEGIN` node (shown as a little black dot on the left).
|
||||
|
||||
On our example, the only node having its input linked to `BEGIN` is `A`.
|
||||
|
||||
Bonobo will send an "impulsion" to all transformations linked to the little black dot on the left, here `A`.
|
||||
`A`'s main topic will be to extract data from somewhere (a file, an endpoint, a database...) and generate some output.
|
||||
As soon as the first row of `A`'s output is available, Bonobo will start asking `B` to process it. As soon as the first
|
||||
row of `B`'s output is available, Bonobo will start asking `C` to process it.
|
||||
As soon as the first row of `A`'s output is available, |bonobo| will start asking `B` to process it. As soon as the first
|
||||
row of `B`'s output is available, |bonobo| will start asking `C` to process it.
|
||||
|
||||
While `B` and `C` are processing, `A` continues to generate data.
|
||||
|
||||
This approach can be efficient, depending on your requirements, because you may rely on a lot of services that may be
|
||||
long to answer or unreliable, and you don't have to handle optimizations, parallelism or retry logic by yourself.
|
||||
|
||||
.. note::
|
||||
|
||||
The default execution strategy uses threads, and makes it efficient to work on I/O bound tasks. It's in the plans
|
||||
to have other execution strategies, based on subprocesses (for CPU-bound tasks) or `dask.distributed` (for big
|
||||
data tasks that requires a cluster of computers to process in reasonable time).
|
||||
|
||||
Graphs with divergence points (or forks)
|
||||
----------------------------------------
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
@ -55,6 +75,9 @@ In this case, any output row of `A`, will be **sent to both** `B` and `C` simult
|
||||
processing while `B` and `C` are working.
|
||||
|
||||
|
||||
Graph with convergence points (or merges)
|
||||
-----------------------------------------
|
||||
|
||||
.. graphviz::
|
||||
|
||||
digraph {
|
||||
@ -71,38 +94,23 @@ processing while `B` and `C` are working.
|
||||
Now, we feed `C` with both `A` and `B` output. It is not a "join", or "cartesian product". It is just two different
|
||||
pipes plugged to `C` input, and whichever yields data will see this data feeded to `C`, one row at a time.
|
||||
|
||||
|
||||
What is it not?
|
||||
:::::::::::::::
|
||||
|
||||
**Bonobo** is not:
|
||||
|bonobo| is not:
|
||||
|
||||
* A data science, or statistical analysis tool, which need to treat the dataset as a whole and not as a collection of
|
||||
independant rows. If this is your need, you probably want to look at `pandas <https://pandas.pydata.org/>`_.
|
||||
|
||||
* A workflow or scheduling solution for independant data-engineering tasks. If you're looking to manage your sets of
|
||||
data processing tasks as a whole, you probably want to look at `airflow <https://airflow.incubator.apache.org/>`_.
|
||||
Although there is no Bonobo extension yet that handles that, it does make sense to integrate Bonobo jobs in an airflow
|
||||
(or other similar tool) workflow.
|
||||
Although there is no |bonobo| extension yet that handles that, it does make sense to integrate |bonobo| jobs in an
|
||||
airflow (or other similar tool) workflow.
|
||||
|
||||
* A big data solution, `as defined by wikipedia <https://en.wikipedia.org/wiki/Big_data>`_. We're aiming at "small
|
||||
scale" data processing, which can be still quite huge for humans, but not for computers. If you don't know whether or
|
||||
not this is sufficient for your needs, it probably means you're not in the "big data" land.
|
||||
|
||||
|
||||
Where to jump next?
|
||||
:::::::::::::::::::
|
||||
|
||||
If you did not run through it yet, we highly suggest that you go through the :doc:`tutorial </tutorial/index>` first.
|
||||
|
||||
Then, you can jump to the following guides, in no particuliar order:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
transformations
|
||||
graphs
|
||||
services
|
||||
environment
|
||||
purity
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -15,3 +15,5 @@ enhancers
|
||||
node
|
||||
-
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -147,3 +147,5 @@ a new dict will of course create a new envelope, but the unchanged objects insid
|
||||
|
||||
Last thing, copies made in the "pure" approach are explicit, and usually, explicit is better than implicit.
|
||||
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -157,3 +157,5 @@ Read more
|
||||
:::::::::
|
||||
|
||||
* See https://github.com/hartym/bonobo-sqlalchemy/blob/work-in-progress/bonobo_sqlalchemy/writers.py#L19 for example usage (work in progress).
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -233,22 +233,16 @@ bonobo send the data to your transformation.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.constants import BEGIN, END
|
||||
from bonobo.execution import NodeExecutionContext
|
||||
|
||||
with NodeExecutionContext(
|
||||
JsonWriter(filename), services={'fs': ...}
|
||||
) as context:
|
||||
|
||||
# Write a list of rows, including BEGIN/END control messages.
|
||||
context.write(
|
||||
BEGIN,
|
||||
Bag({'foo': 'bar'}),
|
||||
Bag({'foo': 'baz'}),
|
||||
END
|
||||
context.write_sync(
|
||||
{'foo': 'bar'},
|
||||
{'foo': 'baz'},
|
||||
)
|
||||
|
||||
# Out of the bonobo main loop, we need to call `step` explicitely.
|
||||
context.step()
|
||||
context.step()
|
||||
|
||||
.. include:: _next.rst
|
||||
|
||||
@ -1,12 +1,12 @@
|
||||
History
|
||||
=======
|
||||
|
||||
**Bonobo** is a full rewrite of **rdc.etl**.
|
||||
|bonobo| is a full rewrite of **rdc.etl**, aimed at modern python versions (3.5+).
|
||||
|
||||
**rdc.etl** is a full python 2.7+ ETL library for which development started in 2012, and was opensourced in 2013 (see
|
||||
`first commit <https://github.com/rdcli/rdc.etl/commit/fdbc11c0ee7f6b97322693bd0051d63677b06a93>`_).
|
||||
**rdc.etl** is a now deprecated python 2.7+ ETL library for which development started in 2012, and was opensourced in
|
||||
2013 (see `first commit <https://github.com/rdcli/rdc.etl/commit/fdbc11c0ee7f6b97322693bd0051d63677b06a93>`_).
|
||||
|
||||
Although the first commit in **Bonobo** happened late 2016, it's based on a lot of code, learnings and experience that
|
||||
Although the first commit in |bonobo| happened late 2016, it's based on a lot of code, learnings and experience that
|
||||
happened because of **rdc.etl**.
|
||||
|
||||
It would have been counterproductive to migrate the same codebase:
|
||||
@ -17,6 +17,5 @@ It would have been counterproductive to migrate the same codebase:
|
||||
* we also wanted to develop something that took advantage of modern python versions, hence the choice of 3.5+.
|
||||
|
||||
**rdc.etl** still runs data transformation jobs, in both python 2.7 and 3, and we reuse whatever is possible to
|
||||
build Bonobo.
|
||||
continue building |bonobo|.
|
||||
|
||||
You can read
|
||||
|
||||
@ -3,7 +3,7 @@ Part 3: Working with Files
|
||||
|
||||
.. include:: _wip_note.rst
|
||||
|
||||
Writing to the console is nice, but using files is probably more realistic.
|
||||
Writing to the console is nice, but let's be serious, real world will require us to use files or external services.
|
||||
|
||||
Let's see how to use a few builtin writers and both local and remote filesystems.
|
||||
|
||||
@ -11,50 +11,129 @@ Let's see how to use a few builtin writers and both local and remote filesystems
|
||||
Filesystems
|
||||
:::::::::::
|
||||
|
||||
In |bonobo|, files are accessed within a **filesystem** service which must be something with the same interface as
|
||||
`fs' FileSystem objects <https://docs.pyfilesystem.org/en/latest/builtin.html>`_. As a default, you'll get an instance
|
||||
of a local filesystem mapped to the current working directory as the `fs` service. You'll learn more about services in
|
||||
the next step, but for now, let's just use it.
|
||||
In |bonobo|, files are accessed within a **filesystem** service (a `fs' FileSystem object
|
||||
<https://docs.pyfilesystem.org/en/latest/builtin.html>`_).
|
||||
|
||||
As a default, you'll get an instance of a local filesystem mapped to the current working directory as the `fs` service.
|
||||
You'll learn more about services in the next step, but for now, let's just use it.
|
||||
|
||||
|
||||
Writing using the service
|
||||
:::::::::::::::::::::::::
|
||||
Writing to files
|
||||
::::::::::::::::
|
||||
|
||||
Although |bonobo| contains helpers to write to common file formats, let's start by writing it manually.
|
||||
To write in a file, we'll need to have an open file handle available during the whole transformation life.
|
||||
|
||||
We'll use a context processor to do so. A context processor is something very much like a
|
||||
:obj:`contextlib.contextmanager`, that |bonobo| will use to run a setup/teardown logic on objects that need to have
|
||||
the same lifecycle as a job execution.
|
||||
|
||||
Let's write one that just handle opening and closing the file:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from bonobo.config import use
|
||||
from bonobo.constants import NOT_MODIFIED
|
||||
def with_opened_file(self, context):
|
||||
with open('output.txt', 'w+') as f:
|
||||
yield f
|
||||
|
||||
@use('fs')
|
||||
def write_repr_to_file(*row, fs):
|
||||
with fs.open('output.txt', 'a+') as f:
|
||||
print(row, file=f)
|
||||
return NOT_MODIFIED
|
||||
Now, we need to write a `writer` transformation, and apply this context processor on it:
|
||||
|
||||
Then, update the `get_graph(...)` function, by adding `write_repr_to_file` just before your `PrettyPrinter()` node.
|
||||
.. code-block:: python
|
||||
|
||||
Let's try to run that and think about what happens.
|
||||
from bonobo.config import use_context_processor
|
||||
|
||||
Each time a row comes to this node, the output file is open in "append or create" mode, a line is written, and the file
|
||||
is closed.
|
||||
@use_context_processor(with_opened_file)
|
||||
def write_repr_to_file(f, *row):
|
||||
f.write(repr(row))
|
||||
|
||||
This is **NOT** how you want to do things. Let's rewrite it so our `open(...)` call becomes execution-wide.
|
||||
The `f` parameter will contain the value yielded by the context processors, in order of appearance (you can chain
|
||||
multiple context processors).
|
||||
|
||||
Please note that the :func:`bonobo.config.use_context_processor` decorator will modify the function in place, but won't
|
||||
modify its behaviour. If you want to call it out of the |bonobo| job context, it's your responsibility to provide
|
||||
the right parameters (and here, the opened file).
|
||||
|
||||
|
||||
Using the filesystem
|
||||
::::::::::::::::::::
|
||||
|
||||
We opened the output file using a hardcoded filename and filesystem implementation. Writing flexible jobs include the
|
||||
ability to change the load targets at runtime, and |bonobo| suggest to use the `fs` service to achieve this with files.
|
||||
|
||||
Let's rewrite our context processor to use it.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def with_opened_file(self, context):
|
||||
with context.get_service('fs').open('output.txt', 'w+') as f:
|
||||
yield f
|
||||
|
||||
Interface does not change much, but this small change allows the end-user to change the filesystem implementation at
|
||||
runtime, which is great to handle different environments (local development, staging servers, production, ...).
|
||||
|
||||
Note that |bonobo| only provide very few services with default implementation (actually, only `fs` and `http`), but
|
||||
you can define all the services you want, depending on your system. You'll learn more about this in the next tutorial
|
||||
chapter.
|
||||
|
||||
|
||||
Using a different filesystem
|
||||
::::::::::::::::::::::::::::
|
||||
|
||||
* Filesystems
|
||||
To change the `fs` implementation, you need to provide your implementation in the dict returned by `get_services()`.
|
||||
|
||||
* Reading files
|
||||
Let's write to a remote location, which will be an Amazon S3 bucket. First, we need to install the driver:
|
||||
|
||||
* Writing files
|
||||
.. code-block:: shell-session
|
||||
|
||||
* Writing files to S3
|
||||
pip install fs-s3fs
|
||||
|
||||
* Atomic writes ???
|
||||
Then, just provide the correct bucket to :func:`bonobo.open_fs`:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def get_services(**options):
|
||||
return {
|
||||
'fs': bonobo.open_fs('s3://bonobo-examples')
|
||||
}
|
||||
|
||||
.. note::
|
||||
|
||||
You must provide a bucket for which you have the write permission, and it's up to you to setup your amazon
|
||||
credentials in such a way that `boto` can access your AWS account.
|
||||
|
||||
|
||||
Using builtin writers
|
||||
:::::::::::::::::::::
|
||||
|
||||
Until then, and to have a better understanding of what happens, we implemented our writers ourselves.
|
||||
|
||||
|bonobo| contains writers for a variety of standard file formats, and you're probably better off using builtin writers.
|
||||
|
||||
Let's use a :obj:`bonobo.CsvWriter` instance instead, by replacing our custom transformation in the graph factory
|
||||
function:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
def get_graph(**options):
|
||||
graph = bonobo.Graph()
|
||||
graph.add_chain(
|
||||
...
|
||||
bonobo.CsvWriter('output.csv'),
|
||||
)
|
||||
return graph
|
||||
|
||||
Reading from files
|
||||
::::::::::::::::::
|
||||
|
||||
Reading from files is done using the same logic as writing, except that you'll probably have only one call to a reader.
|
||||
|
||||
Our example application does not include reading from files, but you can read the file we just wrote by using a
|
||||
:obj:`bonobo.CsvReader` instance.
|
||||
|
||||
|
||||
Atomic writes
|
||||
:::::::::::::
|
||||
|
||||
.. include:: _todo.rst
|
||||
|
||||
|
||||
Moving forward
|
||||
@ -62,6 +141,10 @@ Moving forward
|
||||
|
||||
You now know:
|
||||
|
||||
* How to ...
|
||||
* How to use the filesystem (`fs`) service.
|
||||
* How to read from files.
|
||||
* How to write to files.
|
||||
* How to substitute a service at runtime.
|
||||
|
||||
It's now time to jump to :doc:`4-services`.
|
||||
|
||||
|
||||
@ -30,9 +30,9 @@ requests==2.18.4
|
||||
six==1.11.0
|
||||
snowballstemmer==1.2.1
|
||||
sphinx-sitemap==0.2
|
||||
sphinx==1.6.5
|
||||
sphinx==1.6.6
|
||||
sphinxcontrib-websupport==1.0.1
|
||||
termcolor==1.1.0
|
||||
urllib3==1.22
|
||||
whichcraft==0.4.1
|
||||
yapf==0.20.0
|
||||
yapf==0.20.1
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
-e .[jupyter]
|
||||
appnope==0.1.0
|
||||
bleach==2.1.2
|
||||
decorator==4.1.2
|
||||
decorator==4.2.1
|
||||
entrypoints==0.2.3
|
||||
html5lib==1.0.1
|
||||
ipykernel==4.7.0
|
||||
@ -19,7 +19,7 @@ markupsafe==1.0
|
||||
mistune==0.8.3
|
||||
nbconvert==5.3.1
|
||||
nbformat==4.4.0
|
||||
notebook==5.2.2
|
||||
notebook==5.3.0rc1
|
||||
pandocfilters==1.4.2
|
||||
parso==0.1.1
|
||||
pexpect==4.3.1
|
||||
@ -28,13 +28,14 @@ prompt-toolkit==1.0.15
|
||||
ptyprocess==0.5.2
|
||||
pygments==2.2.0
|
||||
python-dateutil==2.6.1
|
||||
pyzmq==16.0.3
|
||||
pyzmq==17.0.0b3
|
||||
qtconsole==4.3.1
|
||||
send2trash==1.4.2
|
||||
simplegeneric==0.8.1
|
||||
six==1.11.0
|
||||
terminado==0.8.1
|
||||
testpath==0.3.1
|
||||
tornado==4.5.3
|
||||
tornado==5.0a1
|
||||
traitlets==4.3.2
|
||||
wcwidth==0.1.7
|
||||
webencodings==0.5.1
|
||||
|
||||
Reference in New Issue
Block a user