[docs] rewriting the tutorial.
This commit is contained in:
@ -188,7 +188,11 @@ epub_copyright = copyright
|
|||||||
epub_exclude_files = ['search.html']
|
epub_exclude_files = ['search.html']
|
||||||
|
|
||||||
# Example configuration for intersphinx: refer to the Python standard library.
|
# Example configuration for intersphinx: refer to the Python standard library.
|
||||||
intersphinx_mapping = {'python': ('https://docs.python.org/3', None)}
|
intersphinx_mapping = {
|
||||||
|
'python': ('https://docs.python.org/3', None),
|
||||||
|
'fs': ('https://docs.pyfilesystem.org/en/latest/', None),
|
||||||
|
'requests': ('http://docs.python-requests.org/en/master/', None),
|
||||||
|
}
|
||||||
|
|
||||||
rst_epilog = """
|
rst_epilog = """
|
||||||
.. |bonobo| replace:: **Bonobo**
|
.. |bonobo| replace:: **Bonobo**
|
||||||
|
|||||||
@ -1,8 +1,6 @@
|
|||||||
Part 3: Working with Files
|
Part 3: Working with Files
|
||||||
==========================
|
==========================
|
||||||
|
|
||||||
.. include:: _wip_note.rst
|
|
||||||
|
|
||||||
Writing to the console is nice, but let's be serious, real world will require us to use files or external services.
|
Writing to the console is nice, but let's be serious, real world will require us to use files or external services.
|
||||||
|
|
||||||
Let's see how to use a few builtin writers and both local and remote filesystems.
|
Let's see how to use a few builtin writers and both local and remote filesystems.
|
||||||
|
|||||||
@ -1,201 +1,99 @@
|
|||||||
Part 4: Services and Configurables
|
Part 4: Services
|
||||||
==================================
|
================
|
||||||
|
|
||||||
.. include:: _wip_note.rst
|
All external dependencies (like filesystems, network clients, database connections, etc.) should be provided to
|
||||||
|
transformations as a service. It allows great flexibility, including the ability to test your transformations isolated
|
||||||
|
from the external world, and being friendly to the infrastructure guys (and if you're one of them, it's also nice to
|
||||||
|
treat yourself well).
|
||||||
|
|
||||||
In the last section, we used a few new tools.
|
In the last section, we used the `fs` service to access filesystems, we'll go even further by switching our `requests`
|
||||||
|
call to use the `http` service, so we can switch the `requests` session at runtime. We'll use it to add an http cache,
|
||||||
Class-based transformations and configurables
|
which is a great thing to avoid hammering a remote API.
|
||||||
:::::::::::::::::::::::::::::::::::::::::::::
|
|
||||||
|
|
||||||
Bonobo is a bit dumb. If something is callable, it considers it can be used as a transformation, and it's up to the
|
|
||||||
user to provide callables that logically fits in a graph.
|
|
||||||
|
|
||||||
You can use plain python objects with a `__call__()` method, and it ill just work.
|
|
||||||
|
|
||||||
As a lot of transformations needs common machinery, there is a few tools to quickly build transformations, most of
|
|
||||||
them requiring your class to subclass :class:`bonobo.config.Configurable`.
|
|
||||||
|
|
||||||
Configurables allows to use the following features:
|
|
||||||
|
|
||||||
* You can add **Options** (using the :class:`bonobo.config.Option` descriptor). Options can be positional, or keyword
|
|
||||||
based, can have a default value and will be consumed from the constructor arguments.
|
|
||||||
|
|
||||||
.. code-block:: python
|
|
||||||
|
|
||||||
from bonobo.config import Configurable, Option
|
|
||||||
|
|
||||||
class PrefixIt(Configurable):
|
|
||||||
prefix = Option(str, positional=True, default='>>>')
|
|
||||||
|
|
||||||
def __call__(self, row):
|
|
||||||
return self.prefix + ' ' + row
|
|
||||||
|
|
||||||
prefixer = PrefixIt('$')
|
|
||||||
|
|
||||||
* You can add **Services** (using the :class:`bonobo.config.Service` descriptor). Services are a subclass of
|
|
||||||
:class:`bonobo.config.Option`, sharing the same basics, but specialized in the definition of "named services" that
|
|
||||||
will be resolved at runtime (a.k.a for which we will provide an implementation at runtime). We'll dive more into that
|
|
||||||
in the next section
|
|
||||||
|
|
||||||
.. code-block:: python
|
|
||||||
|
|
||||||
from bonobo.config import Configurable, Option, Service
|
|
||||||
|
|
||||||
class HttpGet(Configurable):
|
|
||||||
url = Option(default='https://jsonplaceholder.typicode.com/users')
|
|
||||||
http = Service('http.client')
|
|
||||||
|
|
||||||
def __call__(self, http):
|
|
||||||
resp = http.get(self.url)
|
|
||||||
|
|
||||||
for row in resp.json():
|
|
||||||
yield row
|
|
||||||
|
|
||||||
http_get = HttpGet()
|
|
||||||
|
|
||||||
|
|
||||||
* You can add **Methods** (using the :class:`bonobo.config.Method` descriptor). :class:`bonobo.config.Method` is a
|
Default services
|
||||||
subclass of :class:`bonobo.config.Option` that allows to pass callable parameters, either to the class constructor,
|
::::::::::::::::
|
||||||
or using the class as a decorator.
|
|
||||||
|
|
||||||
.. code-block:: python
|
As a default, |bonobo| provides only two services:
|
||||||
|
|
||||||
from bonobo.config import Configurable, Method
|
* `fs`, a :obj:`fs.osfs.OSFS` object to access files.
|
||||||
|
* `http`, a :obj:`requests.Session` object to access the Web.
|
||||||
|
|
||||||
class Applier(Configurable):
|
|
||||||
apply = Method()
|
|
||||||
|
|
||||||
def __call__(self, row):
|
Overriding services
|
||||||
return self.apply(row)
|
:::::::::::::::::::
|
||||||
|
|
||||||
@Applier
|
You can override the default services, or define your own services, by providing a dictionary to the `services=`
|
||||||
def Prefixer(self, row):
|
argument of :obj:`bonobo.run`:
|
||||||
return 'Hello, ' + row
|
|
||||||
|
|
||||||
prefixer = Prefixer()
|
|
||||||
|
|
||||||
* You can add **ContextProcessors**, which are an advanced feature we won't introduce here. If you're familiar with
|
|
||||||
pytest, you can think of them as pytest fixtures, execution wise.
|
|
||||||
|
|
||||||
Services
|
|
||||||
::::::::
|
|
||||||
|
|
||||||
The motivation behind services is mostly separation of concerns, testability and deployability.
|
|
||||||
|
|
||||||
Usually, your transformations will depend on services (like a filesystem, an http client, a database, a rest api, ...).
|
|
||||||
Those services can very well be hardcoded in the transformations, but there is two main drawbacks:
|
|
||||||
|
|
||||||
* You won't be able to change the implementation depending on the current environment (development laptop versus
|
|
||||||
production servers, bug-hunting session versus execution, etc.)
|
|
||||||
* You won't be able to test your transformations without testing the associated services.
|
|
||||||
|
|
||||||
To overcome those caveats of hardcoding things, we define Services in the configurable, which are basically
|
|
||||||
string-options of the service names, and we provide an implementation at the last moment possible.
|
|
||||||
|
|
||||||
There are two ways of providing implementations:
|
|
||||||
|
|
||||||
* Either file-wide, by providing a `get_services()` function that returns a dict of named implementations (we did so
|
|
||||||
with filesystems in the previous step, :doc:`tut02`)
|
|
||||||
* Either directory-wide, by providing a `get_services()` function in a specially named `_services.py` file.
|
|
||||||
|
|
||||||
The first is simpler if you only have one transformation graph in one file, the second allows to group coherent
|
|
||||||
transformations together in a directory and share the implementations.
|
|
||||||
|
|
||||||
Let's see how to use it, starting from the previous service example:
|
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
from bonobo.config import Configurable, Option, Service
|
|
||||||
|
|
||||||
class HttpGet(Configurable):
|
|
||||||
url = Option(default='https://jsonplaceholder.typicode.com/users')
|
|
||||||
http = Service('http.client')
|
|
||||||
|
|
||||||
def __call__(self, http):
|
|
||||||
resp = http.get(self.url)
|
|
||||||
|
|
||||||
for row in resp.json():
|
|
||||||
yield row
|
|
||||||
|
|
||||||
We defined an "http.client" service, that obviously should have a `get()` method, returning responses that have a
|
|
||||||
`json()` method.
|
|
||||||
|
|
||||||
Let's provide two implementations for that. The first one will be using `requests <http://docs.python-requests.org/>`_,
|
|
||||||
that coincidally satisfies the described interface:
|
|
||||||
|
|
||||||
.. code-block:: python
|
|
||||||
|
|
||||||
import bonobo
|
|
||||||
import requests
|
import requests
|
||||||
|
|
||||||
def get_services():
|
def get_services():
|
||||||
|
http = requests.Session()
|
||||||
|
http.headers = {'User-Agent': 'Monkeys!'}
|
||||||
return {
|
return {
|
||||||
'http.client': requests
|
'http': http
|
||||||
}
|
}
|
||||||
|
|
||||||
graph = bonobo.Graph(
|
Switching requests to use the service
|
||||||
HttpGet(),
|
:::::::::::::::::::::::::::::::::::::
|
||||||
print,
|
|
||||||
)
|
|
||||||
|
|
||||||
If you run this code, you should see some mock data returned by the webservice we called (assuming it's up and you can
|
Let's replace the :obj:`requests.get` call we used in the first steps to use the `http` service:
|
||||||
reach it).
|
|
||||||
|
|
||||||
Now, the second implementation will replace that with a mock, used for testing purposes:
|
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
class HttpResponseStub:
|
from bonobo.config import use
|
||||||
def json(self):
|
|
||||||
return [
|
|
||||||
{'id': 1, 'name': 'Leanne Graham', 'username': 'Bret', 'email': 'Sincere@april.biz', 'address': {'street': 'Kulas Light', 'suite': 'Apt. 556', 'city': 'Gwenborough', 'zipcode': '92998-3874', 'geo': {'lat': '-37.3159', 'lng': '81.1496'}}, 'phone': '1-770-736-8031 x56442', 'website': 'hildegard.org', 'company': {'name': 'Romaguera-Crona', 'catchPhrase': 'Multi-layered client-server neural-net', 'bs': 'harness real-time e-markets'}},
|
|
||||||
{'id': 2, 'name': 'Ervin Howell', 'username': 'Antonette', 'email': 'Shanna@melissa.tv', 'address': {'street': 'Victor Plains', 'suite': 'Suite 879', 'city': 'Wisokyburgh', 'zipcode': '90566-7771', 'geo': {'lat': '-43.9509', 'lng': '-34.4618'}}, 'phone': '010-692-6593 x09125', 'website': 'anastasia.net', 'company': {'name': 'Deckow-Crist', 'catchPhrase': 'Proactive didactic contingency', 'bs': 'synergize scalable supply-chains'}},
|
|
||||||
]
|
|
||||||
|
|
||||||
class HttpStub:
|
@use('http')
|
||||||
def get(self, url):
|
def extract_fablabs(http):
|
||||||
return HttpResponseStub()
|
yield from http.get(FABLABS_API_URL).json().get('records')
|
||||||
|
|
||||||
def get_services():
|
Tadaa, done! You're not anymore tied to a specific implementation, but to whatever :obj:`requests` compatible object the
|
||||||
return {
|
user want to provide.
|
||||||
'http.client': HttpStub()
|
|
||||||
}
|
|
||||||
|
|
||||||
graph = bonobo.Graph(
|
Adding cache
|
||||||
HttpGet(),
|
::::::::::::
|
||||||
print,
|
|
||||||
)
|
|
||||||
|
|
||||||
The `Graph` definition staying the exact same, you can easily substitute the `_services.py` file depending on your
|
Let's demonstrate the flexibility of this approach by adding some local cache for HTTP requests, to avoid hammering the
|
||||||
environment (the way you're doing this is out of bonobo scope and heavily depends on your usual way of managing
|
API endpoint as we run our tests.
|
||||||
configuration files on different platforms).
|
|
||||||
|
|
||||||
Starting with bonobo 0.5 (not yet released), you will be able to use service injections with function-based
|
First, let's install `requests-cache`:
|
||||||
transformations too, using the `bonobo.config.requires` decorator to mark a dependency.
|
|
||||||
|
.. code-block:: shell-session
|
||||||
|
|
||||||
|
$ pip install requests-cache
|
||||||
|
|
||||||
|
Then, let's switch the implementation, conditionally.
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
from bonobo.config import requires
|
def get_services(use_cache=False):
|
||||||
|
if use_cache:
|
||||||
|
from requests_cache import CachedSession
|
||||||
|
http = CachedSession('http.cache')
|
||||||
|
else:
|
||||||
|
import requests
|
||||||
|
http = requests.Session()
|
||||||
|
|
||||||
@requires('http.client')
|
return {
|
||||||
def http_get(http):
|
'http': http
|
||||||
resp = http.get('https://jsonplaceholder.typicode.com/users')
|
}
|
||||||
|
|
||||||
for row in resp.json():
|
Then in the main block, let's add support for a `--use-cache` argument:
|
||||||
yield row
|
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
Read more
|
if __name__ == '__main__':
|
||||||
:::::::::
|
parser = bonobo.get_argument_parser()
|
||||||
|
parser.add_argument('--use-cache', action='store_true', default=False)
|
||||||
|
|
||||||
* :doc:`/guide/services`
|
with bonobo.parse_args(parser) as options:
|
||||||
* :doc:`/reference/api_config`
|
bonobo.run(get_graph(**options), services=get_services(**options))
|
||||||
|
|
||||||
Next
|
And you're done! Now, you can switch from using or not the cache using the `--use-cache` argument in command line when
|
||||||
::::
|
running your job.
|
||||||
|
|
||||||
:doc:`tut04`.
|
|
||||||
|
|
||||||
|
|
||||||
Moving forward
|
Moving forward
|
||||||
@ -203,6 +101,9 @@ Moving forward
|
|||||||
|
|
||||||
You now know:
|
You now know:
|
||||||
|
|
||||||
* How to ...
|
* How to use builtin service implementations
|
||||||
|
* How to override a service
|
||||||
|
* How to define your own service
|
||||||
|
* How to tune the default argument parser
|
||||||
|
|
||||||
It's now time to jump to :doc:`5-packaging`.
|
It's now time to jump to :doc:`5-packaging`.
|
||||||
|
|||||||
@ -1,32 +1,67 @@
|
|||||||
Part 5: Projects and Packaging
|
Part 5: Projects and Packaging
|
||||||
==============================
|
==============================
|
||||||
|
|
||||||
.. include:: _wip_note.rst
|
|
||||||
|
|
||||||
Until then, we worked with one file managing a job.
|
Until then, we worked with one file managing a job.
|
||||||
|
|
||||||
Real life often involves more complicated setups, with relations and imports between different files.
|
Real life often involves more complicated setups, with relations and imports between different files.
|
||||||
|
|
||||||
This section will describe the options available to move this file into a package, either a new one or something
|
|
||||||
that already exists in your own project.
|
|
||||||
|
|
||||||
Data processing is something a wide variety of tools may want to include, and thus |bonobo| does not enforce any
|
Data processing is something a wide variety of tools may want to include, and thus |bonobo| does not enforce any
|
||||||
kind of project structure, as the targert structure will be dicated by the hosting project. For example, a `pipelines`
|
kind of project structure, as the target structure will be dictated by the hosting project. For example, a `pipelines`
|
||||||
sub-package would perfectly fit a django or flask project, or even a regular package, but it's up to you to chose the
|
sub-package would perfectly fit a django or flask project, or even a regular package, but it's up to you to chose the
|
||||||
structure of your project.
|
structure of your project.
|
||||||
|
|
||||||
is about set of jobs working together within a project.
|
|
||||||
|
|
||||||
Let's see how to move from the current status to a package.
|
Imports mechanism
|
||||||
|
:::::::::::::::::
|
||||||
|
|
||||||
|
|bonobo| does not enforce anything on how the python import mechanism work. Especially, it won't add anything to your
|
||||||
|
`sys.path`, unlike some popular projects, because we're not sure that's something you want.
|
||||||
|
|
||||||
|
If you want to use imports, you should move your script in a python package, and it's up to you to have it setup
|
||||||
|
correctly.
|
||||||
|
|
||||||
|
|
||||||
|
Moving into an existing project
|
||||||
|
:::::::::::::::::::::::::::::::
|
||||||
|
|
||||||
|
First, and quite popular option, is to move your ETL job file into a package that already exists.
|
||||||
|
|
||||||
|
For example, it can be your existing software, eventually using some frameworks like django, flask, twisted, celery...
|
||||||
|
Name yours!
|
||||||
|
|
||||||
|
We suggest, but nothing is compulsory, that you decide on a namespace that will hold all your ETL pipelines and move all
|
||||||
|
your jobs in it. For example, it can be `mypkg.pipelines`.
|
||||||
|
|
||||||
|
|
||||||
|
Creating a brand new package
|
||||||
|
::::::::::::::::::::::::::::
|
||||||
|
|
||||||
|
Because you're maybe starting a project with the data-engineering part, then you may not have a python package yet. As
|
||||||
|
it can be a bit tedious to setup right, there is an helper, using `Medikit <http://medikit.rdc.li/en/latest/>`_, that
|
||||||
|
you can use to create a brand new project:
|
||||||
|
|
||||||
|
.. code-block:: shell-session
|
||||||
|
|
||||||
|
$ bonobo init --package pipelines
|
||||||
|
|
||||||
|
Answer a few questions, and you should now have a `pipelines` package, with an example transformation in it.
|
||||||
|
|
||||||
|
You can now follow the instructions on how to install it (`pip install --editable pipelines`), and the import mechanism
|
||||||
|
will work "just right" in it.
|
||||||
|
|
||||||
|
|
||||||
|
Common stuff
|
||||||
|
::::::::::::
|
||||||
|
|
||||||
|
Probably, you'll want to separate the `get_services()` factory from your pipelines, and just import it, as the
|
||||||
|
dependencies may very well be project wide.
|
||||||
|
|
||||||
|
But hey, it's just python! You're at home, now!
|
||||||
|
|
||||||
|
|
||||||
Moving forward
|
Moving forward
|
||||||
::::::::::::::
|
::::::::::::::
|
||||||
|
|
||||||
You now know:
|
|
||||||
|
|
||||||
* How to ...
|
|
||||||
|
|
||||||
That's the end of the tutorial, you should now be familiar with all the basics.
|
That's the end of the tutorial, you should now be familiar with all the basics.
|
||||||
|
|
||||||
A few appendixes to the tutorial can explain how to integrate with other systems (we'll use the "fablabs" application
|
A few appendixes to the tutorial can explain how to integrate with other systems (we'll use the "fablabs" application
|
||||||
@ -40,6 +75,9 @@ created in this tutorial and extend it):
|
|||||||
Then, you can either to jump head-first into your code, or you can have a better grasp at all concepts by
|
Then, you can either to jump head-first into your code, or you can have a better grasp at all concepts by
|
||||||
:doc:`reading the full bonobo guide </guide/index>`.
|
:doc:`reading the full bonobo guide </guide/index>`.
|
||||||
|
|
||||||
|
You should also `join the slack community <https://bonobo-slack.herokuapp.com/>`_ and ask all your questions there! No
|
||||||
|
need to stay alone, and the only stupid question is the one nobody asks!
|
||||||
|
|
||||||
Happy data flows!
|
Happy data flows!
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user