From 9af5d801717aaa5761fcca23a82f94a95e6e6b5a Mon Sep 17 00:00:00 2001 From: Romain Dorgueil Date: Sun, 14 Jan 2018 15:26:04 +0100 Subject: [PATCH] [docs] rewriting the tutorial. --- docs/conf.py | 6 +- docs/tutorial/3-files.rst | 2 - docs/tutorial/4-services.rst | 227 ++++++++++------------------------ docs/tutorial/5-packaging.rst | 62 ++++++++-- 4 files changed, 119 insertions(+), 178 deletions(-) diff --git a/docs/conf.py b/docs/conf.py index 0e394fb..1cf5f21 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -188,7 +188,11 @@ epub_copyright = copyright epub_exclude_files = ['search.html'] # Example configuration for intersphinx: refer to the Python standard library. -intersphinx_mapping = {'python': ('https://docs.python.org/3', None)} +intersphinx_mapping = { + 'python': ('https://docs.python.org/3', None), + 'fs': ('https://docs.pyfilesystem.org/en/latest/', None), + 'requests': ('http://docs.python-requests.org/en/master/', None), +} rst_epilog = """ .. |bonobo| replace:: **Bonobo** diff --git a/docs/tutorial/3-files.rst b/docs/tutorial/3-files.rst index 25fe613..7306726 100644 --- a/docs/tutorial/3-files.rst +++ b/docs/tutorial/3-files.rst @@ -1,8 +1,6 @@ Part 3: Working with Files ========================== -.. include:: _wip_note.rst - Writing to the console is nice, but let's be serious, real world will require us to use files or external services. Let's see how to use a few builtin writers and both local and remote filesystems. diff --git a/docs/tutorial/4-services.rst b/docs/tutorial/4-services.rst index 6a25914..967fa51 100644 --- a/docs/tutorial/4-services.rst +++ b/docs/tutorial/4-services.rst @@ -1,201 +1,99 @@ -Part 4: Services and Configurables -================================== +Part 4: Services +================ -.. include:: _wip_note.rst +All external dependencies (like filesystems, network clients, database connections, etc.) should be provided to +transformations as a service. It allows great flexibility, including the ability to test your transformations isolated +from the external world, and being friendly to the infrastructure guys (and if you're one of them, it's also nice to +treat yourself well). -In the last section, we used a few new tools. - -Class-based transformations and configurables -::::::::::::::::::::::::::::::::::::::::::::: - -Bonobo is a bit dumb. If something is callable, it considers it can be used as a transformation, and it's up to the -user to provide callables that logically fits in a graph. - -You can use plain python objects with a `__call__()` method, and it ill just work. - -As a lot of transformations needs common machinery, there is a few tools to quickly build transformations, most of -them requiring your class to subclass :class:`bonobo.config.Configurable`. - -Configurables allows to use the following features: - -* You can add **Options** (using the :class:`bonobo.config.Option` descriptor). Options can be positional, or keyword - based, can have a default value and will be consumed from the constructor arguments. - - .. code-block:: python - - from bonobo.config import Configurable, Option - - class PrefixIt(Configurable): - prefix = Option(str, positional=True, default='>>>') - - def __call__(self, row): - return self.prefix + ' ' + row - - prefixer = PrefixIt('$') - -* You can add **Services** (using the :class:`bonobo.config.Service` descriptor). Services are a subclass of - :class:`bonobo.config.Option`, sharing the same basics, but specialized in the definition of "named services" that - will be resolved at runtime (a.k.a for which we will provide an implementation at runtime). We'll dive more into that - in the next section - - .. code-block:: python - - from bonobo.config import Configurable, Option, Service - - class HttpGet(Configurable): - url = Option(default='https://jsonplaceholder.typicode.com/users') - http = Service('http.client') - - def __call__(self, http): - resp = http.get(self.url) - - for row in resp.json(): - yield row - - http_get = HttpGet() +In the last section, we used the `fs` service to access filesystems, we'll go even further by switching our `requests` +call to use the `http` service, so we can switch the `requests` session at runtime. We'll use it to add an http cache, +which is a great thing to avoid hammering a remote API. -* You can add **Methods** (using the :class:`bonobo.config.Method` descriptor). :class:`bonobo.config.Method` is a - subclass of :class:`bonobo.config.Option` that allows to pass callable parameters, either to the class constructor, - or using the class as a decorator. +Default services +:::::::::::::::: - .. code-block:: python +As a default, |bonobo| provides only two services: - from bonobo.config import Configurable, Method +* `fs`, a :obj:`fs.osfs.OSFS` object to access files. +* `http`, a :obj:`requests.Session` object to access the Web. - class Applier(Configurable): - apply = Method() - def __call__(self, row): - return self.apply(row) +Overriding services +::::::::::::::::::: - @Applier - def Prefixer(self, row): - return 'Hello, ' + row - - prefixer = Prefixer() - -* You can add **ContextProcessors**, which are an advanced feature we won't introduce here. If you're familiar with - pytest, you can think of them as pytest fixtures, execution wise. - -Services -:::::::: - -The motivation behind services is mostly separation of concerns, testability and deployability. - -Usually, your transformations will depend on services (like a filesystem, an http client, a database, a rest api, ...). -Those services can very well be hardcoded in the transformations, but there is two main drawbacks: - -* You won't be able to change the implementation depending on the current environment (development laptop versus - production servers, bug-hunting session versus execution, etc.) -* You won't be able to test your transformations without testing the associated services. - -To overcome those caveats of hardcoding things, we define Services in the configurable, which are basically -string-options of the service names, and we provide an implementation at the last moment possible. - -There are two ways of providing implementations: - -* Either file-wide, by providing a `get_services()` function that returns a dict of named implementations (we did so - with filesystems in the previous step, :doc:`tut02`) -* Either directory-wide, by providing a `get_services()` function in a specially named `_services.py` file. - -The first is simpler if you only have one transformation graph in one file, the second allows to group coherent -transformations together in a directory and share the implementations. - -Let's see how to use it, starting from the previous service example: +You can override the default services, or define your own services, by providing a dictionary to the `services=` +argument of :obj:`bonobo.run`: .. code-block:: python - from bonobo.config import Configurable, Option, Service - - class HttpGet(Configurable): - url = Option(default='https://jsonplaceholder.typicode.com/users') - http = Service('http.client') - - def __call__(self, http): - resp = http.get(self.url) - - for row in resp.json(): - yield row - -We defined an "http.client" service, that obviously should have a `get()` method, returning responses that have a -`json()` method. - -Let's provide two implementations for that. The first one will be using `requests `_, -that coincidally satisfies the described interface: - -.. code-block:: python - - import bonobo import requests def get_services(): + http = requests.Session() + http.headers = {'User-Agent': 'Monkeys!'} return { - 'http.client': requests + 'http': http } - graph = bonobo.Graph( - HttpGet(), - print, - ) +Switching requests to use the service +::::::::::::::::::::::::::::::::::::: -If you run this code, you should see some mock data returned by the webservice we called (assuming it's up and you can -reach it). - -Now, the second implementation will replace that with a mock, used for testing purposes: +Let's replace the :obj:`requests.get` call we used in the first steps to use the `http` service: .. code-block:: python - class HttpResponseStub: - def json(self): - return [ - {'id': 1, 'name': 'Leanne Graham', 'username': 'Bret', 'email': 'Sincere@april.biz', 'address': {'street': 'Kulas Light', 'suite': 'Apt. 556', 'city': 'Gwenborough', 'zipcode': '92998-3874', 'geo': {'lat': '-37.3159', 'lng': '81.1496'}}, 'phone': '1-770-736-8031 x56442', 'website': 'hildegard.org', 'company': {'name': 'Romaguera-Crona', 'catchPhrase': 'Multi-layered client-server neural-net', 'bs': 'harness real-time e-markets'}}, - {'id': 2, 'name': 'Ervin Howell', 'username': 'Antonette', 'email': 'Shanna@melissa.tv', 'address': {'street': 'Victor Plains', 'suite': 'Suite 879', 'city': 'Wisokyburgh', 'zipcode': '90566-7771', 'geo': {'lat': '-43.9509', 'lng': '-34.4618'}}, 'phone': '010-692-6593 x09125', 'website': 'anastasia.net', 'company': {'name': 'Deckow-Crist', 'catchPhrase': 'Proactive didactic contingency', 'bs': 'synergize scalable supply-chains'}}, - ] + from bonobo.config import use - class HttpStub: - def get(self, url): - return HttpResponseStub() + @use('http') + def extract_fablabs(http): + yield from http.get(FABLABS_API_URL).json().get('records') - def get_services(): - return { - 'http.client': HttpStub() - } +Tadaa, done! You're not anymore tied to a specific implementation, but to whatever :obj:`requests` compatible object the +user want to provide. - graph = bonobo.Graph( - HttpGet(), - print, - ) +Adding cache +:::::::::::: -The `Graph` definition staying the exact same, you can easily substitute the `_services.py` file depending on your -environment (the way you're doing this is out of bonobo scope and heavily depends on your usual way of managing -configuration files on different platforms). +Let's demonstrate the flexibility of this approach by adding some local cache for HTTP requests, to avoid hammering the +API endpoint as we run our tests. -Starting with bonobo 0.5 (not yet released), you will be able to use service injections with function-based -transformations too, using the `bonobo.config.requires` decorator to mark a dependency. +First, let's install `requests-cache`: + +.. code-block:: shell-session + + $ pip install requests-cache + +Then, let's switch the implementation, conditionally. .. code-block:: python - from bonobo.config import requires + def get_services(use_cache=False): + if use_cache: + from requests_cache import CachedSession + http = CachedSession('http.cache') + else: + import requests + http = requests.Session() - @requires('http.client') - def http_get(http): - resp = http.get('https://jsonplaceholder.typicode.com/users') + return { + 'http': http + } - for row in resp.json(): - yield row +Then in the main block, let's add support for a `--use-cache` argument: +.. code-block:: python -Read more -::::::::: + if __name__ == '__main__': + parser = bonobo.get_argument_parser() + parser.add_argument('--use-cache', action='store_true', default=False) -* :doc:`/guide/services` -* :doc:`/reference/api_config` + with bonobo.parse_args(parser) as options: + bonobo.run(get_graph(**options), services=get_services(**options)) -Next -:::: - -:doc:`tut04`. +And you're done! Now, you can switch from using or not the cache using the `--use-cache` argument in command line when +running your job. Moving forward @@ -203,6 +101,9 @@ Moving forward You now know: -* How to ... +* How to use builtin service implementations +* How to override a service +* How to define your own service +* How to tune the default argument parser It's now time to jump to :doc:`5-packaging`. diff --git a/docs/tutorial/5-packaging.rst b/docs/tutorial/5-packaging.rst index 68bc66d..fc53b91 100644 --- a/docs/tutorial/5-packaging.rst +++ b/docs/tutorial/5-packaging.rst @@ -1,32 +1,67 @@ Part 5: Projects and Packaging ============================== -.. include:: _wip_note.rst - Until then, we worked with one file managing a job. Real life often involves more complicated setups, with relations and imports between different files. -This section will describe the options available to move this file into a package, either a new one or something -that already exists in your own project. - Data processing is something a wide variety of tools may want to include, and thus |bonobo| does not enforce any -kind of project structure, as the targert structure will be dicated by the hosting project. For example, a `pipelines` +kind of project structure, as the target structure will be dictated by the hosting project. For example, a `pipelines` sub-package would perfectly fit a django or flask project, or even a regular package, but it's up to you to chose the structure of your project. - is about set of jobs working together within a project. -Let's see how to move from the current status to a package. +Imports mechanism +::::::::::::::::: + +|bonobo| does not enforce anything on how the python import mechanism work. Especially, it won't add anything to your +`sys.path`, unlike some popular projects, because we're not sure that's something you want. + +If you want to use imports, you should move your script in a python package, and it's up to you to have it setup +correctly. + + +Moving into an existing project +::::::::::::::::::::::::::::::: + +First, and quite popular option, is to move your ETL job file into a package that already exists. + +For example, it can be your existing software, eventually using some frameworks like django, flask, twisted, celery... +Name yours! + +We suggest, but nothing is compulsory, that you decide on a namespace that will hold all your ETL pipelines and move all +your jobs in it. For example, it can be `mypkg.pipelines`. + + +Creating a brand new package +:::::::::::::::::::::::::::: + +Because you're maybe starting a project with the data-engineering part, then you may not have a python package yet. As +it can be a bit tedious to setup right, there is an helper, using `Medikit `_, that +you can use to create a brand new project: + +.. code-block:: shell-session + + $ bonobo init --package pipelines + +Answer a few questions, and you should now have a `pipelines` package, with an example transformation in it. + +You can now follow the instructions on how to install it (`pip install --editable pipelines`), and the import mechanism +will work "just right" in it. + + +Common stuff +:::::::::::: + +Probably, you'll want to separate the `get_services()` factory from your pipelines, and just import it, as the +dependencies may very well be project wide. + +But hey, it's just python! You're at home, now! Moving forward :::::::::::::: -You now know: - -* How to ... - That's the end of the tutorial, you should now be familiar with all the basics. A few appendixes to the tutorial can explain how to integrate with other systems (we'll use the "fablabs" application @@ -40,6 +75,9 @@ created in this tutorial and extend it): Then, you can either to jump head-first into your code, or you can have a better grasp at all concepts by :doc:`reading the full bonobo guide `. +You should also `join the slack community `_ and ask all your questions there! No +need to stay alone, and the only stupid question is the one nobody asks! + Happy data flows!