Working on 0.6 documentation.

2018-01-10 06:18:41 +01:00
parent 0a9a27ae08
commit c1ffbe7b5f
10 changed files with 265 additions and 60 deletions
--- a/docs/tutorial/2-jobs.rst
+++ b/docs/tutorial/2-jobs.rst
@ -1,61 +1,138 @@
 Part 2: Writing ETL Jobs
 ========================

-.. include:: _wip_note.rst
+In |bonobo|, an ETL job is a graph with some logic to execute it, like the file we created in the previous section.

-What's an ETL job ?
-:::::::::::::::::::
-
-In |bonobo|, an ETL job is a formal definition of an executable graph.
-
-Each node of a graph will be executed in isolation from the other nodes, and the data is passed from one node to the
-next using FIFO queues, managed by the framework. It's transparent to the end-user, though, and you'll only use
-function arguments (for inputs) and return/yield values (for outputs).
-
-Each input row of a node will cause one call to this node's callable. Each output is cast internally as a tuple-like
-data structure (or more precisely, a namedtuple-like data structure), and for one given node, each output row must
-have the same structure.
-
-If you return/yield something which is not a tuple, bonobo will create a tuple of one element.
-
-Properties
----------
-
-|bonobo| assists you with defining the data-flow of your data engineering process, and then streams data through your
-callable graphs.
-
-* Each node call will process one row of data.
-* Queues that flows the data between node are first-in, first-out (FIFO) standard python :class:`queue.Queue`.
-* Each node will run in parallel
-* Default execution strategy use threading, and each node will run in a separate thread.
-
-Fault tolerance
---------------
-
-Node execution is fault tolerant.
-
-If an exception is raised from a node call, then this node call will be aborted but bonobo will continue the execution
-with the next row (after outputing the stack trace and incrementing the "err" counter for the node context).
-
-It allows to have ETL jobs that ignore faulty data and try their best to process the valid rows of a dataset.
-
-Some errors are fatal, though.
-
-If you pass a 2 elements tuple to a node that takes 3 args, |bonobo| will raise an :class:`bonobo.errors.UnrecoverableTypeError`, and exit the
-current graph execution as fast as it can (finishing the other node executions that are in progress first, but not
-starting new ones if there are remaining input rows).
+You can learn more about the :class:`bonobo.Graph` data-structure and its properties in the
+:doc:`graphs guide </guide/graphs>`.


-Let's write a sample data integration job
-:::::::::::::::::::::::::::::::::::::::::
+Scenario
+::::::::

-Let's create a sample application.
+Let's create a sample application, which goal will be to integrate some data in various systems.

-The goal of this application will be to extract all the fablabs in the world using an open-data API, normalize this
-data and, for now, display it. We'll then build on this foundation in the next steps to write to files, databases, etc.
+We'll use an open-data dataset, containing all the fablabs in the world.
+
+We will normalize this data using a few different rules, then write it somewhere.
+
+In this step, we'll focus on getting this data normalized and output to the console. In the next steps, we'll extend it
+to other targets, like files, and databases.


+Setup
+:::::

+We'll change the `tutorial.py` file created in the last step to handle this new scenario.
+
+First, let's remove all boilerplate code, so it looks like this:
+
+.. code-block:: python
+
+    import bonobo
+
+
+    def get_graph(**options):
+        graph = bonobo.Graph()
+        return graph
+
+
+    def get_services(**options):
+        return {}
+
+
+    if __name__ == '__main__':
+        parser = bonobo.get_argument_parser()
+        with bonobo.parse_args(parser) as options:
+            bonobo.run(get_graph(**options), services=get_services(**options))
+
+
+Your job now contains the logic for executing an empty graph, and we'll complete this with our application logic.
+
+Reading the source data
+:::::::::::::::::::::::
+
+Let's add a simple chain to our `get_graph(...)` function, so that it reads from the fablabs open-data api.
+
+The source dataset we'll use can be found on `this site <https://public-us.opendatasoft.com/explore/dataset/fablabs/>`_.
+It's licensed under `Public Domain`, which makes it just perfect for our example.
+
+.. note::
+
+    There is a :mod:`bonobo.contrib.opendatasoft` module that makes reading from OpenDataSoft APIs easier, including
+    pagination and limits, but for our tutorial, we'll avoid that and build it manually.
+
+Let's write our extractor:
+
+.. code-block:: python
+
+    import requests
+
+    FABLABS_API_URL = 'https://public-us.opendatasoft.com/api/records/1.0/search/?dataset=fablabs&rows=1000'
+
+    def extract_fablabs():
+        yield from requests.get(FABLABS_API_URL).json().get('records')
+
+This extractor will get called once, query the API url, parse it as JSON, and yield the items from the "records" list,
+one by one.
+
+.. note::
+
+    You'll probably want to make it a bit more verbose in a real application, to handle all kind of errors that can
+    happen here. What if the server is down? What if it returns a response which is not JSON? What if the data is not
+    in the expected format?
+
+    For simplicity sake, we'll ignore that here but that's the kind of questions you should have in mind when writing
+    pipelines.
+
+To test our pipeline, let's use a :class:`bonobo.Limit` and a :class:`bonobo.PrettyPrinter`, and change our
+`get_graph(...)` function accordingly:
+
+.. code-block:: python
+
+    import bonobo
+
+    def get_graph(**options):
+        graph = bonobo.Graph()
+        graph.add_chain(
+            extract_fablabs,
+            bonobo.Limit(10),
+            bonobo.PrettyPrinter(),
+        )
+        return graph
+
+Running this job should output a bit of data, along with some statistics.
+
+First, let's look at the statistics:
+
+.. code-block:: shell-session
+
+    - extract_fablabs in=1 out=995 [done]
+    - Limit in=995 out=10 [done]
+    - PrettyPrinter in=10 out=10 [done]
+
+It is important to understand that we extracted everything (995 rows), before droping 99% of the dataset.
+
+This is OK for debugging, but not efficient.
+
+.. note::
+
+    You should always try to limit the amount of data as early as possible, which often means not generating the data
+    you won't need in the first place. Here, we could have used the `rows=` query parameter in the API URL to not
+    request the data we would anyway drop.
+
+Normalize
+:::::::::
+
+.. include:: _todo.rst
+
+Output
+::::::
+
+We used :class:`bonobo.PrettyPrinter` to output the data.
+
+It's a flexible transformation provided that helps you display the content of a stream, and you'll probably use it a
+lot for various reasons.


 Moving forward
@ -63,6 +140,10 @@ Moving forward

 You now know:

-* How to ...
+* How to use a reader node.
+* How to use the console output.
+* How to limit the number of elements in a stream.
+* How to pass data from one node to another.
+* How to structure a graph using chains.

-**Next: :doc:`3-files`**
+It's now time to jump to :doc:`3-files`.
--- a/docs/tutorial/3-files.rst
+++ b/docs/tutorial/3-files.rst
@ -3,6 +3,49 @@ Part 3: Working with Files

 .. include:: _wip_note.rst

+Writing to the console is nice, but using files is probably more realistic.
+
+Let's see how to use a few builtin writers and both local and remote filesystems.
+
+
+Filesystems
+:::::::::::
+
+In |bonobo|, files are accessed within a **filesystem** service which must be something with the same interface as
+`fs' FileSystem objects <https://docs.pyfilesystem.org/en/latest/builtin.html>`_. As a default, you'll get an instance
+of a local filesystem mapped to the current working directory as the `fs` service. You'll learn more about services in
+the next step, but for now, let's just use it.
+
+
+Writing using the service
+:::::::::::::::::::::::::
+
+Although |bonobo| contains helpers to write to common file formats, let's start by writing it manually.
+
+.. code-block:: python
+
+    from bonobo.config import use
+    from bonobo.constants import NOT_MODIFIED
+
+    @use('fs')
+    def write_repr_to_file(*row, fs):
+        with fs.open('output.txt', 'a+') as f:
+            print(row, file=f)
+        return NOT_MODIFIED
+
+Then, update the `get_graph(...)` function, by adding `write_repr_to_file` just before your `PrettyPrinter()` node.
+
+Let's try to run that and think about what happens.
+
+Each time a row comes to this node, the output file is open in "append or create" mode, a line is written, and the file
+is closed.
+
+This is **NOT** how you want to do things. Let's rewrite it so our `open(...)` call becomes execution-wide.
+
+
+
+
+
 * Filesystems

 * Reading files
@ -21,4 +64,4 @@ You now know:

 * How to ...

-**Next: :doc:`4-services`**
+It's now time to jump to :doc:`4-services`.
--- a/docs/tutorial/4-services.rst
+++ b/docs/tutorial/4-services.rst
@ -205,4 +205,4 @@ You now know:

 * How to ...

-**Next: :doc:`5-packaging`**
+It's now time to jump to :doc:`5-packaging`.
--- a/docs/tutorial/5-packaging.rst
+++ b/docs/tutorial/5-packaging.rst
@ -15,7 +15,6 @@ kind of project structure, as the targert structure will be dicated by the hosti
 sub-package would perfectly fit a django or flask project, or even a regular package, but it's up to you to chose the
 structure of your project.

-about using |bonobo| in a pyt
 is about set of jobs working together within a project.

 Let's see how to move from the current status to a package.
@ -28,3 +27,19 @@ You now know:

 * How to ...

+That's the end of the tutorial, you should now be familiar with all the basics.
+
+A few appendixes to the tutorial can explain how to integrate with other systems (we'll use the "fablabs" application
+created in this tutorial and extend it):
+
+* :doc:`notebooks`
+* :doc:`sqlalchemy`
+* :doc:`django`
+* :doc:`docker`
+
+Then, you can either to jump head-first into your code, or you can have a better grasp at all concepts by
+:doc:`reading the full bonobo guide </guide/index>`.
+
+Happy data flows!
+
+
--- a/docs/tutorial/_todo.rst
+++ b/docs/tutorial/_todo.rst
@ -0,0 +1,3 @@
+.. warning::
+
+    This section is missing. Sorry, but stay tuned! It'll be added soon.
--- a/docs/tutorial/docker.rst
+++ b/docs/tutorial/docker.rst
@ -0,0 +1,16 @@
+Working with Docker
+===================
+
+.. warning::
+
+    This section does not exist yet, but it's in the plans to write it quite soon.
+
+    Meanwhile, you can check the source code and other links provided below.
+
+Source code
+:::::::::::
+
+https://github.com/python-bonobo/bonobo-docker
+
+
+