Two weeks ago I started working on Telemetry Analysis Framework.
We are simplifying the MapReduce workflow to be as flexible yet as easy to use and debug as possible. Jonas has been developing TaskCluster for a while and he came up with the idea of porting the analysis to it.
What is TaskCluster
TaskCluster is a set of components that manages task queuing, scheduling, execution and provisioning of resources. It is designed to run automated builds and tests at Mozilla. You can imagine it like the diagram below:TaskCluster for our MapReduce Analysis
When you talk about MapReduce it is usually referring to a directed graph workflow like the one below. Taskcluster provides task graphs that can easily be used as a MapReduce workflow.As mentioned above we want this framework to be:
- simple to use
- flexible
- easy to debug
Simple to use
Simplicity in this case means the programmer has to specify as little as possible:
Docker Image
For the purpose of this application you can consider a Docker image as a lightweight virtual machine. In it you store the setup for the analysis: needed utilities and custom code.Because we know that starting with something new can be annoying we provide a base image where you only need to write your custom code. Also we provide dead easy documentation for each step. More information about Docker can be found here.
Filter.json
Filter.json is the file given so we can extract from S3 bucket the files needed for the analysis.
By providing these two elements the framework will proceed as follows:
The Analysis Framework will parse the Filter.json, make a request to the index database, split the response in sublists of files and start a map task for each batch of files.
A map task would look as follows:
For a mapper task we would need to specify the Docker image (line 25), a command (line 27) and the files that need to be processed (lines 28 and 29).
The output of this task will be an artifact /bin/mapper/result.txt and will be uploaded to an intermediate bucket on Amazon.
Flexible
We call this framework flexible because you can customize all the layers provided.The framework comes with a downloader utility, several mapper helpers (python, javascript) that can decompress the files and feed the result to the loaded custom map function.
If you would like a different Docker image you can customize that too. The framework includes a default image and also the Docker file out of which you can build the image. By modifying that file you can easily get another Docker image that suits your needs.
If you are on MacOS you probably need some way to work with Docker. By installing Vagrant and adding the Vagrantfile to your directory you can work easily with Docker on your machine.
Easy to debug
Until now the analysis that we performed had close to zero logging. This had to change so we could get a robust and easy way to debug the workflow. With the approach we are trying to develop right now we have logging at each step. The developer has the option to see how many rows were processed, what were the files downloaded with success, what were the ones decompressed successfully and if an error occurred. With this information some retry policies could be implemented.Last but not least
I will be blogging about this next week and also provide you with the link to the official repo. Right now a lot of changes happen in my playground :) ..