Monday, July 28, 2014

AF (analysis framework) and the task graph story

Analysis Framework is an application written on top of TaskCluster and provides a way to execute Telemetry MapReduce jobs.

AF comprises two modules:
  

Telemetry-analysis-base


This module contains an example of a MapReduce node that can be executed over TaskCluster.
It contains:
  •  a custom specification for a Docker image
  •  a Makefile for creating a Docker image and posting it on registry
  •  a Vagrantfile useful to developers working on MACOSX
  •  custom code for map/reduce jobs
  •  an encryption module used to decrypt STS temporary credentials

 A task up and running
 

In order to run a task on TaskCluster you need a Docker container, and a custom code to be executed in the container.
 A Docker container is a Docker image in execution. To obtain a custom image you can use the Docker specification in the telemetry-analysis-base repository.
 As TaskCluster needs a container to run your task in, you need to push your container on TaskCluster registry. The Makefile present in the repository takes care of creating a container and pushing it on the registry.

Because Telemetry jobs work with data that is not open to the public you need a set of credentials to access the files in S3. The Docker container expects a set of encrypted temporary credentials as an environment variable called CREDENTIALS.
As these credentials are visible in the task description on TaskCluster they are encrypted and base64 encoded. The temporary credentials used are STS Federation Token credentials. These credentials expire in 36 hours and can be obtain only by AWS users that hold a policy to generate them.
After obtaining this credentials they are encrypted with a symmetric key. The symmetric key is encrypted with a public key and sent together as an environment variable to the Docker instance. Inside the Docker container this credentials will be decrypted properly and used to make calls in S3.

Custom code

Inside the custom container resides the custom code that will receive as arguments a set of file names in S3.

Mapper

After decrypting the credentials the mapper will take the file list and start downloading a batch of files in parallel. As files finish downloading they are stored in a directory called s3/<path is S3> and their names are sent as arguments to the mapperDriver.
MapperDriver will first read the specification of the job from analysis-tools.yml. In the configuration file it is specified if the files need to be decompressed, the mapper function that needs to run and also the language that the mapper is written in.
Next, as the example provided in the repository is in python, the driver spawns another process that executes python-helper-mapper that reads the files, decompresses them, loads the mapper function and sends the decompressed files line by line to the mapper function.
In the mapper function the output is written to result.txt. This file is an artifact for the task ran.

Reducer

The reducer task requires an environment variable named INPUT_TASK_IDS specifying all the mapper task ids. Holding the list of all mappers the reducer makes calls to get all the result files from the mappers. As the files finish to download they are stored in a folder called mapperOutput.
The reducerDriver than reads the specification of the job from analysis-tools.yml. The analysis-tools.yml contains the reducer function name and the language that is written in.
In the example provided in the repository the reducer is also written in python so it uses an intermediary module called python-helper-reducer. This module loads the reducer, removes all empty lines from the result files and feeds them to the reducer function.
The output is written to the result file that is an artifact of the reducer task. After writing the result file, the reducer sends an email to the owner of the task. This mail contains a link to the output of the MapReduce job. The email address will be given as an environment variable called OWNER.



Telemetry-analysis




This module constructs a taskGraph and posts it to TaskCluster.
At this point a set of credentials is needed to run a task graph:
  • credentials in AWS allowing Federation Token generation. To obtain them you need to specify  a policy enabling STS credentials generation.
  • public key associated to the private one residing in the running Docker container
  • symmetric key used to encrypt the Federation Token credentials
  • access to IndexDB
  • credentials to TaskCluster (in the future)


 Example call:

./AF.js Filter.json "registry.taskcluster.net/aaa" '{"OWNER" : "unicorn@mozilla.com", "BLACK" : "black"}'


 AF takes as arguments a Filter.json, a Docker image (registry.taskcluster.net/aaa) and optionally some other arguments that will be passed as environment variables to the Docker container.
AF executes the following:

  • makes a call for a Federation Token. It encrypts with the public key the credentials and provides base64 encrypted credentials as CREDENTIALS environment variable to the Docker container
  • using Filter.json AF queries indexDB to get the specific file names and file sizes
  • creates skeletons for  mapper tasks and adds load to them (file names from indexDB)
  • pushes taskDefinition to graph Skeleton
  • creates reducer task and gives it as dependencies the labels of dependent tasks
  • posts the graph
  • gets the graph definition, prints the graph definition and a link to a simple monitor page
  • as the graph finises execution, the page will contain the links to the result page

Last but not least


Analysis Framework is a really interesting/fun project. It can be easily extended or reused, it is designed as an example that can be customized and has some documentation too. :p




Sunday, June 29, 2014

Like all magnificent things, it's very simple.

While working on the Dashboard I learned about the analysis workflow. This is a diagram stolen from Jonas:




Two weeks ago I started working on Telemetry Analysis Framework.
We are simplifying the MapReduce workflow to be as flexible yet as easy to use and debug as possible. Jonas has been developing TaskCluster for a while and he came up with the idea of porting the analysis to it.


What is TaskCluster

TaskCluster is a set of components that manages task queuing, scheduling, execution and provisioning of resources. It is designed to run automated builds and tests at Mozilla. You can imagine it like the diagram below:



 

TaskCluster for our MapReduce Analysis

When you talk about MapReduce it is usually referring to a directed graph workflow like the one below. Taskcluster provides task graphs that can easily be used as a MapReduce workflow.

     
As mentioned above we want this framework to be:
  • simple to use
  • flexible
  • easy to debug


Simple to use 
Simplicity in this case means the programmer has to specify as little as possible:

 

Docker Image
For the purpose of this application you can consider a Docker image as a lightweight virtual machine. In it you store the setup for the analysis: needed utilities and custom code.

Because we know that starting with something new can be annoying we provide a base image where you only need to write your custom code. Also we provide dead easy documentation for each step. More information about Docker can be found here.

 

Filter.json 

Filter.json is the file given so we can extract from S3 bucket the files needed for the analysis.


By providing these two elements the framework will proceed as follows:


The Analysis Framework will parse the Filter.json, make a request to the index database, split the response in sublists of files and start a map task for each batch of files.

A map task would look as follows:


 
 For a mapper task we would need to specify the Docker image (line 25), a command (line 27) and the files that need to be processed (lines 28 and 29).
The output of this task will be an artifact  /bin/mapper/result.txt and will be uploaded to an intermediate bucket on Amazon.

Flexible

We call this framework flexible because you can customize all the layers provided.
The framework comes with a downloader utility, several mapper helpers (python, javascript) that can decompress the files and feed the result to the loaded custom map function.
If you would like a different Docker image you can customize that too. The framework includes a default image and also the Docker file out of which you can build the image. By modifying that file you can easily get another Docker image that suits your needs.

If you are on MacOS you probably need some way to work with Docker. By installing Vagrant and adding the Vagrantfile to your directory you can work easily with Docker on your machine.

Easy to debug

Until now the analysis that we performed had close to zero logging. This had to change so we could get a robust and easy way to debug the workflow. With the approach we are trying to develop right now we have logging at each step. The developer has the option to see how many rows were processed, what were the files downloaded with success, what were the ones decompressed successfully and if an error occurred. With this information some retry policies could be implemented. 

Last but not least
I will be blogging about this next week and also provide you with the link to the official repo. Right now a lot of changes happen in my playground :) ..














Tuesday, June 10, 2014

“Alice:How long is forever? White Rabbit:Sometimes, just one second.”

Last week I finally pushed the new dashboard to production.

The problem with bugs is that ...the later you find them the harder they bite your ass. I won't talk about bugs because you don't talk bad about the dead :p

Telemetry Dashboards measure how Firefox behaves on desktop and mobile in real world use.This application provides the aggregate view of the data collected.

One can hit http://telemetry.mozilla.org/ in three ways:

  • request for default page (no cookie set)
  • request with cookie set (cookie expiration 3 days)
  • request for specific state (aka url encoding a particular state)


The page itself has two states:

Single filter (when you look for a "one at a time version plot") you get histogram/table plots and summary details.







Several filters



By adding multiple filters on your plot you switch from one mode to another. If you want to get back into single mode state you just have to remove all the filters you added.

Locked/Unlocked state

There are two ways you can use the filters:

Locked State

You want all the measures to be in sync with the first filter. Whatever change made in your first filter will be propagated to the ones below. If the ones below don't have data for this particular measure they will be set to a default measure.





Unlocked State

Use the unlock button to compare between different measures. 




Legend

The filters can get very long. In some cases it doesn't make sense to show the full state because we only have one measure across multiple versions in the locked state. That's why we show the title of the measure and the details for it and simplify the legend by making it show only the versions.





In the case of mixed filters (aka unlocked state) we show only the part of the filter that fits one line, for more information you can look in the tooltip.







Caching

When you add another filter we instantiate a new selector but we don't make a request. Requests are done only if we don't have that data in the cache.

Url and TinyUrl

All the page state: filters, lock/unlock, selector state (percentiles), table/histogram are preserved in the url. This way the state gets saved for a return with state or cookie session.





The end... :)














Tuesday, May 27, 2014

It's the little things that matter

Little things called bugs...

Silent crash on specific data


I started the past week by investigating a bug. It can be reproduced on the current dashboard:






Updating to the latest versions of d3 and nvd3 didn't do the trick so I spent some time looking at the data and the callstack. This plot uses a library called nvd3 that is built on top of d3. The plot calls the voronoi method (voronoi layouts). This method returns an array of polygons, one for each input vertex in the specified data array. If any vertices are coincident or have NaN positions, the behavior of this method is undefined: most likely, invalid polygons will be returned. 

In our case we've got points that are coincident so..we get undefined behavior. The fix is adding some random tiny noise to the data. 

 Tooltip undesired behavior

 Sometimes, on the current dashboard, the tooltip doesn't show up.  After testing for a while I realized that if you refresh the page (and get an empty svg) it works, but if you don't the tooltip might not show up. Emptying the svg resolves this problem. 
Memoization
 Until now we were preprocessing the data at any filter entry, no matter if we had duplicated data or not (x_x).  It was about time to add some cache support. I added some caching in both dashboard.js and in telemetry.js. In dashboard.js we cache all the data that we prepared for plotting and in telemetry.js we cache all the requests made (patch here). 
TinyUrl
Because we sync with all the events on the page the url may get really long so "tiny url" is a nice to have feature. I looked at the free options and picked bitly for it's jsonp call support. Because of the same-origin policy you can't make normal calls to a different domain so you either implement CORS or stick to JSONP.
New legend
A nice clean legend for several series was on my list last week. I tried to implement it with d3 and didn't get to the result I wanted so I switched to JQuery and added some d3 formatting.

Last but not least... 

Fresh air (or steam) comes with fresh ideas. Check out Yellowstone :)





     

Tuesday, May 20, 2014

A step backward, after making a wrong turn, is a step in the right direction..

The version* feature

As a nice to have feature we could aggregate all the data in a single series for all nightly/aurora/etc.
I spent one day on this matter and ..it can be done ...but let me explain why not:
At this point when we make a request for a specific versions/series/etc we get three jsons:
* filter-tree.json
* histograms.json
* [measure]-[by-build-date/-by-submission-date].json


In histograms.json we get all the measures while in filter-tree.json a tree of filters. 
After traversing the filter tree and selecting the filter ids we take the list of ids and look in  [measure]-[by-build-date/-by-submission-date].json at each entry. Whenever finding a corresponding id we append the data that afterwords is plotted.

Imagine doing that for all versions add a large quantity of data. Remember that the processing is done on the client side (maybe on a slow device?)... it might end up in flames :(...
End of version* story for now...(it could always be done on server side)

Several series. Synchronize with top filter

If you want to preserve all your selections for another version you can do that by synchronizing with the first filter and the selection will get propagated on all other filters. 

Because I added "sync with hash" now you can land on the dashboard in two ways:

 * by specific state (example:/index.html#nightly/32/CYCLE_COLLECTOR&nightly/25/CYCLE_COLLECTOR) 

In this case the filters are not synchronized. If you want to synchronize them you just click the lock button.

* by a default landing

On the default page the sync with first is enabled. If you want to disable the sync you just push the lock button once.



Sync with state

In this version you can preserve your options for another time by saving the url from the dashboard and plotting it in a browser afterwords. This weeks version will come with a refactored and detailed one using some jquery magic. It will get some tinyUrl option too so it can be easily dropped on irc.


By adding multiple versions at some point my dashboard started dying..I began profiling and I got lucky.

In the previous version we used a library called Bootstrap, we had a default implementation for a simple selector that was extended with some shiny bootstrap magic. All was awesome till we started adding several selectors ... 
Multiple things might be the cause:

 * big overhead for computing and aggregating the data
 * d3/nvd3 lib 
 * selectors
 * bad synchronization 
 * me writing horrible and unoptimized code

The last three were a quick fix :D
After switching to a different library, modifying the selector and resolving some bugs I ended up with this comparison:



If you take the same mix of series on load and look at the time spent in the selector you get interesting findings:



Select2 FTW!

Temporary fixes and D3

On the production dashboard the plot comes with a very nice legend that turns unusable when you add several versions. On this weeks version I had to fix that. I removed the legend and put it aside in a scrollable svg. It turns out that even so it is cluttered. I removed the information that doesn't make sense for multiple versions: Summary & Histogram. The functionality is still there but is set visible only for one filter or land on the default page. 

This week

Today I had an interesting discussion with one of the users and it turns out that the Summary and Histogram would be a nice to have for multiple versions too. A zoom in the histogram would be nice :D. 
I think I figured out why D3 crashes when it gets some particular type of data..(production bug)





I went for a quick fix, but that is not a long term solution. Still looking for some other options. My guess, this bug triggers the tooltip undesired behavior too. Getting at the bottom of this might fix more than one problem.















Monday, May 12, 2014

The Good, the Bad and the Ugly

I just finished my second week as "the intern" that works on Telemetry Dashboard.

After my first blog post I got some feedback that made me switch between tasks.
Although dygraphs sounds like an option and d3 can be a pain in the ass I switched back to the latter for flexibility reasons (d3 can be pretty powerful you know:p).

I spent a day or two trying to figure out a way to make the bootstrap selectors for versions/measure take multiple versions. The selector was written "one version/measure at a time" and because all the filters were tightly linked together the code got uglier and uglier...
till my eyes almost started bleeding so I said...enough!...let's find a fast workaround...and maybe on the way we get things done even better :p 

The "right in the face" option was to add the same kind of selector for each new version dynamically. 
I needed some new elements and functionality: add/remove buttons.




This way we comply with the feature request: "add multiple versions on the same plot" and also we can add different measures for each specific version (if that might be the case).
Next step was to add a new multiple selector so that we can automatically select the submission/percentile/etc over several versions:



Next steps this week will be trying to add "the * feature", adding support  in telemetry.js for aggregating all data over all versions and some other details that at this point make sense but don't in the context of multiple versions (histogram/summary).
 If I get the time I will also like to investigate why in some cases(maybe because of the data) d3 crashes.

Monday, May 5, 2014

Teh foxy telemetry dashboard...a first week at Mozilla

My story starts with the Telemetry Dashboard at Mozilla...


I just started working as an intern at Mozilla for the Performance team and my first task was to improve and add some features to the Telemetry Dashboard.



Keeping the story short


Telemetry submissions gather into large files and are then uploaded to S3. The analysis framework then downloads them, unpacks them and runs an aggregation over the submissions, on file at the time.
When a set of compressed files from S3 containing telemetry submission have been aggregated. The aggregates are then merged into the a public S3 bucket.
 
The telemetry dashboard fetches the aggregated results from the public S3 bucket. To make it easy to consume this in a browser, everything is stored in JSON, but for storage efficiency reasons (and bandwidth) the storage format is not very intuitive.

To allow people to access the data without having to read a complicated JSON format one can use telemetry.js.  The idea is that people who write dashboards and present the results of the aggregated data should access it through this library. That way, if ever wanted one can change the serverside storage format (this is something we will be doing) and still have a working dashboard.
 

Usability, use cases and feature requests

 

Feature request one:
To see aggregated results  one has to specify channel, version and measure.
There are two plots on the side, the top one shows evolution of the measure over "build dates" or "calendar dates". It does so by plotting mean, submissions, and percentiles over time.
However, there is currently no way to show the evolution over multiple versions.
 
Feature request two:
It would be useful to have the latest build version as a default one so there wouldn't be an additional step of selecting from all the nightly versions to get the latest updates on the current version.
 
Feature request three:
Get all the the selected options as parameters in the url.
 
Feature request four:
Add checkAll/uncheckAll buttons so you don't need to select all the features by serial check.
 

First week updates

 

As I started looking at the code from dashboard.js and after talking to my mentor we got to the conclusion that some major changes might be needed as the dashboard itself is designed to support only one version at a time.
I started looking for some open source libraries (a nice post on this topic can be found here) and played with javascript as it is new to me. I looked at dyghaph library and it seemed like a reasonable one to use.
I started by changing the histogram evolution diagram.
 
 
 I also removed the range selector as the new diagram provides zoom in by selecting range.
 

Next step was to dynamically add checkboxes for selecting the specific features that we need to plot ..(that made me learn about how damn neat jquery can be :p)..it is not that pretty yet but it works.

 

I added two buttons so that checking/unchecking can be a click away for the all features selected/unselected.


By Friday I added data(test data) for several versions on the graph. (because of same duplicated series given as input the graph looks like as drawing only measures for one version but actually, if you are looking at the labels, two identical are present). 


Bottom line: I started doing some neat stuff last week an Mozilla aaaand got a foto with the Fox at the Firefox 29release :D



 This week I will be refactoring the code and I will implement the missing parts. I am pretty sure I've got a loooooot to work/learn this week :D.
BTW my live playground is hosted here.