Efforts to Ease "Insights Analytics" Setup on Open edX

February 20, 2016

Deploying Insights Analytics on Open edX is an extremely complicated issue. Several teams within edX and the consultancy OpenCraft have started a collaboration “to address some of the pain points around Insights Analytics setup, deployment, maintenance, and deployment”. Plans have been outlined on this Google document. This is a summary of all of the difficulties from the Analytics Team:

Maintaining jobs on the scheduler is a highly manual and rather difficult process
Jobs fail periodically, we should identify all common causes and resolve them
Schema changes are very painful (see the process above)
The AWS configuration is rather complex and difficult to replicate
The pipeline should be installed like every other component in the edX infrastructure. Currently it is not.
We should seriously consider deprecating edx-analytics-configuration and just merging it into the edx/configuration monolith.
The analyticstack (devstack) lags behind quite a bit and takes some manual intervention to generate new versions of. It also doesn’t support Elasticsearch 1.5, which is used by currently-in-development features in Insights. We’d like to move this into Docker.
Centralize event collection. We should probably be using Kafka or something similar.
Non-AWS configuration is rather complex and difficult to setup, which is very painful for the open source community.

From OpenCraft

Lack of documentation
Problems setting up edX Analytics Devstack (process took a long time, was impossible to complete for one team member; overall complexity of the stack made it difficult to distribute work to additional team members as needed)
Problems with Hadoop version conflicts (fixed at the time via a couple of PRs: #128, #127), not really an issue anymore
No (straightforward) way to run acceptance tests for edx-analytics-pipeline
Using Analytics in production:
1. Many steps required to install the stack (partly due to Ansible scripts making assumptions about, e.g., AWS regions)
2. Many steps required to configure Jenkins (manually creating jobs and setting parameters/interval for each Analytics task, etc.)
The number of PRs required to implement major changes slows work down (these types of changes often require PRs in four different repos; see “Dependencies” in this example)
Not being able to merge PRs implementing work done for clients; having to maintain changes separately
Deciding where to add different types of functionality (instructor dashboard vs. insights) was not straightforward in some cases

– Related post: Insights Analytics installed for the first time at GW’s Open edX instance.

Success! Our #OpenEdX instance has a fully functional #analytics pipeline. Who else has come this far? @OpenEdX pic.twitter.com/meRTy9jJ4R

— Lorena Barba (@LorenaABarba) December 3, 2015

Latest News