This post was written by Robert On from The Agency Fund. This is part of an ongoing series of posts as we think, design and build an open source data platform. You can read more about the history of this project in our introductory blog post.
There are many potential applications of data science methods for the purposes of advancing social good and its outcomes that range from descriptive statistics and visualizations, to predictive analytics enabled by machine learning and artificial intelligence, as well as generating causal evidence leveraging experimental methods or reinforcement learning.
However, many examples in applying these methods continue to be exactly that, one-time examples. Enabling this analytical capacity for an organization requires a certain level of internal infrastructure to make its application practical and effective. Enter Data Engineering.
Data Engineering is a field that allows the data generated by an organization’s information systems to be usable for analysis. These analytical use cases can be fed back into the operations of the organization, leveraged to develop or improve new or existing products and services, shared with a broader research community or used to satisfy stakeholder reporting requirements. Before tackling any of these use cases with any of the earlier mentioned methods, the data needs to be made easy to work with. Following learning from the technology sector, we lay out a basic Development Data Stack (DDS) below:
The Development Data Stack
There are three primary components to a data stack, commonly abbreviated as ETL: Extract, Transform, Load. These three phases, explained below, describe the core operations of a data engineering workflow, while they are often described in this specific order, modifications of the order and repetition of the stages also emerge in practice.
We would like this phase of the ETL to be context independent, meaning that these are general data transformations that we would do for all sources of data and independent of context specific knowledge about the organization’s operations.
Data first needs to be extracted from existing information systems to be processed by later stages in the data pipeline. The job of extraction is meant to accomplish the following:
- Extract data from the original systems that generate them. The mechanisms available are typically one of the following:
- Connecting to an existing database
- Pulling data from an API
- Scraping data from the web
- Copying the data to a “Data Lake”, a centralized repository that allows you to store all your structured and unstructured data. It is important is this step to ensure that the data is completely unprocessed and comes as-is directly from the source systems. This could happen in batch, at scheduled intervals, or streamed in real time. If this data stores state that can be overwritten, it may be important to timestamp the extractions for updates to the same data across time.
- Catalog metadata about each of the sources (basic description, when last updated, how frequently updated, size, format, etc)
The first level of transformation consists of a set of operations that are executed on the data that is largely independent of the specific context in which it operates. This includes the following:
- Cleaning / Typing: Cleaning obviously erroneous data such as inconsistent or invalid date formats, ensuring consistency of missing values, fixing obvious typos, and casting fields to their appropriate data type.
- Joining / Unnesting / Flattening: Joining different data sets based on a predefined schema, unnesting hierarchical data, and reshaping the data into an accessible tabular format.
- Cataloging: Organizing the data in a structure that is intuitive to access (by source, date, etc).
Loading to a Data Warehouse
We make this tabular data structure available in a quick-to-query data warehouse. PostgreSQL may be able to handle small to medium use cases while columnar formats such as parquet would be better geared to handle data with a very large (100+) number of columns. BigQuery might be a good option for higher volumes of data.
The phases below map to specific data requirements of the organization. This means mapping the fields of data into program-meaningful measures of its onboarding, program take up and implementation, and proximal and long-term outcomes of its beneficiaries.
Create some basic tools to help analysts at the organization make sense of the data in the data warehouse. The querying is presumably fast (real-time) and relatively easy (no nested structures, etc) such that they can develop a set of queries or expressions to construct organization context-specific fields and KPIs. This set of work helps document the construction of these metrics and lead to automation in generating them over time.
This same workbench can be leveraged to do advanced visualization or analysis on the data. This could mean:
- Building predictive models via Machine Learning to forecast future items that require action
- Executing and evaluating experiments run within the program.
- Building new products or features based on these models/analyses.
- Optimizing resource allocations based on available data (routes, inventory, allocation of staff, etc)
Transform – Mapping/Construction KPIs
Take insights from the metric mapping and construction and feed them into an ongoing data pipeline that produces these metrics automatically, on a regular basis. We would like limited aggregation to happen here to maintain flexibility for future use.
Loading back to the organization
Feed data back into systems that drive organizational change or adaptation. This means sending the data to be used in:
- A visualization tool (Microsoft BI, Apache Superset, Metabase, Data Studio, etc) for other staff and stakeholders in the organization to view and respond to.
- Directly back into the technology used to interface with beneficiaries.
- Reports and data for research with external stakeholders (researchers, donors, etc)