2 Key Changes that Unlocked Huge Scale in our Machine Learning Data Pipeline 🚀

6 min readSep 13, 2022

In a world of big data, we need more data engineering stories! This is why I wanted to share the story of how our AI Engineering team increased data throughput on a key production pipeline by 40X, and the architectural changes behind it. Hopefully, you can use these ideas as an example for building your own scalable and reliable AI systems.

“Picasso style painting of a software engineer” - DALL·E

It’s been over 1 year since I joined Faethm’s AI Engineering team, and I’m proud of the unique system we’ve engineered that provides analytics to power a group of Faethm’s products.

What is Faethm?

Faethm is a data analytics platform that allows companies to understand their workforce and plan for jobs and skills that will be needed for the future.

About the machine learning pipeline

One of the key pipelines that drive Faethm’s insights is JADAI (the Job Advertisement AI). This lets us understand skill trends, job trends and the skills that each job requires.

Overall Cloud Architecture of our JADAI pipeline. We ingest and store hundreds of millions of job ads in our database. The distributed Model Pipeline then extracts insights from this data, which gets aggregated and is ready to be displayed in our products.

To achieve this:

we ingest and store hundreds of millions of jobs in our database
using containers, a distributed model pipeline extracts insights from the data
the outputs of the model pipeline go through various post-processing steps, and the aggregated result is ready to be displayed

There is a lot of data engineering that makes this possible. The previous pipeline pulled skills from 6 million job ads from a few countries. Our latest release now sources 240 million job ads from all over the world. 🌍

So how did we do it?

Focusing on the data engineering

The section highlighted in red below is where the magic happens.

The focus of this article is our data extraction and preparation step.

And if we expand it… this is what it looked like initially.

Detailed view of the V1 Job Ad data extraction and storing process

From the diagram above, you can see that the job ads are coming from multiple data providers. API calls are made to download data and store it in our data lake in raw JSON format.

Then, there is another application to clean the data and store it in an appropriate format, i.e parquet files, in our data warehouse. Each data provider has slightly different formats which need custom code to clean and process. There are NLP (natural language processing) libraries in Python for language detection and cleaning text so this was a natural solution.

We also followed best practices for partitioning our data warehouse on S3. We partitioned the data by data-provider / country / year / month which restricts the amount of data scanned by each query. Bucket sizes were also optimised according to Athena’s limitations.

Additionally, the number of job ads that go into each file, i.e. the file size, needs to be controlled. Having a large file size is great for Athena because it wastes only a small proportion of time in opening and reading the file. However, the bulk of our Model Pipeline runs on AWS Spot Instances which can be interrupted at any time. We don’t want big files being interrupted and losing a big chunk of work. Hence, we set the final file size based on the average time an AWS Spot instance stays active.

The Scaling Problem

At first glance, this process looks very reasonable. This process was straightforward and easy to understand, and it worked well for us when we were processing around 6 million job ads per year. It amounted to 0.5 GB of data every month. With this volume of data, the ‘Data Cleaning and Normalisation’ process took around 1 hour to run locally.

Soon, things changed. Our customers asked us if we could bring in job ads from more countries.

This resulted in requiring 240 million job ads per year. 25 GB of data every month. With our current architecture, processing one month of data would now take 2 days, not 1 hour. 🤯

The increase in the scale of the problem revealed some challenges with our original design. A historical set of 3 years of data also needed to be processed for the new countries, taking approximately 72 days. Even with running processes in parallel, this was clearly not scalable for current and future demands.

The Solutions

We made two fundamental changes to solve the scaling problem:

Highlighting the changes we needed to make our data engineering process scalable.

The first key change was around restructuring the ‘Data Cleaning and Normalisation’ process and doing it in Athena instead of Python. The Python-based extraction process was moved into the next stage of processing i.e. Model Pipeline.

The trade-off for this change was that the Model Pipeline would take a hit and be slightly slower. However, it also meant that we could fine-tune the extraction method without having to reprocess all the data after our model changed.

Before we had done all of the text cleaning through Python NLP libraries. Most of that still existed and moved to the Model Pipeline. However, we incorporated as much of the cleaning as possible in the Athena SQL queries using regex.

The second key change we made was changing the format of the stored data from JSON to Newline Delimited JSON (NDJSON), and configuring AWS Athena to read directly from this data.

Previously, we stored the job ads in raw JSON format and processed this in Python later. This alone would have been responsible for most of the 72 days of compute time in our calculations.

When converting the records to NDJSON, the structure becomes one row per job ad:

# JSON example. Athena has to open the entire file before reading its contents
{
  "some":"thing",
  "foo":17,
  "bar":false,
  "quux":true,
  "may":{
      "include":"nested",
      "objects":[
         "and",
         "arrays"
      ]
   }
}# NDJSON example. Athena can directly read each row in the file.{“some”:”thing”}
{“foo”:17,”bar”:false,”quux”:true}
{“may”:{“include”:”nested”,”objects”:[“and”,”arrays”]}}

In the AWS documentation for Best Practices for reading JSON data, they recommend to “make sure that each JSON-encoded record is represented on a separate line, not pretty-printed”.

Converting data to NDJSON allowed Athena to query the data directly. There was no loading required into intermediary Athena tables. This was a massive productivity gain for the team.

The Result

Needless to say, processing with Athena became much faster than Python. With its parallel compute capability, Athena assigns up to 200 workers for active DML queries. Making these changes brought a new scale to the JADAI pipeline.

Detailed view of the V2 Job Ad data extraction and storing process

Now the job ads would be downloaded once and processed once. It would be saved in two formats; the raw JSON data for our data lake and the NDJSON data for the data warehouse.

3 years of data from multiple countries could now be ingested in 2 days, instead of 72. A 97% reduction in processing time. ⚡️

Final words

This post touched on the main data engineering improvements that were needed to scale JADAI. I hope it gave a sense of how important it is to pay attention to data pipelines and to use the correct tools.

I stuck to the key changes we’ve made but there were certainly additional complexities that the team had to overcome. The increase in scale also led us to re-engineer the ‘Model Pipeline’ phase, GPU cloud infrastructure, as well as how the insights were generated. These topics would make good discussions for a future post.

As job listings accelerate around the world and we add more countries and data sources to our pipeline, our innovative approach to architecture now allows us to respond to customer requirements with the throughput and flexibility we need.

I look forward to the challenges ahead as we continue to improve our AI systems and drive innovation in our products. I hope you enjoyed reading it!