How Pinterest Built an Experimentation Data Engineering Pipeline That Scales to 500 Million Users
Every day, Pinterest runs thousands of experiments to determine what features to implement to improve the experience for their 454 million users and drive business value.
From deciding which layout to use for the homepage, to which pins to recommend to users, to the placement of buttons in the interface, Pinterest rely heavily on A/B experimental testing to make data-driven product decisions.
Billions of records are processed for the daily 10 a.m. deadline so that teams who depend on experiment results can hit the ground running and make well-informed decisions (Wang C.,2017). At the heart of what’s required to do this is a reliable and scalable data engineering system.
Billions of records are processed for the daily 10 a.m. deadline
Between 2015 and 2017, Pinterest’s user base grew from 100 million to 150 million. In that same period, the number of running experiments doubled. This rapid growth meant that the legacy system, which used Hadoop to process data stored in Hive, reached its life expectancy and couldn’t process data in time for the daily deadline.
Pinterest needed to revamp their legacy experimentation data pipeline into one that could handle the load and scale beyond.
There are three main requirements the system needed to meet:
- Be extensible, storing the raw data and making it easy to extract new metrics.
- Scalable so it can process and store large amounts of data generated by a rapidly growing userbase and number of experiments.
- Compute and serve metrics in real-time/batch dashboards based on both historical and recent data while ensuring accuracy by monitoring the incoming data for any inconsistencies and data drift.
So how did Pinterest design a system to meet these challenges in 2017?
Architecture
They used the Lambda architecture methodology, composed of three layers; the batch layer for batch data processing, stream layer for real-time data streaming and serving layer to serve the data to an analytics dashboard. In Lambda, each layer uses a different technology stack and each provides data to the serving layer - in this case, a Grafana dashboard.
The main benefits of the Lambda architecture are:
- Lower Latency: Since batch processing can take hours to process, there is a delay on when that data is available to be queried. By implementing the real-time processing pipeline, data can be queried instantly.
- Data consistency: Distributed systems are prone to data inconsistencies since all data is replicated between nodes of the system. If a node fails, then it might not receive the expected data. Lambda architecture processes the data sequentially in the pipeline to resolve this.
- Fault tolerance: Each component of the system is built with fault tolerance in mind. So if a component fails in either the batch or stream layer then the other layers will continue to process data.
- Scalability: The pipeline is designed using scalable technologies, so if it needs to be extended, more nodes can be added to the system.
The batch processing layer allows experiments on historical data on an ad-hoc basis using the analytics dashboard while the real-time layer provides up to date metrics, monitoring capabilities, data drift alerts and notifications of failure to ingest data.
One issue with the Lambada architecture is that each pipeline in the system requires its own codebase. Since the implementation of this in 2017, a more popular architecture, Kappa, has been developed that solves this by using the same technology stack for both pipelines. Pinterest has likely adopted the Kappa architecture.
Components
The first two applications the data flows though are Apache Flink and Kafka. They communicate with each other using a read committed mode, which reads only events that are checked to be valid. If the value is not committed due to Flink malfunctioning, the events will not be read, and Kafka will wait until Flink recovers. This process ensures Exactly-Once-Semantic processing is met across the whole system.
There are three other technologies used in the system:
- AWS S3: S3 is a highly scalable and reliable file store. The alternative is to store the data locally but AWS offers many benefits like automatic encryption of the data, automatic scaling and reliable uptime. Using the cloud enables consistent access and management of the data across the whole organisation.
- HBase: HBase is designed for big data so has extremely fast read/write operations and supports regular overwriting and inputting of data. The alternative is Hadoop Distributed File System(HDFS) which has a more rigid structure and doesn’t support dynamic storage. Because we want to run different experiments on the data at different times, it’s not suitable.
- Spark Workflow: Compared to Hadoop MapReduce, which reads and writes data from disk, Spark executes jobs much quicker by caching data in memory across multiple parallel operations. This gives Spark the edge in parallelism and better CPU utilisation.
- Airflow: A great workflow orchestration tool allowing you to easily author, monitor and schedule data pipeline workflows. An alternative, which Pinterest uses, is Pinball. Pinball is built in Python2 and was last updated 2 years ago compared to Airflow, which runs on Python3 and gets daily updates. Airflow has a huge since being backed by the Apache foundation so has become the industry standard for orchestrating workflows.
- Prometheus: Prometheus can ingest exported metrics from many different systems. It has inbuilt monitoring and allows querying of data for real-time analytics. Originally built by SoundCloud, it is now open-source, low-latency, robust and highly scalable - capable of serving millions of metrics every second.
- Grafana: Grafana can easily edit and create new dashboards for any metrics exported by Prometheus, allowing flexibility in what is monitored by the system. Grafana is an excellent tool for visualising data, it is well-equipped to make sense of complicated data using visualisations like heatmaps, histograms and geo maps. You can build multiple dashboards customised to your liking to serve both the batch and real-time processing layers. It is constantly being upgraded easily integrates with Prometheus.
- AWS RDS: RDS provides excellent scalability for the system, storage resources can be increased with just the click of a button or an API call. It is built on highly available infrastructure from Amazon Web Services and its speed can support even the most demanding applications. Using IAM Roles on AWS the resource can be controlled to give only the right users access to the system and all data is highly encrypted keeping user data safe.
- Fast API: Can currently support up to 7501 API requests calls per second, making it one of the best Python API frameworks. It’s high performance, intuitive to use and robust enough to produce production-quality code. Another option would be to use Flask but Fast API has some key advantages. For example, Fast API automatically produces documentation as you’re building an API, speeding up the development process. Fast API also integrates data validation through the use of the Python Pydantic library which creates more robust APIs.
- Kafka: Kafka is built with real-time processing, fault tolerance and scalability in mind. It can handle terabytes of data with little overhead. Kafka replicates data so can be used to send data to both the batch processing pipeline and the real-time pipeline.
- Spark Streaming: Spark Streaming supports processing huge amounts of data with super low latency. It is the industry standard for streaming data in real-time. Spark streaming has many benefits including load balancing, fast recovery from failure, interactive querying and integration with advanced processing libraries such as SQL, machine learning libraries and graph processing.
Data Flow
- Messages are logged from the Pinterest application servers and API and logged to Kafka. Kafka then streams the data to both the Batch Processing layer(3) and the Real-time streaming layer(2).
- Kafka serves the data to Spark streaming which processes it in real-time and inserts the metric data into an RDS database. The RDS data is then exported to Prometheus which serves the Grafana dashboard. From the dashboard, data is monitored in real-time so the user is alerted to any failures in the pipeline such as inconsistencies.
- Kafka sends the raw logs to S3. The nightly experimental workflows orchestrated by Airflow and performed by Spark process the data to determine experiment user groups and metrics. The output data from Spark processing is then inserted into HBase ready to be served to the analytics dashboard, Grafana, via Prometheus.
Summary
This pipeline design meets the three main criteria; being able to serve metrics in real-time/batch, for a huge and rapidly growing userbase, and with the ability to extend the system.
The addition of Spark and its in-memory processing capabilities allowed Pinterest to reduce the time taken for experiments to run, from an average of four hours to well under two hours. This improvement in processing time allows teams to make decisions faster and run more experiments as a result.
The implementation of batch and real-time processing allows them to make quick and effective decisions on both historical data and the real-time incoming data. This leads to a more robust, versatile and easier to debug pipeline.
References
- Wang, C, 2017, Scalable A/B experiments at Pinterest