The Data Engineering Stack Behind Uber's Ad Processing System

The miracle of modern advertising is that machine learning optimises which ads are shown to which person. For this to work, each purchase needs to be attributed to a particular ad shown at a particular time.

UberEats collect order events from thousands of restaurants and show ads across many platforms. Naturally, it is challenging to build the data engineering system that aggregates this data to get accurate performance metrics in real-time. Why real-time? Uber's restaurants want fast reports of metrics that can help grow their business.

Uber use EOS processing because it is accurate and robust

To solve this challenge, Uber implements near-real-time Exactly-Once-Semantic (EOS) processing. EOS processing is the ability to execute a read-process-write operation exactly once. The coordination required between the messaging system, which ingests and sends the data, and the application, which processes that data, is not easy to achieve. It requires high availability, low latency and a robust communication process that doesn’t break when one piece fails.

Uber use EOS processing because it is accurate and robust. Each event is tagged with a unique ID to prevent duplicates. A backlog of all actions is stored so if one part of the system goes down, then the backlog of failed actions automatically run when the system is back up. EOS processing also makes writing applications easier and expands the space of applications that can use a specific messaging system, according to (Mehta A., 2016).

Uber's implementation uses four open source applications: Apache Flink, Apache Kafka, Apache Pinot and Apache Hive. We will be examining their system architecture.

UML diagram of Uber's Real-Time Exactly-Once Processing Data Engineering System

Components

The first two applications the data flows though are Apache Flink and Kafka. They communicate with each other using a read committed mode, which reads only events that are checked to be valid. If the value is not committed due to Flink malfunctioning, the events will not be read, and Kafka will wait until Flink recovers. This process ensures Exactly-Once-Semantic processing is met across the whole system.

There are three other technologies used in the system:

  1. Apache Hive: Apache Hive is a data warehouse system that facilitates reading, writing, and managing large datasets with rich tooling that allows the data to be queried via SQL. Alternatives to Hive could be Cloudera Impala and Presto DB, which would make the queries faster. One might choose Hive over Impala or Presto due to its easy implementation and its maturity, respectively.
  2. Apache Pinot: Pinot is an OLAP (OnLine Analytical Processing) datastore, whose main characteristic is its fast processing of queries on large datasets. This allows users to obtain their performance in real-time. Alternatively to Pinot, you could consider using ClickHouse or Druid. A good reason to use Pinot over ClickHouse or Druid is how easy is to upsert data, which makes it a great choice for this system.
  3. Docstore Schemaless table: Docstore is Uber Engineering’s Scalable Datastore. The reason Uber uses this Datastore is that it combines schemaless database features and traditional schema enforcement of relational databases. Cassandra is an alternative to Docstore, but it doesn’t scale as effeciently.

Data Flow

  1. The mobile Ads events are stored in an Apache Hive data warehouse. Mobile Events Data is received from (2), and once stored in this warehouse, the data will be useful for teams of data scientists and business analysts.
  2. Mobile Ads Events are ingested by Kafka. The events represent both ad clicks and impressions. As mentioned above, Kafka is essential in this system to ensure Exactly-Once Semantics as mentioned in this article. From this Kafka cluster, the data goes to the Hive data warehouse (1) and to the Aggregation Apache Flink job (3).
  3. The mobile Ads Events are processed in real-time using Apache Flink. The job in this step aggregates the data by firstly cleaning, validating and deduplicating, and then aggregating the data every minute using fixed-size windows named Tumbling Windows. In parallel, the deduplicated data is sent to a Docstore Schemaless table (6). An important step here is that each aggregated data is assigned with a UUID key that will be eventually useful in (12) and (13).
  4. Aggregated events are sent to this Kafka sink, making it available to, and simultaneously gathering data from clients (5), and sending the data to another Flink job responsible for unionising events in both regions A and B (10).
  5. Clients can be restaurant owners that need to make a query about the status of a delivery, or a financial dashboard to report delivery fees. That data will eventually be associated and stored with their corresponding ad.
  6. During step (3), the deduplicated data is sent here, a Docstore Schemaless table, which is Uber Engineering’s Scalable Datastore. One of the reasons for using Docstore is because it allows setting a time to live on the rows, which facilitates the Tumbling Windows used in step (3).
  7. This Kafka topic stores all data for UberEats orders and sends the data to the Attribution Job in (8).
  8. The first step on this Flink Job is to filter out invalid orders. Then it will associate the orders with their corresponding ad event that comes from the Docstore. Here, we observe how important is to have deduplicated entries, otherwise, the metrics would be erroneous.
  9. The processed data from the attribution Flink Job are ingested in this Attributed Kafka topic.
  10. This system uses two regions for robustness. In this Flink Job, the output from Aggregation in A and B are joined. Thus, in case one of the systems in one region fails, the other region can still send metrics to (12) and (13).
  11. This Kafka topic ingests the unioned data and it splits it into both Hive (12) and Pinot (13).
  12. This Hive data warehouse contains ad metrics so data scientists can eventually access them and obtain insights about the retrieved data. Once again, it is important that the data coming to this system is not duplicated, which is achieved thanks to step (3)
  13. The data ingested in the Kafka topic in (11) contains data useful for the data scientists to gain more insights about the ads, and data useful for merchants who are interested in understanding the impact of their ads. Moreover, the system exploits Pinot’s upsert feature, which improves reliability when dealing with possible duplicate data.

Summary

In this review, we introduced the technologies that Uber uses to process data in near real-time for attributing the events given by ads to the corresponding order.

Uber uses an Exactly-Once-Semantic processing to coordinate these events to avoid duplication and data mismatches.

They use four main technologies to achieve this. Kafka and Flink, which are responsible for ingesting and processing the input. Hive, which is responsible for storing the data in a data warehouse that data scientists can eventually use. And Pinot, which allows fast processing of the results to generate dashboards for the users.

With this system, Uber can process data from different sources in an accurate, reliable, and fast way using open source technology.

References

  1. Mehta, A. Chen, B, KIP-98 - Exactly Once Delivery and Transactional Messaging

You could be our next
success story

Download the course brochure

Fill in your details and we'll send you an email with a guide on our course

Why are you downloading this?

Thanks for downloading the brochure!

We've sent you an email with our brochure attached.
If you can't find it, make sure to check your junk folder.

Close
Oops! Something went wrong while submitting the form.