Storing trillions of Discord messages is harder than you might think.

General
Storing trillions of Discord messages is harder than you might think.

In this age of ubiquitous connectivity, it is easy to forget that making it all possible is actually difficult. Enter Discord, a voice, video, and text chat app, and its tale of woe concerning the storage and retrieval of the trillions of messages generated by its users.

The company's struggles are detailed in a blog by Bo Ingram, Discord's senior software engineer.

On a large public Discord server, most of the messages displayed are recent and cached, so performance is not an issue. However, on smaller servers used by a few friends, old messages that are not cached are often displayed and read from the database each time a user logs in.

According to 2019 figures, Discord processes 25 billion messages per month (open in new tab), 850 million daily, or 600,000 messages per minute. These numbers can be expected to be even larger now.

Discord initially used a MongoDB database to support this activity, but has since migrated to Cassandra. the appeal of Cassandra is that it is a NoSQL database that supports clustering, allowing multiple instances to run as one It can run as a single database with multiple instances to improve performance.

Both Apple and Netflix are said to use Cassandra, among other major tech companies. In other words, Cassandra is trusted by blue chip companies. Initially, yes, after Discord solved some problems, the Cassandra database wrote new messages in less than 1 millisecond and read back old messages to users in 5 milliseconds. That's fast enough for a typical text chat.

But the good times didn't last: Discord's monthly active users reached 150 million, and by early 2022, Discord was running 177 Cassandra nodes that stored trillions of messages.

As a result, hot spots were created in the database when users interacted with the Discord server in certain patterns. That is, users sending and reading text messages.

The way Cassandra marks data for deletion before finally deleting it via garbage collection routines also dramatically increased the latency in reading messages. In really rough and plain language, the tombstones marking the location of data marked for eventual deletion slowed down the process of actually reading the live data.

Discord's solution was to migrate to a new database known as ScyllaDB, which is written in C++ and has several advantages, including being much faster than the Java coding language used in Cassandra. It is also compatible with Cassandra and can directly delete data without using a garbage collector.

Anyway, the switch to ScyllaDB took place last May. That's a good thing; according to Bo, latency has improved significantly compared to the tail end of the Cassandra implementation.

"It's a quieter, better behaved database (I can say this because I'm not on-call this week). No weekend-long shootouts, no manipulation of cluster nodes to maintain uptime; from 177 Cassandra nodes to 72 ScyllaDB nodes, we are a much more efficient database."

"Tail latency has also improved dramatically. For example, fetching messages in the past was 40-125 ms p99 in Cassandra, while in ScyllaDB, p99 has settled at a nice 15 ms, and message insert performance was 5-70 ms p99 in Cassandra, ScyllaDB has a stable p99 of 5ms, while Cassandra has a p99 of 5-70ms"

.

So, it's a little more complicated than you might think on how to keep the message service from collapsing under the weight of millions of users; you can read more on Discord's official blog (opens in new tab).

Categories