This blog post seems to blame GC heavily, but if you look back at their earlier blog post [0], it seems to be more shortcomings in either how they're using Cassandra or how Cassandra handles heavy deletes, or some combination:
"It was at that moment that it became obvious they deleted millions of messages using our API, leaving only 1 message in the channel. If you have been paying attention you might remember how Cassandra handles deletes using tombstones (mentioned in Eventual Consistency). When a user loaded this channel, even though there was only 1 message, Cassandra had to effectively scan millions of message tombstones (generating garbage faster than the JVM could collect it)."
And although the blog post talks about GC tuning, there's mention here [1] that they didn't do much tuning and were actually running on an old version of Cassandra (and presumably JVM) - having just switched over from CMS (!).
The funny part is ScyllaDB still uses tombstones for deletions, though they do have configurable compaction strategies and iirc Discord uses Scylla's Incremental Compaction Strategy that I suppose solves the specific issue they were dealing with. iirc that compaction strategy will trigger a compaction once a certain threshold of a partition is tombstones and then the table is rebuilt without the tombstoned content (which effectively pauses writes on that specific node and that specific table and partition for the duration of that process). Compacting a massive partition is really expensive. Scylla defaults to warning you that a partition is too large if it has at least 100,000 rows in it. My guess is when they moved to ScyllaDB they also adopted a new strategy for partitioning messages in a channel that keeps partition sizes reasonable so compactions don't take a super long time.
I don't see anything here that looks untoward. They increased their data storage by 3 orders of magnitude and decided to use a different DB system. Fair enough, maybe they've learned more about the nature of their data.
But that logic isn't sound. When dealing with huge amounts of data there are going to be trade-offs. Picking a system that makes different trade-offs to an existing system is not automatically helpful. Yes you don't have the old problems. However, you are about to discover new problems. There is always something of a gamble around which will be more of a problem to your business.
This is really interesting. CMS was removed in Java 14 after being replaced by G1GC in Java 9. They were probably running an antiquated Java 8 or 11 runtime. So that means that in 2022 they were either running a 4 year old Java 11 runtime or an 8 year old Java 8 runtime. They were really leaving a lot of performance on the table.
They could also have gone the commercial route and gotten Zing with their pauseless GC. It’s been around forever and they even cover Cassandra in their marketing.
That services layer reminds be of a big, fancy, distributed Varnish Cache... they don't mention caching and they chose the word coalesce so I assume it doesn't do much actual caching. But made me think of Varnish's "grace mode" and it's use to prevent the thundering herd problem (which is where I first heard of 'request coalescing') https://varnish-cache.org/docs/6.1/users-guide/vcl-grace.htm...
Also love to see consistent hashing come up again and again. It's a great piece of duct tape that has proven useful in many similar situations. If you know where something should be then you know where everything is gonna come look for it!
Varnish does call it coalescing. Grace is used for a specific situation: When a previously cached object has expired, Varnish won't evict it from the cache immediately, but will continue to serve the old content, while sending exactly 1 request to the background to refetch. How long an object can live after expiring is called the grace. The HTTP standard calls this behaviour "stale-while-revalidate".
Grace mode itself doesn’t prevent thundering herd; varnish coalesces all requests automatically and grace mode is used to increase the likelihood of clients receiving cached (albeit stale) responses.
Some additional nuggets by ScyllaDB co-founder:
- Discord couldn't complete repair with Cassandra. Not the case with Scylla
- Scylla has a lot in common with Cassandra, from a good reason, like the LSM tree, compaction etc. However, Scylla has a unique CPU&IO schedulers which allows us to prioritize the queries over compaction, and defer compaction to the half milisecond where we have enough idle bandwidth. We have plenty of articles about it
- Scylla has a new (1.5 years) tombstone_gc=repair - a much safer mode
- Scylla's new architecture of Raft and tablets was recently launched and is the next big thing for our users. Watch the cool youtube video of those tablet load balancing
This whole problem wouldn't exist if we used distributed chat protocols which have been around for over 40 years (IRC).
With the added benefit of having an open specification and multiple implementations. No walled gardens.
And if you think IRC is too old for the modern world take a look at matrix or xmpp.
How did we let discord take over is a mystery to me, or rather a tragedy.
IRC does not store messages, it only relays them to clients. You need an add-on solution to store chat history, something we've been taking for granted for ~30 years.
IRC all but requires using a bouncer to follow a conversation from more than a single device.
IRC does not encrypt messages, only (optionally) the client<->server connection. Without E2EE, you have no privacy against the server/operator, which is an easily targeted SPOF.
Matrix (the protocol) is still in flux, and the implementations are lagging behind the spec. If you're not using Element, you're behind on features and security.
XMPP is (similarly to IRC) relying on optional protocol add-ons for basic things, like E2EE, which clients may or may not support fully or correctly.
2013/Snowden happened 11 years ago. E2EE should by now be considered a basic feature, a commodity, something we should be calling for as relentlessly as we did for HTTPS. (Discord of course does not implement E2EE.)
Truth is, E2EE isn't a "basic thing". It's an add-on feature that most people don't want. It is impossible to have E2EE that doesn't leak into the UX, and most people would rather have a streamlined UX than deal with key management. It is also much more complex to have robust E2EE in a group chat.
The thing that sets E2EE apart from HTTPS is that HTTPS requires nothing from the end user. It just works. And as a site owner, you just set it up once and forget about it.
> It is impossible to have E2EE that doesn't leak into the UX
True, but one is also free to study the UX solutions implemented on platforms such as iMessage, WhatsApp, and Signal, which all have strong E2EE and see plenty of mainstream usage.
> [...] HTTPS requires nothing from the end user.
Depends on how you define "nothing". We've collectively put an insane amount of work to bring HTTPS to where it is today. Also, HTTPS continues to rely heavily on each server operator's skills and diligence.
There's also plenty of edge cases where HTTPS clients need to go an extra mile, such as containers (many base images do not include a cacert bundle), IoT/retrocomputing/other underpowered devices, and so on. There's always a cost, but it's usually worth it.
On iMessage, your keys are managed by Apple. You effectively fully trust them (which seems to be the assumption in most of Apple products anyway). I wouldn't call this a "real" E2EE implementation.
In WhatsApp, you're limited to one device logged into your account, and the rest are proxied through it. And message backups, those are annoying.
In Signal, you have all those stupid backups too, and while you're able to log into multiple devices (it seems), your past messages don't load "for your own security", and there's also this stupid time component so you get logged out on your computer if you haven't used the Signal desktop app for some weeks (which I don't).
Whereas on Discord, Telegram, Slack and other IM services without end-to-end encryption, you log in on a new device and that's it. You instantly get access to all your messages since the beginning of time, and stay logged in forever.
> On iMessage, your keys are managed by Apple. You effectively fully trust them (which seems to be the assumption in most of Apple products anyway).
I'd argue there are many scenarios in which this might be preferable to a lengthier/wider supply chain. Personally I'd sooner trust Apple than Microsoft+(Lenovo/HP/Dell/...)+(Intel/AMD/Qualcomm/Broadcom/...)+(every device with DMA (PCIe/TB), unless you trust your IOMMU)+(.../...)... (you get the point). And the alternatives to Microsoft are each its own kitchen sink.
> In Signal [...] your past messages don't load "for your own security" [...]
I agree that this is quite annoying. HTTPS clients resolved a somewhat similar problem (usage of self-signed certificates) by trusting the user to make an informed choice. I wish Signal would trust their user base to make their own choices there as well.
> Whereas on Discord, Telegram, Slack and other IM services without end-to-end encryption, you log in on a new device and that's it. You instantly get access to all your messages since the beginning of time, and stay logged in forever.
Same with iMessage. Whether this is a feature or a bug, depends on your threat model.
But we're in a situation where we don't even get to make an informed choice - every solution (as you pointed out) comes with its own bag of UX shortcomings. These trade-offs should be user choices, not something the vendor forces upon you. But these are not fundamental shortcomings of E2EE as a concept, but particular issues with its different implementations. WhatsApp shows you can restore messages from a backup; Signal shows you can have "real" multi-device presence; etc. If we could spend 1/100th of the effort we did to push HTTPS everywhere, E2EE could be just as ubiquitous today.
Just spitballing, but couldn't you have a new device login as three fields, username, password, and encryption key? Then if you don't add the encryption key you don't get the history, but still access the account. Then if password managers really saved all three, then would simplify it for more people (at least those with password managers). But there still has to be a cultural shift for a lot of people to password managers asking non-tech people
> On iMessage, your keys are managed by Apple. You effectively fully trust them
Not really? You can choose whether to upload your recovery key to iCloud or not. The software abstracts over the details of course, but Signal does that too. Unless you're arguing that it's impossible for closed source software to have "true E2EE", which may have some merit, but Discord is proprietary, and something is better than nothing.
> IRC does not encrypt messages, only (optionally) the client<->server connection. Without E2EE, you have no privacy against the server/operator, which is an easily targeted SPOF.
Same as Discord.
> Matrix (the protocol) is still in flux, and the implementations are lagging behind the spec. If you're not using Element, you're behind on features and security.
Discord also only has one reference client, but for me even with that client Matrix/Element was not as reliable. I still use and like it, but it's not a like for like in that regard.
> XMPP is (similarly to IRC) relying on optional protocol add-ons for basic things, like E2EE, which clients may or may not support fully or correctly.
But if you use current clients like Conversations or Dino or the likes it does work. There is no point in counting the clients that don't support it if these aren't the reference or biggest ones. The problem here is more that it's not meant to be used like Discord in any way. Not for big group chats/channels nor for big voice chats (not even sure this possible).
> IRC does not encrypt messages, only (optionally) the client<->server connection. Without E2EE, you have no privacy against the server/operator, which is an easily targeted SPOF.
FWIW this point isn't relevant to the IRC vs Discord discussion, since Discord is also very not E2EE. That said, XMPP my preferred protocol that checks all of the boxes.
I have stated that at the end of my original comment. I'm not advocating for Discord (merely enumerating IRC's and XMPP's shortcomings), but I would like to point out once again, that post-2013 any solution that does not enable strong E2EE by default should not be advocated for - at all.
> That said, XMPP my preferred protocol that checks all of the boxes.
Read up soatok's breakdown on the design & status of OMEMO. I'm not a cryptographer, but I do trust a cryptographer when they say some protocol's design/crypto is broken.
Maybe for your your use. For my use, not a single thing that goes over discord are things I'd object to being posted on a public website. That includes DM's. Not having E2EE means something isn't a solution for actually private conversations, but a lot of conversations happens in setting that are not actually private in any sense.
But Discord & IRC aren't generally private spaces. They're no different to web forums in that you would reasonably expect that something you write today would be accessible without reference to you in 10 years hence.
That's a very different proposition to a private/group message exchange in WhatsApp/iMessage etc.
I get that, I wasn't passing judgement. You guys must be super sensitive to be downvoting me for just sharing another point of view.
Personally, I find xmpp and IRC to be easier ways to talk to friends and interest groups when they use those networks. The software is simpler, faster, and a better experience for me.
Matrix is a bit of an exception where it's slow and buggy and barely hanging on.
But me and my friends don't care about discord stickers or nitro or giphy links or the discord store or any of that kinda stuff that you go to discord to use. And thats fine if you do.
People can want and enjoy different things and also "want to easily talk to their friends or interest groups without having to worry about it."
I do consider it a feature, in hindsight. Learning to program by asking "dumb" questions was great, because chats were ephemeral, nobody cared if the same question was asked for the 10 millionth time or risk of embarrassment being like 12 years old and asking greybeards for help.
Nobody also felt bad saying "RTFM" because, whatever, it blows over in a minute, there's no permanent record of having a harsh moment, more free to just move on.
The same old questions being asked due to no search also provided more opportunities to answer those questions, so, newbies could start to learn by teaching.
So, yeah, I think something beneficial was lost, even if I wouldn't go back to that approach- it's more of a tradeoff than a definitive improvement
> I do consider it a feature, in hindsight. Learning to program by asking "dumb" questions was great, because chats were ephemeral, nobody cared if the same question was asked for the 10 millionth time or risk of embarrassment being like 12 years old and asking greybeards for help.
I pity the new generations for not having this kind of opportunity: the opportunity to make mistakes, say dumb stuff and goof off with all these things vanishing in a matter of minutes, hours at most.
I miss the old internet: at any point you could pick a new nickname and get a fresh and clean new email address from many of the webmail providers and just start a new online life.
And it was considered normal. It was actually a "best practice" to never use nicknames.
This approach simply doesn’t work when users are allowed to vote or have any sort of scoring mechanism. Since bad actors will also create multiple “online lives” and manipulate those systems with a few clicks
Remember when phrases like "Never use your real name online" used to be near universal? Yeah, this is something I also miss about the old Internet.
Like, even back then you could absolutely tie your IRL identity up with your online identity, but the difference of course was that it wasn't a requirement of existing online, like it is now. Like yeah, you can stay anonymous but a) it's super difficult since the modern day assumption is that you're not doing that and b) that you're up to no good, because why would you be hiding who you are, unless you were doing something shady. And now even "normal" people lament just where we went wrong and what happened to online privacy. To the aware, privacy dying like this was clear as day, but I suppose most just didn't hear, or chose to ignore, the alarm bells.
And now everything is logged, analysed, and associates with the people who produced the messages and other sundry content. There is no ephemera, we need laws just to be forgotten by services (as an EU citizen, I'm glad about law existing here, but it shouldn't need to be a law, it ideally should be assumed), and we're constantly getting watched by both states and surveillance capitalists alike. Not actively in most cases, mind you, but passively, with our movements, our interactions online, and just what we do, just getting aggregated into these humongous data sets of Big Data, to train statistical models on. Mostly to surveil us even harder, or to manipulate us in the form of advertisement, which can be even more insidious in some ways.
I'm sure that stuff like the Cambridge Analytica fiasco could have occurred even without this destruction of privacy, anonymity, and ephemeral content, but I posit that it would have been way more difficult had people not been encouraged to put everything about themselves into services that would log them and build evermore complex models about them and their thoughts. And now this kind of stuff can be used to destroy democracies, and as alluded to earlier, manipulate for example our spending habits. And now we all wonder just where this all went wrong.
> How did we let discord take over is a mystery to me, or rather a tragedy.
The fact that you're baffled why discord took over is exactly why it took over. You can't even acknowledge that the user experience is 10x better and it's suitable for a general non-technical audience.
New quest available!
Buy nitro for stickers!
Buy nitro as a gift!
New quest available!
New quest available!
Restart to update.
New quest available!
Look at the new emojis you could use with nitro!
New update available!
New update available again!
Third update today!
New quest available!
Look at these profile decorations you could use with nitro!
Boost this server!
*NEW QUEST AVAILABLE*
I’m a huge IRC fan and I dislike Discord, but all these other services are way too clunky and IRC is really only usable through IRCCloud that has a relatively okay mobile app these days.
Recently a very technical group I’m part of migrated from Telegram to Matrix and the user experience is just not very good. The apps are buggy, don’t look good, then in the new “Element” app SSO isn’t supported so I can’t use my account with it. There’s lots of paper cuts that are okay for someone like me who likes to figure it out but I’d never try to convince my friends to use it.
For telegram refugees then maybe SimpleX is an option, except it has no bots nor other options for clients at the moment.
What I personally use is the nostr protocol through a client like Amethyst or OxChat. Messages and groups can be E2EE private, or you can just use the public groups.
The biggest advantage is that you are joining a bigger community of apps and services built on top of the same protocol, rather than joining some isolated island (again).
I recently listed to a nostr podcast and even people working in it said it would not be reasonable to recommend it for a secure messaging app at this point. Just because very early things like metadata leaking are not addressed yet. So not really an alternative.
I don't know what podcast you are mentioning or the context. Anyone can say anything on youtube.
We are talking about a transition from telegram, when comparing to that platform then NOSTR is undoubtely more secure when noticing that telegram doesn't even encrypt conversations by default and this isn't informed to users. Whereas in NOSTR you are made aware when a conversation is private between both parties.
Metadata is fetchable for 99% of messaging apps out there. If you'd ask me about making a more secure app then this involves continuous streaming of data, padding of messages to avoid content guessing and avoid the usage of internet as data channel.
So it really depends on what you consider secure and what it is compared against. Compared to Telegram it is more secure. Compared to a piece of paper encrypted with a custom algorithm and delivered by a trusted human transporter? Not really.
*Some* geeks. Specifically those who are into encryption.
There is nothing wrong with wanting an application to just work, especially when it's significantly better than what came before (contemporary competitors were Skype and IRC)
You download an exe, install it, make an account and it runs. Just like that. Everybody can do it.
There are tons of useful and great software out there. Most of it is not easy for the public. Some (most?) of it doesn't even have an GUI. People rather sell their identity and even pay than suffer through too many hops.
> How did we let discord take over is a mystery to me, or rather a tragedy.
Anyone can set up or join a Discord server. If you give users the choice between a complex open platform and an easy proprietary solution, they will pick the latter every time.
There’s no lack of open chat protocols and federated services but those have mostly torpedoed themselves: by usability and discoverability problems, holier–than–you attitudes, and plain nerd attention wars. Such as XMPP (used a lot until around 2010 but easily dragged into the mud because XML and overengineering), Mastodon (saw a surge as twitter was faltering but then seemingly stopped to be everyone‘s darling as its limitations became obvious, among them Mastodon admins taking their audience hostage; also ActivityPub fans going around advertising it for each and everything when RSS is just fine for web sites, damaging news feeds alltogether in the process).
Where spamming, or the systematic exploitation of digital communication by the „ad industry“, was killing it in the past (Usenet, and arguably the web), today there‘s also the problem of being consumed by LLMs to push non-public messaging. Though I‘m not sure the latter is really a concern for many, as developers not only are giving away their code, but their entire activity log/issues and their solutions on github such that they can easily be digested and replaced by coding assistant LLMs, git being a distributed system in the first place.
> among them Mastodon admins taking their audience hostage
I was excited first hearing all the "fediverse" stuff, but having to hand over control of your online identity to a particular node forever felt a little bit like "old boss, same as the new boss."
(Yes, I know some folks are working on the identity issue.)
Reminds when I joined the largest mastodon server for my country. Advertised by the owner as a bastion for free speech, democracy and fair treatment. Then in 2020 started mass banning everyone "that went against science" on the covid fraudemia at our country.
Twitter on those days was bad, but that mastodon server sure became even worser. Nowadays found a fresh air of innovation with Nostr. No more servers with your data and followers locked inside.
You can silence the people you don't want to hear, you won't hostage them into forced silence any longer.
Mastodon means you can at least pick your boss, be your own boss, and take your identity and followers to a new boss. (Possibly even taking your content too, though maybe not links)
Picking a ‘boss’ in a system where the average ‘employee’ has no credible way of assessing or evaluating them, or their superiors, and zero prospects of ever getting a face to face meeting with, is effectively no different to having the boss picked by an anonymous shareholder meeting in SF.
If all of the potential bosses have roughly the same degree of accessibility… which is the case for Mastodon for anything over a few hundred users.
Did they ever address the problem of migration from a bad server?
For example, a scenario where your server dies and does not return. Or a malicious actor takes over and bans the user base. Or a honeypot encouraging user account migration, followed by bans.
In all 3 cases, you are effectively screwed the moment you migrate to a malicious server, or your server becomes malicious.
I remember blue sky trying to address this by tying your identity to a DNS record or something, but it's a severe limitation in anything trying to be decentralized
The other reply goes to airplanes but there are much more common ways to get disconnected. Locking my phone or closing my laptop lid disconnects me from IRC. A lot of Discord users have desktops that are always on (since Discord originally advertised to gamers), but a lot of Discord users don’t.
Discord is fundamentally a very versatile platform. If you lose one seemingly unimportant, you lose a lot of versatility. Maybe I’ll write a blog post just with examples of how I’ve used it. It replaces IRC, but it also replaces Facebook groups, Skype, a lot of group texts, and a lot of email for me.
It does alter the meaning of chat tremendously. In discord, often things become heavy, because we're not talking, we're accumulating information, and you have to stay on purpose so data is manageable and seekable.
The few times I join IRC I know we're only here to chat, it's semi-transient (a little bit more if logs are stored) and I feel lighter.
Is it really that much of a jump to say "I would like to see the chat that has happened between my friends between the time I got on a plane and then got back off"? Does that sound odd?
Imagine if you couldn't receive e-mail while you were offline!
This isn't to disparage IRC and friends too much, obviously there's huge value in it existing as a synchronous chat room. Just... async chat is a thing that totally happens for most people.
a non-technical person wouldn't consider the implications of a history log with regards to security or data hoarding, they just see it work and think of it as a convenience.
this value sell shifts in the mind of the non-technical person once they're told that the feature they want implies non-ephemeral data that will be systematically sifted through either for legal or financial benefit by a third party.
in other words : the reason why 'async chat is a thing that totally happens for most people.' is because a vast majority of people are simply unqualified to even see the problem, much less seek alternatives or solutions to the data hoarding that they must comply with.
this creates a social effect and pulls everyone into Discord, regardless of their beliefs on the matter, simply because it has become 'the only game in town'.
regardless of personal preference, centralization of these kind of things is BAD for the user in nearly all circumstances aside from convenience.
Please stop pretending that "data hording" didn't / doesn't happen on IRC. There's nothing inherently friendly to security or privacy in the protocol; if anything, it's quite the opposite.
That you can, with augmentation and diligent op-sec, get something a bit better than Discord isn't a great selling point unless you have the time and resources and buy-in already, not just for yourself but from everyone in your group. At which point, there are still better options than IRC.
For decades now, the main draw of IRC has remained a fetish for conspicuous configuration, as it embodies a sort of brutalist architecture of communication software. The excuses change every few years, but the love for cobbling together a barely workable system from parts remains core.
Sure, the advantages of async communication are obvious but the crucial difference is that in that case vendor has to store your data somewhere in the data center. Reusing that data for unsolicited purposes is what many people will have a concern with.
But logs are stored on IRC as well. It’s not a part of standard protocol, but a lot of ir c-servers can do that automatically and there are boys which do that not to mention personal archives.
The difference is that end-users don’t have easy access to this logs. And on discord they do (because it is a part of protocol)
How about a secure async chat where the vendor simply stores a list of message IDs, and then the client requests if anyone has a copy of any message you haven't received yet from the other users in chat when you log on
Such vendor would have a hard time finding a business model since plenty of chat-services are already existing on the market and all of them have access to the data of their users in one way or another. Thus I don't know what other type of leverage they would be able to pull off to sustain their business.
> How did we let discord take over is a mystery to me, or rather a tragedy.
I think I'm reasonably technically competent, and I also dislike Discord's issues with privacy, data sovereignty, siloing information away from the open web, etc.
But you know what I think whenever I click a Matrix link, or IRC? I just don't want to deal with it. You get a list of apps you've never heard of, some of which may not be feature-complete, some with more than one version, some which are advertised using words like "GNOME", "Rust", "Qt5", and "C++" that have no meaning or relation to actually using them as a chat app, and all of which I guess are different and would need to be tried and learned separately. Then picking and clicking one tries to open an outside program which probably isn't installed and I don't want to install because I don't really know/care what it is. And if at that point, out of the dozen or so app options it showed you, you happened to choose one with a web version like Element, and you figure out you can click the "Continue in your browser" button out of the four or five unexplained buttons that pop up as a result ("XDG-Open", "Cancel", "FlatHub", "Download", and "Continue in Browser")— You get a static screen that shows just enough message history to not be useful, with a confusing UI you can't seem to interact with, hidden behind a login wall that still hasn't really explained what in the Internet tubes you're actually looking at.
If you try to Google "What is Matrix"— You get pages about math. So then you Google "What is Matrix chat". And all the results harp on using words like "open network", "decentralised", "protocol", "real-time communication", "open standard", "federated"— Which, again, may be technically interesting if you're into that, but doesn't actually have anything to do with how it directly serves the user as a chat app and how you can use it or sign up for it.
It takes way too many clicks, and you get bombarded with way too much information… To still not end up using the app, and in fact end up more confused than before about what a "Matrix" even is. Let's say you lose 15% of incoming users at each step. That rapidly scares off most of the mainstream, before they've even tried it. Maybe Matrix and Element are great. But it just seems like such an ordeal.
Compare that with Discord. You click a link. And then either you're already in the server, or it has a single text box and a single button you click to funnel you through making an account and joining the server.
It doesn't try to convince you to install a Desktop app until you're already fully using it in the web version. You get clear answers and reasons to use it if you search "What is Discord" or go to the website. It doesn't overwhelm you with options and then hound you with technical explainers that you didn't ask for.
IRC goes the other way in usability. People want voice chat, message history, different channels in the same "server", PM channels, etc.
because the voice chat function is so leaps and bounds better than anything out there and it was primarily used for that to game in real time. the text was an afterthought for gamers.
There are loads of comments exactly like OP's, and they always make the mistake of mentioning IRC alongside XMPP and Matrix. Inevitably repliers can't help themselves and spend their replies discussing IRC's unsuitability for modern IM and how it's not federated. When IRC is mentioned, commenters ignore XMPP and Matrix and attack the point in terms of IRC. (Though this thread in particular is better than average).
Matrix and XMPP are the far more appropriate competitors for Discord, we need to steer the conversation toward them.
I deliberately never mention IRC when I make these types of comments so people don't latch onto it and ignore everything else I said.
My takeaway from this is maybe somewhat different from what the authors intended:
> The last one? Our friend, cassandra-messages. [...] To start with, it’s a big cluster. With trillions of messages and nearly 200 nodes, any migration was going to be an involved effort.
To me, that's a surprisingly small amount of nodes for message storage, given the size of discord. I had honestly expected a much more intricate architecture, engineered towards quick scalability, involving a lot more moving parts. I'm sure the complexity is higher than stated in the article, but it makes me wonder, given that I've been partially responsible for more than 200 physical nodes that did less, how much of modern cloud architecture is over engineered.
They are talking about 177 database nodes, which is not an indicator of architecture complexity. I assume they have dozens/hundreds of services consisting of multiple highly available nodes each across various geographies.
Having seen a much smaller set of Cassandra nodes used to store billions (rather than trillions) of records, I can say that Cassandra was definitely a total PITA for on-call, and a cause of several major outages.
> ...how much of modern cloud architecture is over engineered.
I would wager a good majority of it is. The Stack Overflow architecture[0] sticks out to me in this regard as an example on the other end of the spectrum.
Very well-written article. I'm happy for them that part of the solution was switching from Cassandra to drop-in replacement Scylla, rather than having to deal with something entirely different.
I do think there is a balance to be struck, because directed communication means the recipients of old messages are also stakeholders, such that maintaining a consistent record by default is a fundamental part of the "service" they offer. The message contents are different from e.g. secretly hoovering up click patterns. Matrix had some thoughts when they faced the same questions:
The key question boils down to whether Matrix should be considered more like email (where people would be horrified if senders could erase their messages from your mail spool), or should it be considered more like Facebook (where people would be horrified if their posts were visible anywhere after they avail themselves of their right to erasure).
Solving this requires making a judgement call, which we've approached from two directions: firstly, considering what the spirit of the GDPR is actually trying to achieve…
In Discord culture, indeed, users usually share a shit-ton of PII in "introduction" messages from images to specific hobbies to medical information (EG "support" communities).
The problem from GDPR perspective is that Discoed makes it impossible to delete those, since once thet detect your interest in trying to delete any of your accounts' data, they will try to get to "anonymisize" it. Then at least publicly your username isdisconnected from thos messages, but they can still be traced back to specific persons. Now if this also is done server side, then they would be in a situation where you'd either have to go through ton of messages or to bulk delete past messages of all to enforce the GDPR demands of an user wanting their PII deleted.
EU Parliament is not a real Parliament in the sense that ONLY the Comission can propose new laws, and the elected parliament basically just votes on those. Who controls the Comission if not the people? The US State Department. Newsguard and non-Musk US bigtechs including Discord are in the same poli-financial bed of the establishment here. And they are full of previous state department workers.*
Unless there is public outrage, the
EU-level bodies at least will probably be owned. But Public opinion is controlled by the cyberpunk establishment that trains their LLMs & targets their campaign ads using that illegal Discord data to get political advantage.
You in my view ought to "worry" about the fact that it's possible there will sooner or later no longer be escape from a permanent establishment, Orwell-style. Goes along with the theme that "cybersecurity" is the United States government level has been "war against hate speech" for years, and of course "hate speech" meaning "censorship of internal and external enemy speech."
Budd Dwyers if I recall correctly shot himself in TV after writing to Biden (???) that under some conditions (that became true), the Department of Justice should have "Justice" removed from its name.
---
Most of this I hold only at 50+% confidence of being broadly correct. Take with lots of salt.
Cassandra is essentially an append-mostly distributed fault-tolerant hash table. If you need specifically that with high write throughput, it's a good choice. I don't understand why people use it as a database. You run into it's limitations immediately and the pain of trying to use it like a database only gets worse with scale.
> In Cassandra, reads are more expensive than writes.
This makes it insane as a message store for a chat server to me. It seems appropriate for a logging destination for a distributed system, one where you want lots of clients to dump data but most of the time you don't even need to audit the logs, so the number of reads for a given item is less than one. This is obviously not true for Discord messages.
The sentence makes it sounds like Cassandra and Scylla are slow for writes, which isn't the case at all. It's just that writes require a bit less I/O. Reads are still very fast. If reads were slow, nobody would use Cassandra and Scylla for the purposes that they're being used for.
Not too sure - I would have guessed that most of the messages are written once, read by the constant number of participants (say 1-100 or so) and then they disappear off the screen and are never accessed again, ever. Maybe a few people will scroll or search, or use some custom extension to load and export the history, but very rarely.
All the Casandra documentation and web site say it is a database. You can't blame anyone from getting confused. In my experience, I have never seen a project that started to use it, continue to use it after a year or so it may take a year to run into its limitations before having to replace it, with a database, like Postgres.
How is they just can’t shard the thing? Isn’t each Discord ‘server’ isolated from the others (can’t send a message from one to the other?) Why can’t they address trillions of messages by having thousands of shards that each handle billions?
> In an afternoon, we extended our data service library to perform large-scale data migrations. It reads token ranges from a database, checkpoints them locally via SQLite, and then firehoses them into ScyllaDB. We hook up our new and improved migrator and get a new estimate: nine days!
How many machines this migrator was running on? One? :D Sounds absurdly amazing!
Also, people need to keep in mind that those trillions of messages are archived nowhere. Thanks to the walled gardens we're obsessed with building, far-future anthropologists will know more about Pompeii and Machu Picchu than San Francisco.
Secondly, how would such an archive work? Who would pay for it? How would it be safeguarded in such a way that it can be read by 'far future anthropologists' but not the people paying for the storage?
If we're only talking about public chat rooms, it shouldn't be difficult to archive the content of those.
There are open repositories of the entire internet text content (common crawl). These scrapes are periodically repeated. That's orders of magnitude more data than all discord messages ever.
So technically it's not a problem making such an archive. The financing is of course always an issue, but not because the costs are large.
I don't think every single individual message ever needs to be archived. Every text, every email, every post-it, every poke, every emoji, every reaction GIF...
Well, considerting annoying push for "let's resolve the issue on discord" it's very annoying. With things like github issues you can search for a problem and find a solution. Even ancient mailing lists most of the time have archives. Not so much with all those fancy "realtime" :/
I agree with the sentiment but GitHub issues is not a good replacement. First, it’s also owned by a corporation and is available on the open web today because they let us (is it even scrape/api available today? Can people build tooling on top?). Anyway, this “openness” can easily be changed once the “value extraction knob” is turned.
Secondly, GitHub is a developer platform, not a user/enjoyer platform. Issue reports are high-barrier even for devs. People get upset if you’re asking a random question, don’t check for duplicates, etc. Some people even get upset about issues without a PR.
Again, I’m all for good open alternatives but when HN is like “you just configure Gentoo and type 30 commands” we don’t stand a chance to actually win users over, gotta accept reality before we can improve it…
Definitely not everything, but it's still wild to me that so many products and services have all their troubleshooting and customer support in a discord server.
It makes sense to me. The number of people who actually create useful open source software is so vanishingly small compared to the number of people who use OSS, it seems obvious that we should optimize for their time, not the other way around. I agree with you that using mailing lists or GitHub issues or whatnot would be globally more efficient, but if I’m working on a product, I’m going to work in the way that is most efficient for my time. I owe my “customers” nothing because they are not paying for my work. We keep seeing discord as a means to communicate about products because devs see it as the best use of their time. The fact that so many people use it should be an indictment on the alternatives, not the devs who choose to use discord.
Sadly, I can understand why Discord doesn't have a lot of incentive to do this. Maybe the community should popularize an open-source free/low-costing bot and hosting solution for exported chat? (I couldn't find one in a few minutes of searching).
I’m lost at why a DB (Cassandra) with better write performance than read performance was ever selected for a messaging system. I feel like it’s obvious that a message will be read more than it is written (once).
The fact that it has better write speed than read speed doesn't mean that it has bad read speed. It just happens to have even better write speed.
It's like how I connect my phone to my home's cable connection to send a big file. It is better at downloading than uploading, but that doesn't mean it's not the best solution for uploading.
ScyllaDB scales horizontally on a shard-per-core architecture with a ballpark throughput of 12,500 Reads and 12,500 Writes per second per shard. If you're running Scylla across a total of 64 cores (maybe on 4 VMs with 16 vCPUs each), you can get up to 800k Reads 800k Writes per sec of throughput with P99 writes of <500us and p99 reads of <2ms.
You will not be able to get that performance out of Postgres and the write scaling will also be impossible on a non-sharded DB.
If you're a company like Discord and are running dozens (70-something?) of ScyllaDB nodes, likely each with 32 or 64 vCPUs, you've got capacity for 50M+ reads/writes per second across the cluster assuming your read/write workloads are evenly balanced across shards.
Fwiw the benchmarked numbers are for writing very small rows. When doing the messages migration, with no read traffic, and the cluster/compaction settings tuned for writes we only managed approx 3m inserts/sec while fully saturating the Scylla cluster.
Interesting, we've got to 5M+ reads/sec in realistic simulated benchmarks and ~2M reads/sec of real-world-throughput on our clusters that are <10 nodes (though really high density). I don't think I've pushed writes beyond 1M QPS in real-world or simulated loads yet though. Thankfully our partitioning schemes are super well distributed though and our rows are very small (generally 1-5k) so I don't think we'd have a problem hitting some big numbers.
How about per-node memory pressure, did it change in favor of Scylla? I ask because I would legitimately expect that GC-based system would have a larger pressure on the memory subsystem.
Scylla just eats all the ram it can with cache. So it's hard to say really. On Cassandra we allocated half the ram to the JVM which it gladly used up and left the other half to the OS for disk cache. On Scylla, since it uses direct io, there is no need for OS disk cache.
Okay but this is where I get confused. Why does Discord need a single database system when discord servers are independent, right?
And the volume of traffic per Discord server must be human-processable or what would the point be? A Discord server doing 800k writes per second makes no sense.
So why not a RDBMS per Discord server, and if you want to ship all that out to a warehouse for analytics you do that as a separate problem?
Or is it that spinning up a Postgres instance per Discord server ends up being significantly more expensive than these mega distributed database systems?
There are ballpark of a few hundred million discord servers... do you really want to run that many Postgres instances? And even so what do you do about DM/GDMs? Easier to just run one big mega cluster for messages.
Okay so the latter then - economies of scale. Surprised to hear that few hundred million figure - I thought it'd be 1/10th of that at most! Wow.
Although I did expect there'd be a very long tail, and you might choose to host a bunch of servers on a single RDBMS, at that scale yeah it wouldn't solve much.
I'd guess that Discord's storage systems lean towards processing a lot more writes than reads. Postgres and other databases that use B-tree indexing are ideally suited for read heavy workloads. LSM based databases like Cassandra/Scylla are designed for write intensive workloads and have very good horizontal scaling properties built into the system.
When you send a message, afaik it sends to all people looking at it at the time. So there is no read when in a conversation, and maybe the reads are batched when reading multiple.
Read traffic is much higher than write traffic due to mobile clients needing to sync chat history more often as their sessions are much shorter lived. Also search queries execute 1 query per result. And don't forget people doing GDPR data dump requests. It adds up.
I’m not sure if Postgres would have enough horizontal scaling to accommodate the insane volume of reads and writes.
I would be super interested to be proven wrong though… anyone know of a cluster being run at that scale?
Many companies have products that operate at “scale”. They manage to do so with pretty boring techniques (sharding, autoscaling) and technologies (postgres, cloud storage).
Because of the insane blog driven tech culture, many of these teams get questioned by clueless leadership (who read these blogs) and ask why the company isn’t using cassandra / some other hot technology. And it always causes much consternation and wastage.
Anyone wanting to introduce $new/$other language, database, library, deployment system, build system into a large enough system that doesn't solve any actual problem is a nightmare for someone working at this scale.
I don't mind the scale, I like it. I don't like having to fend off questions and complaints why we aren't deploying the latest shiny new thing in our core this week.
This was a “double-pump” migration to a faster database and building a caching service. There’s nothing particularly fancy or creative about their solutions. The migration efforts and working out issues with the reverse table scan were probably way more creative, but they didn’t get into that unfortunately.
I think I can understand the appeal, but it's just not there for me. I have enough complicated problems outside of work, some of which are even fun to solve.
I'm happy I'm currently not working at this scale. I'm not happy when idiots (including one of our self-important ex-Google VP's) set this as a benchmark for backend interviews (for careers that 99% likely will never come close to such problems).
I have stared into the abyss and seen the eyes of cthulu. I am much happier writing embedded drivers than I was trying to make sense of why previous devs thought it was a good idea to move bounded tunable server side api calls to the client, allowing it to effectively write arbitrary sql calls across multiple databases.
Fortunately the web is starting (very slowly) to return to sanity, pushing back towards the simpler server-rendered pattern with Javascript being relegated to specific use cases.
Which is precisely what is meant by specific use cases. We don't have to throw out the first 25 years of the web and reimplement all of our business logic in a minified JS blob. Even when client side code is necessary, the trend of pushing rendered HTML rather than JSON that must be parsed and rendered keeps us as close to browser primitives as possible.
Once you move beyond basic CRUD business requirements work their way into the UI. For instance, making fields read-only based on access level. Adding additional form fields, etc. Conditionally hiding and showing entire portions of the UI. All of which requires you to either pass around UI-directives in your data or implement business logic in your client code. Better to just ship HTML, and if we're worried about full page loads, just use one of the many over-the-wire options to only change small bits of a page.
This is before we get into having to implement application primitives like authentication on the client, and all of the state management that goes with. The absolute amount of scaffolding and plumbing we've built up just to save a few ms is always worth questioning. Doesn't mean the answer is no, just that we need to ask the question and not assume the default is carved in stone.
That’s your generation that happens once. The browser still needs to render it. Sure, rendering it on the client may cost the client a bit more, but the client generally has the computational power to spare.
Which becomes a far more important issue when dealing with bandwidth or CPU constrained devices, or artificially imposed constraints due to data usage costs.
Honestly, 77 nodes doesn’t sound like a terrific scale? The more I scale things up, the more I realize that the tone of the problems doesn’t really change. You just get more layers to your data structures.
The blog posts shows how great the technical expertise is at Discord. I work in IT and in my company devs are so incompetent, they don't even know how to create an M365/Azure dev tenant and constantly request *.Read write.All to our production tenant. I'm so envious!
On the other hand, the HOME/END keys jump to the beginning of the input field rather than the line and the frontend devs are unable to fix this non-default behaviour for years, which makes it a fucking pain in the ass to use the Posts feature within a Discord channel. I believe the budget for the backend geniuses meant that frontend had to be juniors only.
Hiring good is probably the most important thing for a company and also one of the hardest problem. I have seen a team of competent engineers outperform their sibling teams by 5-10x as long as each member of the team is good enough. Just 2 bad hires will slow down a team drastically. One terrible hire can do -5x work of a normal engineer.
I haven't found a difference to AWS, for example. They are all terrible in their own ways. But if one or the other is what you earn your money with, then at least put in the effort to be proficient with it, and not a complete dumbass. (Not you as in "you!")
When you get to scale like this, I wonder if the access patterns of the application and its data might be best served by a custom data retrieval and storage application.
I may be wrong but I just wonder if efficiency is lost to the generalized nature of any data storage system.
The other question that comes to mind is, to what extent have the developers made a systematic effort to optimize how data is stored and retrieved? If you’re building a gigantic back end system and simply accepting that the system load is what it is then you might be missing a chance to dramatically impact the size of the task of managing that data.
They did give one example, if someone does a @everyone in a big channel, they specifically optimized their architecture to make that efficient using their custom data services.
Interesting read on one had, a bit disappointing on the other: when the solution is just "we moved to this other product" it smells of lack of serious and rigorous investigation.
Also, having worked with the JVM and with GC issues I don't buy the "GC problems" point: there are a number of improvements in recent JVM release, the main being ZGC (and generational ZGC in particular).
ZGC is great, I've personally witnessed sub-millisecond GC pauses (and i mean sub-millisecond stop-the-world pauses) on machines serving millions of requests per second. Garbage Collection is largely a solved problem in the industry as of today, thanks to ZGC.
Other than this, also comparing latencies for machines with 9TB disks rather than 4TB disks is a bit like comparing apples and oranges: we will never know if issues at the storage layer were affecting tail latencies. Were the node having, i don't know, filesystem fragmentation issues? Does the 9TB storage configuration deliver higher iops than the previous 4TB storage configuration? Is the same kind of hardware underneat (same disk type? same disk bus? or are we talking ssd vs nvme?).
As somebody that's been doing performance engineering for work, this piece is a bit appalling.
GC is a problem, and it always will be at some level. You can improve it but that doesn’t mean it is not a problem. Memory allocation and management is a problem even in c/c++ problems if you want to optimize your program, there is no universe where gc is not a problem
My love of embedded stuff is growing. I'm self teaching C and assembly....to get better at low level programming and interactions with hardware but it all seems much simpler than the big data systems. Granted I'm sure it call be broken down into steps and issues to solve like any programming issue but I'm happy focusing on low level stuff for now.
Just wondering if anyone considered using Postgres or another relational db. I understand it won’t do multi master replication as well but it is much more stable and predictable if you give it right amount of traffic. I guess the team had to do that part anyways for ScyllaDB
I don't think anyone runs Postgress at that scale (unless very specialized sharding setup). Given the choice between using ScyllaDB like everyone else and using Postgres in a super specialized best in the world setup, the choice becomes clear. Also keep in mind that Discord is not a huge super profitable company, so for them to develop something like vitess for Postgress would not make sense. For a small company with huge data like discord, using existing data solutions makes a lot more sense.
They could use vitess, citus or alloydb. They could use read replicas for read operations and single master in a shard for write. They would get many SQL features (upgrades, referential integrity etc) for free. It would allow them to extend their business logic considerably.
Interesting technical read, but I appreciated the lighthearted jokes/comments the author threw in as well. Felt like they struck the right balance - nice work!
Did they go with ScyllaDB just because it was compatible with Cassandra? Would it make sense to use a totally different solution altogether if they didn't start with that.
Yes, we wanted to migrate all our data stores away from Cassandra due to stability and performance issues. Moving to something that didn't have those issues (or at least had a different set of less severe issues) while also not having to rewrite a bunch of code was a positive.
Did you guys end up redesigning the partitioning scheme to fit within Scylla's recommended partition sizes? I assume the tombstone issue didn't disappear with a move to Scylla but incremental compaction and/or SCTS might have helped a bunch?
Nope. Didn't change the schema, mainly added read coalescing and used ICS. I think the big thing is when Scylla is processing a bunch of tombstones it's able to do so in a way that doesn't choke the whole server. Latest Scylla version also can send back partial/empty pages to the client to limit the amount of work per query that is run.
Oh that's pretty neat. Did you just end up being okay with large partitions? I've been really afraid to let partition sizes grow beyond 100k rows even if the rows themselves are tiny but I'm not really sure how much of a real-world performance impact it has. It definitely complicates the data model to break the partitions up though.
i usually start projects with postgres this days. i have reached the tens of millions of rows threshold without breaking a sweat, but is there any good reason postgres can't handle into the billions or trillions? any well known products at that scale that are known to use postgres?
Just the raw amount of data is not enough metrics to judge whether postgres is "enough".
They seem to value horizontal scalability e.g. in terms of write throughput, which is easier to handle with something like their solution compared to postgres.
Postgres can pretty easily scale to billions or trillions of rows. It forces you to think carefully about how you query that data, though, and I think most beginners would find themselves in deep trouble jumping into the deep end.
The problem with Postgres is that you have to read the doc (boring), sometimes read database books beyond chapter 2 (lmao nerds).
This filters out 99% of software "engineers".
So it's better to use KookaburaDB, version 0.2 just got released. It's written in Rust, and it's modern of course (whatever that fucking means: config written in YAML I guess? Complicated build and deployment?)
Thank you prostgresfan, it's nice to see a completely unbiased source on database technologies.
Jokes aside, you're largely right. Postgres really does cover 99.9% of database usecases. But, I think Discord might still fall outside of that.
The problem is that Discord's scale pretty much requires heavy sharding. While you can make this word with Postgres, you can tell it was never designed out of the gate for this.
IMO, Discord isn't even taking it far enough. Working under the assumption every server is its own isolated pocket, I see no reason not to have 1 database (or database-like thing) per server. Then it's truly a distributed system, which matches Discord's business use cases. I often find matching business use cases to technology like this can greatly simply architecture and reduce friction.
No cache. Just read coalescing. There is a big difference. Coalescing just ensures that while a query is executing if an identical query arrives, rather than sending the same query as an already executing query to the database it will wait for the existing query to complete and duplicate the result. If after this the same query arrives again, it will be issued against the database.
This means we don't have to deal with cache invalidation/consistency issues while also being able to handle thundering herds, for example a large server pinging @everyone and having a bunch of people click into the channel or launch their apps in response.
The post appears to consistenly use past tense for things that were true in the past at time of writing, and present tense for things that are true in the present or are always true. So the use of tense appears to be valid, though not following commonly prescribed style.
He's walking us through the process of designing the solution. Why wouldn't present tense work for this? We're discovering things with him as he takes us along for the journey.
I think it’s annoying they interview engineers like they are Google and reading the blog they made it up and learned some basic “pitfalls” as they went along
Having used discord in the past. Most of the conversations were just shit posts. Nothing serious. Why even bother storing a trillion messages of garbage in the first place?
Many people within niches have discord servers for researching and discussing specific things. There is a large wealth of information locked away behind them that can be lost pretty much whenever discord decides to start pursuing different monetization strategies.
How do you sort the good from the bad? I'm sure most of my conversations were shit posts aswell but some weren't, especially when it figuring out how something new worked or how to fix a problem.
That's why I laugh when people say discord content needs to be indexed on the web so things are more discoverable. 99% is garbage and the useful messages are scattered across channels.
I'm not trying to be a smartarse but doesn't this describe the entire internet? The good stuff is rare and scattered, and that's why search is so important.
At least with forums, there are dedicated pages for whatever is being discussed. Discord is just a collection of channels with topics being split up across multiple messages and shitposts in the middle.
"It was at that moment that it became obvious they deleted millions of messages using our API, leaving only 1 message in the channel. If you have been paying attention you might remember how Cassandra handles deletes using tombstones (mentioned in Eventual Consistency). When a user loaded this channel, even though there was only 1 message, Cassandra had to effectively scan millions of message tombstones (generating garbage faster than the JVM could collect it)."
And although the blog post talks about GC tuning, there's mention here [1] that they didn't do much tuning and were actually running on an old version of Cassandra (and presumably JVM) - having just switched over from CMS (!).