I'm fairly amazed by some of the choices - in particular the single huge user table that is hit on every profile page view to map url to shard. I'm guessing it has hot backups and similar - but wow - talk about a single point of failure. I guess having it like that actually makes it easier. If it fails, it should be trivial to get it back up quickly on a backup machine.
The whole setup seems a lot simpler than nearly every 'web scale' solution I've heard discussed.
EDIT: I'm not saying that is a bad thing - it's a good thing. Most of the articles around scaling on HN often deal with very complex Cassandra style technical solutions. This article is the opposite with a very 'just do it the simplest way' feel to it. One big mysql table for all users (probably with some memcache in front) is the sort of thing you can put together in an hour or two - yet it is the very core of their system.
I (and I bet a lot of other engineers reading HN) try to focus too much on perfect implementations - if someone asked me to design the shard look-up for something as heavily hit as this, I'd probably reach for a distributed in-memory DB with broadcasted updates - and I'd probably be wrong to do so. The mysql/memcache approach is just going to be simpler to keep running.
You don't really want to get into the alternatives. It's pretty common to shard everything but the core user data (email, PBKDF2, username) and scale that with slaves and vertical upgrades. SSDs can take you a long way if you're fastidious about offloading everything else off the machine responsible for keeping user accounts consistent.
I agree. Usually it is better to scale vertically on the user data. Even if it seems painful. The alternatives like partitioning are not really that easy either. There is pain involved as well.
I think, of the existing solutions, one is not better than the other. The choice in the end is which one fits best with your team and infrastructure, and specially which disadvantages are you willing to put up with.
That’s not what they’re doing. They are sharding horizontally on the user. ID’s are 64 bits with the db location (logical shard id) in the first 16 bits. This will allow anyone with an ID to go to the right db for querying, granted they have logical to physical shard mappings.
User table is unsharded. They just use a big database and have a uniqueness constraint on user name and email. Inserting a User will fail if it isn’t unique. Then they do a lot of writes in their sharded database.
Get the user name from the URL. Go to the single huge database to find the user ID.
This might be a silly thing to ask, but why don't they save their data in flat files on a shared filesystem?
They wouldn't need Memcached, since the OS caches files anyway; replication with rsync becomes easy; they don't need transactions anyway; and they wouldn't have so much software to manage.
Its a real pain when you want to inspect the files, delete or copy them.
Try taking 300,000 files and copy them somewhere. Then copy 1 file which has the size of the 300,000 combined. The single file is MUCH faster (its also why we usually do a tar operation before copying stuff if its already compressed). Any database that's not a toy will usually lay the 300,000 records out in a single file (depending on settings, sizes and filesystem limits).
The 300,000 files end up sitting all over the drive and disk seeks kill you at run-time. This may not be true for a SSD but I don't have any evidence to to suggest this or otherwise.
Even if the physical storage is fine with this I suspect you may run into filesystem issues when you lay out millions if not hundreds of millions of files over a directory and then hit it hard.
I have played with 1,000,000 files before when playing with crawling/indexing things and it becomes a real management pain. It may seem cleaner to lay each out as a singe file but in the long run if you hit a large size the benefits aren't worth it.
It doesn't have good querying utilities. You'd have to build your own indexer and query engine. Since you can't put billions of files in a single directory, you'd have to split them into a directory tree. That alone requires some basic indexing functionality and rebalancing tools (in case a single directory grows too large or too small). This is without any more sophisticated querying capabilities like “where X=val” where X isn't the object ID/filename.
Write performance is going to be very very horrible.
Existing tools around managing, backing up and restoration, performance optimisation and monitoring aren't suitable for handling huge number of small files as well as a subset of them (give certain criteria related to the data itself)
You could build specialized tools to resolve all of these issues, but in the end, you'd end up with some kind of database after hundreds of man-years anyway.
Since they're MySQL pros, they could put that on a MySQL Cluster and get high availability (up to 30 or so machines in the cluster, use a larger replica count, on top of drive mirroring, etc.) Also, it's a read-heavy workload, so hide it behind memcached.
I really like how when you look at it, most of their scaling was removing things from the environment and really honing the platform down to what it needed to be.
> Architecture is doing the right thing when growth can be handled by adding more of the same stuff. You want to be able to scale by throwing money at a problem which means throwing more boxes at a problem as you need them. If you are architecture can do that, then you’re golden
My understanding is that the term of art is horizontal scalability.
How does SOA improve scaling? I've never had to build anything as big as pinterest and it seems interesting to me.
I don't quite understand how it reduces connections. If you have your front-end tier connecting to your service tier, you'll still have the same amount of requests going to the database, just with a little middle-man... If I understand correctly?
I'm no expert, but hopefully by abstracting out different services from your core app and making them available through some connection protocol (Thrift, a message queue, http, whatever) you can more easily scale up and scale out systems that become hotspots, and consumers of that service will never be any the wiser, since they only interact with it through the published API.
Looking at it another way, when I talk to the GitHub API, I have no idea what sort of database I'm ultimately talking to. Their interface abstracts away all of that complexity. Each request they receive from me might be handled by a different machine for all I know. It makes no difference.
By itself SOA doesn't do anything but it does free you up to do other things that allow you to scale like read-write separation and event driven or fan-out caching. One of the most important thing SOA does is that it quickly stops you from doing dirty hacks that work right now but are killers for future work. It's encapsulation at the application level
Service orientation allows specialization of architectural components. If you want to build on scale, it is easier to replicate specialized components than a generalized one. It also benefits from better fault tolerance.
Additionally, services layer forms an interface between the client and service provider, which makes it easier to scale.
2 questions:
1) What do they do with the Pinterest-generated ID? Do they store it with each row in addition to the local ID?
2) Why randomly select a shard? Shouldn't you do this based on DB size across all boxes, or at least based on physical space left across all boxes?
What is most interesting to me is that in many ways they have reinvented Couchbase. I think that the only reason they didn't go with this technology was the financial cost for their scale was too high.
I wonder if their dislike for Cassandra is based on previous versions pre-2.0. From my perspective of looking at it as it stands now, it's pretty compelling.
Not if you want strong consistency. Cassandra's performance sucks in comparison with the likes of MongoDb or Couchbase when reading with strong consistency since the clients have no idea of the server topology.
umm what? Cassandra is just as fast/faster (depending on both configurations and load) compared to MongoDB with consistant read/writes. Definitely with writes but reads get tricky.
Firstly, these sorts of applications are always going to be more read heavy so the reads are more important. Secondly, Cassandra cannot and will not be as good as something like Couchbase since the client libraries are not aware of the server topology so they cannot make direct connection to the server hosting the data. This means that depending on your consistency requirements, Cassandra will be merely okay to occasionally terrible depending on whether you care about 99th percentile. This behaviour was one of the reasons my company moved away from Cassandra
No, the benchmark is not pure marketing. Why would you claim that it is? Apart from Astyanax, which clients are token aware?
That paper is very useful so thanks for posting the link but it has a number of issues as I see it.
1) It considers Cassandra, Redis, VoltDB, Voldermort, HBase and MySQL. It does not cover either MongoDB or Couchbase.
2) Latency values are given as average and do not show p95/p99. In my experience, Cassandra in particular is susceptible to high latency at these values.
3) Even considering average values, the read latency of Cassandra is higher than you would see with either MongoDB or Couchbase.
4) Cassandra does not deal well with ephemeral data. There are issues while GC'ing large number of tombstones for example that will hurt a long running system.
The long and short of it is that Cassandra is a fantastic system for write heavy situations. What it is not good at are read heavy situations where deterministic low latency is required, which is pretty much what the pinterest guys were dealing with.
It is marketing, because Couchbase is a featured customer of Altoros, the company that did the benchmark. And the rule of thumb is: never trust a benchmark done by someone who is related to one of the benchmarked systems. Obviously they'd not publish it if Couchbase lost the benchmark. They must have had been insane to do it.
Another reason it is marketing is because it lacks essential information on the setup of each benchmarked system. E.g for Cassandra I don't even know which version they used, what was the replication factor, what consistency level did they read data at, did they enable row cache (which decreases latency a lot), etc.? Cassandra improved read throughput and latency by a huge factor since version 0.6 and is constantly improving so the version really matters.
First, let me concede that Cassandra has had a storied history of terrible read performance. However, if the last time anyone looked at Cassandra for read performance was 0.8 or used size-tiered compaction, I'd encourage them to take another look.
The p95 latency issues were largely caused by GC pressure from having a large amount of relatively static data on-heap. In 1.2, the two largest of these: bloom filters and compression data were moved off-heap. It's my experience that with 1.2, most of the p95 latency is now caused by network and/or disk latency, as it should be.
I'm not going to compare it with other data stores in this comment, but I'd encourage people to consider that Cassandra is designed for durable persistence and larger-than-RAM datasets.
As far #4, this is mostly false. Tombstones (markers for deleted rows/columns) CAN cause issues with read performance, but "issues while GC'ing large number of tombstones" is a bit of a hand-wavey statement. The situation in which poor performance would result from tombstone pile-up is if you have rows where columns are constantly inserted and then removed before GC grace (10 days). Tombstones sit around until GC grace, so effectively consider data you insert to live for at least 10 days, unless of course you do something about it.
Usually people just tune the GC grace, as it's extremely conservative. It's also much better to use row-level deletes if possible. If the data is time-ordered and needs to be trimmed, a row-level delete with the timestamp of the trim point can improve performance dramatically. This is because a row-level tombstones will cause reads to skip any SSTables with max_timestamp < the tombstone. It also means compaction will quickly obsolete any succeeded row-level tombstones.
Here's a graph of P99 latency as observed from the application for wide row reads (involving ~60 columns on average, CL.ONE) from a real 12-node hi1.4xlarge Cassandra 1.2.3 cluster running across 3 EC2 availability zones. The p99 RTTs between these hosts is ~2ms.
This also happens to be on data that is "ephemeral" as our goal is to keep it bounded at ~100 columns. The read:write ratio is about even. It has a mix of row and column-level deletes, LeveledCompactionStrategy, and the standard 10 day GC grace.
DataStax Enterprise 2.0 shipped with Cassandra 1.0
DataStax Enterprise 3.0 shipped with Cassandra 1.1
DataStax Enterprise is a different product than Cassandra. Cassandra is one of its components, but there are more things bundled, e.g. Solr and Hadoop.
From what I've read, it's Python and "heavily modified" Django. I recall one of their engineers saying that, if they were to do it again, they would've gone with a more lightweight Python framework because of how many heavy modifications they made.
I have read that also (multiple times actually) and the statement never made sense to me - of course in hindsight you would start with something like Flask because you ran into all of the scaling problems with Django, but there would be no way to fix these problems before they occurred - and if you did try to fix them before they would occurred, it would be premature optimization. It seems like a example of what Taleb's calls the "Narrative Fallacy"
what happens if 1 shard contains a whole slew of large companies (macy's, gap, etc), or over time users on one shard grow to the point that they cannot be contained within one physical machine? Are they just assuming that this will be very unlikely?
No not at all. We already have this problem. Right now we are able to take the DB hit but we will eventually attack this problem from cache perspective. We will have replicated caches for this scenario. In worst case if a shard is overloaded we stop creating new users in it and put it on a dedicated physical machine. I dont think we will let the situation get there.
A simply awesome article, that I would refer to back and again, can someone please refine this with his comments as well. I mean it would be really helpful to point out any mistakes Pinterest did, at any point, what I want to know is, what could they have done better. Also, business is a dynamic environment, so somethings might have changed, so can someone please also point out, if something was right when pinterest did it, but now there are better alternatives. Thanks a lot in advance. I would be really reading many of the comments properly.
The whole setup seems a lot simpler than nearly every 'web scale' solution I've heard discussed.
EDIT: I'm not saying that is a bad thing - it's a good thing. Most of the articles around scaling on HN often deal with very complex Cassandra style technical solutions. This article is the opposite with a very 'just do it the simplest way' feel to it. One big mysql table for all users (probably with some memcache in front) is the sort of thing you can put together in an hour or two - yet it is the very core of their system.
I (and I bet a lot of other engineers reading HN) try to focus too much on perfect implementations - if someone asked me to design the shard look-up for something as heavily hit as this, I'd probably reach for a distributed in-memory DB with broadcasted updates - and I'd probably be wrong to do so. The mysql/memcache approach is just going to be simpler to keep running.