Wow, the title of this post is very calm compared to what is actually happening.
CloudSQL Postgres is running with a misconfigured OS OOM killer, crashes Postmaster randomly even if memory use is below instance spec. GCP closes this bug report as "Won't fix".
This is a priority 1 issue. Seeing a wontfix for this has completely destroyed my trust of their judgement. The bug report states that they have been in contact with support since February.
Unbelievable attitude towards fixing production critical problems of their platform affecting all customers.
So many GCP products are surprisingly terrible. Certainly not all of them, some are really good, like GKE, Cloud Storage and Cloud Load Balancer. But Cloud SQL is pretty weak, and products like Cloud Logging, Cloud Metrics and Cloud Tracing are legitimately terrible. Cloud NAT is pretty sketchy too, and can lead to a lot of egress issues if not configured perfectly.
My current workplace uses GCP, my last workplace used AWS, and personally I’ve found AWS to have much higher average quality. At my current workplace we’ve stopped using Cloud SQL, and moved our Postgres usage to Aiven (with VPC peering). Aiven seem to do a much better job operating Postgres than GCP do.
Basically, their Cloud Tracing product is broken for modern Node/Postgres (in terms of showing PG queries and whatnot in traces), users have found the issue (and a seemingly super simple fix), but it’s been over a year and Google still haven’t fixed it. Google’s response is “yeah, we know pretty core functionality of this product is broken, but we’re not fixing it in the near future.” Or maybe ever? Many of their products feel semi-abandoned like this, especially in their observably stack - major bugs and/or performance issues that they never fix, and extremely limited features.
Cloud SQL isn’t terrible, but at least the Postgres version is one of the weaker managed Postgres offerings out there. And their whole observability stack (Logging/Monitoring/Tracing/Error Reporting) is legit terrible compared to competing products. Compared to other products I’ve used in the space, Cloud Logging is unbelievably worse than Sumo Logic, Cloud Metrics soooo much worse than Grafana+Prometheus, Cloud Tracing way worse than offerings from Datadog or New Relic, Cloud Error Reporting is ridiculously far behind Sentry, etc.
The GCP options are often quite cheap, but it shows in their extremely limited features, poor performance and plentiful bugs. Go with GCP for the things they do well, but don’t bother adopting their solution for everything simply to stick with one platform, as so many of their products are just so poor compared to competitors.
> Cloud SQL isn’t terrible, but at least the Postgres version is one of the weaker managed Postgres offerings out there. And their whole observability stack (Logging/Monitoring/Tracing/Error Reporting) is legit terrible compared to competing products. Compared to other products I’ve used in the space, Cloud Logging is unbelievably worse than Sumo Logic, Cloud Metrics soooo much worse than Grafana+Prometheus, Cloud Tracing way worse than offerings from Datadog or New Relic, Cloud Error Reporting is ridiculously far behind Sentry, etc.
Fair enough, we never used CloudWatch at my previous company, where we did use AWS for most infra, but didn’t use CloudWatch, so don’t have much experience. But we do use Cloud ops suite (a.k.a. Stackdriver) at my current co, and man, I miss the more standalone observability tools we used at my previous co - Sumo, Prometheus/Grafana, Sentry and New Relic. They’re sooooooo far ahead it’s not even funny.
I'm gonna use your disasfaction of your log management tool for a little self-promotion. I'm working as part of the Wrble.com team its a fast logging platform that is priced way lower than hosting your open source stack. We are based on Lucene technology. I believe we provide the same service that existing players do for an 80% reduction in spending. Give it a shot let me know what you think.
Looks interesting, but can you only do very structured queries of JSON logs? We like to be able to do full text search on the whole log, i.e. find every log with a specific UUID in it, regardless of where in the log the UUID is.
Also, looks like no log aggregation? i.e. no SQL style queries on logs, that you can do in products like Sumo Logic.
GREAT pricing, but my first impression is that it’s lacking some key features we’re looking for. Seems like you guys are going for the low priced, bare bones solution, and it’s literally orders of magnitudes cheaper than a really feature rich solution like Sumo, but I think it’s too stripped down for us.
Not really surprising if you consider their likely motivations.
Google isn't in the business of selling things to end users, they're in the business of selling ads. The only thing GCP gives them (outside of getting wall streeters off their backs a few years ago when everyone and their brother was starting a cloud service) is a credit to their own infrastructure cost by selling excess to random joes.
Therefore I'm not surprised that AWS continues to be the defacto, they do sell things to end users. I'm not surprised that Azure is growing quickly, either, since MS also sells things to end users and they needed a way to transition their on-premise stuff to the wires.
I mean, GCP is a decent source of revenue for them. i.e. last quarter:
* Alphabet did ~$55 billion in revenue overall last quarter, ~$4 billion was from "Cloud", which is GCP + Workspace (I don't think they disclose how much is GCP alone?). Although, for now it's a money loser for them, they had operating losses of ~$1 billion for Cloud, but the operating losses are shrinking over time, it'll become profitable eventually
* In contrast, Amazon did ~$108 billion in revenue overall last quarter, and ~$13.5 billion was from AWS. Although unlike GCP, AWS is highly profitable, ~$4 billion in operating income for the quarter, which is almost half of Amazon's total operating income
But AWS isn't THAT much higher a percentage of Amazon's revenue than GCP is of Alphabet's revenue. And in terms of COSTS, AWS is actually spending less, relative to their overall revenue (Amazon spending ~$9.5 billion of $108 billion total revenue on AWS, Google spending ~$5 billion of $55 billion total revenue on "Cloud").
AWS has been around longer than GCP, and they've certainly spent more absolute dollars, so it makes sense it's further ahead and more polished. Yeah, AWS is more used to selling things to end users than Google, they may have a better culture for quality there, but Google invests heavily in GCP, and it's a pretty significant revenue stream for them. I'm guessing their motivations are similar, both see Cloud offerings as a big revenue stream first and foremost.
It's interesting that you are satisfied with GKE. Do you rely on the k8s-API to be (high) available?
We were using the API as our source of truth for Patroni, but we had to configure some really high timeouts in order to compensate regular multi minute API downtimes.
We need the services we run inside K8s to be highly available (as well as K8s ingress), but the K8s API we care less about. We haven't noticed any K8s API downtime issues, but I guess we mostly hit the K8s API during deploys, which for us are likely not frequent enough to notice the downtime you're talking about.
This doesn't really surprise me. We use CloudSQL at my work (Sourcegraph) and have run into all sorts of weird issues actually putting it into production, e.g. segmentation faults when turning on Query Insights (which, lol, is supposed to give insight into why your DB might be behaving poorly.)
For the most part it works okay and is fine, but there have definitely been a fair number of quirks..
It exposes a very problematic communication pattern. The engineering team doesn't respond to the support team (accidentally or deliberately). The support team then just decides to close the issue instead of prodding the engineering team for an actual response (even if it's just "Yeah, we're not fixing it").
Now the issue is just in limbo and the only one who feels the pain is the customer.
Another "fun" interaction pattern: User reports a bug (or a feature request), several others subscribe to and/or vote for this to be solved, and then a service rep closes the issue because there wasn't any recent activity.
I've observed with with Atlassian where I wanted to report a Jira bug, but found that it had already been opened some years before, more than a hundred people had subscribed, bug was still closed as "no activity, must not be relevant". I just found the exact same bug reported for Jira Cloud (I had observed it in the on-prem version): https://jira.atlassian.com/browse/JSWCLOUD-8865 and it was closed there for the very same reason.
I didn't leave a comment because the original report described the issue perfectly, and adding a "me too" comment is just noise in the bug tracker. Guess I'll be noise in future :-(
I don't think a 'me too' on this bug would be noise right now. Last activity was in 2019; having someone state the problem is still real in 2021 could impact if it gets fixed.
Not for Atlassian. I'm subscribed to several serious issues, ranging from usability and compatibility to causing actual data loss since about 2015. Each week I'll get "me too"s and "+1"s in my mailbox, but nothing from Atlassian.
Thankfully, they've deprecated their on-premise products, leading us to finally find a better alternative.
I concur with the other comments, Cloud SQL is a very mediocre service at best. Lot's of weird issues and the engineering team doesn't seem to care. We also had the segfault due to query insight.
Just the fact that you can't upgrade your database version without creating a new instance and restoring a backup is just bad.
I also suggest aiven as an alternative that works very well and cost is reasonable.
I migrated off CloudSQL even when they tried to pay me to use it (startup credits). It's not worth risking your business with GCP. Sad, but that's what I've learned...
I'd consider Aiven if I were still on GCP and looking for a solid managed Postgres provider. As it is, I'm now on DigitalOcean and fairly happy with their managed Postgres offering, but there are a few rough edges so I'm actually still looking at Aiven even though everything else I have is on DO...
Are there any good/recommended books or resources for someone who wants to learn how to run postgresql well? E.g, what defaults to change and when, settings for the host OS (such as in the parent linked article), overall tips/insights/recommendations.
Are there recommendations for learning about Linux kernel memory management? Two anecdata:
* I had some compute servers that were up for 200 days. The customers noticed that they were half as fast as identical hardware just booted. Dropping the file system cache ("echo 3 | sudo dd of=/proc/sys/vm/drop_cache") brought the speed back up to the newly deployed servers. WTF? File system caches are supposed to be zero cost discards as soon as processes ask for RAM - but something else is going on. I suspect the kernel is behaving badly with overpopulated RAM management data (TLB entries?), but I don't know how to measure that.
* If that is actually the problem, then a solution might be to decrease data size by using non-zero hugepages ("cat /proc/sys/vm/nr_hugepages"). I'd love to see recommendations on when to use that.
I don’t remember details now, but I’ve seen a situation when a Java app was working slower and a box with more RAM (and probably a bigger heap size), compare to a box with the same CPU but 2x less RAM. I suspected that TLB cache was the reason, but didn’t have time to test this.
Explicit hugepages on x86 are difficult to manage. Most people using off-the-shelf software can only take advantage of it by configuring, for example, innodb buffer pools to use them. However if your compute server really is a database, then you'll find the performance benefit is well worth the configuration.
For other processes you'll need a hugepage-aware allocator such as tcmalloc (the new one, not the old one) and transparent hugepages enabled. Again, the benefits of this may be enormous, if page table management is expensive on your services.
You will find a great deal of blogs on the web recommending disabling transparent hugepages. These people are all mislead. Hugepages are a major benefit.
For workload using forking and CoW sharing like Redis or CRuby it negates the entire benefit of CoW since flipping a single bit copies the entire huge page.
That's what used to happen but since kernel 5.8, anonymous shared pages that are dirtied by child processes are instead divided into normal pages, in the same way they would be if they were named (file-backed) mappings.
3rd party closed source software; I think it's using the C library malloc - which uses sbrk for small things, but uses mmap for >= 128k. Fun historical fact: the Red Hat/CentOS 5 kernel ulimit didn't limit mmap allocations :-/
This is likely due to a kernel bug that was caused by the way cgroup slab management is handled. Upgrade to 5.10 or later, and it should be fixed. I’d be interested to see if the problem continues.
I recently managed to crash a GCP cloudsql postgres 12 host running an interactive query that was rather heavy (based on error logs OOM).
It surprised me because I had never executed a query and caused the whole host to crash up until that point - now I'm wondering if this misconfiguration is the cause
I'd like to thank the author for their clear, simple explanation. I haven't had to think about allocating memory since university and am not practiced thinking about it in my software but now I feel like I have useful ways to think about why processes just disappear sometimes.
GCP CloudSQL has a lot of issues. There was one with query insights being enabled causing segfaults on `LEFT JOIN` operations. Its since been patched, but really shitty.
Some programs allocate a lot of virtual memory and then don't use it.
Also, linux's forking model can result in a lot of virtual memory being allocated if a heavy-weight program tries to fork+exec a lot of smaller programs, since fork+exec it not atomic and briefly doubles the virtual memory usage of the original program.
I think there are better ways to spawn programs that don't suffer from this problem now...
If you have programs that are written to allocate virtual memory sparingly (like postgres) then that should be fine.
However, there is a second way you can be caught out: even if you disable overcommit, your program can still be OOM killed for violating cgroup limits, since cgroup limits always behave as though over-commit is enabled (ie. they allow you to allocate more than you are allowed, and then you get OOM killed when you try to use the allocated memory). This means you'd have to be really careful running eg. postgres inside a kubernetes pod.
This behaviour really sucks IMO. I would like it if you could set overcommit on a per-program basis, so that eg. postgres can say "I know what I'm doing - when I allocate virtual memory I want you to really allocate it (and tell me now if you can't...)". I think you can somewhat achieve this with memory locking, but that prevents it from being paged out at all...
The fork issue is solved by adding swap. Making sure you have plenty of swap solves these issues, and I'd like to argue that it is more reliable than using overcommit.
It is certainly one way to solve that specific issue, assuming the program was written to take advantage of it. As mentioned, there are several other reasons a program may use a lot of virtual memory though.
Consider this scenario: a process runs a fork(), and shortly after it runs an exec(). Normally, the extra fork only uses a tiny amount of extra memory, because the memory is shared between the parent and the child, until one of them writes to it (copy-on-write).
With overcommit disabled, the kernel must reserve enough space to copy the whole writable RAM of a process when it forks.
So you have a 16GB machine, and an 8.1GB process cannot spawn any other program through the usual fork + exec routine (workarounds exist, like forking before allocating lots of memory and using IPC to instruct the low-memory fork to fork again and launch, but that's way more complicated than a simple fork + exec).
So if you have a dedicated DB host and you know that your DB engine is very carefully engineered to work with disabled overcommit, you can disable it. On a general-purpose machine a disabled overcommit will waste lots of RAM that's sitting unused.
If you mean “why isn’t it disabled by default on Linux installs”: most programs don’t expect malloc(2) to ever return NULL. Those programs will just assume the return value from malloc(2) is valid memory. In the best case, they’ll immediately write to it and protection-fault. In the worst case, they’ll hold onto this NULL pointer for a while, passing it around, until eventually something else somewhere distant in the program blows up some unknown amount of time later.
Even those programs that are “malloc(2) error aware”, often do something stupid and counterproductive in response, like attempting to allocate more memory for an exception object / stack trace / error string.
Programs that do something useful in response to a NULL malloc(2) return result — useful for the stability of the system as a whole, better than what the OOM killer gets you — are rare, even on servers. Usually it’s only stateful, long-running, DBMS-like daemons that 1. bother, and 2. have the engineering effort put into them to do the right thing.
Most of these were likely managed-language programs.
Programs witten for managed language runtimes will have a language-runtime-level abort on malloc(2) fail, which usually is well written, in the sense that it will clean up language-runtime-level resources, and emit a language-runtime-level error message.
But this language-runtime-level abort usually isn’t exposed to the application in any hookable way, so from the developer’s perspective, it’s basically the same as being OOM killed. There’s no option to clean up e.g. an individual transaction’s resources in order to keep going. There’s no hooks for libraries to use to e.g. properly send close messages on sockets (if the language runtime doesn’t do that itself as part of managing socket lifetimes.) Etc.
These managed runtimes (e.g. the JVM) may expose a catchable exception for OOM errors, but these are for internal, language-runtime level OOM errors, triggered by the runtime itself under certain conditions, rather than in response to a syscall failure. When malloc(2) fails, it’s basically “too late” from these runtimes’ perspectives — they no longer have the resources required to allow the user to run any more code.
« Most of these were likely managed-language programs. »
Please don't guess. They weren't.
It is true that a program that aborts as soon as malloc returns failure isn't doing any special cleanup or attempting to keep going.
But that's not at all the same as « Those programs will just assume the return value from malloc(2) is valid memory. In the best case, they’ll immediately write to it and protection-fault. », which is what I'm informing you is too pessimistic.
I'm not guessing. I'm just answering you by ignoring/disregarding your personal experience, and instead treating you as a random variable sampling the population of people who use all possible software, and then talking about what that random variable would look like.
Why? Because we don't need anecdotes to know this particular thing — we have the data. We know what the random variable actually looks like. (How? Because people have downloaded "all of Github" or "the entire Debian package archive", and run Valgrind over it, and uploaded the resulting dataset to BigQuery!) By the Law of Large Numbers, we can actually do stats about, effectively, what "all software" looks like.
By volume, the majority of POSIX software that calls malloc(2), is incompetently-written, with no checks on the return result of malloc(2). This is an objective, verifiable fact.
By volume, the majority of POSIX software that has a call to malloc(2) that does check the return value, exists as the result of a managed-language compiler emitting a language-runtime-level check into the compiled binary, rather than as an explicit source-level check. Another objective, verifiable fact.
-----
It so happens that the software making up the "backbone" of an OS / average LAMP server is more competently-written, because it's had a lot more attention and engineering time put into it.
But the same "power law of features" from e.g. Microsoft Office applies here — there's a core set of stuff everyone uses, but every user also has some weird stuff they are in the small minority of users for. And that stuff is what breaks.
As it happens, that lesser-used stuff is also usually mission-critical to the operation of a business; otherwise people wouldn't be driven to use such not-a-lot-of-engineering-effort-put-in software in the first place. People are using this stuff "in anger", if they're using it at all.
Which means that, sadly — insofar as most developers creating business-process IPC pipelines, don't already have the hard-won experience to build in fault-tolerance for individual processes within that pipeline — we see production systems where these malloc(2) failures are Single Points of Failure for the entire system. The flakiness of these long-tail programs, drags down the reliability of most systems-as-a-whole.
Because many program "unnecessarily" allocate a lot of virtual memory and then never actually page it in, disabling overcommit will start killing processes due to lack of memory even though most of the memory isn't been actually used.
It's extremely useful to be able to map more virtual memory space than exists physical memory in your computer. This is what makes e.g. mmap'ed access to large files possible.
CloudSQL Postgres is running with a misconfigured OS OOM killer, crashes Postmaster randomly even if memory use is below instance spec. GCP closes this bug report as "Won't fix".
This is a priority 1 issue. Seeing a wontfix for this has completely destroyed my trust of their judgement. The bug report states that they have been in contact with support since February.
Unbelievable attitude towards fixing production critical problems of their platform affecting all customers.