Look, it's likely we just come from different backgrounds. Most of my perf-sensitive work was optimizing inner loops with SIMD, allowing the compiler to inline hot functions, creating better data structures to make use of the CPU cache, etc. Frame pointer prologue overhead was measurable on most of our use-cases. I have a smaller amount of experience on profiling systems where calls trace across multiple processes, so maybe I haven't felt this pain enough. Though I still think the onus should be on teams to be able to comfortably recompile---not the world---but some part of it. After all, a lot of tuning can only be done through compile flags, such as turning off codepaths/capabilities which are unnecessary.
I wasn't exaggerating about recompiling the world, though. Even if we say I'm only interested in profiling my application, a single library compiled without frame pointers makes useless any samples where code in that library was at the top of the stack. I've seen that be libc, openssl, some random Node module or JNI thing, etc. You can't just throw out those samples because they might still be your application's problem. For me in those situations, I would have needed to recompile most of the packages we got from both the OS distro and the supplemental package repo.
My experience is on performance tuning the other side you mention. Cross-application, cross-library, whole-system, daemons, etc. Basically, "the whole OS as it's shipped to users".
For my case, I need the whole system setup correctly before it even starts to be useful. For your case, you only need the specific library or application compiled correctly. The rest of the system is negligible and probably not even used. Who would optimize SIMD routines next to function calls anyway?