I'm surprised Facebook still uses software to do video encoding.
Most big companies with millions of hours of video uploaded each day have realised it's cheaper to stick a bunch of hardware video encoding chips onto an accelerator board and be able to transcode 100 HD streams simultaneously into all the formats and resolutions you need to host.
The power savings on CPU's pay for the custom hardware in a matter of months.
It does reduce flexibility when new video formats get released though.
>It does reduce flexibility when new video formats get released though.
A new video codec is released and adopted once every ten years. So I dont think it would be a problem. Hardware encoding also trade compression quality for speed. And they are compensating it with slightly higher bitrate. It will also reduce their incentives for improving open source encoder.
Although I think even Netflix switched to BEAMR ( Cant really blame them though )
I think the next frontier, for both Audio and Video will be Codec designed with LiveStreaming / Low Latency in mind.
( Or pretty much everything computing, I wish we could focus on latency, from Hardware Input, Display, Network, Disk, etc. Apple is certainly moving in that direction without talking about it. )
I was talking with a Netflix product manager a few years ago and, IIUC, he said Netflix re-encodes their entire catalog monthly to take advantage of new encoder optimizations. Disk is cheap. They want to optimize network transfers to both save money and improve user experience.
Netflix has about 36,000 hours of content. On Youtube that much content is uploaded every 1 hour and 20 minutes! Or to put it another way, Netflix re-encodes every month what YouTube encodes in just over an hour.
I don't know the reliability of these sources, but according to one article[1], as of 1 year ago, US Netflix had ~5.8K titles, totaling 36K hours of content. Another more recent source[2] claims that worldwide, Netflix has ~15K titles, so assuming the same distribution of runtimes, it would be closer to 93K hours of content. So, while you're correct, it's still the same order of magnitude.
Back-of-the-envelope calculation: a slacker suffering from insomnia watching 5 streams of Netflix at once × 20 hours/day × 365 days non-stop would otherwise watch the whole catalog in less than a year.
If we unpack your calculation to slightly more typical human behaviour: Watching 1 stream × 10 hours/day × 365 days non-stop, one person would need 10 years to watch the whole catalog.
FWIW, that conversation about Netflix re-encoding their catalog monthly was from about seven years ago! Their process has surely changed since then. :)
Also, IIUC, they re-encoded about once a month, but the re-encoding didn't necessarily take one month of compute time.
Any links for this? Since my experience shows that Ampere encoding blocks still aren't close to x264 / slow preset when it comes to saving bandwith and delivering quality.
Even with power savings it was usually more economically efficient to run encodes on a large (12+ core) machine than to deal with limited amount of nvenc slots on GPUs.
NVENC beats x264 medium. Not quite up to the level of the "slow" presets yet but you have to throw a huge amount of hardware at it to match them let alone beat them. Basically the number I came up with a few weeks ago from playing around with x264 settings on ffmpeg was between 6 and 12 cores to keep up with NVENC at 720p and 1080p, depending on framerate and the quality preset.
What sort of bitrate are we talking about? I haven't tried using NVENC for years and last time I check it was clearly missing many details that x264 tries to preserve.
NVENC is good at cleaning the noise and fast encoding. ( Or basically Game Streaming ). Which isn't something you want to do if you want to do movies encoding.
> but you have to throw a huge amount of hardware at it to match them let alone beat them.
Sure, and I think hardware encoders are great when you need speed, certainly for real-time video. But in other cases, well, the medium preset sucks. I always encode videos at `veryslow`, and there's just no way to get close to that with a gpu.
Video encoding isn't GPU-parallelizable. It's a good fit for either CPU+SIMD or custom ASICs. It's just a kind of compression, which means it's based on unpredictable if-statements, which is just what GPUs don't do.
You can mass parallelize it by encoding different clips on different CPUs, this is more optimal because it has less communication overhead.
No idea, maybe it will change some day. But right now, if you want the best quality at the smallest filesize, nothing seems to come close to the best software encoders. Maybe x264 and x265 are just really good.
They're unusually good because 1. encoding research wasn't done on a business schedule 2. isn't done according to objective metrics (the main ones used, SSIM/PSNR, suck and you have to use human raters) 3. more varied and weirder testcases (some pirated movies, some video game screen recording, more anime).
There's a lot of other free video encoding tools, like avisynth plugins, that are just better than all professional tools. I'm not sure why this is, maybe customers aren't sophisticated enough.
Hardware encoders for newer codecs tend to be simply less bit-efficient. They can be more limited in features. My desktop's GPU can handle quite a few 1080p H.264 streams in parallel. But it can't do average-bitrate-capped CRF, which I think is the preferred type of encoding for streaming services.
Like someone already pointed out this whole announcement could be in response to Google announcing this week it's new Argos chipset approach to transcoding
The problem is that most of the big tech thinks they are smarter than 1000's of codec engineers who worked on libx264 and they try to reinvent the wheel. They realize too late that all the money they threw into building their own stuff (or variants) do far worse in general case than the baseline libx264. They just don't want to admit it.
I don’t think it’s that they don’t want to admit it, it’s that the cost savings from something disruptive may make it worth it at scale. Or that somebody wants to risk their career on it haha and big tech has tons of cash to throw around.
Absolutely. In fact, I interviewed a guy last week that comes from very big tech and he basically himself admitted that they're basically wasting time and money chasing unknown "ideal" codec while almost always those projects hit dead end. He loves the money, I am sure, but he wants out anyway. :)
I've been to a project where, because the development and "hardening" budgets were separate, we knowingly pushed out buggy code so that we could take advantage of the latter.
Had I known that half of the job would be to game the system, I wouldn't have joined.
I would be suprised if FB doesn't use hardware encoders. That said, hardware encoders are fast, but they don't match the quality of the software (CPU) encoding. Slower CPU encoding still provide the best quality.
I can’t speak for all of the encoding options. But I can tell you that Dolby Hybrik, one of the main competitors and encoding space uses relatively cheap ec2 instances.
Based on the speed I see from Telestream and Elemental. I think they’re software based as well
Youtube and Facebook have a much different encoding problem then say HBO/Netflix/Hulu...etc
YT & FB have a few popular videos, and a ginormous long-tail. That long-tail needs to be encoded, and cheaply as it might not get many views.
The TV/Movie VOD over IP industry stands to benefit optimizing the encoding to have the smallest filesize with highest quality. Spending considerable CPU cycles to find that best quality path.
For those like Netflix, why would you invest with ASICs. You likely don't have a huge infra bill (relatively) when it comes to encoding. And CPU vs ASICs, the ASICs mean you loose flexibility to really fine-tune the quality and use the latest codecs.
For those like YT/FB. You just gotta encode that long tail. Getting something cheap that's 80% the quality of CPU-based encoding is good enough for 99% of the content.
I could see lots of reasons to do software encoding at a company like Facebook which are probably not mentioned in the article, including e.g. extracting the optical flow which can be reused as part of various machine learning workflows.
For example, if you want to only annotate 1 in 100 frames and interpolate with optical flow, you can get the flow in the process of doing video compression.
I don't have inside info but there are various other things like this that I can think of for wanting to do it in software.
Also, hardware encoders suck if the company making the hardware decides to suddenly EOL the product. With software encoders you can easily scale the backend whenever you want, at any time in the future, and without succumbing to supply chain issues.
counterpoint: if you're big enough to have your own bona-fide cloud, then you might have more spare CPU capacity that can be flexibly allocated than available PCs with dedicated GPU
What was mentioned in the article seem like very basic, obvious considerations. Meanwhile, Youtube has custom ASICs which efficiently produce multiple formats simultaneously.
And yet, despite YouTube's abundantly clear motivation for doing so it took them years and years to develop the hardware and it only recently hit production. Could it be that acceleration for video encoding isn't as easy as people are making it sound?
Of course I could;) It would just be hacky, ugly, unreliable, scale badly, use either GPUs or inadequate FPGAs, and fall over once it exceeded 10 users. My resume doesn't need to know that, though:)
If you break videos up into short chunks, then you could simply encode those chunks on-demand into the perfect encoding for the requester.
The advantages are:
- You don't waste CPU encoding video into formats that won't be used.
- You can use a standard caching solution to reuse those chunks.
- Everyone gets the perfect encoding, always.
- If most people watch the first 2 minutes and then give up on a 20 minute video, you don't waste CPU encoding the other 18 minutes.
- You can introduce new encodings instantly for all videos, without going back and re-encoding historical videos.
- You don't waste storage on video chunks in formats that will never be used.
- It's really simple.
The disadvantages would be:
- You have more unpredictable load if a lot of people start watching (different) videos at once (although the common case of a lot of people watching the same video is still fine) and you could "cap" the load by switching to a fallback format to avoid becoming overloaded.
- There might be an initial delay when playing or sweeping a video whilst the first chunk is encoded. On the other hand, it can't get much worse than it is already, and you could make sure these initial chunks are prioritised, or else serve the initial chunks via a fallback format.
Have you ever written a stateless transcoder like this? Of course it can be done, but saying “you could simply encode those chunks” and “It’s really simple” is pretty misleading especially if you are changing frame rates or sample rates or audio codecs during the encoding process.
That said, if there is someone that could do this at scale it would be Facebook.
Also this would mess up ABR streaming at least for the first people to watch the video which would not really guarantee “the perfect encoding always”.
I have written such a transcoder [0] and while it is definitely not "simple," it has definitely never been easier to achieve than today.
If the input video source has been prepared properly (i.e., constant framerate, truly compliant VBR/ABR, fixed-GOP), or if your input is a raw/Y4M, then segmenting each GOP into its own x264 bytestream is rather trivial.
If the input is not prepared for immediate segmentation, it is also somewhat easy now to fix this before segmenting for processing. Using hardware acceleration a transcoder could decode the non-conforming input to Y4M (yuv4mpegpipe) or FFV1, which can then have a proper GOP set.
It's not that simple. Especially if you deal with videos that have open GOPs and have multiple B frames in the GOP. The encoders don't do so great job in those cases. Also breaking the videos up into short chunks is easier said than done -- you need to understand where it makes sense to split the video, make sure the Iframes are aligned and generally try to keep the consistent segment size -- which with very dynamic videos encoded with different types of source encoders could result in very inconsistent performance of the encoder itself across the chunks. For that reason, it's always best to have some sort of two/three pass encoding where the analysis step is integral and then based on it the actual split & encode is performed. Which of course does not work for low-latency live streaming scenarios.
That's not always reasonable, though. If I upload a video at 4K, there needs to be some "baseline" encoding so that when the video is published, there's something playable without streaming 4K to, say, cell phones with a resolution smaller than 4K.
Even then, chewing on that 4K file into a 1080P resolution video "on-demand" for a desktop user on high-speed internet is no small task. First and foremost, you need to assume concurrency: if two people request the video at the same time, there's a complex problem of coordinating the encoding in a large distributed system so that the video is encoded once (or a very small number of times). You also need to do the encoding _faster than the video can be played_, or at least faster than the baseline/fallback version can be retrieved and sent. You also need to queue the next chunk(s) of video up, so you're not watching a chunk, buffering, watching a chunk, etc.
In a system at the scale of FB, it's not smooth sailing for compute jobs like this: you're subject to network latency, noisy neighbors, failures (disk/network/software/power/etc.). The case where you're able to stream the original file from storage, start encoding it and streaming the output back to storage and to everyone around the world who is requesting it _at that moment_, and coordinating the encoding of the next chunk is actually not very likely.
Want to talk about weird failure modes?
- I start watching your video and click all over the seek bar. Am I DOSing your compute cluster?
- Two users on opposite sides of the world request the same location of the same file with the same fidelity. Does one of those users get a dirt-slow experience, or do I double my compute costs?
- A thousand users start watching the same video at roughly the same time. A software bug causes the encoder(s) to crash. Do 1000 users suddenly have a broken experience, or does the video pause while your coordination software realizes there was a failure, releases the lock, and restarts that encoding job from the top while your users all get in line for the new job?
I'd argue that this is the _least simple_ approach. In the happy path, you get a nice outcome while reducing compute cost, and users get high-quality video. In the unhappy path, users get slow loading from slow encoding instead of reduced resolution, or you start to need to trade performance for compute (do you encode twice in two datacenters, or move compute further from the viewer?).
As other people have pointed out, video encoding isn't stateless. You can chunk and encode in parallel, but there are tradeoffs.
The biggest problem is that you've taken what is effectively a CDN problem (serve the first chunk of a video) into a CDN and a CPU scheduling problem
Serving video is cheap, apart from the bandwidth. So anything that reduces the number of bytes transferred yields savings. Real time encoders are not as efficient as "slow" encoders.
For low volume videos (ie 99.5% of all video) the biggest cost is storage. So storing things in a high quality, or worse still, original codec makes storage expensive. Not only that you still have to transcode on the way in, or support all codecs ever made, in real time.
In short, yes, for some applications this approach might work, but for facebook or youtube, it wont.
It seems that this process will be incredibly stateful in the encoder component.
Most of codecs targeted at low bandwidth for mobile streaming track scene changes and if you make chunks in a naive way (split via I-frame borders and encode 'em independent of each other), when reassembled final video will look choppy due to broken scene change relations.
So after encoding each chunk you will have to carefully save relevant parts of encoder state and reuse it for the next chunk. Seems doable, but tricky to get right?
This is kind of simplistic. For instance, if ‘most people’ only watch the first 2 minutes of a 20 minute video, you still have to encode all of it for that minority that does watch the whole video. Also consider that very large groups of people use very similar hardware and connections.
Anyway, of course videos are already chopped into chunks that are stored separately. It’s much easier to distribute and cache these independent chunks. On demand encoding doesn’t change that.
Except facebook web always show me 1080p on mobile data connection. I have reported many times but I guess they don't care about user feedback. If I select 480p on youtube, It will never show me 1080p, unless I switch back to 1080p. On facebook, every video is by default 1080p. Even if I don't watch video, scrolling news feed also pre-load 1080p. This is a huge problem with limited mobile data plan.
> An encoding family requires a minimum set of resolutions to be made available before we can deliver a video. [...] For example, having one video with all of its VP9 lanes adds more value than 10 videos with incomplete (and therefore, undeliverable) VP9 lanes.
I don't see why this constraint is in place, you can absolutely serve video for certain-res users only with certain codecs (youtube certainly does this).
I assume Facebook has this requirement for usability reasons. They don’t want a user to receive a video through one of its many share features and then not be able to view it.
This is (was?) a big problem with google photos. If you upload a video and immediately share it with someone they'll just see a black screen. A few minutes later they might be able to watch a super-low res encoded version, and only much later does a HD version become available. I started waiting 10 minutes between generating a "share" link before sending it to my family because of all the confusion it was causing.
You need baseline codecs for baseline visibility, sure. But the article is about selectively choosing which videos to encode with more advanced parameters, and there's no reason to hold back delivering your 1080p VP9 just because the 720p VP9 isn't ready yet - just serve something else.
What if you were streaming 1080 VP9 and then your player decides to downgrade to 720 because your connection became slower? At that point the missing 720 would make you unable to watch the video.
In MPEG-DASH codec is specified at the representation level so with a compliant packager and a compliant player the player should be able to switch down to the "fast h.264" rendition that exists before they start trying to create VP9 renditions.
It seems like Facebook does a thing that doesn't scale, in the interest of time: feed a minimally processed video until other streams are available. That makes sense, but I wonder if it also means that video quality on Facebook generally suffers when high-profile events are taking place.
Context on that wonder: I've always noticed that news outlets seem to carry downright horrible quality user-generated video clips of rallies, protests, and the like. Where everybody's carrying around stellar-quality video gear in their pockets these days, I've never figured out why that is.
1) Livestreaming can have very poor quality when bandwidth constrained (e.g. at an overloaded cell site at a rally)
2) Viral videos get reencoded many times in many formats. The cumulative encoding errors are not only limited by the lowest quality reencode, but also by the defects in all previous encodes with various codecs.
I bet you just answered my question - we aren't talking about a 5mbps video uploaded after the fact on home Wi-Fi, but a 384kbps video shoved down a 512kbps, TCP-unfriendly pipe as it's being shot. Thank you for that!
Related question: is there any decent OS software that can intelligently pick encoding parameters that preserve quality while minimizing size?
I recently tried to implement video uploading for an open source project, but naively choosing ffmpeg parameters can often result in noticeable quality loss / large output file size / long encoding time. And easily all three of those at the same time.
This is a known and hard problem. A lot of companies are trying to internally do some sort of analysis of their own content libraries and intelligently tune their encoders. Netflix is definitely the most known for their tech related to video analysis and introduction of VMAF[1], but other metrics also exist which enable you to compare the original/master and the encoded variant (PSNR, SSIM, etc). Bottom line is that you need a lot of trial and error and fitting different curves on your bitrate/quality graphs and often times what metrics consider to be good quality human visual system doesn't agree 100% with. It's a very interesting problem nonetheless, and I recommend [2] if you want to learn more.
> but naively choosing ffmpeg parameters can often result in noticeable quality loss
Welcome to the world of video compression. If you're not able to take the time to learn the ins/outs of how to use a codec as well take each incoming video's specifics into consideration, then you'll be needing to borrow someone's middle of the road presets. Dedicated settings make decisions based on the frame size, bitrate restrictions, things like HLS vs download/play, 1pass/multipass etc. All of that determines GOP size, reference frames, etc.
The problem really is encoding from compressed video to compressed video again. It really isn't great whatever you do (Facebook/Twitter definitely haven't solved this problem either).
The quality from going from source -> 15mbit/sec h265 -> 2mbit/sec h264 (like on a classic social media or whatever site) is absolutely terrible compared to going from source -> 2mbit/sec h264.
Facebook/YouTube don't care about how their compressed video looks. They just need it in the format so that they can get in front of their millions of user's eyballs. I'd be willing to bet that >95% of users don't "care" about compression quality. They just want the content, hence their decisions. There's just no way to properly encode that much content with a "we care about compression" thoughts.
Not really - "raw" 1080p video is ~3gbit/sec - so going from 3000 to 15 is a huge amount of data compression.
The problem is when you go from 15 to 2 in my example is that the encoder spends all the time trying to basically encode the artefacts. It really doesn't work well at all.
Hm, that's too bad. Is there a better way to compress already-compressed streams? I find having to do that a lot due to a use case that's not very relevant here.
Handbrake [0] is a pretty user-friendly GUI with some good presets. It's not quite the automatic optimizing tool you're describing but does a good job for most basic tasks.
The parameters for x264 present everything you need in the correct different dimensions (speed, compatibility, quality, type of content like animation/film/screen recording). The problem is solved, the issue is people built frontends on top of it and present the options in a wrong way that ruins them.
Yes, the x264 params are all useful. It's selecting the optimal combination for a given use case that presents a challenge. Vimeo's video quality (and therefore filesize) is much higher than FBs, because their use cases are completely different.
I am not sure if this is still the case, but be aware that a specific CRF on x264 vs x265 might mean a different (perceived) quality. It was certainly the case a few years ago.
I have been interested in this and to my surprise I was able to find just a single project that does something similar: https://github.com/master-of-zen/Av1an
It has the ability to do trial compression (w/ scene splitting) and evaluate quality loss up to a desired factor.
> A relatively small percentage (roughly one-third) of all videos on Facebook generate the majority of overall watch time.
Surprising to me that it’s so large, I’d expect something like 5% or less account for half of watch time. I wonder how it compares across platforms (YouTube for instance would almost certainly be in the <5% bucket, I’d imagine)
Maybe you and the author are considering different periods, e.g.
A. what % of all videos generated the majority of overall watch time over the last week?
B. what % of all videos generated the majority of overall watch time over the last year?
If the watch time is growing by a high percentage month-over-month, perhaps the numbers would be similar. But, if overall watch time is flat, then you'd expect A to be much lower than B.
Facebook does a garbage job of encoding, I wouldn't be using them as a reference. Half the videos people send me on messenger are either terrible quality or just a frozen frame with audio.
Both when played on my phone or via the web page.
Videos sent by the same people on other services like Signal, Slack or Messages are fine.
Videos over messaging are a different problem space from the one outlined in this article. The number of people watching a messaged video is low; latency might be the topline metric the team optimizes for, since it has a big effect on whether people stay in the conversation or leave.
In fact, prioritizing latency might be the reason for the garbage quality you're seeing: reduce the bitrate, and people will receive the video more quickly and more reliably. Whether that's a smart decision is another question.
I feel the same way about their terrible UIs. Messenger is almost unusable on both web and android and new bugs are introduced frequently, but somehow their shit house technology (React) has become the industry standards.
Is there an example of a video that can be encoded at 5MB with H.264 that can be encoded at an equivalent quality with VP9 at 3MB? There are a lot of things to say about this article, but that seems most glaring.
The first draft of H.264 is over over 20 years old. And in most of these comparison they could use anywhere from Baseline to High Profile. Modern VP9 could compete with HEVC, 40% reduction of Bitrate is very reasonable between generations as long as it is not some insanely low bitrate which doesn't compress well.
That may be what its intended. But 5 -> 3 is a world apart from 5 -> 5.95. One is consumer friendly (and groundbreaking information theory) and the other is just about reducing margins while pretending to be consumer friendly.
Is it too early to think about using the AI inferencing chips provided on mobile phones for decoding video? There are lots of papers that use deep CNNs for compression/decompression, and Facebook / Google would be the best companies to productionize it first.
At least here, the video quality is not good on FB. I have 120 mbit/sec connection, youtube plays good quality 4k just fine. I don’t think I ever saw 1080p served by facebook, both resolution and bitrate are very low.
Nothing is bad as reddit video that would load random segments instead of consequential ones or Twitter video that would be 3 seconds on 4K 20 seconds 144p 5 more seconds in buffering or Vimeo that’s buffering all the times on 100Mbps connection.
Moral of the story: Video is hard, apart from Youtube, Twitch, Netflix and Amazon Prime I don’t know any service that plays video flowlessly. O.K. maybe some ad networks too.
All possible. But FB video (meaning also Instagram) is a popular target for cheap ISPs that don't rate a peer to constrict since it consumes such a large portion of users' total bandwidth, especially on mobile
I find it odd that this is a kind of mathematical article but doesn’t use the existing terms in codec literature - why does it say “cost/benefit” instead of “rate/distortion”?
A Related Tangent is how FB handles different photo requests by saving 4 copies with different resolutions at the time of upload. It's even worse because according to [1]:
" The number of Haystack photos written is 12 times the number of photos uploaded since the application scales each image to 4 sizes and saves each size in 3 different locations. "
Keep in mind that the referenced paper that this page is based on is over a decade old now. Many things have changed since then, as you can imagine. Look for later papers and engineering blog posts for more about what’s changed since then.
can anyone show me decent looking video on facebook? whenever i see anything from my friends it's just a horrible pixelated mess, even the gaming channels i saw
Why would I read this? Every video I share on Messenger gets crushed beyond any usefulness, I don't even bother anymore. Whatever FB does is only useful as a case study in failure.
These are the kinds of articles that show why Facebook has different problems to almost every other tech company. The complexity of these kinds of solutions is mind-boggling.
Just imagine if all that ingenuity was focused on solving humanity’s problems, instead of sharing conspiracy theories and advertising.
Being able to efficiently serve video is one of humanity's problems. You may not like the host or video content, but that doesn't mean the problem is not a useful one to solve.
I don't think this is difficult to engineer. It's a good example of ML/AI being used for optimization, which I believe is the most legit usecase that everyone should be aiming for. Still, the problem is how do we get to the scale where we desperately need stuffs like this...
I wish people spent more time on Facebook. Is that what FB employees think? Or are they like the rest of us and think fuck this guy, but I'll take his money.
Other firms pay just as much, so there's plenty of choice in employment. Rather, working at Facebook just isn't an ethical problem for many of its employees.
My engineering colleagues that voraciously use and consume Messenger and Instagram don't see a problem either.
There a lot of "bad" companies. Certain advertising, defense, pharma, law, etc. firms are shady or morally bankrupt. It doesn't stop them from finding people to do the work.
I have found Twitter videos to have improved in quality recently, but there’s a disconcerting effect where when you upload your own video you see first a very low quality version, and the much better version replace it. So you are lead to believe that your video is indeed horribly recompressed, especially if you don’t wait a bit, when it seems that there’s actually several level of quality available.
So they are using feedback to see how many people are watching a video to use more cpu for advanced compression - There is nothing innovative here - we want to make sure we provide resources to something thats viral and not to all videos, whats innovative here ? I think in cloud and all companies are using metrics to optimize their resources,
To Collect feedback metrics to judge usage and throw resources to optimize rather than blind optimizing every single video.
These techniques are very simple but makes sense to use when you have such a high demand.
But I would expect more from facebook.
Most big companies with millions of hours of video uploaded each day have realised it's cheaper to stick a bunch of hardware video encoding chips onto an accelerator board and be able to transcode 100 HD streams simultaneously into all the formats and resolutions you need to host.
The power savings on CPU's pay for the custom hardware in a matter of months.
It does reduce flexibility when new video formats get released though.