tl;dr: If your Postgres autovacuum_vacuum_cost_limit is < 2000, set it to 2000:
The Problem
Vacuuming is an important part of keeping a database healthy, and if it's not happening, things will start to go wrong:
- The database will bloat, consuming more and more disk space and never giving it back
- Performance will suffer, and giving the server more resources won't help
- Performance issues will be really hard to nail down, as none of the usual suspects will really fit
How vacuuming actually works is beyond my ability to explain, but thankfully the vast majority of the problems caused by insufficient vacuuming can be solved by simply making it faster. Here's a story to explain why:
The story of Craig the autovacuum worker š§āš§
Think of your Postgres database as a movie theatre. You've got a bunch of individual cinemas (tables), and lots of patrons coming and going (rows) and leaving popcorn crumbs everywhere (dead tuples). A critical part of an efficient cinema is vacuuming up all the popcorn once someone leaves so that the next person can use that seat. This job is done by the autovacuum workers. Since you have a small cinema you have one worker and his name is Craig.
Craig decides when it is time to vacuum a specific cinema by looking at how many people have come and gone and estimating how many seats must at this point have popcorn on them. If this percentage is greater than autovacuum_vacuum_scale_factor then he will start vacuuming that particular cinema.
As your cinema gets more popular Craig is going to get busier and busier and at some point he isn't going to be able to keep up. Fortunately the management has a good solution to keep the patrons coming in: More seats! A new policy is enacted that if a patron comes in and there aren't any clean seats, we build them a new seat and just keep expanding the cinema. Problem solved! Management doesn't care if we're expanding the cinema because there aren't enough total seats, or because there aren't enough clean seats, they just add more seats.
Also part of this new policy is that once we've added a seat we never get rid of it again. It's too much hassle to get rid of a seat, only to build a new one if you happen to need it again. That makes sense right...?
Note: The above is the reason why Postgres databases never get smaller even if you delete all the data in it. In this case building the seats means requesting disk space from the OS. But it's true that it takes a really long time compared to using disk space you already have, and it actually doesn't ever give it back (except in a VACUUM FULL, or a drop table).
With this new policy it means that nobody ever has to wait for Craig to clean the seat before they can sit down, if there aren't any clean ones they just get a new seat.
A week goes by and things are great. Craig is super busy, it takes him longer and longer to clean each cinema since the cinemas are getting larger and larger, with more and more dirty chairs, but he loves his job so he doesn't complain.
However after another few weeks management start to notice that the cinema really is starting to get awfully big, and they aren't actually serving any more patrons. They realise that they are going to have to solve their cleaning problem. To this they weigh up two options:
- Allow Craig's vacuum cleaner to draw more power: This will mean that Craig can work faster, but we'll have to be careful. If we let him draw too much power he will literally clean so fast there is no power for the cinema to run. He really loves his job. It'll be clean, but we don't really want all the projectors to switch off when Craig decides to clean a cinema at lightspeed
- Hire more workers: This seems like the obvious choice, but all the workers share the same power limit for their vacuuming, so it will just mean that all of them work slower unless we also increase the power limit.
Despite their previous record for poor management decisions with the whole "add more chairs" debacle, management actually make a good decision here. They do both! Firstly they increase the amount of power that vacuum cleaners can draw by 10x (autovacuum_vacuum_cost_limit) since they realised that their first limits were way too low. They also realised however that sometimes Craig has other things to do than vacuuming. Sometimes he has to wait for people to get out of the way and he does have to take bathroom breaks. So they also hired two more staff to ensure that all of this new power can be utilised even if Craig is stuck doing something else.
Summary: Hopefully this explains the purpose of autovacuum_vacuum_cost_limit and why it's important. While the other settings like the number of workers, and the settings that regulate when/how often a vacuum should be triggered can help to optimise the vacuuming process, but the reality is that no amount of tuning these parameters can help if you're in a situation where you are cleaning popcorn slower then it's coming in.
Further reading:
- https://www.datadoghq.com/blog/postgresql-vacuum-monitoring
- https://www.2ndquadrant.com/en/blog/autovacuum-tuning-basics (cached)
Implement It Yourself
If you're the sort of person who blindly copy-pastes things from blog posts, then just copy the setting at the top of this post and add it to your postgresql.conf, and power to you. If not though, I've built a tool that analyses your logs to determine exactly how busy your "Craig" is so that you can see the effect that changing autovacuum_vacuum_cost_limit has had, and if you need to do it at all. You can find that tool here:
https://github.com/dylanratcliffe/pg_autovacuum_cost
Good luck, and may all your cinemas be clean.
* Overmind is now available to try for free. Get started by signing up and creating an account here.