Resources and Rightsizing for the Docker CPM given that some jobs time out
Issue
You're planning to install the Docker CPM, but are unsure how to size the host.
Environment

Docker CPM

A VM or EC2 type of host
Resolution
Our requirement is to use a multicore cpu with 2.5 GiB per cpu core dedicated to just the CPM and no other projects running alongside.
My recommendation is to use a cpu that can sustain a consistent workload. A burstable cpu may not work reliably when the burst balance is used up.
Memory will need to be 5 GiB at a minimum. Starting with a m5.large instance on EC2 is not a bad idea. An m5.large instance is a good starting point that works well with 2 heavy workers (equal to the number of cpu cores) and 8 GiB of memory.
Accommodations need to be made to actually support 2+1 heavy workers since our health check job, which runs every 5 minutes, also counts as a heavy worker.
Also worth considering is how many jobs the minion needs to process per minute, which can vary over time. Ideally you want enough heavy workers such that each heavy worker can spend the maximum time processing a job without causing the jobs to start queuing up.
For ping jobs the default timeout is 65 seconds for nonping jobs it is 180 seconds. Here's a formula that can help to calculate that.
After inputting your values for number of nonping monitors, average monitor frequency, and number of hosts, rearrange the formula to solve for the number of heavy workers that would be needed to allow each one 180 seconds per nonping job.
number of nonping monitors / (average monitor frequency * number of heavy workers * replicas or hosts)
For example, let's assume 192 nonping monitors, an average frequency of 6 minutes with 2 hosts, each with 4 cpu cores.
192 / ( 6 * 4 * 2 ) = 4 jobs per heavy worker per minute.
Doable, but not ideal if there are some jobs that last all the way to timeout. At 4 jobs per heavy worker per minute, if jobs start taking longer than (60/4) 15 seconds to complete, the queue will grow.
So we want to accommodate jobs that might take 180 seconds to complete to make a more reliable, robust CPM. If every job timed out, each heavy worker would be occupied for 180 seconds, which is 3 minutes per heavy worker per job. Flipping that around, we get 1/3 jobs per heavy worker per minute.
1/3 = 192 / ( 6 * x )
1/3 * 6x = 192
6x = 576
x = 96 heavy workers
At 96 heavy workers, the CPM could accommodate timeouts on all 192 monitors. Let's reasonably assume 10% of jobs time out at 180 seconds and the remaining 90% complete within 30 seconds (2 jobs per heavy worker per minute).
# 10% of jobs time out
192 * 10% = 19.2 jobs
96 * 10% = 9.6 heavy workers
# 90% complete within 30 seconds
2 = ( 192  19.2 ) / ( 6 * x )
2 * 6x = 172.8
6x = 86.4
x = 14.4 heavy workers
# total heavy workers needed
9.6 + 14.4 = 24
With 2 hosts, each with 4 cpu cores, 8 heavy workers will not be sufficient. Results depend greatly on the average job completion time and average monitor frequency. In this scenario, adding 16 more heavy workers will allow the CPM to run smoothly and accommodate 10% of jobs that time out. That could be a mix of monitors that time out 10% of the time, or 10% of monitors that time out 100% of the time.
Adjusting the average monitor frequency to 10 minutes, the average job completion time to 15 seconds, and the number of monitors that time out to 5%, only 8 heavy workers would be needed.
See below for a Google Sheet that calculates the above values.