Some Notes on Monitoring and Testing Load Balanced Resources

, ,

One of the challenges faced when designing end-to-end checks is encountered when you are facing a load balancing arrangement. It is necessary to repeat part of the check more than once so that you can be reasonably sure that you hit all the resources that are being load balanced. Repeat too many times and you push your resources (ah, nothing like stress testing a live production system). Repeat too few times, and node 12 will keep serving that critical error you thought you'd fixed.


Perhaps the most common case is random or semi-random balancing. If the balancing is fair (that is if the probability of a request being assigned to a given resource is constant across all resources), this is a case of the coupon collectors problem. Essentially the problem is the same as someone collecting a full set of coupons (or baseball cards, world cup stickers or pokemon etc.) - the first few are easy to get but it becomes progressively more difficult to get new ones as your collection grows.

In this case, the problem scales as O(N ⋅ log(N)), where N is the number of resources. To calculate the number of times we need to repeat the checks to cover all the resources(C), we first calculate it's expectation:

⟨C⟩ = N ⋅ HN
= N ⋅ loge N + γN + o(1⁄N2)
≅ N ⋅ loge N + γN as N → ∞

Hn is the Harmonic Number of N, and γ is the Euler-Mascheroni constant. For small N we can find Hn by hand.

Hn = 1/1 + 1/2 + 1/3 + … + 1/n

We can then use the markov inequality to find a bound on the probability of coverage, which should be a satisfactory guide.

P(C ≥ xnHn) ≤ 1⁄x

Round Robin

The other common load balancing algorithm is set up for round robin balancing, when the resources are ordered in a list, and the balancer chooses them in sequence, restarting from the top when necessary.

In this case we will need to know the number of 'organic' requests coming through between one test and the next. We can calculate this if we know the request probability density L(t); which is something you really want to be keeping track of anyway. If we can assume our L(t) is cyclostationary, the expectation of the number of requests between checks that take with period T at time 0 is:

0NT L(t)/N dt
= L(t)T as NT → 0.

We can find the probability that no requests come through by using the markov inequality again, if needed.

If you are running the checks in a quiet period (or running them quickly enough) then you get full coverage at C = N with probability 1-PO, where PO is the probability of an an organic request. Unfortunately simply adding an extra check or two will not help; you'll have add another N-1, to get probability 1-PO2. If you do your checks at busy times L(t)T >> N, when it's almost certain that several requests have come through, you can delay or randomise the timing of the checks so that either

T >> σ(L(t))
σ(T) >> L(t)N

in which case it reduces to random.

Other Algorithms

If the load balancing is set up to favour 'least loaded' resources in some way, then we have a different kettle of fish. In fact, unless we're load balancing requests that always take the same time to complete this is usually a bad thing to have in place anyway, and even then it's highly questionable whether this added complexity helps.

Sometimes Less really is More

, ,

Today I made some computers go faster by taking out some of their memory and halving the number of CPUs in each one.

No seriously I did. There were three VMware machines on a dual core 3G Xeon running ESX that we couldn't coax above 150 MHz if there was activity on the NICs, no matter what we fiddled with. Bizarrely an almost identical (virtual) machine on the same (physical) box ran fine. We couldn't find anything different enough between them to warrant this massive difference.

In desparation VMware support in Cork suggested we try cutting the allocated memory and for good measure reducing them to one core each. Amazingly this punitive measure worked; performance leapt by an order of magnitude. Developers were jubilant, managers sighed with relief. Network engineers were mostly apologised to.

The ESX hypervisor will seemingly starve itself of RAM in some cases unless you leave a portion of RAM free, although exactly how much you need to be on the safe side we have yet to determine. For some reason this caused the 'lucky' (virtual) machine to be scheduled to run with far greater likelihood than the other three when the nic was busy. This could well be (and I'm guessing here) an artefact of the way ESX gathers entropy - in lieu of anything better a lot of RNGs take their cues from packet arrival times. It's probable that the number of CPUs is a bit of a red herring - but apparently there are scheduling weirdnesses that can kick in where some of the VMs begin core-squatting (for want of a better term - though the behaviour of taking a resource in order to do nothing with it fits more landlords than real-life squatters).

Coincidentally I learned a wonderful chinese idiom yesterday pronounced roughly "joe joe gwee yee" meaning literally "nine nine return-to one"; which basically means 'integer overflow - carry the one', and is used colloquially to express the idea that taking too much is as bad as never having enough (which is important to much of Chinese philosophy). In the west I suppose we have the game of 'chicken' as a similar cultural reference.

Like most 21st century techies I'm reasonably confident these days working in the bizarro world of virtual hardware, but this was the most counter-intuitive result so far.

21:49 01 Feb 2008 /articles/systems/ less_is_more

Things for linux users to bear in mind when starting to use a mac (part 1: hardware)

, ,

So I (finally) agreed to see what the fuss was about. I must confess to being a little, um, underwhelmed. I'm not going to write off something on the basis of less than a weeks' experience, but my initial impression hasn't been wonderful. In short - if you don't mind messing about with libraries, installing many things by hand and generally being prepared to do a lot of learning (the kind of learning that doesn't help solve your problems), by all means use a Mac. If you want it to 'just work', stick to Linux.

Some caveats; there are really three exeriences for me to have first impressions of here: hardware, software, and the integration between the two. We will consider OSX in a separate article, which will be imaginatively titled 'part 2'. I am using a 2007 15" macbook pro with OSX 10.4 (Tiger). It should also be borne in mind that I am only interested in it as a software development laptop; the ease with which one can back up their iPod is not relevant.

Having said that, I can report that as expected the integration component of my first impressions was excellent. With almost complete control over the hardware environment, apple can restrict what kernel development needs to happen to the essentials, and get them right, much like Sun did with sparc Solaris. I was expecting to find this wonderful as this has always been the weakest part of the Linux user experience. Other things which linux isn't great at like proprietary codecs and JVM implementations 'just work' as well. It's worth pointing out that in fairness there are economic and legal forces outside the Linux kernel developers' and distribution creators' control here.

The hardware is of excellent build quality, but very quirky. I am very impressed with the power connector - so many otherwise functional laptops are lost to damaged AC connectors which nobody will admit to being able to repair.

The lack of a proper mouse at first seemed to be an infuriating hinderance, but later the OSX environment obviates this by making impossible most of what you'd use one additional buttons for. The keyboard is bizarrely laid out - requiring 'Alt-3' to get a hash symbol (the § and ± are conveniently located where the escape key normally is, for some reason). The enter key is so small that I regularly hit several of my fingers off the side of the keyboard trying to press it. The caps lock key is much larger and more conveniently located (this is a bad thing). Inexplicably there is second enter button concealed between the left-arrow key and one of the two logo keys - which have replaced ctrl in most of the software. The keyboard warms up rapidly, to the extent that it is actually uncomfortable to use. In fact the whole machine warms up the room quite considerably.

Replacing the keyboard and mouse with external usb devices is better, but leaves no free usb ports (though there are two flavours of firewire).

00:47 09 Sep 2007 /articles/systems/ apple_imp1

quick links




(RSS) (atom) whole site
(RSS) (atom) this bit


Email 'steve' at this domain.
My public key (gpg)