Some Notes on Monitoring and Testing Load Balanced Resources

, ,

One of the challenges faced when designing end-to-end checks is encountered when you are facing a load balancing arrangement. It is necessary to repeat part of the check more than once so that you can be reasonably sure that you hit all the resources that are being load balanced. Repeat too many times and you push your resources (ah, nothing like stress testing a live production system). Repeat too few times, and node 12 will keep serving that critical error you thought you'd fixed.


Perhaps the most common case is random or semi-random balancing. If the balancing is fair (that is if the probability of a request being assigned to a given resource is constant across all resources), this is a case of the coupon collectors problem. Essentially the problem is the same as someone collecting a full set of coupons (or baseball cards, world cup stickers or pokemon etc.) - the first few are easy to get but it becomes progressively more difficult to get new ones as your collection grows.

In this case, the problem scales as O(N ⋅ log(N)), where N is the number of resources. To calculate the number of times we need to repeat the checks to cover all the resources(C), we first calculate it's expectation:

⟨C⟩ = N ⋅ HN
= N ⋅ loge N + γN + o(1⁄N2)
≅ N ⋅ loge N + γN as N → ∞

Hn is the Harmonic Number of N, and γ is the Euler-Mascheroni constant. For small N we can find Hn by hand.

Hn = 1/1 + 1/2 + 1/3 + … + 1/n

We can then use the markov inequality to find a bound on the probability of coverage, which should be a satisfactory guide.

P(C ≥ xnHn) ≤ 1⁄x

Round Robin

The other common load balancing algorithm is set up for round robin balancing, when the resources are ordered in a list, and the balancer chooses them in sequence, restarting from the top when necessary.

In this case we will need to know the number of 'organic' requests coming through between one test and the next. We can calculate this if we know the request probability density L(t); which is something you really want to be keeping track of anyway. If we can assume our L(t) is cyclostationary, the expectation of the number of requests between checks that take with period T at time 0 is:

0NT L(t)/N dt
= L(t)T as NT → 0.

We can find the probability that no requests come through by using the markov inequality again, if needed.

If you are running the checks in a quiet period (or running them quickly enough) then you get full coverage at C = N with probability 1-PO, where PO is the probability of an an organic request. Unfortunately simply adding an extra check or two will not help; you'll have add another N-1, to get probability 1-PO2. If you do your checks at busy times L(t)T >> N, when it's almost certain that several requests have come through, you can delay or randomise the timing of the checks so that either

T >> σ(L(t))
σ(T) >> L(t)N

in which case it reduces to random.

Other Algorithms

If the load balancing is set up to favour 'least loaded' resources in some way, then we have a different kettle of fish. In fact, unless we're load balancing requests that always take the same time to complete this is usually a bad thing to have in place anyway, and even then it's highly questionable whether this added complexity helps.

Sometimes Less really is More

, ,

Today I made some computers go faster by taking out some of their memory and halving the number of CPUs in each one.

No seriously I did. There were three VMware machines on a dual core 3G Xeon running ESX that we couldn't coax above 150 MHz if there was activity on the NICs, no matter what we fiddled with. Bizarrely an almost identical (virtual) machine on the same (physical) box ran fine. We couldn't find anything different enough between them to warrant this massive difference.

In desparation VMware support in Cork suggested we try cutting the allocated memory and for good measure reducing them to one core each. Amazingly this punitive measure worked; performance leapt by an order of magnitude. Developers were jubilant, managers sighed with relief. Network engineers were mostly apologised to.

The ESX hypervisor will seemingly starve itself of RAM in some cases unless you leave a portion of RAM free, although exactly how much you need to be on the safe side we have yet to determine. For some reason this caused the 'lucky' (virtual) machine to be scheduled to run with far greater likelihood than the other three when the nic was busy. This could well be (and I'm guessing here) an artefact of the way ESX gathers entropy - in lieu of anything better a lot of RNGs take their cues from packet arrival times. It's probable that the number of CPUs is a bit of a red herring - but apparently there are scheduling weirdnesses that can kick in where some of the VMs begin core-squatting (for want of a better term - though the behaviour of taking a resource in order to do nothing with it fits more landlords than real-life squatters).

Coincidentally I learned a wonderful chinese idiom yesterday pronounced roughly "joe joe gwee yee" meaning literally "nine nine return-to one"; which basically means 'integer overflow - carry the one', and is used colloquially to express the idea that taking too much is as bad as never having enough (which is important to much of Chinese philosophy). In the west I suppose we have the game of 'chicken' as a similar cultural reference.

Like most 21st century techies I'm reasonably confident these days working in the bizarro world of virtual hardware, but this was the most counter-intuitive result so far.

21:49 01 Feb 2008 /articles/systems/ less_is_more

Ellsberg's Wager

, , , , , ,

There is a bag containing at least one token. Tokens must be either black or white. One token will be drawn at random from the bag and then returned to it, say at noon on Sunday. Now imagine that everyone in some community has one black token and one white token. They can place one (and only one) on a Magic Box before noon, and if it's colour matches the token that is drawn out at noon, then a Desirable Prize is dispensed. There's no penalty for getting it wrong. Which token should they choose? For those who are interested we're playing a game where the payoff matrix is identity.

token choice
the bag's outcomeblack10

Obviously the best strategy depends on what the bag contains. Suppose it's a balanced bag that contains an equal number of white tokens and black tokens. Even small children will tell you that either token is equally good in this case, and when psychologists have investigated people's choices they about 50% of people choose black and 50% of people choose white. What if we have no idea what's in the bag except that nothing but a white or a black token will be pulled out? It's a bag of mystery! Again either token has a 50-50 chance. And given the choice, in this case too people will choose roughly equal proportions of each token.

So lets's be clear - the strategies for a one-off game are identical for balanced and mystery bags. Now suppose we offer people a choice of matching their token against a token drawn from either the balanced bag or the mystery bag. Most people choose to play against the balanced bag. Most of the people choosing the balanced bag will not switch to the mystery bag even if the game is repeated. This is known as Ellsberg's paradox. It's named after Daniel Ellsberg, better known for leaking the Pentagon Papers in an attempt to stop the Vietnam war, by the way. He's now an outspoken opponent of the USA's war in Iraq, and the frequently threatened attack on Iran.

See, in the case of the repeated game, the mystery bag offers the opportunity of learning to exploit any imbalance that may exist in the proportion of token colours, but in the balanced case you're always stuck at 50% and can never improve upon it. Ellsberg's paradox caused much consternation amongst many economists as their theories are frequently based on the assumption (spoken or unspoken, and frequently attended by hordes of similar assumptions) that all agents act rationally (ie. in their best interests). Psychologists aren't really sure why so many people choose the balanced bag - could it be a commitment to fairness, a preference for not making hard choices, fear of the unknown...? Nobody really knows for certain. But maybe we can make an educated guess.

Let's look more closely at how strategies for the one-off game are arrived at. In the case of the balanced bag we have an probability estimate of 50%, and a maximum confidence in that estimate - given that token colour is a random variable we can't do better. The balanced probability is considered to be unconditional. In the case of the bag of mystery we have an initial probability estimate of 50% but no confidence at all in it - it's really a representation of our own ignorance, what Bayesians would call a 'prior'.

A mathematical formalism that addresses this notion of confidence is called Dempster-Shafer theory, a generalisation of Bayesian statistics. Take the set ω of outcomes. In the case above ω = {black, white}. We are interested in the 'power set' or set-of-all-subsets of ω. We write this as 2ω = {ø, {black}, {white}, ω}. Now let's define a function called the mass m, which will be a little bit like probability in that it adds up to 1, but it's defined not on ω but on 2ω. We'll define m(ø) = 0 and then we can ignore the empty set completely.

The mass of a set s is m(s) and represents the 'weight of evidence' that the outcome is in that set, but not evidence for a particular member of that set. At the beginning of our game:

m Balanced Mystery
ω 0 1
{white} ½ 0
{black} ½ 0

As the game progresses, we will receive evidence for the mystery bag's internal state (the particular outcomes of each draw) - and gradually we will see m(ω) going down, and m({black}) and m({white}) rising. While we will never have the rock-solid certainty of the balanced bag system, we could conceivably find that one or other of the outcomes is in the lead and adapt our strategy accordingly.

For decision making purposes, we usually don't want to work with the masses themselves, but with an interval of probabilities. The Support S(p) where p ⊆ ω is defined as the sum of m(x) where x ∈ 2p. The Plausibility of p is the extent to which evidence against p leaves room for the possibility that the event is in p and is given by Pl(p) = 1 - S(¬ p). Also note that S(p) ≤ Pl(p).

To look at our table again:

Balanced Mystery
m S Pl m S Pl
ω 0 0 0 1 1 1
{white} ½ ½ ½ 0 0 1
{black} ½ ½ ½ 0 0 1

This sort of reasoning has lead some people assert the nonexistence (or sometimes non-utility) of probability altogether (whatever that turns out to mean). Judea Pearl, an AI dude from UCLA, has argued that the support is actually a probability of provability, where provability is contrasted with truth. Judea Pearl is also the originator of much of the science of belief propagation (by 'belief' he means 'support') and causality research (I've become very interested in this recently).

As evidenced by the confused thinking on the subject of decision under uncertainty it appears a pervasive human cognitive bias is preference for the experience of irreducible risk over the experience of having to deal with our own ignorance. Knowing this bad mental habit is certainly useful in itself, but I was interested to find out that there is an entire branch of decision theory that is based on it, for those situations where it is actually appropriate. They call it 'info-gap theory' which is just about the worst name I could have imagined. It comes as no surprise that the 1950s are ultimately responsible for this crime against nomenclature.

The appropriate domain is wherever the precautionary principle applies wherever making the wrong decision leads to irreversible harm and there is no further opportunity to make the decision over again- medicine, biodiversity, etc. In info-gap theory we need make decisions based on optimising robustness under failure, not the expectation of 'utility' (or just 'payoff' in pure game theory). Pascal's wager is a fun example. Pascal's wager is the argument that you have nothing to lose from believing in a God because if (s)He doesn't exist you'll still be as just as dead as if you hadn't. There is a similar, but subtler, argument attributed to the Buddha.

There is an obvious problem with Pascal's approach; the assumption that Gods will invariably reward belief and/or punish denial. We don't know this. Now we're almost back to our bag of tokens; we have two levels of uncertainty. Drawing any conclusions, that is, deciding whether to believe in Gods, deciding which set of Gods to pick, deciding whether to decide whether to believe in Gods at all, and adapting your strategy to account for potential past or future lives is left as an exercise for the reader.

19:32 20 Jan 2008 /articles/ ellsbergswager is live


The campaign to save the Mill Road area of Cambridge from the insipid, depressing big-brand mediocrity that supermarkets bring has well and truly kicked off. Tesco plan to open a metro store on the road, inevitably destroying the smaller merchants and sucking all the character out of the place. Metro is a special code-word that magically reclassifies tesco supermarkets as non-supermarkets (convenience stores) thereby allowing them to bypass the usual competition laws. Tesco control 30% of the UK grocery market - but over 5% comes from metros so the magic 25% that would otherwise incur the wrath of the competition comission is avoided.

The main site has links to all the relevant resources such as tescopoly, indymedia and even a facebook group.

There's an excellent blog highlighting the many problems with the decades-long supermarket chain explosion the UK is undergoing. In fact criticism of Tesco in particular is so widespread it even has it's own wikipedia page. Another site worth a read is tesco employees' own Very Little Helps.

There is a lively protest planned for this weekend at the proposed site, from 10am on Saturday.

I fully endorse this product and/or event.

20:54 03 Oct 2007 /articles/action/ tesco1

Things for linux users to bear in mind when starting to use a mac (part 1: hardware)

, ,

So I (finally) agreed to see what the fuss was about. I must confess to being a little, um, underwhelmed. I'm not going to write off something on the basis of less than a weeks' experience, but my initial impression hasn't been wonderful. In short - if you don't mind messing about with libraries, installing many things by hand and generally being prepared to do a lot of learning (the kind of learning that doesn't help solve your problems), by all means use a Mac. If you want it to 'just work', stick to Linux.

Some caveats; there are really three exeriences for me to have first impressions of here: hardware, software, and the integration between the two. We will consider OSX in a separate article, which will be imaginatively titled 'part 2'. I am using a 2007 15" macbook pro with OSX 10.4 (Tiger). It should also be borne in mind that I am only interested in it as a software development laptop; the ease with which one can back up their iPod is not relevant.

Having said that, I can report that as expected the integration component of my first impressions was excellent. With almost complete control over the hardware environment, apple can restrict what kernel development needs to happen to the essentials, and get them right, much like Sun did with sparc Solaris. I was expecting to find this wonderful as this has always been the weakest part of the Linux user experience. Other things which linux isn't great at like proprietary codecs and JVM implementations 'just work' as well. It's worth pointing out that in fairness there are economic and legal forces outside the Linux kernel developers' and distribution creators' control here.

The hardware is of excellent build quality, but very quirky. I am very impressed with the power connector - so many otherwise functional laptops are lost to damaged AC connectors which nobody will admit to being able to repair.

The lack of a proper mouse at first seemed to be an infuriating hinderance, but later the OSX environment obviates this by making impossible most of what you'd use one additional buttons for. The keyboard is bizarrely laid out - requiring 'Alt-3' to get a hash symbol (the § and ± are conveniently located where the escape key normally is, for some reason). The enter key is so small that I regularly hit several of my fingers off the side of the keyboard trying to press it. The caps lock key is much larger and more conveniently located (this is a bad thing). Inexplicably there is second enter button concealed between the left-arrow key and one of the two logo keys - which have replaced ctrl in most of the software. The keyboard warms up rapidly, to the extent that it is actually uncomfortable to use. In fact the whole machine warms up the room quite considerably.

Replacing the keyboard and mouse with external usb devices is better, but leaves no free usb ports (though there are two flavours of firewire).

00:47 09 Sep 2007 /articles/systems/ apple_imp1

quick links




(RSS) (atom) whole site
(RSS) (atom) this bit


Email 'steve' at this domain.
My public key (gpg)