booth's algorithm vs russian peasant's algorithm for multiplying two binary numbers
-
If I want to compute/calculate A*B, then: In russian peasant's algorithm: declare and define new helper variable C and P for product, initialized to 0, i.e. all bits are initialized to 0. NOW: 1. copy A to C 2. Select 1 random bit 1 (whose value/state is 1) of B and left shift C the number of bits that come in right of the selected bit, i.e. if B=0000100, then C should be left shifted 2 bits, because there are 2 bits in right of the single bit 1 of B. The new bits of C after the left shift are set to 0. 3. Add C to P 4. Turn the selected bit to 0 and if B becomes 0, i.e. all bits are 0, then P is the product of A and B, i.e. P=A*B, otherwise/else repeat all steps, i.e. redo step 1, 2, 3, 4 and etc. Is that correct/right? If yes, then why CPU/ALU is designed to use Booth's algorithm instead of Russian Peasant's algorithm? In some cases, Russian Peasant's algorithm can be much easy to perform than Booth's algorithm, when most bits of the multiplier B, are 0, and only few are 1.
-
If I want to compute/calculate A*B, then: In russian peasant's algorithm: declare and define new helper variable C and P for product, initialized to 0, i.e. all bits are initialized to 0. NOW: 1. copy A to C 2. Select 1 random bit 1 (whose value/state is 1) of B and left shift C the number of bits that come in right of the selected bit, i.e. if B=0000100, then C should be left shifted 2 bits, because there are 2 bits in right of the single bit 1 of B. The new bits of C after the left shift are set to 0. 3. Add C to P 4. Turn the selected bit to 0 and if B becomes 0, i.e. all bits are 0, then P is the product of A and B, i.e. P=A*B, otherwise/else repeat all steps, i.e. redo step 1, 2, 3, 4 and etc. Is that correct/right? If yes, then why CPU/ALU is designed to use Booth's algorithm instead of Russian Peasant's algorithm? In some cases, Russian Peasant's algorithm can be much easy to perform than Booth's algorithm, when most bits of the multiplier B, are 0, and only few are 1.
That is correct (though usually you wouldn't select a random 1 but a *particular* 1, eg going from low to high or from high to low, you can use some additional tricks that way). Typical hardware implementations don't use this because it takes a lot of steps. They typically don't really use Booth's algorithm either, but some of the ideas are reused. Yes it *may* take very few steps, but it can also take many steps (consider -1 * -1). It's bad enough already that it can take many steps, but the variation is really bad as well - throws a wrench into the usual pipelining and forwarding tricks. I suppose you might pipeline it by unrolling that entire loop, not very pleasant in terms of area, and then it has multiple exits which on top of being generally annoying means you can't do the usual thing of dispatching an instruction the cycle *before* its inputs will be ready and then rely on forwarding because you don't know when the multiplication ends until it does (ok you kind of could but it sucks).. it's horrible. So other algorithms were invented, such as the Dadda multiplier and the Wallace tree. The idea (for both of them) is to first compute all pairwise single-bit products, then add them up using as many simple adders as possible (not full word adders, but the 3:2 compressor kind). That works out really well, because the first step (the single bit products) are all completely independent, and then you get a ton of product bits that you can reduce to fewer bits with *independent* full adders. It takes some number of layers of such adders, and the layers are dependent, but within a layer everything is parallel. And then you still need a multi-bit adders in the end, where you'd use any fast adder design you like. You only need one of them so go crazy, unlike an unrolled russian peasant multiplication (RPM) where you'd need a metric ton of big adders (you could use ripple carry but then it sucks). So it's smaller and faster than unrolled RPM, and pipelinable unlike iterative RPM. That's not the end though, for example you can use slightly larger modules that add up 5 bits into 3 in a particular way and build the same sort of layered tree out of those, which is a bit faster. And you can shuffle the connections of those modules around a bit so that you connect slow bits (the modules don't calculate their outputs all at the same time) to "fast inputs" (the delay through a module is not the same for every input) and vice versa to compensate the timings. Now Booth's encoding, specifically the modified Booth en
-
That is correct (though usually you wouldn't select a random 1 but a *particular* 1, eg going from low to high or from high to low, you can use some additional tricks that way). Typical hardware implementations don't use this because it takes a lot of steps. They typically don't really use Booth's algorithm either, but some of the ideas are reused. Yes it *may* take very few steps, but it can also take many steps (consider -1 * -1). It's bad enough already that it can take many steps, but the variation is really bad as well - throws a wrench into the usual pipelining and forwarding tricks. I suppose you might pipeline it by unrolling that entire loop, not very pleasant in terms of area, and then it has multiple exits which on top of being generally annoying means you can't do the usual thing of dispatching an instruction the cycle *before* its inputs will be ready and then rely on forwarding because you don't know when the multiplication ends until it does (ok you kind of could but it sucks).. it's horrible. So other algorithms were invented, such as the Dadda multiplier and the Wallace tree. The idea (for both of them) is to first compute all pairwise single-bit products, then add them up using as many simple adders as possible (not full word adders, but the 3:2 compressor kind). That works out really well, because the first step (the single bit products) are all completely independent, and then you get a ton of product bits that you can reduce to fewer bits with *independent* full adders. It takes some number of layers of such adders, and the layers are dependent, but within a layer everything is parallel. And then you still need a multi-bit adders in the end, where you'd use any fast adder design you like. You only need one of them so go crazy, unlike an unrolled russian peasant multiplication (RPM) where you'd need a metric ton of big adders (you could use ripple carry but then it sucks). So it's smaller and faster than unrolled RPM, and pipelinable unlike iterative RPM. That's not the end though, for example you can use slightly larger modules that add up 5 bits into 3 in a particular way and build the same sort of layered tree out of those, which is a bit faster. And you can shuffle the connections of those modules around a bit so that you connect slow bits (the modules don't calculate their outputs all at the same time) to "fast inputs" (the delay through a module is not the same for every input) and vice versa to compensate the timings. Now Booth's encoding, specifically the modified Booth en
Thanks for the quick and detailed answer. Now I am satisfied.