How to map a matrix index to a memory location
-
+ y; Thus lookup is the offset to the start of each row in memory. So, my question is how can we do this as fast and simply as possible? Is there a quicker way?
-
+ y; Thus lookup is the offset to the start of each row in memory. So, my question is how can we do this as fast and simply as possible? Is there a quicker way?
The performance of integer multiplications depend on the used CPU architecture and generation. With most x86 and compatible CPUs integer multiplications are 2 - 4 times slower than additions which can be usually ignored. Using a lookup table will be then not significantly faster or even slower. However, with other (mainly RISC) CPU's, the operation can be up to 40 times slower than additions.
Quote:
The quickest way I can think of is to use a look up table
The quickest way is using a shift operation because that is executed on registers while a lookup table requires loading from memory. Just compare the methods (in pseudo code):
; Classic method
load r1, x
mul r1, total_number_of_columns
add r1, y; Shift
load r1, x
shl r1, shift_count
add r1, y; Lookup table
load r1, x
; Assuming item size is a power of 2
; Otherwise a multiplication is used here!
shl r1, sizeof(item)
; EDIT: Must be of course according to the table item size
shl r1, TABLE_ITEM_SIZE_SHIFT
load r1, [lookup_table + r1]
add r1, yNote that some of the above operations might require two assembly instructions (especially on RISC CPU's) when operations can be only performed on registers and not with memory addresses.
-
The performance of integer multiplications depend on the used CPU architecture and generation. With most x86 and compatible CPUs integer multiplications are 2 - 4 times slower than additions which can be usually ignored. Using a lookup table will be then not significantly faster or even slower. However, with other (mainly RISC) CPU's, the operation can be up to 40 times slower than additions.
Quote:
The quickest way I can think of is to use a look up table
The quickest way is using a shift operation because that is executed on registers while a lookup table requires loading from memory. Just compare the methods (in pseudo code):
; Classic method
load r1, x
mul r1, total_number_of_columns
add r1, y; Shift
load r1, x
shl r1, shift_count
add r1, y; Lookup table
load r1, x
; Assuming item size is a power of 2
; Otherwise a multiplication is used here!
shl r1, sizeof(item)
; EDIT: Must be of course according to the table item size
shl r1, TABLE_ITEM_SIZE_SHIFT
load r1, [lookup_table + r1]
add r1, yNote that some of the above operations might require two assembly instructions (especially on RISC CPU's) when operations can be only performed on registers and not with memory addresses.
Thank you.
-
+ y; Thus lookup is the offset to the start of each row in memory. So, my question is how can we do this as fast and simply as possible? Is there a quicker way?
There may be, depending on the processor and number of columns and size of elements. Some processors give fast access to separate bytes of a register (or in the first place form a larger register by adjoining several smaller registers), then perhaps you can form an address almost for free by creating `y` in the low byte and `x` in the high byte of the address (likely with an offset otherwise the matrix has to start at 0). This applies to z80 and 8086 and probably some other old CISC architectures (not modern x86 though - it's still possible but no longer actually fast). On a RISC architecture that sort of trick usually doesn't exist. It also typically wastes a lot of memory (you can use the gaps of course but they're fragmented).
-
+ y; Thus lookup is the offset to the start of each row in memory. So, my question is how can we do this as fast and simply as possible? Is there a quicker way?
Before using such micro-optimisations, I would ensure that we have used all higher-level optimisations, first. For example: 1. How are the data received? Are they stored in memory in order of reception, or does some processing (e.g. address calculation) need to be performed? 2. How are the data processed? Is the access pattern sequential? random? If a single pass through the data is performed, is it possible to store the data (see question 1 above) in the order of processing, and thereby avoid all indexing? You may be able to think of other optimisations, based on your knowledge of the hardware and the problem.
Ad astra - both ways!