Simple Question regarding Pointer Arithmetic
-
I am writing a template container class similar to a vector or list. I am optimizing the class for performance not size. I want to know between the two examples relatively speaking how much faster example 2 is than 1. Since the functionality employed within the loops is simple, it stands to reason the conditional testing (in the while clause) is relatively speaking measureable (from a performance standpoint......as compared to a loop where complex assignments are occurring. Example 1: T* dst1; //is pointing to legitimate memory address. T* src1; //is pointing to legitimate memory address. T* dst2; //is pointing to legitimate memory address. T* src2; //is pointing to legitimate memory address. T* stop = dst1 - 1000000; //is pointing to legitimate memory address as part of the same array dst points to. while (dst1-- > stop){ *dst1 = *--src1; } stop = dst2 - 1000000; while (dptr2-- > stop){ *dst2= *--src2; } Example 2: T* dst1; //is pointing to legitimate memory address. T* src1; //is pointing to legitimate memory address. T* dst2; //is pointing to legitimate memory address. T* src2; //is pointing to legitimate memory address. T* stop = dst1 - 1000000; //is pointing to legitimate memory address as part of the same array dst points to. while (dst1-- > stop){ *dst1= *--src1; *--dst2= *--src2; } Obviously 2 is faster, but by how much?? In other words compared to the increment operations dereferencing and assigning of memory addresses, how much does the 1000000 conditional tests affect performance compared to the 1000000 of each operations occurring in the loop. Thanks
-
I am writing a template container class similar to a vector or list. I am optimizing the class for performance not size. I want to know between the two examples relatively speaking how much faster example 2 is than 1. Since the functionality employed within the loops is simple, it stands to reason the conditional testing (in the while clause) is relatively speaking measureable (from a performance standpoint......as compared to a loop where complex assignments are occurring. Example 1: T* dst1; //is pointing to legitimate memory address. T* src1; //is pointing to legitimate memory address. T* dst2; //is pointing to legitimate memory address. T* src2; //is pointing to legitimate memory address. T* stop = dst1 - 1000000; //is pointing to legitimate memory address as part of the same array dst points to. while (dst1-- > stop){ *dst1 = *--src1; } stop = dst2 - 1000000; while (dptr2-- > stop){ *dst2= *--src2; } Example 2: T* dst1; //is pointing to legitimate memory address. T* src1; //is pointing to legitimate memory address. T* dst2; //is pointing to legitimate memory address. T* src2; //is pointing to legitimate memory address. T* stop = dst1 - 1000000; //is pointing to legitimate memory address as part of the same array dst points to. while (dst1-- > stop){ *dst1= *--src1; *--dst2= *--src2; } Obviously 2 is faster, but by how much?? In other words compared to the increment operations dereferencing and assigning of memory addresses, how much does the 1000000 conditional tests affect performance compared to the 1000000 of each operations occurring in the loop. Thanks
Without knowing the microprocessor architecture + compiler optimization specifics, it's hard to give a "how much" answer. Working off a super-simplified instruction set, the first example can be generalized to the following. We'll assume that the architecture has branch prediction and that compiler optimization has rearranged the memory accesses so they effectively take just one spot in the pipeline. loop 1 x1 million: decrement dst1 compare/branch dst1 to stop decrement src1 read memory at src1 write memory at dst1 loop 2 x1 million: decrement dst2 compare/branch dst2 to stop decrement src2 read memory at src2 write memory at dst2 Second example loop x1 million: decrement dst1 compare/branch dst1 to stop decrement src1 read memory at src1 write mem at dst1 decrement src2 decrement dst2 read mem at src2 write mem at dst2 With this oversimplification, the second example will run at 9/10 the time of the first. If the memory accesses are the limiting factor in performance, then the difference between the examples will be negligible. Also, depending on the size of the data and the amount of physical memory available to your process, the first one example might possibly run faster since it better obeys the principle of spatial locality. If branching policy is the limiting factor, then the second example would perform even better (as in less time) than 9/10 the time of the first, since it has 1 million less branch instructions. Best way to check is to implement both and time a few typical data runs through both. :-) FTS Technology Solutions