Thanks to all the replies, it turned out that a easier way of optimizing such code is to make the data structures as cache friendly as possible making the nodes smaller and careful allocation, so one can forget about iterative traversal for another bit...