Heap stack allocation

trønderen

(I brought this up as a sidetrack in a Lounge thread. Even though I ask out of curiosity only, not any 'need' or immediate problem, the question is on the technical side, so I move it over here:) There is no real reason why stack frames are allocated edge to edge! The stack frame head 'previous frame' pointer could go anywhere. Stack relative addressing never goes outside the current frame. Traditionally, with a million objects, each running its own thread, each of them has allotted a stack for its maximum call depth. This can bind up quite a few megabytes that are hardly if ever used, most certainly not all at the same time. Never ever are every single thread preempted at its very deepest nesting of calls at exactly the same time. In a non-preemptive, co-routine based system, they won't all yield (or whatever it is called in your favorite co-routine package) at the maximum stack depth, all at the same time. Stack frames could be allocated from the heap, with space is occupied only as long as a function call is active. Then, a given amount of RAM would be capable of handling a much larger number of threads. Especially in event driven architectures, a large fraction of threads idle at a low stack level. They receive an event and go into a nest of calls for handling it, and then return to the low stack level waiting for the next event. The thread might do some sort of yield halfway, but the great majority of threads would be back to 'ground base', or at a moderate call nesting level, most of the time. The argument against heap allocation is of course the processing cost. There are (/were) machines offering micro coded entry point instructions doing stack allocation from a buddy heap, essentially unlinking the top element from the free list for the appropriate frame size. Function return links the frame back to the top of the list. This requires a couple of memory cycles for each operation. What is that on the total instruction budget for a modern style method call? Even though you with desktop PCs can just plug in another 64 GiB of RAM to satisfy stack needs, in other systems (such as embedded), this is not always possible. Yet, lots of programmers recoil in horror over buddy allocation: It causes lots internal fragmentation! 25% with binary buddies! Well, but how much can you save in stack requirements? Besides, lots of architectures demand word, doubleword or paragraph stack alignment anyway - that leads to internal fragmentation/waste as well! If the allocation cost worries you, lots of optimization

jschell

Lot of text there with no context.

trønderen wrote:

There is no real reason why stack frames are allocated edge to edge!

In C/C++ there were library calls that allowed one to chop the stack off. To reset it to a specific point. I never used it, so I don't know why it existed. Might have been performance or error handling. That call would have certainly been easier (and faster) if frames were next to each other. However I also believe that is not the case for Java. I believe I looked at the source code for that at one point and it was just allocating on the heap.

trønderen wrote:

Stack relative addressing never goes outside the current frame

False. In C and C++ one can overwrite the frame. I have in fact done so in the past. Since both Java and C# support native code overwriting the frame is also possible there.

trønderen wrote:

Stack frames could be allocated from the heap, with space is occupied only as long as a function call is active

That is somewhat simplistic. Allocations of any sort still require a mechanism to deallocate it. So for example a standard in C/C++ (at least when I looked at source code) was that local variable were allocated. Might have even been in the stack frame. Then the deallocate was when the frame was freed on exit. That is why they were (or are) on the stack frame. Because the deallocation is already there. Heap deallocation on the other hand is a different mechanism.

trønderen wrote:

Then, a given amount of RAM would be capable of handling a much larger number of threads.

As stated that is wrong. Memory is not the only limitation for the number of threads. Not to mention that no developer should ever consider the idea that an unlimited or large number of threads is a 'good' idea.

trønderen wrote:

Even though you with desktop PCs can just plug in another 64 GiB of RAM to satisfy stack needs

Just noting that I have never seen a computer that actually allows for the full addressable range of 64 bits. Certainly no desktop PC does that. So one cannot just keep plugging it in. And cloud servers are always limited also. Sometimes to very surprising low limits.

trønderen wrote:

(such as e

Lost User

[Branching stacks aka spaghetti stacks](https://en.wikipedia.org/wiki/Parent\_pointer\_tree) are used in the implementations of some functional languages. They work basically how you suggest they could work, plus the added detail of allowing "sibling frames" to exist simultaneously. Go supposedly used to use segmented/split stacks (which also work roughly as you suggested, not on a per-frame basis exactly but more as a list of blocks), but later apparently switched to just copying the whole damn thing to keep it contiguous (which still leaves "the stack(TM)" as essentially a bunch of arraylists that exist on the heap). Other userspace-threading solutions may use similar techniques for their stacks for "fake threads", but I don't really pay attention to what happens in that space, hell I don't really know what Go is doing either, I just remembered that it did something funny and spent 5 minutes on google to look it up. By the way, the way stacks normally use, is by reserving a bunch of contiguous virtual address space, and lazily commit pages of actual physical memory only as needed. Which is why large stack allocations involve a call to [_chkstk](https://learn.microsoft.com/en-us/windows/win32/devnotes/-win32-chkstk), to ensure that every page is touched in order, to avoid missing the guard page. So already with the stack working the way it normally does, you're probably not paying the full memory cost for each stack. But there is no mechanism to shrink the stack. Once a page has been touched once, you keep it.

trønderen

jschell wrote:

Lot of text there with no context.

What kind of 'context' are you requesting? A call stack is a general concept, employed in practically speaking all current programming languages. Do you want the context limited to one specific language? One specific hardware platform?

In C/C++ there were library calls that allowed one to chop the stack off. To reset it to a specific point. I never used it, so I don't know why it existed.

I never knew of anything like that, and your reference is so unspecific (no function names etc.) that it is difficult to do a search. From what you write, it may sound like a mechanism intended for threads requiring a lot of stack space during initialization, and then enter a 'working' state with small stack requirements, so the stack limit is reduced after init, to make the address space above it available to other threads (started at a later time). I am guessing now, and may be completely wrong, but I can't think of many other uses that makes sense (based on the information you provide).

trønderen wrote: Stack relative addressing never goes outside the current frame False. In C and C++ one can overwrite the frame.

I am not sure what you mean by 'overwriting', suspecting that you refer to something similar to out-of-bounds array indexing: If you put a somearray[1] at the very top of the stack, you may index it beyond the limit of the current frame. That is as bad as any other out-of-bounds indexing! Conceptually, you are still within the same stack frame, you are just pretending that the frame is larger than initially allocated. There is no guarantee that the space you are trying to (mis)use is available; you might hit the stack limit. In languages with a static link (I haven't been working with any more recent one than Pascal; most newer languages don't have a static link), code of an inner function may address locations in frames lower on the stack. However, they do not do this by negative offsets (or positive, if the stack is growing downwards) from their own frame pointer. Rather, they use the static link to find the stack frame of the outer function, and do relative addressing from that frame base address - as if it was a pointer to a struct. The addressing never goes outside that frame.

Allocations of any sort still require a mechanism to deallocate it. [...] That is why they

trønderen

If you do dynamic allocation of physical pages to a stack, each stack has a minimum memory cost of one RAM page (typically 4 Kibytes), and a similar increment. If the typical thread nests deeply, and you have a few threads, this is OK, but if you rather go for tiny, lightweight threads and lots of them, 4 Ki might lead to a lot of internal fragmentation. One disadvantage: If you start new threads very frequently, and each start operation leads to a stack page fault, causing allocation of a new physical page and updating of MMS tables, then the cost of starting new threads, as well as the delay before the new thread is running, will increase noticeable, to phrase it politely. Btw: I have seen such 'on demand' allocation of physical memory pages not just for stacks, but also for heap memory. Small association: A long time ago, CPU architects were arguing whether stacks ought to grow upwards or downwards. One of the downwards arguments was that if you put the stack at the bottom of the address range, a stack overflow could be detected by an attempt to address below address zero, by the sign bit. If that logic was still in use, it would prevent an 'on demand extension' of the stack.

jschell

trønderen wrote:

From what you write, it may sound like a mechanism intended for threads requiring a lot of stack space during initialization,

Incorrect. C existed before threads. The call stack existed then only for controlling the method calling process. Even now threads on C are more of a add on. So threads must work within the way C does stuff.

trønderen wrote:

suspecting that you refer to something similar to out-of-bounds array indexing:

No I wrote what I meant. You are referring to reading. I was talking about writing.

trønderen wrote:

There may certainly be other limitations on the number of threads, but memory is a significant one.

For every resource available to a running program memory is the one of which there is the 'most' available. Every other resource of any kind is going to be much much smaller. As a specific example of that it doesn't matter if a language provides a way to provide for an unlimited number of threads when the OS will not. And never will because nothing in computing is unlimited.

trønderen wrote:

You are certainly confusing 64 address bits with 64 gigabyte,

No confusion on my part. You suggested that someone could just plug in an additional 64 gig of memory. That would of course only be to a 64 bit (or higher) CPU. However, as I said, there are very real practical limits on existing computers (the physical boxes and cloud ones) which mean that your suggestion is not generally true.

trønderen wrote:

I find it slightly funny that you spend a lot of space and energy to tell me how non-workable my proposal for a heap allocated stack is

I addressed specific points that you made, because as I pointed out in the very first sentence of my response, you long post seemed to ramble on quite a bit. So for example in the first part of my response you claimed there was never any reason for doing it - so I showed that there was. I based that on my understanding the history of programming and more specifically on the history of the C programming language. But in general you claimed that no one does heap based stack allocations but I pointed out that they do. So certainly nothing contradictory in my statements in posting a link that says tha

jschell

trønderen wrote:

One disadvantage: If you start new threads very frequently, and each start operation leads to a stack page fault, causing allocation of a new physical page and updating of MMS tables, then the cost of starting new threads, as well as the delay before the new thread is running, will increase noticeable, to phrase it politely.

That however is a architecture/design problem. Not a technical one. Computers have limits and always will limits. So one can never create an architecture/design that is based on unlimited resources. In fact one should plan on smaller than actual resources to give one room for unexpected problems.

jsc42

trønderen wrote:

In languages with a static link (I haven't been working with any more recent one than Pascal; most newer languages don't have a static link), code of an inner function may address locations in frames lower on the stack. However, they do not do this by negative offsets (or positive, if the stack is growing downwards) from their own frame pointer. Rather, they use the static link to find the stack frame of the outer function, and do relative addressing from that frame base address - as if it was a pointer to a struct. The addressing never goes outside that frame.

In the late 1970s, I had a project to optimise expression evaluation in Pascal (based on the Pascal PCode compiler). As you surmised, in the PCode compiler, accessing higher nested values from lower nested functions did skip down the stack frame back pointers (one skip per depth of nesting); but that is an implementation technique, it is not part of the language definition. I guessed that it could be improved by maintaining a vector of nested stack frame start locations to access stack specific offsets directly using this vector. This proved to be a useful optimisation for evaluating expressions. However, when I ran comparisons on a large Pascal program (my test large program was the original version of the PCode compiler) I discovered that whilst the overheads for evaluating expressions were reduced, the number of times that the test program actually accessed variables in higher scopes was so low that the overhead of maintaining the vector outweighed the advantages of having the vector.