Embedded and dynamic memory allocation

honey the codewitch

I'm not discussing this on r/embedded for fear of starting a flame war. For those of you that do embedded, you're probably told to not dynamically allocate memory, but rather to allocate up front. I cheat. The problem is with things like TrueType and SVG depending on the complexity of the content the memory requirements can vary wildly application to application, and firmware is difficult (but not necessarily impossible) to adequately profile on the embedded hardware itself. These features are typically used on systems with at least 360KB of RAM but will run in as little as 192KB that I've tried, at least for some applications (I haven't gotten exhaustive metrics on the memory usage of these things) What I've done is dynamically allocate in these situations, and then test my app, usually for days to make sure it's stable. I've also run this graphics code (which I use in all my projects) through its paces with deleaker so I'm confident in it. I could use a custom pool and allocate from that, but I'd still have to know how much TTF for example, would use up front for that given font - which requires instrumenting the application, which costs money. One advantage to dynamic allocation is my fonts for example, only use it while rendering. It is therefore possible to (while not rendering) use that memory for other (temporary) things. I rarely do this in practice, but it's definitely doable, if you're careful about fragmentation. In these cases, you have to test thoroughly, but again, doable. How terrible is this approach? I'm self taught and was sort of drafted into embedded so I'm unsure of myself here.

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

Mircea Neacsu

honey the codewitch wrote:

How terrible is this approach? I'm self taught and was sort of drafted into embedded so I'm unsure of myself here.

Not so terrible, rest assured; don't damage your self-esteem :D The biggest issue with dynamic allocation, and where this mantra of "no mallocs" came, is the potential for memory fragmentation. You don't know how well the memory manager behaves and you risk running out of big memory chunks. However, if you think about it, this is a phenomena specific to multi-tasking, where each task is oblivious of other tasks that may exist in the system. This is not what happens in an embedded system where you have only one "task" (footnote: yes, there are exceptions to this scenario). You certainly have multiple threads but those threads cooperate, not compete for resources. If you would have an anal boss that wants to enforce the "no malloc" rule, you can allocate all the available memory and run a memory manager inside your app that allocates chunks of this pool. On the outside the system would behave the same way because there is only one task/app. Again, I'm over-simplifying here: there is still a network stack that needs some buffers but you get the gist. In embedded world, quoting Taylor Swift: "This is our place, we make the call".

Mircea

honey the codewitch

Mircea Neacsu wrote:

you can allocate all the available memory and run a memory manager inside your app that allocates chunks of this pool

The trouble here is determining the pool size up front without just making it the size of available memory. As I mentioned it varies wildly application to application and content to content. One thing I am doing is playing hot potato with my RAM to avoid fragmentation. I do not keep little allocations around. Allocations that are kept around are allocated up front. I think that's what saves me here.

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

Mircea Neacsu

honey the codewitch wrote:

without just making it the size of available memory

My point was that you can make it the size of available memory. What else is going to use it? Can user run a 2nd app behind your back? And if I'm right, and you make it the size of available RAM and manage it inside your app, how is this different from allocating it only when needed? The only difference would be that you use your own memory manager instead of the system one. Is your memory manager going to be better than the standard one? Probably only marginally so.

Mircea

honey the codewitch

It depends on the platform. The ESP-IDF for example, likes to temporarily allocate to do just about everything.

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

Greg Utas

honey the codewitch wrote:

The trouble here is determining the pool size up front without just making it the size of available memory.

One way around this is to allocate in slabs, adding a new slab when needed. A small system might allocate one slab, whereas a bigger one might allocate, say, four.

honey the codewitch wrote:

One thing I am doing is playing hot potato with my RAM to avoid fragmentation. I do not keep little allocations around.

If the memory manager uses buddy allocation and you allocate temporary memory and then free it all before allocating more, fragmentation should be avoided altogether.

honey the codewitch wrote:

Allocations that are kept around are allocated up front.

An excellent strategy that is also important in systems where latency must be predictable.

Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing.

honey the codewitch

Greg Utas wrote:

One way around this is to allocate in slabs

Right, but you're still allocating. Initially I didn't realize the issue was strictly fragmentation. My TTF rasterizer already uses a custom pool/heap for allocations.

Greg Utas wrote:

If the memory manager uses buddy allocation and you allocate temporary memory and then free it all before allocating more, fragmentation should be avoided altogether.

Typically I allocate for the span of an operation in order to improve performance. Sometimes I fallback to a slower method if I can't allocate, so it gets used like a cache. It gets freed once the operation completes. This is the case in every situation in my graphics library save a couple of very explicit things like "large bitmaps" which are actually composed of a bunch of smaller bitmaps because of heap fragmentation, and things like SVG documents (which are actually kind of small).

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

CPallini

I do not use dynamic memory allocation on my systems. Anyway, most of my targets 'feature' 4 KB of SRAM, so malloc would be overkill. I am interested in trying dynamic allocation on larger system (embedding Lua, for example, would require that), but never did before.

"In testa che avete, Signor di Ceprano?" -- Rigoletto

honey the codewitch

For some reason I simply can't imagine calling malloc on a 4kB system. :-D

Check out my IoT graphics library here: https://honeythecodewitch.com/gfx And my IoT UI/User Experience library here: https://honeythecodewitch.com/uix

CPallini

By default, the IDE (PSoC Creator) sets the heap size to 128 bytes.

"In testa che avete, Signor di Ceprano?" -- Rigoletto

trønderen

Greg Utas wrote:

If the memory manager uses buddy allocation and you allocate temporary memory and then free it all before allocating more, fragmentation should be avoided altogether.

Every time I mention buddy allocation, someone stand up, shouting about internal fragmentation. That usually goes into a lengthy discussion about the fraction of allocations being 2-powers, how much you should allow yourself to fine tune memory use by, say, changing a long to an int to bring the allocation size down from 516 to 512, and so on. Usually, the shouters end up admitting that an average of 25% internal fragmentation is overly pessimistic. Buddy allocation is certainly not very much valued among my professional friends. I love it.

Greg Utas

I think one reason buddy allocation originally became popular was its very low management overhead. There may be better schemes, but they would use more management overhead to reduce fragmentation and would need a background task (like a garbage collector) to merge adjacent free areas, something that buddy allocation does on the fly.

Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing.

trønderen

I'd prefer not to do buddy recombination on the fly, i.e. when allocating/deallocating a block. I want neither allocation nor deallocation to be delayed by block recombination. If you run out of space for bigger blocks, the freelists for smaller blocks may be long. To find buddies fit for combination is similar to sorting the list. It would be inefficient to go to the first smaller list (below the requested size) for the search; it may have viable candidates, but one block's buddy may live as two not-yet-combined halves in the second smaller list. So recombination should start at the smallest unit, and proceed upwards. This you can leave to a very low priority task. The task unhooks a freelist for a given size, possibly leaving a couple entries for other threads' requests while the combination is being done. The task sorts the unhooked freelist to make combinable buddies list neighbors. If the heap start address is a multiple of the largest block that can be allocated, an XOR between the neighbors' addresses leaves the single bit that is the block size being processed; that is super fast. The two are unlinked, and the lower put into a separate list. Forget the upper one from now on - it is merged into the lower. When reaching the end of the list, the task is left with two lists: The reduced one (of the size being processed) and a list of recombined next size blocks, to be added to that freelist. At the start, the task must reserve one freelist head while unhooking; that is done in a couple of instructions. At the cleanup, it must again reserve the freelist head to hook the reduced list back in, and then add its entries to the next bigger freelist. Both operations are done in a couple of instructions. There is no hurry: Noone is hurt if the task must wait in line before getting to add its entry to the two list heads. The order doesn't matter; if one head is busy, it can do the other one while it waits. The task processes one size at a time. Each size usually has a moderate size freelist, so the time spent doing the job is moderate. Doing a real list sort might be a good idea: With allocation done from the list head, after the combination, the lowest available block is allocated for the following requests. If a bigger buddy must be split, it will be the lowermost of those. This strategy tends to gather active heap blocks at lower addresses, leaving the higher addresses for larger blocks (without the risk of a few single small block allocations in the upper range 'polluting' a large heap are

Greg Utas

I do the splitting and recombining during allocations and deallocations. It's fairly efficient, and the code already has the heap mutex. A background task has to lock out users for longer. I just make all users share the cost, because I can't see how offloading the work to a background task can make things any more efficient.

Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing.

trønderen

It all depends on OS/hardware context. "Background" doesn't always mean a background task in the OS sense. You can offer your applications a heap manager providing allocate/deallocate as intrinsics / inline functions working on freelists with operations synchronized on hardware reservation (available in several architectures), with need for software semaphores. That is all your applications see. Behind the scenes, your heap manager can have a clock interrupt handler doing the recombination. Even if the clock handler must maintain a list of timed events (it probably has one already!), regular activation of a recombination task is magnitudes cheaper than using OS mechanisms to start a background process in the dotNet sense, or of most other OSes. Your apps won't know the reason why memory is so tidy all the time. They will get the memory blocks they ask for, in a small handful of instructions; release goes in even less. Your apps are never delayed by any (lengthy?) recombination of buddies. The risk of heap pollution from small memory allocation blocking large recombinations is greatly reduced. You do not provide details about your procedures: When deallocating, do you search for buddy blocks for recombination of the same size only, or do you recurse to larger blocks, possibly up to the maximum size available? Do you keep freelists sorted for efficient identification of buddies, or are they unsorted? Corollary: On new allocations, will you allocate new blocks minimizing the pollution of the heap by small blocks preventing allocation of large blocks? This is more or less implicit with sorted freelists, but if your freelists are not sorted, how do you maintain it?

Greg Utas

Most memory allocated at run-time is actually taken from object pools[^] allocated when the system initializes. Each pool supports a major framework class, and the number of blocks in each pool is engineered based on the system's object model. The buddy heap[^] is seldom used at run-time. It is used to allocate byte buckets for IP message payloads or when using the framework to implement a standalone application. Most allocation occurs in the pools. The buddy heap[^] has a queue for its block sizes (powers of 2). Allocation can recursively split a block, and deallocation can recursively merge blocks. Both memory managers verify that a pointer references a valid block, support debug tools that can be enabled to trace activity, and collect usage statistics.

Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing.

leon de boer

Embedded includes very large systems and processors these days and dynamic allocation has become very common on those. Even with an RTOS on a smaller micro they will typically use dynamic allocation and support most options FreeRTOS is typical and as you see there are 5 options on it FreeRTOS - Memory management options for the FreeRTOS small footprint, professional grade, real time kernel (scheduler)[^] Perhaps to give a more informed answer we need need to know what your intended system is.

In vino veritas