Building a relational graph from a large build enlistment

Adam Dare

Hello there, I'm starting up a little personal project that I'm not too sure what the best way to tackle is. I'm hoping someone could give me a few pointers and/or some ideas on questions I should answer before I dig in. The problem I'm trying to solve to determine file and project dependancies in a big, mature, sprawling codebase, with tens of thousands of files. This codebase has evolved over time and can be quite unwieldy. To build a small tool I need to build most of the build tree because it's almost impossible to figure out what all my dependancies are. So I found a library that will allow me to watch file activity, for example file open, creates, and so forth. So my thought is if I run a full clean build and monitor and record all of the file activity of a build and can generate a dependency graph of the entire project and/or be able to automatibally create a command file that to build any project I want and have it build everything I need in the proper order. I'm sure there will be other interesting things I can do with this information. I'd also like to try and visualize the entire project and maybe create a change heat map from it, but looking at the source server info. My quandry is how best to record the file activity so that I can build a depenecy tree. Ideally I'd like to be able to do this in a multi-threaded way and be able to tell which files need to be built before others and which files are grouped in a project and so on. My current proposed approach is to record all file activity to a logfile and the post process it to generate the depenecy graphs. I'm still a little hazy about all the data that I need to record. I'm currently thinking I'll figure that out as I go when I find I'm missing some important information. Any pointers or thoughts would be greatly appriciated. Thanxx, Adam

Jonathan Davies

Adam, Apologies if I'm just rephrasing your question: View what you want as a series of graph nodes and verticies where different verticies represent the reason for a link (dependancy) between the nodes. Try drawing a few views on paper to see what you, as the user, want to see. Eg "library used by" or "header used by" etc. This should give you the answer to your question.

Adam Dare

I hope you don't mind I'm using this group as a sounding board. I'm posting my thoughts in part to help solidify them in my head and then also to see if anyone out there has any interesting observations or experience that can help. :-) I have been thinking about what I and my customers would like to get from this tool and have come up with a great many pieces of information that I could/should track. But I'm thinking that I just need to find the core pieces of information and then use queries against the information store to figure out the rest. For example, some of the things I'd like to be able to do are: 1. Determine build order of files/projects 2. Determine which projects can be built simultaneously 3. See all the dependancies of file X/project Y 4. See all the dependancies on file X/project Y 5. Find duplicate files 6. Which projects create which files and then which other project then consume those files. I have some ideas on what to track I just need to implement it and see how it works. The more I think about this issue I think the harder part will be to actually collect this information. Ideally I'd be collecting this information while multiple instances of the compilation tools are executing at once. I'm having some fun wrapping my head around having a single process monitoring multiple processes and maintaining the build ordering/hiereachy inside and outside of a project. I'm wondering if it would be best to do this in two passes. Then there's the question of what's the best design for processing all of those file open/creation transactions. My current thought is to have one thread per compilation process. I only expect to have 4-8 processes going at once, so that doesn't seem like it should be to resource intesive. And just so you know this is the biggest project I've ever tackled, so I'm a bit nervous about getting my approach organized before I start and then making sure I'm starting in the correct place. Thanxx for your input, Adam

Jonathan Davies

Storing the data produced by monitoring the file activity and processing it should be kep apart In my opinion. In this case you need a format that both Data-Storer and Data-Processor can work with, and it would seem to be wise to pick a format that can be extended easily. If you start storing data A,B and C but later realise you need D as well for each file it should be relatively easy to add. Coding any of the 'knowledge' about how processing occurrs in the Storer side would seem to be wrong. This will allow you to work on the Data-Storing code side independantly of the Processing. Presuming you store a certain set of data in this format. From this data you will produce some information, or at least rearrange the data so that it's in a format more suitable for the user or for a second stage of processing at least. This makes me think that breaking your Data-Processing down into smaller units should perhaps be woth considering. Personally I'd leave out for now how you are going to implemenet it, i.e.leave the problems of multi processes and threads until later and settle on what you are going to do first. To be able to debug this etc what about starting it all as single threaded processing file/data in serial rather than in parallel to prove that your idea of what needs to be done is correct?