table construction/design question

Jim Crafton

I'm an SQL nebie, so bear with me: I need to collect the following data after parsing a bunch of c/c++ files and then storing the parsed data as a serious of records that store the symbol type (is it a function class/struct/union, variable, enum etc), the symbol name, offset, and a few other values. I would like to enter this data into one or more tables for easy searching later on. My question is: is it better to design multiple tables, for example one table for classes/structs/unions, another table for functions, another for variables/member vars, and a fourth for miscellaneous stuff, or one single table? I would prefer whatever results in a faster return of the results. If I have multiple tables, what is the proper way to glob all the results (from selecting across multiple tables) into one result set? ¡El diablo está en mis pantalones! ¡Mire, mire! Real Mentats use only 100% pure, unfooled around with Sapho Juice(tm)! SELECT * FROM User WHERE Clue > 0 0 rows returned Save an Orange - Use the VCF!

Rob Graham

My initial reaction is that this would best be in several tables. I would create a small symbol_types table with an identity field as primary key, and a unique index on the symbol_type field (NVarChar) similarly a Files table with the file path and an ID (identity) field. The main table would contain symbol_name, TypeID (ID from symbol_types), PathID ( ID from files) and the rest of the story (but look for other opportunities to segregate repeated string data into a lookup table like "symbol_types"). This table should likely have a compound primary key on Symbol_name,TypeId and PathID to allow for multiple occurences of the same symbol_name (either that or an identity field for primary key, and just a unique index on the previously mentioned three fields). The idea here is to consolidate repeated strings into a single table so they can be represented by the integer key for the row that contains the one instance of the string. This makes it much easier and less expensive to update a symbol type name (to correct spelling for instance), and reduces the record size in the main table by replacing the repeated strings with an integer value. This makes for much faster selection queries, since extracting groups of records from the main table ( all symbols of a certain type, for instance) a match on integer id's rather than a string comparison) - "Select [symbol name] from [symbol table] inner join [symbol types] on [symbol name].[TypeID] = [symbol types].id where [symbol types].[typename] = 'member'"). The proper way to "glob them all together" is to use an inner join query " select [S.symbol name], [T.Typename],[P.Path]...from [symbol table] as S inner join [symbol_types] as T on S.TypId = T.ID inner join Files as P on s.Pathid = P.ID" (warning: syntax may not be quite correct, check documentation for your database on proper way to make inner join). Note that you often cannot directy update the database using the resulting recordset, but must instead update the underlying base tables one by one. Absolute faith corrupts as absolutely as absolute power Eric Hoffer All that is necessary for the triumph of evil is that good men do nothing. Edmund Burke -- modified at 23:15 Monday 23rd January, 2006

Jim Crafton

Wow! Thanks for the info. I appreciate the detailed response, since I don't know that much about SQL! ¡El diablo está en mis pantalones! ¡Mire, mire! Real Mentats use only 100% pure, unfooled around with Sapho Juice(tm)! SELECT * FROM User WHERE Clue > 0 0 rows returned Save an Orange - Use the VCF!

Damodar Periwal

If your use cases are such that at most of the places you need to lookup objects of one particular type (e.g., give me a class by this name, give me all the variables with this property, ...), you may be better off storing objects of each type in a separate table. This will help in better performance because less number of records needs to be filtered during a query for a particular type of object. This will also provide better data normalization if the information for each type is quite different than that for the other ones. If you are developing your application in C# or some other managed language, you may use an OR-Mapping product to totally avoid dealing with tedious low-level infrastructire code in ADO.NET or OleDb for data integration. For example, with NJDX OR-Mapper, you may do the following: 1- Define your domain model classes (say AClass, AStruct, AUnion, AVariable, AnEnum, etc.) with properties for name, offset, etc.). For example, class AClass { string name; int offset; ... } If some classes are very similar, you may define them in a class-hierarchy. 2- Define OR-Mapping declaratively like: CLASS AClass TABLE CLASSES PRIMARY_KEY name ; CLASS AUnion TABLE UNIONS PRIMARY_KEY name ; 3- Create the database schema using NJDXSchema tool. This will create tables CLASSES, UNIONS, etc, with proper columns and primary keys. 4- Write your application using NJDX APIs. For example, the following code will insert a new object c1 of type AClass in the database: njdx.insert(c1, 0, null); The following code will fetch an AUnion object having name="someUnion" // oid below can be built dynamically using program variables Object oid = ObjectId.createObjectId("AUnion;name=someUnion"); AUnion myUnion = njdx.getObjectById(oid, true, 0, null); The following code will fetch all AClass objects into the myClasses variable. ArrayList myClasses = njdx.query("AClass", null, -1, 0, null); If you have defined some of your classes in a hierarchy, you can fetch all the qualifying objects of all the classes in that hierarchy with one query call. Essentially, your code will be more object-oriented and easier to evolve. And you will avoid all the complexities of SQL. Damodar Periwal Software Tree, Inc. Simplify Data Integration http://www.softwaretree.com -- modified at 18:55 Tuesday 24th January,

Jim Crafton

The database is intended to hold a standard description of potententially 10s of thousands of class types. It's going to be populated after scanning a directory (and potentially sub directories) and parsing all the c/c++ files there. So creating a table per class might get a little unwieldy. ¡El diablo está en mis pantalones! ¡Mire, mire! Real Mentats use only 100% pure, unfooled around with Sapho Juice(tm)! SELECT * FROM User WHERE Clue > 0 0 rows returned Save an Orange - Use the VCF!

Damodar Periwal

I did not mean creating one table per end-user class. In my example, the CLASSES table will hold information for all the instances of the AClass class. So essentially, you would be dealing with only a handful of tables - one for all the end-user classes, one for all the end-user structs, one for all the end-user unions, etc. Damodar Periwal Software Tree, Inc. Simplify Data Integration http://www.softwaretree.com

Jim Crafton

Well I tried this technique with SQLite. And got all the data entered, my main symbols table had over 11,000 entries, and the functions table over 7200. And getting things to join takes FOREVER (about 20 secs)!!! It's insanely slow (this on a P4 3 GHz with HT). So while I much prefer the organization of this method, it seems like it just isn't going to work well with sqlite. I suspect that this is a problem with sqlite. Just for kicks here's how I created the tables:

CREATE TABLE Types (
TypeID INTEGER PRIMARY KEY
, TypeInfo CHAR(125)
);

CREATE TABLE Templates (
TemplateID INTEGER PRIMARY KEY
, TemplateSignature CHAR(125)
);

CREATE TABLE Files (
FileID INTEGER PRIMARY KEY
, Path CHAR(255)
);

CREATE TABLE Symbols (
Id INTEGER PRIMARY KEY
, Kind INTEGER
, Name CHAR(125)
, FileID INTEGER
, TypeID INTEGER
, Offset INTEGER
, LineNumber INTEGER
, TemplateID INTEGER DEFAULT 0 NOT NULL
, Parent INTEGER DEFAULT 0
);

CREATE TABLE Functions (
FuncID INTEGER PRIMARY KEY
, Signature CHAR(125)
, SymbolID INTEGER
, ReturnTypeID INTEGER NOT NULL
);

I left out the FK stuff as sqlite ignores it apparently. For the query I tried:

select symbols.name,functions.signature from symbols, functions where
functions.symbolid = symbols.id;

It does eventually return, but it's impossible to use, so for now I think I'm just going to go with the single table approach. The query is a little more difficult to formulate, but it's *really* fast. ¡El diablo está en mis pantalones! ¡Mire, mire! Real Mentats use only 100% pure, unfooled around with Sapho Juice(tm)! SELECT * FROM User WHERE Clue > 0 0 rows returned Save an Orange - Use the VCF!

Jim Crafton

Well it apepars I spoke too soon. The problem was that I didn't have an index built for the functions.sybolid column. Once I did that the query completed instaneously! Woot! :) ¡El diablo está en mis pantalones! ¡Mire, mire! Real Mentats use only 100% pure, unfooled around with Sapho Juice(tm)! SELECT * FROM User WHERE Clue > 0 0 rows returned Save an Orange - Use the VCF!