XML and XSD into to a SQL Database

binarymax

Hello, through all my searching I can't seem to go about finding the best way to do something that hopefully shouldn't be as hard I am finding it to be. I have a set of open schema specifications (SportsML to be specific), an open database architecture (SportsDB), and lots of XML files that fit the schema which should intuitively map to SportsDB. What I want to do is to load all the XML into the DB. I don't know the quickest/simplest route for me to have all this XML loaded as mapped rows and columns in the database. Here is the path I am on: Generate typed datasets using XSD.exe (which is in itelf a challenge since I am having troubles with XSD on multiple schema files), and hand-code all the mappings from all the DataSets to the DB. Then hand-code a SQL OPENXML command for each dataset. There are 124 individual tables in SportsDB. Is there an easier way? Is there a better way? I am interested in any feedback or ideas. Here is what I am using: Visual Studio 2008, SQL Server 2008. I am writing CS but I don't mind receiving any suggestions with VB. Also, to make it easier for people in the future I plan on documenting all this, making it open source, and putting it up as an article on this site. Many Thanks!

dojohansen

It sort of depends doesn't it? I would probably go down a dynamic code generation route, partly because it's fun to do (it's real easy in .net to compile code at run time and then run it), but also because it results in the fastest possible execution while also being very very flexible. For example, you could take a reflection-based approach. What matters to you isn't so much the database types used, but the .net types they map to. As far as I know there's no simple way to query this mapping from the ado.net providers, so the easy way is to do a "select top 0 * from [table]" and fill a datatable, then loop through it's columns and check the DataType property, which is the .net type and not the native database type. Then you can generate an sql command, such as an INSERT statement, knowing little more than the names of the columns and the table: INSERT [table] (col1, ..., colN) VALUES (@col1, ..., @colN); Of course, for this to work you do need to know about any identity or computed columns that cannot be inserted. Then, generate C# code to create an SQL command, use the SQL as the CommandText and add parameters to it for all the columns. And generate a *separate* method that takes an XmlNode and assigns the parameters of the command object. This will allow you to insert N records creating only one command instance. What I've done before is have the C# code generator use reflection to see if the type used for each column has a Parse(string) method and also if it has the Parse(string, IFormatInfo) overload. If it has the latter I generate code to use it - passing CultureInfo.InvariantCulture as I want all our files to be culture-independent; only *presentation* should depend on this. (I've had enough of Microsoft's French "CSV" (comma separated values) files. Because the comma is decimal point in the French culture they changed the *file* format and use semicolon (or whatever is "list separator" in Windows Regional Settings) instead, meaning users of the English Excel version cannot open files saved by the French version unless you go and change the default settings...) Obviously the generation itself is not super quick, depending on reflection and having to compile the results into an in-memory ("dynamic") assembly before use. But it's quick enough that I find it's perfectly acceptable not to pre-compile the generated code, just cache the compiled type so it can be reused for the rest of the apps lifetime. Other alternatives exist of course. If you have or want to have a business layer

binarymax

Thanks for the response! I know what you mean about the nested classes and the look of the code. I can target 3.x so that is definitely good to know. I had used generation before and there are some good tools out there, but none for free that solved this specific problem :) I had written some reflection before for a different problem in which I had only the XML and not any official XSD nor the DB. I might cave and buy an XMLSpy or similar license, but then I'll miss out on learning the trials and tribulations of this whole process. I have used MyGeneration and XSLT for going the other way (SQL DB to XML) and that was fine. But I think the problem this time is that I already have both the schema and the data model which strangely enough seems harder than only having one of them! Getting them to fit together just seems a bit tedious.

dojohansen

Also, I don't know the SportsML schemas or the database, but you might want to check if the native ADO.NET datasets can be made to load these xml documents. Upon re-reading your original posting it seems to me that your primary goal is to minimize the development effort, and datasets might work quite well for this purpose. Another interesting possibility is to make a SportsML reader that implements the IDataReader interface - then you can use SqlBulkCopy to bulk insert the data. This gives you the best performance and is fairly easy to do if you can base the object on an existing IDataReader implementation. For example, we wanted to copy some data efficiently from one DB to another. The data included some parent-child relations, so a foreign key in table Child would refer to a row in Parent. And the keys were IDENTITY values provided by each database, so they would be different in each database. To deal with this I wrote a simple TransformingDataReader class based on SqlDataReader, which already implements IDataReader. Since SqlDataReader is sealed I couldn't inherit from it, but I just embed a reader in my class and delegate most of the interface implementation to that object, like this:

...
public void Close()
{
sqlDataReader.Close();
}
public DataTable GetSchemaTable()
{
return sqlDataReader.GetSchemaTable();
}
public bool NextResult()
{
return sqlDataReader.NextResult();
}
public void Dispose()
{
sqlDataReader.Dispose();
}
...

The only thing I really wanted to change from SqlDataReader was that when it reads I'd like to be able to modify the data of the row just after it's read internally and before it's passed on to the consumer of the reader. This is easily done like this:

SqlDataReader sqlDataReader;
object[] currentRow;

abstract public void Transform(object[] currentRow);

public bool Read()
{
if (sqlDataReader.Read())
{
sqlDataReader.GetValues(currentRow);
Transform(currentRow);
return true;
}
currentRow = null;
return false;
}

I like this solution since it's now very easy to make a reader that does transformations on the fly. For example, you could load a "correspondance" table if you know that in system A values 1, 2, 3 actually mean the same as X, Y, Z does in another system, and by implementing just Transform() you'd be able to do t