SQL - Handling Large Volumes of Data [modified]

MrEyes

More of a high level question, than a specific C# question. I need to put together an application that sits on top of what I consider to be a very large database. This database mainly consists of the following tables:

Table 1
id (int, primary key, index)
value1 (varchar 50)
value2 (varchar 50)
value3 (varchar 50)
value4 (varchar 50)
value5 (varchar 50)
value6 (varchar 50)
value7 (int)
value8 (int)
value9 (int)

Table 2
id (primary key, index)
table1_id (int, foreign key to table1 id)
value1 (varchar 50)
value2 (varchar 50)
value3 (int)

Currently table1 has around 350 million rows, table 2 has around 900 million rows. So as you can see the volumes of data here are fairly large. My application needs to be able to search these tables on any of the data columns and display results in a datagrid (or other suitable control), obviously searches could result in large result set. User should also be able to update a single row, or a batch of rows. The initial version of this application (coded before the database became so huge) just used a simple single line list view and worked perfectly. Now the volume of data are so large I have had to restrict search results to 500 rows, which is annoying end users. So I need to find a viable alternative So the questions are as follows: 1) Whats the best approach for handling this in a C# based application. After googling for most of today I have come across a number of "solutions" however I do not want to go steaming down the track of one of these to discover that it is slow. 2) Although not specifically C# related, could this table structure be optimised in any way? As an example I am considering normalising the varchar data. -- modified at 19:37 Wednesday 5th July, 2006

Alexander Wiseman

Hello, I don't have an answer to your second question right now, but I can make a suggestion for the first question. My suggestion is to use a grid control which allows you to implement a "virtual grid". Basically, a virtual grid does not store the data in memory for any of the cells in the grid. Instead, when the data for one of the cells needs to be displayed, the control makes a request (some sort of callback function) and retrieves the data that way. This would allow you to store only a small part of the data of your database in memory at a time - namely, just as much data as the user can see at one time. Here is basically how I see it working (I have done a virtual grid pulling from a database before, though my implementation was in C++ and probably not the most efficient one): 1) Application loads and so does the grid. 2) Grid requests data for the first hundred rows. 3) Your app queries your database and gets back a data set with those hundred rows and hands them to the grid view. The SQL pull back is fast, because it is SQL, and displaying the items in the grid is fast, because they are stored in memory. [now there are only one hundred rows in memory] 4) The user uses the scroll bar to scroll down to the next hundred rows. The first hundred rows are now hidden, the next hundred must be displayed. 5) Grid request data for the second hundred rows. 6) Your app queries your databse and gets back a data set with the second hundred rows and hands them to the grid view. This takes about the same amount of time for the first hundred rows. [there are only one hundred rows in memory) And so on: as the user scrolls or searches, you simply pull back the data which needs to be displayed and pass it to the grid. The great thing about doing it this way is that everything takes roughly the same amount of time and is fairly quick, even if your database grows bigger. Obviously the more rows you wish to display at once, the longer it will take and the memory it will take, but this isn't really an issue, because the number of rows that can fit on the screen is limited. Now, I haven't built an application that has had to deal with the volume of data which you are working with, but conceptually I think this idea works. Please take this simply a suggestion, and not an expert opinion. For a C# grid which supports virtual grid, just do a search on this site. One such grid is: SourceGrid[

Daniel Turini

MrEyes wrote:

Although not specifically C# related, could this table structure be optimised in any way? As an example I am considering normalising the varchar data.

Probably yes, but without knowing what "value1", "value2", etc, mean, it will be very hard to suggest anything. A lot of varchar(50) in a SQL Server always smells like someone did not spent too much time designing the database... You have a lot of varchar fields, which would be easier to search if you use a full text indexing solution, like the one that comes with SQL Server (you didn't mention your specific DBMS).

MrEyes wrote:

The initial version of this application (coded before the database became so huge) just used a simple single line list view and worked perfectly. Now the volume of data are so large I have had to restrict search results to 500 rows, which is annoying end users. So I need to find a viable alternative

You can raise this value a lot (probably easily up to 10,000 or 50,000 rows, but I don't know your hardware) if you use the listview in virtual mode, as you won't need to add items to the listview. Bear in mind that giving your users 50,000 rows is a pseudo-solution. It's a lot of data to search manually. If you spend 1 second reading or understanding each row, it can take up to 14 hours to read everything. You'll need to give your users a decent UI, automatically clustering or grouping results, or some way to refine searches.

I don't see dead pixels anymore... Yes, even I am blogging now!

MrEyes

Daniel Turini wrote:

Probably yes, but without knowing what "value1", "value2", etc, mean, it will be very hard to suggest anything. A lot of varchar(50) in a SQL Server always smells like someone did not spent too much time designing the database... You have a lot of varchar fields, which would be easier to search if you use a full text indexing solution, like the one that comes with SQL Server (you didn't mention your specific DBMS).

Basically this is an MS SQL database that stores unfiltered incoming SNMP traps from a vary large array of devices (6000+). The table model I have inherited is as follows

[captured_traps]
[id] [int] IDENTITY (1, 1) NOT NULL ,
[snmp_version] [int] NULL
[community_name] [varchar] (255)
[packet_type] [varchar] (50)
[oid] [varchar] (500)
[source_ip] [varchar] (15)
[generic] [int] NULL
[specific] [int] NULL
[time_stamp] [varchar] (15)
[trap_entered] [datetime] NULL
[status] [int] NULL

[captured_varbinds]
[id] [int] IDENTITY (1, 1) NOT NULL
[trap_id] [int] NULL
[varbind_oid] [varchar] (500)
[varbind_text] [varchar (500)

Storing data as string is, I believe unavoidable otherwise it would not be possible to search data. However the design has alot of repeated data in some fields, I believe that normalising the varchars would give a significant performance enhancement. Or as you suggested using text indexing

Daniel Turini wrote:

Bear in mind that giving your users 50,000 rows is a pseudo-solution. It's a lot of data to search manually. If you spend 1 second reading or understanding each row, it can take up to 14 hours to read everything. You'll need to give your users a decent UI, automatically clustering or grouping results, or some way to refine searches.

This is very true, and most searches do not create massive result sets. However the app allows users to search on any field or combination of fields within any date range. More often than not the search is for a specfic oid on a specific day. However the app needs to be able to support a user running a search for a specific oid receive in the last 3 years

Daniel Turini

MrEyes wrote:

Storing data as string is, I believe unavoidable otherwise it would not be possible to search data. However the design has alot of repeated data in some fields, I believe that normalising the varchars would give a significant performance enhancement. Or as you suggested using text indexing

I don't know SNMP enough to be sure of what I'm saying without looking at the data, but some fields feel like they deserve a separate table: Packet Type -> Maybe a packet type table? OID -> Is it something like a MAC Address or a unique object ID? If so, you should have an "object" table. snmp_version -> this seems like an attribute of the "object" table, as it seems 1x1 related to OID. Source IP and Community Name -> Aren't those somehow related to OID too? Source IP -> You can (and maybe should) store IPs as INTs[^]. Timestamp -> This is a count of seconds since some arbitrary point on time. You could use a numeric field to store that, it'll be way smaller. This may make your database way faster to search, as it will be much smaller, and you'll only need to search smaller varchar tables (maybe hundreds or thousands of records) before going to the main table (captured_traps). But logging it may be a bit slower, only a test may say...

I don't see dead pixels anymore... Yes, even I am blogging now!