Ultimate Database Fail [updated]
-
I wouldn't think the clustering vs non-clustering would not be a very big problem. The open_quantity and order_type index could be useful if the goal is, for example, to process all orders which have not yet been processed. Rather than scan 23,000,000 rows, you'd only have to process the 10 unprocessed orders. You sure you found the problem?
The open_quantity is never a search criteria. Its value is 0 99.99% of the time, and we never search by open_quantity or order_type, if we did, the order_id and date would be included as well. Have you ever inserted large sets of data into a table that had a clustered index, while the inserted data would have to be physically stored into the middle of the table because of the index? It's quite ugly. I'm not suggesting eliminating the clustered index, but there's no reason the data need be physically stored in order according to open quantity and order type.
"Life should not be a journey to the grave with the intention of arriving safely in a pretty and well preserved body, but rather to skid in broadside in a cloud of smoke, thoroughly used up, totally worn out, and loudly proclaiming "Wow! What a Ride!" — Hunter S. Thompson
-
The open_quantity is never a search criteria. Its value is 0 99.99% of the time, and we never search by open_quantity or order_type, if we did, the order_id and date would be included as well. Have you ever inserted large sets of data into a table that had a clustered index, while the inserted data would have to be physically stored into the middle of the table because of the index? It's quite ugly. I'm not suggesting eliminating the clustered index, but there's no reason the data need be physically stored in order according to open quantity and order type.
"Life should not be a journey to the grave with the intention of arriving safely in a pretty and well preserved body, but rather to skid in broadside in a cloud of smoke, thoroughly used up, totally worn out, and loudly proclaiming "Wow! What a Ride!" — Hunter S. Thompson
You are right, there's not really a good reason to store that type of data using a clustered index. Still, I don't see anything that you've come across which would fix the problem you described. Would inserting a value with a bad clustered index cause all the rows in the table to be moved so that the physical location on the hard drive is stricly enforced (i.e., so rows will be ordered on the HD by the clustered index)? Or is SQL smart enough to insert the data without rearranging it (i.e., by using a tree-like structure)?
-
You are right, there's not really a good reason to store that type of data using a clustered index. Still, I don't see anything that you've come across which would fix the problem you described. Would inserting a value with a bad clustered index cause all the rows in the table to be moved so that the physical location on the hard drive is stricly enforced (i.e., so rows will be ordered on the HD by the clustered index)? Or is SQL smart enough to insert the data without rearranging it (i.e., by using a tree-like structure)?
Clustered indexes physically store the rows in the order provided by the index, as far as I know, it can be quite taxing on resources so I'm not sure what logic they use for it.
"Life should not be a journey to the grave with the intention of arriving safely in a pretty and well preserved body, but rather to skid in broadside in a cloud of smoke, thoroughly used up, totally worn out, and loudly proclaiming "Wow! What a Ride!" — Hunter S. Thompson
-
You are right, there's not really a good reason to store that type of data using a clustered index. Still, I don't see anything that you've come across which would fix the problem you described. Would inserting a value with a bad clustered index cause all the rows in the table to be moved so that the physical location on the hard drive is stricly enforced (i.e., so rows will be ordered on the HD by the clustered index)? Or is SQL smart enough to insert the data without rearranging it (i.e., by using a tree-like structure)?
A clustered index is by definition used to physically order the table data in the file. Non-clustered indexes are only ordered by the data in the index. http://www.simple-talk.com/sql/learn-sql-server/sql-server-index-basics/[^]
Curvature of the Mind now with 3D
-
A clustered index is by definition used to physically order the table data in the file. Non-clustered indexes are only ordered by the data in the index. http://www.simple-talk.com/sql/learn-sql-server/sql-server-index-basics/[^]
Curvature of the Mind now with 3D
Thanks, that is what I suspected. Looks like the row data is stored with the custered index. Since the indexes, including the clustered index, are stored as B-trees, there should indeed be no huge performance hit (i.e., all the data isn't going to have to be reorganized on disk if a row is inserted into the middle of the clustered index). The only time a performance hit would occur would be during a reindexing of the table (even then, the hit would turn an O(1) operation into an O(1 * x) operation, and since x is effectively constant, that would reduce to a O(1) operation). Since those shouldn't be occurring very often, that should not cause the problem described by the OP.
-
Clustered indexes physically store the rows in the order provided by the index, as far as I know, it can be quite taxing on resources so I'm not sure what logic they use for it.
"Life should not be a journey to the grave with the intention of arriving safely in a pretty and well preserved body, but rather to skid in broadside in a cloud of smoke, thoroughly used up, totally worn out, and loudly proclaiming "Wow! What a Ride!" — Hunter S. Thompson
See here. Basically, the row data is stored with the clustered index, which is stored as a B-tree, so insertion is not an expensive operation (i.e., this should not be the cause of your problem).
-
Thanks, that is what I suspected. Looks like the row data is stored with the custered index. Since the indexes, including the clustered index, are stored as B-trees, there should indeed be no huge performance hit (i.e., all the data isn't going to have to be reorganized on disk if a row is inserted into the middle of the clustered index). The only time a performance hit would occur would be during a reindexing of the table (even then, the hit would turn an O(1) operation into an O(1 * x) operation, and since x is effectively constant, that would reduce to a O(1) operation). Since those shouldn't be occurring very often, that should not cause the problem described by the OP.
What happens is the pages fill up and then get split as records are inserted in the middle. Splitting pages is significantly slower than just appending records on the end. It's not the O(whatever) performance, but the big ass constant performance of re-arranging data on a disk.
Curvature of the Mind now with 3D
-
Thanks, that is what I suspected. Looks like the row data is stored with the custered index. Since the indexes, including the clustered index, are stored as B-trees, there should indeed be no huge performance hit (i.e., all the data isn't going to have to be reorganized on disk if a row is inserted into the middle of the clustered index). The only time a performance hit would occur would be during a reindexing of the table (even then, the hit would turn an O(1) operation into an O(1 * x) operation, and since x is effectively constant, that would reduce to a O(1) operation). Since those shouldn't be occurring very often, that should not cause the problem described by the OP.
I am only speaking from experience here: Inserting large numbers of data at a high rate of speed into the middle of a table with seemingly random clustered index can really hurt performance. I've done it in the past, rate: ~10-30k rows per second on a table ~ 100m to 1b rows in size. Using a clustered index, and the pushing the data to the table in the order of the clustered index made queries nearly unusable or in the very least lengthy, however when doing a time-series based search, on a table with a time series clustered index was fine and allowed for a realtime monitoring solution.
"Life should not be a journey to the grave with the intention of arriving safely in a pretty and well preserved body, but rather to skid in broadside in a cloud of smoke, thoroughly used up, totally worn out, and loudly proclaiming "Wow! What a Ride!" — Hunter S. Thompson
-
What happens is the pages fill up and then get split as records are inserted in the middle. Splitting pages is significantly slower than just appending records on the end. It's not the O(whatever) performance, but the big ass constant performance of re-arranging data on a disk.
Curvature of the Mind now with 3D
A page is 8KB. That's not going to take long to move around. The time to split a page also averages in the long run to be O(1). Say a page holds 50 records. It may split on a single record insertion, but those other 49 record insertions would not cause a split. The average time per insertion would be O(1). Now, if SQL sorts data within the page for each insert, that may slow things down a bit, as the entire page would essentially be rewritten each time randomly ordered data got inserted. Not sure exactly how it works at that low of a level. Even in this case, though, SQL should be able to optimize things. For example, if records are inserted in batches, adjacent records need not cause multiple sorts within a page... SQL can figure out the position within the page, and sort all the data at once before writing it to disk. And in the case that many small insertions are peformed, those individual inserts shouldn't take long enough to slow other stuff down. Of course, transaction locks and such could bork things up, but that's a different issue.
-
I am only speaking from experience here: Inserting large numbers of data at a high rate of speed into the middle of a table with seemingly random clustered index can really hurt performance. I've done it in the past, rate: ~10-30k rows per second on a table ~ 100m to 1b rows in size. Using a clustered index, and the pushing the data to the table in the order of the clustered index made queries nearly unusable or in the very least lengthy, however when doing a time-series based search, on a table with a time series clustered index was fine and allowed for a realtime monitoring solution.
"Life should not be a journey to the grave with the intention of arriving safely in a pretty and well preserved body, but rather to skid in broadside in a cloud of smoke, thoroughly used up, totally worn out, and loudly proclaiming "Wow! What a Ride!" — Hunter S. Thompson
Well, the clustered index you talked about is certainly not ideal and is less performant than it should be. I supposed the magnitude of the problem depends on the exact amount of data you are using and the frequency of inserts/updates/reads/deletes. I guess the "ultimate" fail here would be that the others you work with didn't find the problem sooner.
-
A page is 8KB. That's not going to take long to move around. The time to split a page also averages in the long run to be O(1). Say a page holds 50 records. It may split on a single record insertion, but those other 49 record insertions would not cause a split. The average time per insertion would be O(1). Now, if SQL sorts data within the page for each insert, that may slow things down a bit, as the entire page would essentially be rewritten each time randomly ordered data got inserted. Not sure exactly how it works at that low of a level. Even in this case, though, SQL should be able to optimize things. For example, if records are inserted in batches, adjacent records need not cause multiple sorts within a page... SQL can figure out the position within the page, and sort all the data at once before writing it to disk. And in the case that many small insertions are peformed, those individual inserts shouldn't take long enough to slow other stuff down. Of course, transaction locks and such could bork things up, but that's a different issue.
It all depends. If the entire table can fit inside the memory of your sql server and there isn't a lot of contention then it doesn't really matter one way or the other. If you have 32Gigs of memory in your server and have 50 databases on the server with about 1 TB of data, then it makes the difference between inserting 5 million records in a few minutes or in a few days. For another server it made the difference between inserting 35,000 records a second for 22 hours straight and starting to slow down after a few hours.
Curvature of the Mind now with 3D
-
Well, the clustered index you talked about is certainly not ideal and is less performant than it should be. I supposed the magnitude of the problem depends on the exact amount of data you are using and the frequency of inserts/updates/reads/deletes. I guess the "ultimate" fail here would be that the others you work with didn't find the problem sooner.
A lot of the time has to do with speculative reads missing and the random read write access on the disks pushing them from < 10ms access times to over 100ms. SQL server optimizes it's accesses for overall throughput and that can drive latency to ridiculous levels. As much as it doesn't sound like it should be a big deal. I can tell you from real world experience it is.
Curvature of the Mind now with 3D
-
It all depends. If the entire table can fit inside the memory of your sql server and there isn't a lot of contention then it doesn't really matter one way or the other. If you have 32Gigs of memory in your server and have 50 databases on the server with about 1 TB of data, then it makes the difference between inserting 5 million records in a few minutes or in a few days. For another server it made the difference between inserting 35,000 records a second for 22 hours straight and starting to slow down after a few hours.
Curvature of the Mind now with 3D
Andy Brummer wrote:
inserting 35,000 records a second for 22 hours straight
Inserting nearly 3 billion records is going to take a while no matter what you do.
Andy Brummer wrote:
it makes the difference between inserting 5 million records in a few minutes or in a few days.
I don't think it would cause a 1,000x slowdown. 10x, maybe... but 1,000x? That doesn't sound right.
-
I wouldn't think the clustering vs non-clustering would not be a very big problem. The open_quantity and order_type index could be useful if the goal is, for example, to process all orders which have not yet been processed. Rather than scan 23,000,000 rows, you'd only have to process the 10 unprocessed orders. You sure you found the problem?
I don't know why this was down voted, hopefully I corrected it enough.
"Life should not be a journey to the grave with the intention of arriving safely in a pretty and well preserved body, but rather to skid in broadside in a cloud of smoke, thoroughly used up, totally worn out, and loudly proclaiming "Wow! What a Ride!" — Hunter S. Thompson