Naming my clustered index column which isn't the PK (and is a non-PK clustered Identity a good idea?)
-
> I believe you're mixing up indexes with keys. Perhaps - let me try again; maybe I didn't convey this well. The index isn't just for lookup speed, it is a unique index (same functionality as a unique constraint unless I'm missing something... I'm more of an 'app dba' than a 'real DBA') to enforce referential integrity because... referential integrity. The addition of the Identity field was due to my concern about page splits. The guid will remain the primary key for the table. The guid is generated within a vendor's application so there isn't much I can do about that at the moment. I know (or at least 'think') that fill factor is only applied at rebuild or creation. This troubled me since guids are random (in the sense that where a newly generated guid falls numerically in relation to an existing list of guids is a random position). Inserts will fill up any fill factor <100 and then splits start (and the splits are equivalent to fill factor = 100 as the table grows). So I came up with the idea of a Clustered Identity as the 'physical sorting field' (numerically increasing won't generate page split) because it seems a better idea than (1) guid inserts causing page splits, and (2) better than making the table a heap (without a clustered index). Someone mentioned the volume being low. Yes, last month was 3,000 but this will be expanded from 1 to 5 categories. This could cause 3,100 entries next month or 31,000 entries. I don't have the time of day of the inserts but it is all first shift and (I think) mostly at the start of the month. If I can cause less work for the db, it feels like I should. One new piece of information that I'll share: testing this I don't see as much of a negative impact from a pk clustered guid as I expected.
JChrisCompton wrote:
The index isn't just for lookup speed, it is a unique index (same functionality as a unique constraint unless I'm missing something... I'm more of an 'app dba' than a 'real DBA') to enforce referential integrity because... referential integrity.
You're not missing anything, but you need to differentiate the purpose. A key, (Unique or primary) needs a unique index for its function, and while it's enough to create a unique index to enforce the uniqueness of a column(s) you should really create a unique (or primary) key to show the intent if nothing else.
From the help file
Creating a unique index guarantees that any attempt to duplicate key values fails. There are no significant differences between creating a UNIQUE constraint and creating a unique index that is independent of a constraint. Data validation occurs in the same manner, and the query optimizer does not differentiate between a unique index created by a constraint or manually created. However, you should create a UNIQUE constraint on the column when data integrity is the objective. This makes the objective of the index clear. (emphasis mine.)
There's also another slight difference, on a unique index you can actually add the option "Ignore duplicate values", this is not allowed for a constraint.
JChrisCompton wrote:
So I came up with the idea of a Clustered Identity as the 'physical sorting field'
This is a good idea, but it would need to be the primary key in that case. Which doesn't matter at all, from a relational point of view there's no difference between a primary key and a unique key. It's only about how the data is stored in the case of clustered tables and magnetic tapes. :)
JChrisCompton wrote:
Someone mentioned the volume being low. Yes, last month was 3,000 but this will be expanded from 1 to 5 categories. This could cause 3,100 entries next month or 31,000 entries
To put it in relation, at my job we insert approximately 300000 rows every week on a clustered table with 300 columns and a semi-random natural key. It takes a couple of minutes.
Wrong is evil and must be defeated. - Jeff Ello
-
JChrisCompton wrote:
I know (or at least 'think')
Specify Fill Factor for an Index | Microsoft Docs[^] :thumbsup:
Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^] "If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
Yes, it says "The fill-factor option is provided for... [when] an index is created or rebuilt..." "I think" was intended to mean "I think fill-factor has no influence except during create/rebuild". The issue that concerns me is page splits. From memory of a MSFT SQL Server Performance Tuning class I took five years ago, I think tables with a clustered index are a B+ tree with doubly linked leaf nodes. (class and school were long ago :–) When an insert happens, and the bottom node of the tree is full, then a split happens. When SSvr splits a node, my understanding is that it just makes a new node - the fill-factor doesn't enter in to it. If I specify my fill factor to be "only fill 1/3" it doesn't matter - a new node is created (instead of 2 extra nodes being created then each adjusted to 1/3 full). This is done without rebalancing, because it is the fastest solution. Is my mental model right, or are there things in newer versions that makes 'real life' different? (maybe some auto tuning feature I've missed)
-
Yes, it says "The fill-factor option is provided for... [when] an index is created or rebuilt..." "I think" was intended to mean "I think fill-factor has no influence except during create/rebuild". The issue that concerns me is page splits. From memory of a MSFT SQL Server Performance Tuning class I took five years ago, I think tables with a clustered index are a B+ tree with doubly linked leaf nodes. (class and school were long ago :–) When an insert happens, and the bottom node of the tree is full, then a split happens. When SSvr splits a node, my understanding is that it just makes a new node - the fill-factor doesn't enter in to it. If I specify my fill factor to be "only fill 1/3" it doesn't matter - a new node is created (instead of 2 extra nodes being created then each adjusted to 1/3 full). This is done without rebalancing, because it is the fastest solution. Is my mental model right, or are there things in newer versions that makes 'real life' different? (maybe some auto tuning feature I've missed)
I never did trees in school.
JChrisCompton wrote:
The issue that concerns me is page splits.
From the page I linked; "When an index is created or rebuilt, the fill-factor value determines the percentage of space on each leaf-level page to be filled with data, reserving the remainder on each page as free space for future growth." "How do you avoid them? There are strategies for doing this, one such strategy is to use a low fill factor when you create and index, but that will be at the expense of read operations and because you usually have more reads to writes then that’s bad." What is a page split? What happens? Why does it happen? Why worry? - Tony Rogerson's ramblings on SQL Server[^]
Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^] "If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
-
JChrisCompton wrote:
The index isn't just for lookup speed, it is a unique index (same functionality as a unique constraint unless I'm missing something... I'm more of an 'app dba' than a 'real DBA') to enforce referential integrity because... referential integrity.
You're not missing anything, but you need to differentiate the purpose. A key, (Unique or primary) needs a unique index for its function, and while it's enough to create a unique index to enforce the uniqueness of a column(s) you should really create a unique (or primary) key to show the intent if nothing else.
From the help file
Creating a unique index guarantees that any attempt to duplicate key values fails. There are no significant differences between creating a UNIQUE constraint and creating a unique index that is independent of a constraint. Data validation occurs in the same manner, and the query optimizer does not differentiate between a unique index created by a constraint or manually created. However, you should create a UNIQUE constraint on the column when data integrity is the objective. This makes the objective of the index clear. (emphasis mine.)
There's also another slight difference, on a unique index you can actually add the option "Ignore duplicate values", this is not allowed for a constraint.
JChrisCompton wrote:
So I came up with the idea of a Clustered Identity as the 'physical sorting field'
This is a good idea, but it would need to be the primary key in that case. Which doesn't matter at all, from a relational point of view there's no difference between a primary key and a unique key. It's only about how the data is stored in the case of clustered tables and magnetic tapes. :)
JChrisCompton wrote:
Someone mentioned the volume being low. Yes, last month was 3,000 but this will be expanded from 1 to 5 categories. This could cause 3,100 entries next month or 31,000 entries
To put it in relation, at my job we insert approximately 300000 rows every week on a clustered table with 300 columns and a semi-random natural key. It takes a couple of minutes.
Wrong is evil and must be defeated. - Jeff Ello
Just so you know (in the event that I ask a question here again) those inserts would be against the same database server which supports thousands of real-time users. I was experimenting because it seemed needlessly inefficient to have a clustered char(36) as the key. :wtf: Turns out that all my questions may be moot (aside from my own learning) :sigh: For review exec sp_helpindex <table name> gives the following description
table #1
clustered, unique, primary key located on PRIMARY guid columntable #2
clustered located on PRIMARY the int column
nonclustered, unique, primary key located on PRIMARY guid columnThere is no detectable difference on insert speed when table size is <100k. There is a difference, but it seems trivial, when the table contains more than two million rows (14 second difference when inserting 300k rows in each table) P.S. Love the magnetic tape reference. P.P.S. > a clustered table with 300 columns and a semi-random natural key Yikes!
-
Just so you know (in the event that I ask a question here again) those inserts would be against the same database server which supports thousands of real-time users. I was experimenting because it seemed needlessly inefficient to have a clustered char(36) as the key. :wtf: Turns out that all my questions may be moot (aside from my own learning) :sigh: For review exec sp_helpindex <table name> gives the following description
table #1
clustered, unique, primary key located on PRIMARY guid columntable #2
clustered located on PRIMARY the int column
nonclustered, unique, primary key located on PRIMARY guid columnThere is no detectable difference on insert speed when table size is <100k. There is a difference, but it seems trivial, when the table contains more than two million rows (14 second difference when inserting 300k rows in each table) P.S. Love the magnetic tape reference. P.P.S. > a clustered table with 300 columns and a semi-random natural key Yikes!
The difference is bigger than you might think. On table #1 you have one index, and on table #2 you have two indexes. Insert performance is basically linear with the number of indexes[^]. So the fact that it doesn't take twice as long to do the inserts on table #2 means that your idea was quite correct after all. And this is on a narrow table, on a wide clustered table the difference would be bigger.
Wrong is evil and must be defeated. - Jeff Ello
-
The difference is bigger than you might think. On table #1 you have one index, and on table #2 you have two indexes. Insert performance is basically linear with the number of indexes[^]. So the fact that it doesn't take twice as long to do the inserts on table #2 means that your idea was quite correct after all. And this is on a narrow table, on a wide clustered table the difference would be bigger.
Wrong is evil and must be defeated. - Jeff Ello
One has twice the indices... well duh. I should have thought of that... but I didn't :laugh: Thanks for sticking with me until I got it!
-
I never did trees in school.
JChrisCompton wrote:
The issue that concerns me is page splits.
From the page I linked; "When an index is created or rebuilt, the fill-factor value determines the percentage of space on each leaf-level page to be filled with data, reserving the remainder on each page as free space for future growth." "How do you avoid them? There are strategies for doing this, one such strategy is to use a low fill factor when you create and index, but that will be at the expense of read operations and because you usually have more reads to writes then that’s bad." What is a page split? What happens? Why does it happen? Why worry? - Tony Rogerson's ramblings on SQL Server[^]
Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^] "If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
Thanks for the article reference, it is good reading. It is great background and wonderfully answers "what is a page split." I do wonder, however, how much of that has changed in SSrv 2008+. I thought, for example, that VarChar columns in SQL Server can be moved off to different storage location to prevent splits due a column changing its size. (row overflow? or is that just for blobs?) I didn't find something to corroborate my thought when I did a quick search, so if someone could let me know whether I'm on/off base in the previous paragraph it would be great. Just specifically about VarChar (not text/blob; that's a different world I'm rarely in).
-
Thanks for the article reference, it is good reading. It is great background and wonderfully answers "what is a page split." I do wonder, however, how much of that has changed in SSrv 2008+. I thought, for example, that VarChar columns in SQL Server can be moved off to different storage location to prevent splits due a column changing its size. (row overflow? or is that just for blobs?) I didn't find something to corroborate my thought when I did a quick search, so if someone could let me know whether I'm on/off base in the previous paragraph it would be great. Just specifically about VarChar (not text/blob; that's a different world I'm rarely in).
JChrisCompton wrote:
It is great background and wonderfully answers "what is a page split."
Better than I could in a single post, and it is nice to have an actual example that you can copy/paste to try for yourself and see :)
JChrisCompton wrote:
I didn't find
Let me share more bookmarks ;P How are varchar values stored in a SQL Server database? - Stack Overflow[^] - first answer. I had already forgotten about the option to compress those fields.
Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^] "If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
-
JChrisCompton wrote:
It is great background and wonderfully answers "what is a page split."
Better than I could in a single post, and it is nice to have an actual example that you can copy/paste to try for yourself and see :)
JChrisCompton wrote:
I didn't find
Let me share more bookmarks ;P How are varchar values stored in a SQL Server database? - Stack Overflow[^] - first answer. I had already forgotten about the option to compress those fields.
Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^] "If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
Thanks again. I believe the term I was looking for is "offrow pages". Here's a recent (Feb 2018) link that talks about it briefly then takes the undocumented DBCC IND command really deep into the RowID and beyond SQL Server Row Data Linking to Off Row Data[^] in case you're interested. I enjoyed it but was a little got lost :)
-
Thanks again. I believe the term I was looking for is "offrow pages". Here's a recent (Feb 2018) link that talks about it briefly then takes the undocumented DBCC IND command really deep into the RowID and beyond SQL Server Row Data Linking to Off Row Data[^] in case you're interested. I enjoyed it but was a little got lost :)