In this comment to one of my posts, Ryan asks…
Can you give any insight as to what life would be like, if I decided to migrate away from Guids as my IDs to Ints?
Sure: it would look exactly the same except your entities would derive from IntIdentityPersistenceBase instead of GuidIdentityPersistenceBase
Seriously though, this a great question and it goes to the heart of something that I may not have been completely clear on in the screencasts: why do I choose Guids for identity values over the probably more common Int32 data-type in the first place?
Followers of my Autumn of Agile screencast series will probably have taken note of the fact that I have chosen to use Guids as the data type for the value in my persistent objects that I am using to model my domain. Although I had thought that I was clear on explaining the reasoning behind my choice, I went back and skimmed over the screencast wherein I showed this choice of mine and I have to say that my memory of how clear I was and how clear I actually was are apparently two somewhat different things
Understanding Choices
With NHibernate, you have great freedom to choose whatever data type you want to use to represent your identity value in your objects. This is one of the great strengths of NHibernate in re: its ability to adapt to the needs of your application rather than force you to do one thing or another in a specific way. But one challenge with all this flexibility is to always try to avoid confusing ‘flexibility’ with ‘equality’: picking any one of the wide-array of choices about how to do something isn’t without its pros and cons — that’s partly why there are a somewhat dizzying array of choices about how to do things in NH in the first place~!
What’s Wrong With Ints as IDs?
The trouble with integers as Identity values is that they have to be assigned by the database. Technically of course this isn’t entirely true — there is nothing that requires an integer as primary key to be assigned by the DB, but its certainly the 95% case (if not the 99% case!) that when you use Ints as identity values, you also tend to ask the DB (usually MSSQL in most cases) to auto-increment and assign the identity value for you.
NH actually supports either approach quite capably: your identity (no matter the data type chosen) can be an auto-increment value assigned by the DB or a value assigned by your application. But if you choose to take the responsibility for assigning it within your application then it becomes really difficult to ensure properly unique identities in a multi-user environment. Where anyone might be attempting to do INSERTS into the DB at the same time, all of these would need some way to ensure their copy of the application (or their session, if a web app) was assigning truly unique Int values for Primary Keys. This issue (how to ‘coordinate’ unique Ids across multi-user applications) is, after all, why DBs have auto-increment PKs as an option in the first place.
Example
To better understand the issue, let’s consider the following code snippet (adapted — somewhat — from the project in the screencasts):
using (UnitOfWork.Start()) { Employee emp = new Employee(); _employeeRepository.SaveOrUpdate(emp); if (CheckSomethingImportant() == true) _employeeRepository.DeleteEmployee(emp); UnitOfWork.Current.Flush(); }
Now, this snippet is obviously contrived — its hard to imagine your needing to do exactly this sequence in code, but let’s pretend its real. The pattern (at least) is completely real: do something to one or more objects, save them, delete one or more of them, add one or more new ones, and then Flush() the unit of work when you’re done. This is the whole point of the Unit Of Work pattern in the first place: do anything I want to with any of the objects and then (if I’m completely happy with the results) commit the unit-of-work in one atomic unit.
What’s Wrong with our Code Snippet?
The $64,000 question (the 64kb question????) is: when in the above example does the first DB communication happen?
Well, what we know about the Unit-Of-Work pattern suggests that the semantically-correct answer is in the line…
UnitOfWork.Current.Flush();
…since that’s the whole point of the Unit-Of-Work pattern: nothing actually happens until I commit the UoW. And if you were using application-assigned identity values, you would be correct! If the call to CheckSomethingImportant() returns false then emp will be saved to the DB when I call UnitOfWork.Current.Flush();. And if the call to CheckSomethingImportant() returns true then actually absolutely nothing will happen when I call UnitOfWork.Current.Flush(); — the unit-of-work will be empty since I effectively undid the save by deleting the Employee from the unit-of-work before I commit it.
But if you actually step through this very same code in the debugger and watch SQL Profiler when using auto-increment identity values, you would see something a little unexpected (and disconcerting, at least to me): the emp instance of Employee would actually be saved to the DB in the line…
_employeeRepository.SaveOrUpdate(emp);
…before I commit the Unit-of-Work!
Why?
If you start to think for a moment about what component in your system is responsible for assigning an auto-increment identity value, then you realize why the call to
_employeeRepository.SaveOrUpdate(emp);
results in an INSERT against the database. Once you call this method, NHibernate needs to change your object from transient (not part of the unit-of-work) to persistable-but-not-persisted and to do this it needs to assign it a valid identity value. And the only place to get this value is from the database (by calling INSERT to get the next auto-increment identity value and assign it to the object).
NHibernate needs this object to have a valid identity once you call
_employeeRepository.SaveOrUpdate(emp);
so that any other object that you might attach to the session that needs to reference this object (like children in a collection that need to store the identity value of their parent, etc.). Since the only way that NH can get an identity value for this object is to INSERT it into the DB and have the DB assign it a Primary Key, NH has to issue an INSERT outside of the actual Unit-Of-Work (e.g. before you commit the UoW) to get this value.
GUIDs can be Generated Without Talking to the Database
This simple fact: "GUIDs can be generated without talking to the database" is the core reasoning behind my selection of them as identity values in the screencasts and my preference for them in professional practice. They allow my application (and NHibernate, to whom I will delegate the responsibility) to generate and assign Identity values without needing to communicate with the database to find out the next auto-increment identity value to assign to the object.
This makes for a cleaner, less-chatty, more predictable application whose Unit-Of-Work behaves as expected. It also makes for an application that isn’t dependent on what is really a database-specific implementation detail to assign identity to my objects (hint: not every database platform supported by NH supports auto-increment PK values).
For Clarity
In the interests of completeness and 100% accuracy, this capability (to assign identity without asking the database for the next valid value) has absolutely nothing to do (inherently) with using Guids as a datatype. You can accomplish this with any datatype that NH supports for an identity. But since Guids are guaranteed unique as part of their definition, they make an excellent choice for an identity datatype since I can ensure they are unique and are assignable by my application without my needing to keep track of them somewhere in my running application.
I hope this helps clarify a bit better why I made the choice the way I did. And thanks again for the great question~!
You’ve mentioned the upside of using GUIDs, in terms of avoiding the database access due to identities being generated on the client, but what about the downsides?
GUIDs use more space and consequently reduce the performance of queries, and GUIDs (depending on the algorithm used to generate them) may be “randomly” distributed, which results in fragmentation and the associated issues that come with it.
Note: SQL Server 2005 introduced NEWSEQUENTIALID() to help resolve the fragmentation issues with using GUIDs as keys, and NHibernate’s guid.comb generator is something similar.
In terms of a GUIDs uniqueness, there’s only a finite number of bits used to represent a GUID, so whilst generating the same GUID more than once may be unlikely, it is possible.
The algorithm used to generate GUIDs will affect when the first non-unique value will be generated, as if the GUIDs are “randomly” distributed, then that’s something different to sequencial/comb GUIDs. Just as an example, this quote is from the MSDN documentation for the Guid.NewGuid() method:
“There is a very low probability that the value of the new Guid is all zeroes or equal to any other Guid.”
From what I’ve read, that’s an example of the “randomly” distributed GUIDs, and you have to resort to using the Windows API or rolling your own for the other GUIDs.
@James:
Good point about not mentioning some of the downsides. Let’s look @ them…
GUIDs use more space
Yes. Obviously nothing can be done about that 😀 Everything is a trade-off and I will take a few more bytes in my DB against extra hits to the DB any day of the week (and twice on Sundays) but I will absolutely admit that depending of the size/shape of your data there may indeed be cases where the reverse ROI conclusion is perfectly valid.
Fragmentation:
This is the reason the Guid.Comb algorithm is (genrally) prefered to generate the identity values as you mention. NH supports both ‘real’ guids (truly random) and also the much more efficient Guid.Comb algorithm that results in ‘semi-synthetic’ Guids that are much less fragmented.
As you say, these are some of the donwsides to using GUIDs and I did exclusively focus on the positives and ignore the negatives — but I didn’t mean to imply that there are none, rather that on balance I tend to prefer using them to integers in most of my work for the reasons illustrated.
Thanks for pointing out that this choice, like all others, comes with BOTH pros and cons~! Good points.
-Steve B.
@Steve:
There is a little more to GUIDs using more space than just the disk space itself though, in that it also has an effect on the performance of queries on that database too.
When querying by ID, or joining foreign/primary keys, the database engine has more work to do to find matching IDs, as the IDs use more bytes, hence more time is involved in comparing them. Also, the IDs using more bytes means that the data is spread further over the disk(s) storing them, hence there’s caching/IO considerations too.
I guess what I’m saying is that I’m not sure whether the benefits of the identity inserts outweighs the downsides when it comes to the subsequent querying of the database.
As you probably guessed, generally speaking I prefer using auto-incrementing numbers, i.e. for me GUIDs tend to come into the equation when passing references to external systems. In such cases recently, an auto-incrementing number was used for the primary/foreign keys, and a guid (unique constraint) was used as a look-up for external systems.
Note: GUIDs really come into their own when there’s more than one source creating the values for the identity field, e.g. scenarios where multiple systems are sharing/syncing data.
Thanks for the explanation Steve. I thought it would be as easy as that but there isn’t an implemenation of IntIdentityBase in the Microdesk.Domain.Foundation assembly…so I rolled my own, worked like a charm. The only real assumption I made is that I don’t need to provide the static bool operators of == and != as this is simple with ints. So far so good.
@James:
I’ve often heard this concern (performance of queries) mentioned over and over again re: GUIDs; do you have any references (from actual performance studies) that demonstrate performance metrics and might indicate at what point scalability concerns become overwhelemed by the choice of GUIDs? I’d be very interested in any concrete data that might be available.
Also, I’d be interested in understanding what that performance hit looks like as against all the extra calls to the DB that are req’d by using auto-increment IDs since all of these choices are obviously not without both their postives and their negatives (as both you and I are pointing out).
Can you provide any research/resources that would help me better see some concrete metrics?
@Ryan:
IntIdentityPersistenceBase was a somewhat tongue-in-cheek response as I am (of course) aware that there isn’t any such class in the assembly…yet 😀 Its my intent to extend it to contain such a thing at some point in the future, but just haven’t gotten around to it yet.
As for not needing to override operator == and operator !=, I actually think that you *DO* still need to override them in the manner demonstrated in the screencasts (although as you point out, you need to be comparing integer-equality). The default behavior of == (and !=) for objects leads to property-by-property comparison of the objects being evaluated and this isn’t the behavior that (I think) you want. Instead I’m considering that an identity comparison should be based on a comparison of the identity values of the objects in question and for objects based on int identity this will make for a comparison of int values. I admit that this ‘interpretation’ of equlity (based purley on identity) is somewhat opinionated and may not make sense for everyone, but that’s how I prefer to look at it.
@Steve
Hmmm…is there some language nuance going on that I’m not groking (wouldn’t be a first). It would seem them then to get the behavior you are wanting your left and right comparisions would have to read more like this (this is from the bool operator== override):
return object.Equals(left._persistenceId, right._persistenceId)
than how they look right now:
return object.Equals(left, right)
Oops nevermind I see it now…
@James,
Jimmy Nilsson did a performance comparison of INTs vs. Guid.comb with NHibernate. http://www.informit.com/articles/printerfriendly.aspx?p=25862
Basic conclusion through testing metrics: The performance differences are negligible. I wouldn’t be too concerned with GUIDs (using guid.comb)
I personally prefer GUID. Most guys immediately discard GUID because they are too worried that GUID _might_ affect performance… but isn’t this ‘PREMATURE OPTIMIZATION’.
Have a look a this old heated thread from sqlteam.com http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=6136&whichpage=2
There’s this guy quazibubble, his comments are a bit rude, but he does make a lot of sense.
@Ryan:
Yeah, I was going to comment back that you’re just not looking in the right place, but really wanted to wait until I had a chance to get home, fire up my laptop, and inspect my own code to ensure I didn’t make a juge boo-boo without realizing it (tho I was pretty sure I didn’t as — you’ll enjoy this — I used TDD to develop that class and so was NEAR-CERTAIN that equality was based solely on identity-value comparison).
Glad you found it on your own (and glad I didn’t make a huge mistake in that class~!). BTW, GuidIdentityPersistenceBase is going to be replaces with IdentityPersistenceBase < T > soon and this will eliminate all dependence on any SPECIFIC data-type per-se and allow you the flexibility to use whatever type you want for your PKs/Identity values.
@Todd:
Thanks for that — I was aware that Jimmy Nilsson was to thank for the Guid.Comb algorithm, but was completely unaware of the article that you linked to there. THAT’S the real-world benchmark data that I was hoping someone could find to help me better understand the actual impact of the choice of ID type one way or the other.
While I don’t think that necessarily means ‘use GUIDs anywhere you want to with complete impunity’, it does strongly suggest that (in general) the ‘rules of thumb’ are indeed:
+ use GUIDs without much concern for performance (so long as you are using the Guid.Comb algorithm for generating them)
+ you still have the storage capacity issue (e.g., 16 bytes instead of 4 for INTs) and there is (of course) nothing you can do about that 😀
Thanks again for the link; nice to have hard data to assist in making one’s design choices~!
@Steve:
I see Todd has pointed you in the direction of an article by Jimmy Nilsson, which contains some statistics/benchmarks. I’m always a little cautious about that kind of thing, as like the saying goes, there’s lies, damn lies and statistics/benchmarks! 🙂
Clearly the article has some value, but to really appreciate the true difference in performance, there’s no benchmark like the real system. Sometimes the complexities of all the caching etc that goes on in modern hardware and software can result in some quite surprising/hard to predict results.
Just to pick on some other considerations, how is performance affected when there are several joins involved, and how are non-clustered indexes affected given the relationship between them and the clustered index for bookmark look-ups etc.
Changing subject a little, as you mentioned, the example in your post is somewhat contrived, hence generally speaking, the difference is probably more likely to about when the insert occurs than whether it occurs at all, and whether that’s significant (from a performance or functional stand-point) is hard to say out of context, i.e. as with most technical questions, the answer is probably it depends.
Note: The ratio or inserts to selects is obviously a significant factor too, as if your application leans heavily towards selects, which I’d guess is quite common, then their performance is probably much more significant to the user.
Anyway, I’ve probably waffled-on long enough, especially given that in the application being developed as part of the autumn of agile series, for which admittedly I’m a few videos behind, there’s probably only going to be a fairly small number of rows in each table, so any difference in performance is probably going to go un-noticed either way! 🙂
@James:
I wouldn’t want to stiffle what I consider to be a very valuable discussion about the pros and cons of this specific design decision that every developer who isn’t handed a 100%-complete DB by a DBA has to grapple with all the time.
I think the opinions debated here on this topic are very valuable for the debaters to discuss their opinons on, myself certainly to read, and — I hope — other readers of this blog and (thanks to google) the rest of the internet-at-large.
I completely agree that “the best benchmark is what it does to YOUR system” and this is actually one of the great things about the approach to development that says “the DB is an implementation detail I want to delay worrying about until the last-responsible-moment”. If I discover later that using GUIDs as identity values is troublesome from a performance standpoint, its relatively easy to switch to something else (INTs or otherwise) and have that affect localized in my persistence layer (mostly). [Note that I’m of course talking about BEFORE the app is launched and starts to contain production data — after which such a change is clearly impractical if not impossible 😀 ]
Your points about the ratio of inserts to seleects is a great one too; I would wager that you are right that most common business apps will experience a higher number of SELECTs than INSERTs in most operational conditions (though its not hard to conceive of order-entry-type apps where the reverse might very well be true too).
Caching (as you point out) is another tremendous performance factor that comes into play, as do the use of views as SELECT targets instead of multi-join SELECT statements, and a whole host of other factors that impact DB-interaction performance benchmarks.
There is a real question (in my mind) about whether choosing INTs (or more accurately, any PK that needs to be generated by a call to the DB) is a premature optimization where we KNOW that using auto-increment PK values from the DB is going to increase SQL traffic but we don’t KNOW that it will impact query performance (yet) without some actual perf tests on the app.
That said, one man’s ‘premature optimization’ is another man’s ‘I’ve been down this road before in a similar app and there’s no need to make that same mistake again!’ talking, so I think this is definitely one where reasonable people will have to agree to disagree based on each of our varied (different) experiences 😀
Again, great debate everyone — I personally found everyone’s input to be very valuable and helpful for illuminating the issues and factors for consideration, even if (in the end) we still didn’t settle the issue convincingly one way or the other 😀
@Steve:
It seems to me like one of those situations where it isn’t as simple as there always being a good and a bad solution, as sometimes there’s really just two different solutions.
The solution you pick is perhaps based on bias/history, or maybe factors like database support and what’s easier to type/remember when debugging etc, as picking either solution based on performance is likely to be ‘premature optimization’ (or perhaps de-optimization), unless you happen to have the knowledge/experience of the specific use cases in question.
Just curious, but has discussing it changed your opinion of either method in any way? I’m probably more open-minded towards using GUIDs (sequential/comb) than I was initially.
@James:
I would have to say ‘yes’ it has changed my mind a bit, and this is kind of where I was headed with my point about the value that I got from this comment thread.
My preference (default design choice) is still for GUIDs, but I have to say that in the past if I had a poor-performing DB tier I wouldn’t really have considered my choice of GUIDs to be a significant (possible) factor worth investigating but I would now be more skeptical of whether that might be at least a contributing factor to my peformance problems.
And your point about not every DB supporting GUIDs anyway is a great one; just like not all DBs support auto-increment PKs, not all DBs support GUIDs either so the choice (whichever) is limiting in *some* way no matter what direction you go.
As they say when making any software architecture choice, it always comes down to three factors: context, context, and context 😀
Hi,
I was wondering why you choose in GuidIdentityPersistenceBase Eqaul to treat all objects with a _persistencId of 0 as not comparable. If a object was not persisted yet wouldn’t it be better if you use the reference equal in this case or generate a guid when the constructor is called? In my opinion this behavior is not intuitive for a novice user and could introduce subtle bugs if the coder is new to your practices.
I’m new the NHibernate and the Rhino IoC stuff so I’m sure that I am missing something. Could you please explain this design decision?
@Serial Nightmare:
Thanks for the great question: here’s the summary of why I chose the equality behavior as I did…
I agree with you that it might seem counter-intuitive for a novice user; that’s one of the reasons that I pointed out in the screencasts that it works the way it does — I may not have been as explicit I could have been as to WHY I chose to make it work this way tho…
The single underlying principle behind my implementation of equality in the base class is “identity is based solely on the value of the identity field and nothing else”. This is (if you think about it) the same thing that happens in just about any DB…if two rows have the same primary key, they are considered THE SAME ROW. If two rows have different PK values, they are considered NOT THE SAME ROW no matter the values of their constituent fields.
I can see how ‘expecting’ value equality (the default where if each public field is equal, then the two instances of the object are considered equal) would seem intuitive if the entity wasn’t yet persisted but this would introduce two completely different evaluations of equality for persisted and transient (non-persisted) entities and this would (IMHO) introduce MORE chances for bugs rather than less.
re: assigning the identity value in the constructor, as I see it this doesn’t seem to buy me anything I don’t already have with my present implementation. If I leave the _persitenceId value as Guid.Empty and say (as I do) that the comparison of two Guid.Empty values is non-equal, then isn’t this the same thing I would get if I assigned the identity value in the constructor –? Two non-persisted entities would still have (I assume) different Guids assigned to them and they would still equate to non-equal (since their Guids would be non-equal). I may be misunderstanding your suggestion, can you clarify if I’m not getting it…?
Basically what it comes down to is this: I want identity (and by extension, the notion of equality) in my object model to be based on the same rules as my persistence environment (the DB) so that I can never end up with two persistent entities equal as objects but non-equal in the database. Since equality in the DB is purely a function of the PK values, then I want the same ‘rules’ in my domain model and this is what I was shooting for with the persistence base class and the implementation of the equality operator that’s in there.
Does this make sense…?
@sbohlen
I understand and agree that if the Id is equal that should mean that the two objects are equal no matter what the rest of their fields hold. If rest of the fields were different that would mean only that one of the objects was loaded at a different time then the other or it was changed after load.
What I meant was that if an object is transient it would not be equal to itself.
ex.
Employee emp = new Employee();
Assert(emp.Equals(emp)) // Fails
This would be solved if the Id would be assigned in the constructor.
Could this be a problem, or are there some best practices to avoid this?
Do you have to do that because NHibernate compares the Id to the unsaved-value to see if the object was persisted or not? (I just saw that in your NHibernate screencast session 1)
@Serial Nightmare:
You are suggesting an interesting side-effect of my approach that I must confess to not having fully considered: that of an object not being equal to itself. Can you demonstrate for me the (actual) case where this comparison would be useful/valuable in an actual app? I haven’t stumbled across such a case in my experience though that doesn’t even come close to proving that there isn’t one 😀 Can you help me understand what use-case are you positing this would be a hinderance for…?
And, yes, the other reason for leaving the _persistenceId as its ‘unassigned’ default of Guid.Empty is indeed b/c that’s part of how NHibernate considers whether its dealing with a persistent or transient instance. This in turn leads to how it understands whether it needs to do a ‘save’ (an SQL INSERT statement) or an ‘update’ (an SQL UPDATE statement).
If we were to assign a Guid on entity construction, then WE would have to take over that responsibility ourselves as developers and ensure that we always invoked EITHER ISession.Save(…) or ISession.Update(…) as needed. With my present implementation, we can near-always invoke ISession.SaveOrUpdate(…) and let the NHibernate framework decide for itself whether an INSERT or UPDATE statement needs to be generated (based on the value of the _persistenceId field in the base class).
This kind of thing is one of the reasons why (IMHO) persistence-ignorance (PI) of our domain entities is a GOAL rather than an absolute. Using something like NHibernate, its actually the case that if ‘pure’ PI is what you want, there are enough extensibility points in the tool to allow you to achieve just that but if you back off from the ‘purity’ of PI just a bit, you can let the framework (NH in this case) do more of the heavy-lifting for you. Designing your domain model so that NH can be given the right ‘hints’ about whether to construct and INSERT or an UPDATE statement when the session is flushed is just once such ‘concession’ to pure PI that I am willing to make in this kind of case.
@sbohlen
For about 2 years I was working on a large legacy C/C++ codebase. I think that this experience might have made me “sensitive” (read: paranoid 🙂 ) to such edge cases.
I can only imagine such scenarios where you have to code defensively in a not disciplined big ball of mud project. If your class is responsible for a list of objects that have to be different (different id) and someone would take one of the items already in the list and give it back to you as a new item you could not check if this object is already there or not.
Or if the Hash function is based on the Guid inside the Id you could run into problems if you want to put transient objects into a Dictionary. If two transient objects can not be equal you might not be able to use them as keys in a Dictionary.
@Serial Nightmare:
The dictionary example is an interesting one I hadn’t considered — I will have to give that some more thought…
But the first situation is NEVER going to happen — that’s the point of the DDD concept of ‘aggregate root’. If the parent holds a collection of children, then that parent is going to be responsible for managing the lifecycle of its children.
If it were to ‘give’ a transient instance to another part of the app before assigning it an Id (by persisting it), then I think that’s essentially relinquishing control of the item to the other aggregate root which (even if it were a good idea) would probably mean that the item would be REMOVED from the original parent rather than a copy of it given out to the other. Then the responsibility for managing that child would also be passed to the second parent. In essence, handing off a child from one parent to the next is also handing off lifecycle management for that child from one parent to the next.
I agree that its POSSIBLE that the first parent would relinquish a *copy* of one of its children to another part of the app, but I would also have to say that it seems to me to be a poor design decision to do so — I would probably REMOVE the item from the first parent’s collection of children and then ADD it to the second parent’s collection rather than have Parent1 hand a copy to Parent2 (and create exactly the trouble you are quite correctly pointing out could easily occur). This is just one of the reasons why I tend to shudder inside when I see props on objects that expose non-read-only collections (the members of that collection are suddenly completely exposed to the world).
I am inclined to concur with your paranoid estimation that this trouble is indeed possible, but to create these conditions would (I think) require me to go against all the rest of the tenets of DDD that I tend to consider good practice. Hopefully, if a dev on the project did follow the pattern you are describing, the rest of the team would catch it in tests, code-reviews, etc., but I think you’re quite right that it represents at least a potential point of contention to watch out for.
Thanks for the points~!
For really detailed and profound background info on why a GUID is a TERRIBLE choice for a clustering key in SQL Server, check out the various articles by Kim Tripp – she’s arguably one of the most prominent experts in SQL Server performance and index tuning there is.
http://www.sqlskills.com/BLOGS/KIMBERLY/post/GUIDs-as-PRIMARY-KEYs-andor-the-clustering-key.aspx
http://sqlskills.com/BLOGS/KIMBERLY/post/The-Clustered-Index-Debate-Continues.aspx
http://www.sqlskills.com/BLOGS/KIMBERLY/post/Ever-increasing-clustering-key-the-Clustered-Index-Debateagain!.aspx
If you really want to and have to use GUID’s as PRIMARY KEY, you should always try to use a separate INT IDENTITY column to cluster the table. The performance differences are MASSIVE indeed (seen first hand with tables of 20’000 rows or more).
Marc
@Marc Scheuner:
Yep, I completely respect Kim’s expertise when it comes to tuning SQLServer for performance and so I’m well-aware of what she’s suggesting here: that randomly-ordered PKs are awful in re: the performance of SQL Server indexing and I don’t dispute this point at all as all the literature would (and my own experience) would certainly back her up on this.
This is why I make use of the Guid.Comb generator for identity (PK) values in my NHibernate-based solutions. Guid.Comb results in a unique yet also ever-increasing value for the identity which is what Kim is suggesting at the end of her post here…
http://www.sqlskills.com/BLOGS/KIMBERLY/post/Ever-increasing-clustering-key-the-Clustered-Index-Debateagain!.aspx
…in which she suggests that under SQL Server 2005 (and also SQL Server 2008, of course) one should choose the newsequentialid() method over the completely random newid() method in the database.
The Guid.Comb identity generator in NHibernate is essentially the non-database-dependent version of the SQL Server newsequentialid() function that can be invoked client-side without requiring a database round-trip just to get a new PK for an entity.
The idea of using a separate INT IDENTITY column in the DB along with the Guid as PK and having the clustered index depend on this INT value is an interesting one I hadn’t considered before — are you saying that you have implemented this in a production system? And if so, how do you keep the index in sync with the data when inserts/deletes occur? Just wondering…
Thanks for the comments~!
Quick question that may be borne out of ignorance but I have developed at least 2 apps where the user expects to be able to find something by id. Is this a valid reason to use an incrementing int in your table either as primary key or alongside a GUID?
Its an amusing thought if you are dealing with a customer who has submitted a ticket on your crm and you say “can you call out the GUID to me?”.
Fantastic series Steve both on Agile and nHibernate. Really valuable to me in my efforts to learn more. As soon as my credit card statement looks a bit healthier I’m going to donate a few euro 🙂
@Brian:
*Generally* speaking, you are describing the (age-old) database design debate about the difference between ‘natural’ ( http://en.wikipedia.org/wiki/Natural_key ) and ‘surrogate’ ( http://en.wikipedia.org/wiki/Surrogate_key ) primary keys in the table.
Setting aside the fact that this has been the subject of many religious wars over the years, I am a fan of the ‘surrogate’ key approach where the PK value has zero semantic business meaning (you can read either of those wikipedia links for ideas as to why I feel this way) and so my choice of data type for PK is independent of what a user might expect/want to try to remember.
Thanks for the feedback and I’ll be glad to accept any (big or small) donation if/when you can — every little bit helps~!
Thanks again,
-Steve B.