Maysam Mahfouzi wrote:
The problems is, I don't like to manipulate the initial urls
If you don't want to manipulate the URL, doesn't that actually mean you store each URL just as it is (just check that the exact URL isn't found)?
Maysam Mahfouzi wrote:
The first thing that came to my mind as a solution, was to make urls canonical before inserting them into database
One thing is that you could make a canonical version first, store it in parent-table and then store the unmodified url in child-table. Something like
CanonicalUrl (
CanonicalUrlId int,
Url varchar(500)
)
Url (
UrlIdId int,
CanonicalUrl int,
Url varchar(500)
)
If you want you could also build a calculated column to Url table to represent the canonical form. However all these add extra logic to the data handling. So I'm wondering why do you want to prevent storing similar url's at all. Of course storing exactly the same may not be wise, but that's easily prevented.
The need to optimize rises from a bad design.My articles[^]