Detecting similar URLs
-
I've got a table in database in which I store URLs that users bookmark. But before inserting a url into database, I want to make sure it's not been bookmarked before by another user. To do so, I have to search for similar forms of a url. i.e. if someone inserts www.yahoo.com, I want to avoid inserting http://yahoo.com again to prevent duplicate entries. The first thing that came to my mind as a solution, was to make urls canonical before inserting them into database, i.e. remove www from the beginning of url (if any) and add http:// to it. This seems a good workaround. The problems is, I don't like to manipulate the initial urls. I mean, If a user wants to bookmark www.yahoo.com, I don't like to insert http://yahoo.com into database for some urls, will not open if you remove www from the beginning of them. Any idea dudes?
-
I've got a table in database in which I store URLs that users bookmark. But before inserting a url into database, I want to make sure it's not been bookmarked before by another user. To do so, I have to search for similar forms of a url. i.e. if someone inserts www.yahoo.com, I want to avoid inserting http://yahoo.com again to prevent duplicate entries. The first thing that came to my mind as a solution, was to make urls canonical before inserting them into database, i.e. remove www from the beginning of url (if any) and add http:// to it. This seems a good workaround. The problems is, I don't like to manipulate the initial urls. I mean, If a user wants to bookmark www.yahoo.com, I don't like to insert http://yahoo.com into database for some urls, will not open if you remove www from the beginning of them. Any idea dudes?
Maysam Mahfouzi wrote:
The problems is, I don't like to manipulate the initial urls
If you don't want to manipulate the URL, doesn't that actually mean you store each URL just as it is (just check that the exact URL isn't found)?
Maysam Mahfouzi wrote:
The first thing that came to my mind as a solution, was to make urls canonical before inserting them into database
One thing is that you could make a canonical version first, store it in parent-table and then store the unmodified url in child-table. Something like
CanonicalUrl (
CanonicalUrlId int,
Url varchar(500)
)Url (
UrlIdId int,
CanonicalUrl int,
Url varchar(500)
)If you want you could also build a calculated column to Url table to represent the canonical form. However all these add extra logic to the data handling. So I'm wondering why do you want to prevent storing similar url's at all. Of course storing exactly the same may not be wise, but that's easily prevented.
The need to optimize rises from a bad design.My articles[^]