Algorithm for matching database records?
-
Apologoes if this the wrong forum – could be in “Database”, or even C# - it’s a bit of a cross-over... The scenario: I have a database table holding records of items that are flagged according to various key terms held in another table – eg: keywords table: ID – integer sKeyword – string There are about a dozen of these, which won’t change, with ID’s from 1 to 12. items table: ID – integer iKeywords – Double (other fields) The iKeywords number is calculated simply as the sum of 2 ^ keywords.ID for all relevant keywords. A record may be associated with multiple keywords. Thus it is easy to select records matching keywords criteria, and updating records is similarly easy. The problem: Give the iKeyword value of a particular item, find other items which are matched with some or all of the keywords of this record, ordered by the number of matching keywords. The solution...? Apart from an obvious brute-force approach, I am sure there must be an elegant algorithm (c# or vb .NET preferably) that can do it.... even better a really smart SQL statement, but that’s probably hoping for too much. Or maybe I should be approaching the database structure differently?
-
Apologoes if this the wrong forum – could be in “Database”, or even C# - it’s a bit of a cross-over... The scenario: I have a database table holding records of items that are flagged according to various key terms held in another table – eg: keywords table: ID – integer sKeyword – string There are about a dozen of these, which won’t change, with ID’s from 1 to 12. items table: ID – integer iKeywords – Double (other fields) The iKeywords number is calculated simply as the sum of 2 ^ keywords.ID for all relevant keywords. A record may be associated with multiple keywords. Thus it is easy to select records matching keywords criteria, and updating records is similarly easy. The problem: Give the iKeyword value of a particular item, find other items which are matched with some or all of the keywords of this record, ordered by the number of matching keywords. The solution...? Apart from an obvious brute-force approach, I am sure there must be an elegant algorithm (c# or vb .NET preferably) that can do it.... even better a really smart SQL statement, but that’s probably hoping for too much. Or maybe I should be approaching the database structure differently?
Why is your
iKeywords
field is adouble
if you only have 12 keywords and they're not going to change? A 16-bit integer (smallint
) would be sufficient. The design feels wrong; the standard approach would be to have a many-to-many relationship between the keywords and the items:Keywords: KeywordID [PK], Keyword
Items: ItemID [PK], ...
ItemKeywords: ItemID [PK, FK], KeywordID [PK, FK]This would lead to a fairly simple solution:
SELECT
B.ItemID,
Count(1) As MatchedKeywordCount
FROM
ItemKeywords As A
INNER JOIN ItemKeywords As B
ON A.KeywordID = B.KeywordID
WHERE
A.ItemID = @ItemToMatch
And
B.ItemID != @ItemToMatch
GROUP BY
B.ItemID
ORDER BY
MatchedKeywordCount DESC
;With your current implementation (updated to use an integer type for
iKeywords
), you can identify matching products using bitwise-AND[^]:SELECT
...
FROM
Items
WHERE
(iKeywords & @KeywordsToMatch) != 0
;Counting the number of matching keywords is slightly more complicated. For example:
WITH N0 (X) As -- 2 rows
(
SELECT 1
UNION ALL
SELECT 1
),
N1 (X) As -- 4 rows
(
SELECT 1
FROM N0 A CROSS JOIN N0 B
),
N2 (X) As -- 16 rows
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
FROM N1 A CROSS JOIN N1 B
),
PowersOfTwo (Value) As
(
SELECT Power(2, X)
FROM N2
),
MatchingItems As
(
SELECT
I.ItemID,
I.iKeywords & @KeywordsToMatch As MatchedKeywords
FROM
Items As I
WHERE
(I.iKeywords & @KeywordsToMatch) != 0
)
SELECT
I.ItemID,
Count(1) As MatchedKeywordCount
FROM
MatchingItems As I
INNER JOIN PowersOfTwo As P
ON (I.MatchedKeywords & P.Value) != 0
GROUP BY
I.ItemID
;
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
-
Why is your
iKeywords
field is adouble
if you only have 12 keywords and they're not going to change? A 16-bit integer (smallint
) would be sufficient. The design feels wrong; the standard approach would be to have a many-to-many relationship between the keywords and the items:Keywords: KeywordID [PK], Keyword
Items: ItemID [PK], ...
ItemKeywords: ItemID [PK, FK], KeywordID [PK, FK]This would lead to a fairly simple solution:
SELECT
B.ItemID,
Count(1) As MatchedKeywordCount
FROM
ItemKeywords As A
INNER JOIN ItemKeywords As B
ON A.KeywordID = B.KeywordID
WHERE
A.ItemID = @ItemToMatch
And
B.ItemID != @ItemToMatch
GROUP BY
B.ItemID
ORDER BY
MatchedKeywordCount DESC
;With your current implementation (updated to use an integer type for
iKeywords
), you can identify matching products using bitwise-AND[^]:SELECT
...
FROM
Items
WHERE
(iKeywords & @KeywordsToMatch) != 0
;Counting the number of matching keywords is slightly more complicated. For example:
WITH N0 (X) As -- 2 rows
(
SELECT 1
UNION ALL
SELECT 1
),
N1 (X) As -- 4 rows
(
SELECT 1
FROM N0 A CROSS JOIN N0 B
),
N2 (X) As -- 16 rows
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1
FROM N1 A CROSS JOIN N1 B
),
PowersOfTwo (Value) As
(
SELECT Power(2, X)
FROM N2
),
MatchingItems As
(
SELECT
I.ItemID,
I.iKeywords & @KeywordsToMatch As MatchedKeywords
FROM
Items As I
WHERE
(I.iKeywords & @KeywordsToMatch) != 0
)
SELECT
I.ItemID,
Count(1) As MatchedKeywordCount
FROM
MatchingItems As I
INNER JOIN PowersOfTwo As P
ON (I.MatchedKeywords & P.Value) != 0
GROUP BY
I.ItemID
;
"These people looked deep within my soul and assigned me a number based on the order in which I joined." - Homer
Many thanks for this - should be enough to get me going! I used a double only because when I began I wasns't sure of the exact keyword list, so played safe in case there were more. I agree the design "looks wrong" as it is clearly limited to only a relatively small number of keywords. I'll have a play with doing it right!