Question about simple pattern recognition (hopefully the solution is simple too...)

James Shao

Hi guys, I know strictly speaking this is not a database question, but since this has to do with massive data manipulation, and that from past experiences I know there are some really intelligent and creative people here, hopefully bouncing some ideas here could shed some light on a possible solution. ;P Conceptually, I need to compare the pattern of object A, B, C, D, E, F. Each object has 5 attributes (a, b, c, d, e), the value of which can range from 0 to 9. Object A B C D E F a 6 9 1 9 9 3 b 8 3 5 1 1 3 c 7 0 8 3 5 8 d 7 5 6 5 5 2 e 8 9 2 2 8 0 The goal is to find out which objects have the highest match of attribute values. For example, in the above table, D & E have 3 matches, B & D have 2 matches, C & D have 1 match...etc. Of course a simple loop comparing each object's pattern to another would suffice if there's only 6 objects and 5 attributes like the above. But my eventual goal is to categorize some 10000 objects with 20 attributes. :rolleyes: For some reason I thought there must be an elegant solution to this, possibly involving some ingenious/creative data sorting using arrays. But again I've been in programming long enough to know simple concepts sometimes require unbelievably complex programmming. I'd greatly appreciate any suggestions or ideas. Many thanks again. :)

Mycroft Holmes

E has either 2 or 4 matches, otherwise I have missed your logic. Are you counting each col that matches eg A has 1 match with 3 cols, is it one or 3. Do the values of the match impact on your requirement (3x9 is worth more than 3x5) or (4x2 < 2x5) I can think of a few way, all simple and reiterative so I'll not waste your time.

Never underestimate the power of human stupidity RAH

Luc Pattyn

seems to me he is comparing each possible pair of two columns, so for N columns, that's N*(N-1)/2 column comparions, where identical field values constitute a match. And he is contemplating N=1000 :omg: and it looks like a huge job, possibly not a good approach to reach whatever the higher goal might be. :)

Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles]

I only read code that is properly formatted, adding PRE tags is the easiest way to obtain that.

i j russell

You could do what you want using Sql.

USE tempdb
GO

CREATE TABLE #Data
(
ObjectId INT NOT NULL,
AttributeId TINYINT NOT NULL,
Value TINYINT NOT NULL
)
GO

--Object A B C D E F
--a 6 9 1 9 9 3
--b 8 3 5 1 1 3
--c 7 0 8 3 5 8
--d 7 5 6 5 5 2
--e 8 9 2 2 8 0

INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (1, 1, 6);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (1, 2, 8);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (1, 3, 7);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (1, 4, 7);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (1, 5, 8);

INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (2, 1, 9);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (2, 2, 3);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (2, 3, 0);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (2, 4, 5);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (2, 5, 9);

INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (3, 1, 1);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (3, 2, 5);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (3, 3, 8);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (3, 4, 6);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (3, 5, 2);

INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (4, 1, 9);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (4, 2, 1);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (4, 3, 3);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (4, 4, 5);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (4, 5, 2);

INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (5, 1, 9);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (5, 2, 1);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (5, 3, 5);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (5, 4, 5);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (5, 5, 8);

INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (6, 1, 3);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (6, 2, 3);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (6, 3, 8);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (6, 4, 2);
INSERT INTO #Data (ObjectId, AttributeId, Value) VALUES (6, 5, 0);
GO

SELECT SourceId,
MatchId,
COUNT(*) AS TotalMatches
FROM
(
SELECT

James Shao

Thank you Holmes for the reply. :) In essence I am trying to compare one column's pattern against each other, so if column A and column B have three attriutes that are the same, then that's 3 matches for that pair. Luc is right in saying there'll be n(n+1)/2 comparisons, given n columns. Given roughly 10000 columns, that's about roughly 50 million comparisons. However, that is assuming that during each comparison loop you can only compare 1 versus 1 at a time. Now I am trying to figure out whether there's an ingenious way to solve this problem efficiently (1 hour of process time is okay by me). Ultimately, the purpose of this exercise is to write a code that can categorize various random patterns into groups. So for example, after this is done, you can observe and say, "Group 1 has higher attribute a, lower attribute b, lower attribute c", and "Group 2 has below average attrite a, but very high attribute d"...etc. :-D

James Shao

Wow it really worked! Thank you so much for the help Russell. :-D This result looks awesome: Object Object #ofMatches 4 5 3 2 4 2 2 5 2 3 4 1 1 5 1 2 6 1 3 6 1 Just curious, did it take you a long time to figure out how to do this?

i j russell

You're welcome. It took me more time to type in the sample code than it did to work out how to to it and run it. :-\ The Entity-Attribute-Value (EAV) pattern is one that is well known in database design. Many people have tried to use it instead of the standard normalized approach. It is not generally something that I would recommend but it suits your requirements perfectly. One thing to remember is to have a Clustered Index on the ObjectId and AttributeId to prevent table scans. This isn't really an issue with small data sets but will become an issue when you have more objects and attributes. Ian