Merging Inputs and producting ID number [modified]

ZeroPun

Hi everyone, How would I go about taking X number of inputs and combining them all to create an 'output ID'? The key thing about the ID is that it should take all the inputs into account (preferably with weighting such as having X1 slightly more important than X2) and produce a number where number can be used to compare similarity. For example: A) Inputs 1,2,3 ---> ---> 0123456 B) Inputs 1,2,4 ---> ---> 0123457 C) Inputs 5,3,8 ---> ---> 1354565 Notice A and B are similar, so the outputs are similar to each other than A/B and C. Thank you for any help

modified on Friday, July 24, 2009 6:43 PM

Moreno Airoldi · modified on Friday, July 24, 2009 6:43 PM

If your inputs are very small integers, say for example 1 or 2 digit numbers, and you have a small amount of them, then you might simply use a sum of factors. Just put your X's in order by weight so that X1 has the bigger weight and so on, then (supposing you are using 2 digit integers) just do something like (code is in C# but neutral enough): long ID = (X1 * 1000000) + (X2 * 10000) + (X3 * 100) + X4; Comparison with other IDs to find similarity can now be done using a threshold: if (Math.Abs(ID1 - ID2) <= Threshold) ... I know, too many "IF"s to be useable, but I hope this can somehow help. :)

2+2=5 for very large amounts of 2 (always loved that one hehe!)

Leonardo Muzzi

Good. But maybe you could use smaller and more close numbers to multiply, or else a slight diferent X1 will never pass a threshold, and a very different X4 will always pass. I think you could do the same with different weights:

long ID = (X1 * W1) + (X2 * W2) + (X3 * W3) + (X4 * W4);

and vary the weights W1, W2, W3 and W4 and the threshold until you get a good result. As another solution, if your inputs are really 3 numbers, you could just treat them as mathematical three dimentional vectors and calculate the size of the vector, that is, the modulus of the vector. This would be:

double Modulus = Math.Sqrt(X1*X1 + X2*X2 + X3*X3)

You can also add weights in this formula, like X1*X1*W1 and so on. Then you can compare:

if (If Math.Abs(Modulus1 - Modulus2) <= Threshold) ...

Regards, Leonardo Muzzi

Moreno Airoldi

I particularly like the vector modulus idea, very elegant. :thumbsup: But I see a problem with the two approaches you suggested: you should use dynamic weights to take into account for the difference in magnitude between the various inputs. If, for example, we have:

W1=3 ; W2=2 ; W3=1

X1=100 ; X2=300 ; X3=10

it's clear that we should increase the value of W1 in order to restrain X2 from weighing more than X1, even if its weight W2 is already less than W1. With different input sets, weights may have to be re-adjusted again. This means you would have to go through all the input sets in order to come up with proper weights before you start applying them. Even when possible, this would not be optimal. I suggested factors of 10 because in most real world applications it's natural to have inputs whose range is in boundaries defined by factors of 10 (for example 0 to 999 etc.). Using factors of 10 will retain all of the digits for each input. If we want to save memory (bits), I think we should switch to factors of 2 and lose the least significant bits. Switching to factors of 2 is trivial: we just left-shift the values with higher weights, while values with lower weights will drop to the right (least significant bits), and we preserve the logic I suggested with factors of 10 in terms of comparisons with thresholds. For example, let's say that our inputs will be in the range from 0 to 999. We need 10 bits to hold that range of values, hence if we want to combine four inputs X1...X4 we need 40 bits. Now, it would sure be good to make that 32 bits, so that we can optimize memory usage and processing time, going for a nice standard 32-bit int. To do that, we just drop the two least significant bits for each input:

int ID = ((X1 >> 2) << 24) & ((X2 >> 2) << 16) & ((X3 >> 2) << 8) && (X4 >> 2);

EDIT: of course the same can be done with factors of 10, by dividing the inputs by 10, 100, etc. in order to use up less digits. Looking forward to hear your thoughts about this. :)

2+2=5 for very large amounts of 2 (always loved that one hehe!)

modified on Saturday, August 8, 2009 4:12 AM

Leonardo Muzzi · modified on Saturday, August 8, 2009 4:12 AM

Hi there! I think the dynamic weights should be used if the scenario asks for them. If he needs to always make X1 more valuable than X2, than go with dynamic weights and increase W1 as long as he needs. If the scenario asks for different results based on the input sets, that you be nice! I think a good approach would be using dynamic weights, but vary them based on the results. That is, choose a formula (like the sum of factors, the vector modulus or any other), implement the solution, and make a test program that do as many tests as possible. Them compare the results generated with the desirable results, and vary the weights to get closer to the desirable. The test program could even vary the weights by himself. This would be close to a small neural network solution. The program learns how to proceed based on test data. The problem with this approach is that you need a large test data so the program can "learn" enough. About the memory usage, a very nice suggestion! I just think that maybe shrinking the more important factors (X1, X2,...) can compromise the solution, since the least significant bits of X1, for instance, could be more important than the whole X4. But again, that depends on the scenario. By the way, I forgot to mention that the vector modulus can be used with any quantity, not just 3, just adding more factors to the formula. Anyway, I think the author of the post has enough to work with! :)

Regards, Leonardo Muzzi

Moreno Airoldi

Leonardo Muzzi wrote:

Anyway, I think the author of the post has enough to work with! Smile

That's for sure hehe, the rest is just for our fun! :P

2+2=5 for very large amounts of 2 (always loved that one hehe!)

Moreno Airoldi

Leonardo Muzzi wrote:

About the memory usage, a very nice suggestion! I just think that maybe shrinking the more important factors (X1, X2,...) can compromise the solution, since the least significant bits of X1, for instance, could be more important than the whole X4. But again, that depends on the scenario.

BTW I forgot to mention - you are absolutely right (depending on the scenario, of course, yep, but I think it goes for most)! So it should be changed to something similar to:

int ID = (X1 << 22) & ((X2 >> 1) << 13) & ((X3 >> 3) << 6) & (X4 >> 4);

2+2=5 for very large amounts of 2 (always loved that one hehe!)