How to find the similarity between users in Twitter ? How to design a good and efficient idea?

ldaneil

I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

For example:
user A's hashtag = {cat, bull, cow, chicken, duck}
user B's hashtag ={cat, chicken, cloth}
user C's hashtag = {lenovo, Hp, Sony}

clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?

Pete OHanlon

Is your company going to give your salary to anyone here for solving this? It's your job after all, not ours.

*pre-emptive celebratory nipple tassle jiggle* - Sean Ewington

"Mind bleach! Send me mind bleach!" - Nagy Vilmos

CodeStash - Online Snippet Management | My blog | MoXAML PowerToys | Mole 2010 - debugging made easier

ldaneil

No, I am a University student, and I did not get any salary. I am just want to discuss with some coding Pro and those smart guy. I will be very appreciated if someone could give me some ideas. I think the forum is to discuss programming question, we could help each other and enhance our programming skills. I hope those capable coding Pro give me some hints. Thanks.

jschell

You should eliminate trivial words like 'a', 'and', etc. And then research matching algorithms, I would start with the following google string. algorithms for set matching -string

April Fans

Well - you could try find the similarities or "document distance" of and between the Twitter users by matching their tweets against each other - kind of like the way one search for plagiarism, perhaps that might work. You could start by out by searching the tweets of a particular Twitter user - using some sort of application. If I am not mistaken - I believe Twitter does have something like this available - furthermore, comparisons between and of the groups against each other can be carried out, therefore that way we can get a comparison of the similarity or "document distance" of Twitter users.

April Comm100 - Leading Live Chat Software Provider

ldaneil

yes, definitely have to use String and array to process the data. However, I don't know how exactly to do it. The idea is not clear yet. Thanks very much for your reply. ;)

ldaneil

Thanks very much for your suggestion. I will try to do some research about document distance. To process so huge amount of data like this, normal way is definitely infeasible, have to find a good idea on how to implement it. The project's focus is the idea, the coding should be very simple, but if the idea is very lousy, the whole project will become useless. I am very appreciated for your suggestion.

April Fans

You're very welcome! It was what initially popped into my head - though I believe there is probably a stronger and ideal way to carry such a project out with regards to the large amounts of data you will be dealing with. I find your project quite interesting! Best of Luck! With Kind Regards,

April Comm100 - Leading Live Chat Software Provider

Marc Koutzarov

Take a look at the Levenshtein distance