Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. General Programming
  3. Design and Architecture
  4. How to find the similarity between users in Twitter ? How to design a good and efficient idea?

How to find the similarity between users in Twitter ? How to design a good and efficient idea?

Scheduled Pinned Locked Moved Design and Architecture
tutorialdesignalgorithmssaleshelp
9 Posts 5 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Offline
    L Offline
    ldaneil
    wrote on last edited by
    #1

    I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

    Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

    For example:
    user A's hashtag = {cat, bull, cow, chicken, duck}
    user B's hashtag ={cat, chicken, cloth}
    user C's hashtag = {lenovo, Hp, Sony}

    clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?

    P J A M 4 Replies Last reply
    0
    • L ldaneil

      I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

      Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

      For example:
      user A's hashtag = {cat, bull, cow, chicken, duck}
      user B's hashtag ={cat, chicken, cloth}
      user C's hashtag = {lenovo, Hp, Sony}

      clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?

      P Offline
      P Offline
      Pete OHanlon
      wrote on last edited by
      #2

      Is your company going to give your salary to anyone here for solving this? It's your job after all, not ours.

      *pre-emptive celebratory nipple tassle jiggle* - Sean Ewington

      "Mind bleach! Send me mind bleach!" - Nagy Vilmos

      CodeStash - Online Snippet Management | My blog | MoXAML PowerToys | Mole 2010 - debugging made easier

      L 1 Reply Last reply
      0
      • P Pete OHanlon

        Is your company going to give your salary to anyone here for solving this? It's your job after all, not ours.

        *pre-emptive celebratory nipple tassle jiggle* - Sean Ewington

        "Mind bleach! Send me mind bleach!" - Nagy Vilmos

        CodeStash - Online Snippet Management | My blog | MoXAML PowerToys | Mole 2010 - debugging made easier

        L Offline
        L Offline
        ldaneil
        wrote on last edited by
        #3

        No, I am a University student, and I did not get any salary. I am just want to discuss with some coding Pro and those smart guy. I will be very appreciated if someone could give me some ideas. I think the forum is to discuss programming question, we could help each other and enhance our programming skills. I hope those capable coding Pro give me some hints. Thanks.

        1 Reply Last reply
        0
        • L ldaneil

          I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

          Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

          For example:
          user A's hashtag = {cat, bull, cow, chicken, duck}
          user B's hashtag ={cat, chicken, cloth}
          user C's hashtag = {lenovo, Hp, Sony}

          clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?

          J Offline
          J Offline
          jschell
          wrote on last edited by
          #4

          You should eliminate trivial words like 'a', 'and', etc. And then research matching algorithms, I would start with the following google string. algorithms for set matching -string

          L 1 Reply Last reply
          0
          • L ldaneil

            I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

            Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

            For example:
            user A's hashtag = {cat, bull, cow, chicken, duck}
            user B's hashtag ={cat, chicken, cloth}
            user C's hashtag = {lenovo, Hp, Sony}

            clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?

            A Offline
            A Offline
            April Fans
            wrote on last edited by
            #5

            Well - you could try find the similarities or "document distance" of and between the Twitter users by matching their tweets against each other - kind of like the way one search for plagiarism, perhaps that might work. You could start by out by searching the tweets of a particular Twitter user - using some sort of application. If I am not mistaken - I believe Twitter does have something like this available - furthermore, comparisons between and of the groups against each other can be carried out, therefore that way we can get a comparison of the similarity or "document distance" of Twitter users.

            April Comm100 - Leading Live Chat Software Provider

            L 1 Reply Last reply
            0
            • J jschell

              You should eliminate trivial words like 'a', 'and', etc. And then research matching algorithms, I would start with the following google string. algorithms for set matching -string

              L Offline
              L Offline
              ldaneil
              wrote on last edited by
              #6

              yes, definitely have to use String and array to process the data. However, I don't know how exactly to do it. The idea is not clear yet. Thanks very much for your reply. ;)

              1 Reply Last reply
              0
              • A April Fans

                Well - you could try find the similarities or "document distance" of and between the Twitter users by matching their tweets against each other - kind of like the way one search for plagiarism, perhaps that might work. You could start by out by searching the tweets of a particular Twitter user - using some sort of application. If I am not mistaken - I believe Twitter does have something like this available - furthermore, comparisons between and of the groups against each other can be carried out, therefore that way we can get a comparison of the similarity or "document distance" of Twitter users.

                April Comm100 - Leading Live Chat Software Provider

                L Offline
                L Offline
                ldaneil
                wrote on last edited by
                #7

                Thanks very much for your suggestion. I will try to do some research about document distance. To process so huge amount of data like this, normal way is definitely infeasible, have to find a good idea on how to implement it. The project's focus is the idea, the coding should be very simple, but if the idea is very lousy, the whole project will become useless. I am very appreciated for your suggestion.

                A 1 Reply Last reply
                0
                • L ldaneil

                  Thanks very much for your suggestion. I will try to do some research about document distance. To process so huge amount of data like this, normal way is definitely infeasible, have to find a good idea on how to implement it. The project's focus is the idea, the coding should be very simple, but if the idea is very lousy, the whole project will become useless. I am very appreciated for your suggestion.

                  A Offline
                  A Offline
                  April Fans
                  wrote on last edited by
                  #8

                  You're very welcome! It was what initially popped into my head - though I believe there is probably a stronger and ideal way to carry such a project out with regards to the large amounts of data you will be dealing with. I find your project quite interesting! Best of Luck! With Kind Regards,

                  April Comm100 - Leading Live Chat Software Provider

                  1 Reply Last reply
                  0
                  • L ldaneil

                    I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

                    Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

                    For example:
                    user A's hashtag = {cat, bull, cow, chicken, duck}
                    user B's hashtag ={cat, chicken, cloth}
                    user C's hashtag = {lenovo, Hp, Sony}

                    clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?

                    M Offline
                    M Offline
                    Marc Koutzarov
                    wrote on last edited by
                    #9

                    Take a look at the Levenshtein distance

                    1 Reply Last reply
                    0
                    Reply
                    • Reply as topic
                    Log in to reply
                    • Oldest to Newest
                    • Newest to Oldest
                    • Most Votes


                    • Login

                    • Don't have an account? Register

                    • Login or register to search.
                    • First post
                      Last post
                    0
                    • Categories
                    • Recent
                    • Tags
                    • Popular
                    • World
                    • Users
                    • Groups