Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
CODE PROJECT For Those Who Code
  • Home
  • Articles
  • FAQ
Community
  1. Home
  2. The Lounge
  3. Detecting Information Bias

Detecting Information Bias

Scheduled Pinned Locked Moved The Lounge
comalgorithmsannouncement
7 Posts 4 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Offline
    L Offline
    Lost User
    wrote on last edited by
    #1

    Hi, I just stumbled on something and thought I would share it. A while back I mentioned that I was working on a CCC analyzer/solver in my spare time, it's a side project and I haven't finished it. Since I will have some free time towards the end of this year I am picking up the project again. As part of my project I am analyzing the crossword puzzles using a skip-gram word-embedding with over 2 trillion tokens (3M vocab) evaluated in 500 dimensions. The embedding is trained from parts of the English Gigaword corpus, the wikipedia dump and most of the news/science articles from 2011-2017. (Yes, alot of data!) One of my unit tests checks the 100 common nouns in the English language for certain characteristics.

    [Top 10 correlations for Government]

    governments 0.723813
    minister 0.618532
    administration 0.60618
    federal 0.595554
    governmental 0.587466
    cabinet 0.584909
    public 0.583068
    ministry 0.579487
    officials 0.572555
    whitlam 0.565244

    I like to think that I have a good grasp of the English language. However last night I noticed something that stood out, I saw a word relation that seemed unusual. The word 'Whitlam' was showing up as being very highly related to the word 'government'. I'd never heard of that word before so I looked up the definition. It's not a word, it's a persons name but how could the world's population of 7.9 billion use this word at such a high frequency under the context of 'government'. The spearman[^] and pearson correlation[^] was so high... it could only mean that the word was being used directly next to

    Greg UtasG E 2 Replies Last reply
    0
    • L Lost User

      Hi, I just stumbled on something and thought I would share it. A while back I mentioned that I was working on a CCC analyzer/solver in my spare time, it's a side project and I haven't finished it. Since I will have some free time towards the end of this year I am picking up the project again. As part of my project I am analyzing the crossword puzzles using a skip-gram word-embedding with over 2 trillion tokens (3M vocab) evaluated in 500 dimensions. The embedding is trained from parts of the English Gigaword corpus, the wikipedia dump and most of the news/science articles from 2011-2017. (Yes, alot of data!) One of my unit tests checks the 100 common nouns in the English language for certain characteristics.

      [Top 10 correlations for Government]

      governments 0.723813
      minister 0.618532
      administration 0.60618
      federal 0.595554
      governmental 0.587466
      cabinet 0.584909
      public 0.583068
      ministry 0.579487
      officials 0.572555
      whitlam 0.565244

      I like to think that I have a good grasp of the English language. However last night I noticed something that stood out, I saw a word relation that seemed unusual. The word 'Whitlam' was showing up as being very highly related to the word 'government'. I'd never heard of that word before so I looked up the definition. It's not a word, it's a persons name but how could the world's population of 7.9 billion use this word at such a high frequency under the context of 'government'. The spearman[^] and pearson correlation[^] was so high... it could only mean that the word was being used directly next to

      Greg UtasG Offline
      Greg UtasG Offline
      Greg Utas
      wrote on last edited by
      #2

      Wikipedia also needs to be heavily faded for anything that involves political opinion.

      Robust Services Core | Software Techniques for Lemmings | Articles
      The fox knows many things, but the hedgehog knows one big thing.

      <p><a href="https://github.com/GregUtas/robust-services-core/blob/master/README.md">Robust Services Core</a>
      <em>The fox knows many things, but the hedgehog knows one big thing.</em></p>

      1 Reply Last reply
      0
      • L Lost User

        Hi, I just stumbled on something and thought I would share it. A while back I mentioned that I was working on a CCC analyzer/solver in my spare time, it's a side project and I haven't finished it. Since I will have some free time towards the end of this year I am picking up the project again. As part of my project I am analyzing the crossword puzzles using a skip-gram word-embedding with over 2 trillion tokens (3M vocab) evaluated in 500 dimensions. The embedding is trained from parts of the English Gigaword corpus, the wikipedia dump and most of the news/science articles from 2011-2017. (Yes, alot of data!) One of my unit tests checks the 100 common nouns in the English language for certain characteristics.

        [Top 10 correlations for Government]

        governments 0.723813
        minister 0.618532
        administration 0.60618
        federal 0.595554
        governmental 0.587466
        cabinet 0.584909
        public 0.583068
        ministry 0.579487
        officials 0.572555
        whitlam 0.565244

        I like to think that I have a good grasp of the English language. However last night I noticed something that stood out, I saw a word relation that seemed unusual. The word 'Whitlam' was showing up as being very highly related to the word 'government'. I'd never heard of that word before so I looked up the definition. It's not a word, it's a persons name but how could the world's population of 7.9 billion use this word at such a high frequency under the context of 'government'. The spearman[^] and pearson correlation[^] was so high... it could only mean that the word was being used directly next to

        E Offline
        E Offline
        ElectronProgrammer
        wrote on last edited by
        #3

        Anything using text as source is hard to process :( I never worked directly with any of that, and frankly I didn't understand any of the more technical terms (zipfian?!? ;P ), but I did work once on a project with people that did and I caught a few things (hopefully correctly). From what they explained at the time, if my memory is working correctly, they filtered those kind of "temporarily important information" using normalization, word appearance rate and a temporal sliding window. As I remember, the algorithm was something like: for a certain period of time (the temporal sliding window) calculate the increase/decrease rate in count of the target word (word appearance rate) compared to a previous period and inversely affect the normalization (if count increases it has a negative effect on the total count and the faster the growth the bigger the impact). Then move the time window forward and repeat. What happened was that spikes in words due to temporary increase of usage (example due to news articles) were smoothed out while at the same time the overall count of the word would not grow significantly. I hope I made some sense and that I did not just wrote something that is a complete lie.

        L 1 Reply Last reply
        0
        • E ElectronProgrammer

          Anything using text as source is hard to process :( I never worked directly with any of that, and frankly I didn't understand any of the more technical terms (zipfian?!? ;P ), but I did work once on a project with people that did and I caught a few things (hopefully correctly). From what they explained at the time, if my memory is working correctly, they filtered those kind of "temporarily important information" using normalization, word appearance rate and a temporal sliding window. As I remember, the algorithm was something like: for a certain period of time (the temporal sliding window) calculate the increase/decrease rate in count of the target word (word appearance rate) compared to a previous period and inversely affect the normalization (if count increases it has a negative effect on the total count and the faster the growth the bigger the impact). Then move the time window forward and repeat. What happened was that spikes in words due to temporary increase of usage (example due to news articles) were smoothed out while at the same time the overall count of the word would not grow significantly. I hope I made some sense and that I did not just wrote something that is a complete lie.

          L Offline
          L Offline
          Lost User
          wrote on last edited by
          #4

          Hmmmm, It's interesting to track social interests over time. :) Codeproject[^] over the last 17 years. Github[^] Stack Overflow[^] Something happened on November 9th[^] (probably here in the Lounge) :-O that caused a huge spike in search traffic from Virginia. :-D

          T E 2 Replies Last reply
          0
          • L Lost User

            Hmmmm, It's interesting to track social interests over time. :) Codeproject[^] over the last 17 years. Github[^] Stack Overflow[^] Something happened on November 9th[^] (probably here in the Lounge) :-O that caused a huge spike in search traffic from Virginia. :-D

            T Offline
            T Offline
            trønderen
            wrote on last edited by
            #5

            Statistics covering USA only has limited interest outside USA. Unless it is a US only phenomenon. So maybe we should leave Codeproject Github and StackOverflow to the USAians, and make something different for the rest of the world.

            L 1 Reply Last reply
            0
            • L Lost User

              Hmmmm, It's interesting to track social interests over time. :) Codeproject[^] over the last 17 years. Github[^] Stack Overflow[^] Something happened on November 9th[^] (probably here in the Lounge) :-O that caused a huge spike in search traffic from Virginia. :-D

              E Offline
              E Offline
              ElectronProgrammer
              wrote on last edited by
              #6

              Randor wrote:

              It's interesting to track social interests over time

              Yes. But most web sites seem to end up with a curve similar to CodeProject and the question becomes how long is that tail. Also interesting is that big gray rectangle on Stack Overflow's map. I had to look it up and is Wyoming. Either it has no data or no data was produced. Both equally strange :confused:

              Randor wrote:

              Something happened on November 9th[^] (probably here in the Lounge)

              Sorry. I think I was offline that day and missed it. Went back in the lounge and couldn't find anything I could connect with that day (but I am not that smart). Taking into account that there is a C# tag on the link you sent and a general web search return the launch of new features for C#.

              1 Reply Last reply
              0
              • T trønderen

                Statistics covering USA only has limited interest outside USA. Unless it is a US only phenomenon. So maybe we should leave Codeproject Github and StackOverflow to the USAians, and make something different for the rest of the world.

                L Offline
                L Offline
                Lost User
                wrote on last edited by
                #7

                :) Don't mess with me north man. Today is Saturday and Saturday is wine day. Besides, after my third glass I become a black belt in Kung Fu.

                1 Reply Last reply
                0
                Reply
                • Reply as topic
                Log in to reply
                • Oldest to Newest
                • Newest to Oldest
                • Most Votes


                • Login

                • Don't have an account? Register

                • Login or register to search.
                • First post
                  Last post
                0
                • Categories
                • Recent
                • Tags
                • Popular
                • World
                • Users
                • Groups