Detecting Information Bias

Lost User

Hi, I just stumbled on something and thought I would share it. A while back I mentioned that I was working on a CCC analyzer/solver in my spare time, it's a side project and I haven't finished it. Since I will have some free time towards the end of this year I am picking up the project again. As part of my project I am analyzing the crossword puzzles using a skip-gram word-embedding with over 2 trillion tokens (3M vocab) evaluated in 500 dimensions. The embedding is trained from parts of the English Gigaword corpus , the wikipedia dump and most of the news/science articles from 2011-2017. (Yes, alot of data!) One of my unit tests checks the 100 common nouns in the English language for certain characteristics.

[Top 10 correlations for Government]

governments 0.723813
minister 0.618532
administration 0.60618
federal 0.595554
governmental 0.587466
cabinet 0.584909
public 0.583068
ministry 0.579487
officials 0.572555
whitlam 0.565244

I like to think that I have a good grasp of the English language. However last night I noticed something that stood out, I saw a word relation that seemed unusual. The word 'Whitlam' was showing up as being very highly related to the word 'government'. I'd never heard of that word before so I looked up the definition. It's not a word, it's a persons name but how could the world's population of 7.9 billion use this word at such a high frequency under the context of 'government'. The spearman[^] and pearson correlation[^] was so high... it could only mean that the word was being used directly next to

Greg Utas

Wikipedia also needs to be heavily faded for anything that involves political opinion.

Robust Services Core | Software Techniques for Lemmings | Articles
The fox knows many things, but the hedgehog knows one big thing.

ElectronProgrammer

Anything using text as source is hard to process :( I never worked directly with any of that, and frankly I didn't understand any of the more technical terms (zipfian?!? ;P ), but I did work once on a project with people that did and I caught a few things (hopefully correctly). From what they explained at the time, if my memory is working correctly, they filtered those kind of "temporarily important information" using normalization, word appearance rate and a temporal sliding window. As I remember, the algorithm was something like: for a certain period of time (the temporal sliding window) calculate the increase/decrease rate in count of the target word (word appearance rate) compared to a previous period and inversely affect the normalization (if count increases it has a negative effect on the total count and the faster the growth the bigger the impact). Then move the time window forward and repeat. What happened was that spikes in words due to temporary increase of usage (example due to news articles) were smoothed out while at the same time the overall count of the word would not grow significantly. I hope I made some sense and that I did not just wrote something that is a complete lie.

Lost User

Hmmmm, It's interesting to track social interests over time. :) Codeproject[^] over the last 17 years. Github[^] Stack Overflow[^] Something happened on November 9th[^] (probably here in the Lounge) :-O that caused a huge spike in search traffic from Virginia. :-D

trønderen

Statistics covering USA only has limited interest outside USA. Unless it is a US only phenomenon. So maybe we should leave Codeproject Github and StackOverflow to the USAians, and make something different for the rest of the world.

ElectronProgrammer

Randor wrote:

It's interesting to track social interests over time

Yes. But most web sites seem to end up with a curve similar to CodeProject and the question becomes how long is that tail. Also interesting is that big gray rectangle on Stack Overflow's map. I had to look it up and is Wyoming. Either it has no data or no data was produced. Both equally strange :confused:

Randor wrote:

Something happened on November 9th[^] (probably here in the Lounge)

Sorry. I think I was offline that day and missed it. Went back in the lounge and couldn't find anything I could connect with that day (but I am not that smart). Taking into account that there is a C# tag on the link you sent and a general web search return the launch of new features for C#.

Lost User

:) Don't mess with me north man. Today is Saturday and Saturday is wine day. Besides, after my third glass I become a black belt in Kung Fu.