4 petabytes of useless drivel
-
Michael Schubert wrote:
I bet that a simple compression algorithm would squeeze this down
Unlikey. Compression works by looking for repeated patterns, and since each tweet is 140 characters maximum, there's little opportunity for compression. Yes, bolt a 1000 tweets together, and they'll compress really well. But then I would imagine it's really diffcult to process the compressed data as the site needs to.
Compressing the tweet itself is probably an incidental problem. Twitter's 'value' to advertisers and such lies in the meta-information about the tweets.
Software Zen:
delete this;
Fold With Us![^] -
See here: http://www.neowin.net/news/storing-tweets-requires-four-petabytes-of-data-a-year[^] I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc.
Just trash evry "tweet" older then 1 hour that should solve it. Thats about the avarage lifespan according to this
saru mo ki kara ochiru (even monkeys fall from trees) Usualy i'm that monkey. If you want an intelligent answer, Don't ask me. To understand Recursion, you must first understand Recursion.
-
Just trash evry "tweet" older then 1 hour that should solve it. Thats about the avarage lifespan according to this
saru mo ki kara ochiru (even monkeys fall from trees) Usualy i'm that monkey. If you want an intelligent answer, Don't ask me. To understand Recursion, you must first understand Recursion.
-
See here: http://www.neowin.net/news/storing-tweets-requires-four-petabytes-of-data-a-year[^] I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc.
Just remove Steven Fry's posts - removing his luvvieness would squeeze it down to 0.00001% easily.
I have CDO, it's OCD with the letters in the right order; just as they ruddy well should be
Forgive your enemies - it messes with their heads
-
Just trash evry "tweet" older then 1 hour that should solve it. Thats about the avarage lifespan according to this
saru mo ki kara ochiru (even monkeys fall from trees) Usualy i'm that monkey. If you want an intelligent answer, Don't ask me. To understand Recursion, you must first understand Recursion.
R-tsumami wrote:
Just trash evry "tweet" older then 1 hour that should solve it.
That's what jumped out at me when I read this: > All that data is being analyzed by Weil and his team to attempt to > find information that would be useful to help make Twitter a > profitable business. Twitter has been working hard to become > profitable If you're storing all that data, and growing at that rate, and needing to move to a new data center, and you aren't profitable, then maybe you need to look at your business model. I might be proven dramatically wrong, and it wouldn't be the first time, but we seem to be in the midst of a twitter bubble where the perceived value of tweets has gotten out of all proportion. Perhaps if tweets are really worth so much twitter should try auctioning them off, make the storage someone elses problem. Up for auction today, we have the 2009 tweet collection...what am I bid? Seriously? No bids? Years from now when someone figures out how to make a profit from these things you could be sitting on a goldmine? Someone start me off at $1? Anyone? Anyone? Bueller? Anyone? huh! whadya know, this $^!* is worthless. -Rd
-
Compressing the tweet itself is probably an incidental problem. Twitter's 'value' to advertisers and such lies in the meta-information about the tweets.
Software Zen:
delete this;
Fold With Us![^] -
Gary R. Wheeler wrote:
Twitter's 'value' to advertisers and such lies in the meta-information about the tweets.
I'd love to see what's of value :)
Two heads are better than one.
A lot of it will be indirect. For example, BMW launch a new car via television advertising - their marketing department would a) like to know how many people are talking about the new car b) *really really like* to get hold of their twitter handles
-
A lot of it will be indirect. For example, BMW launch a new car via television advertising - their marketing department would a) like to know how many people are talking about the new car b) *really really like* to get hold of their twitter handles
-
Electron Shepherd wrote:
Unlikey
And obviously I was exaggerating.
Surely not!
Henry Minute Do not read medical books! You could die of a misprint. - Mark Twain Girl: (staring) "Why do you need an icy cucumber?" “I want to report a fraud. The government is lying to us all.”
-
R-tsumami wrote:
Just trash evry "tweet" older then 1 hour that should solve it.
That's what jumped out at me when I read this: > All that data is being analyzed by Weil and his team to attempt to > find information that would be useful to help make Twitter a > profitable business. Twitter has been working hard to become > profitable If you're storing all that data, and growing at that rate, and needing to move to a new data center, and you aren't profitable, then maybe you need to look at your business model. I might be proven dramatically wrong, and it wouldn't be the first time, but we seem to be in the midst of a twitter bubble where the perceived value of tweets has gotten out of all proportion. Perhaps if tweets are really worth so much twitter should try auctioning them off, make the storage someone elses problem. Up for auction today, we have the 2009 tweet collection...what am I bid? Seriously? No bids? Years from now when someone figures out how to make a profit from these things you could be sitting on a goldmine? Someone start me off at $1? Anyone? Anyone? Bueller? Anyone? huh! whadya know, this $^!* is worthless. -Rd
Richard A. Dalton wrote:
Years from now when someone figures out how to make a profit from these things you could be sitting on a goldmine? Someone start me off at $1? Anyone? Anyone? Bueller? Anyone?
Does that include the 4 petabytes storage, or is that sold seperately?
saru mo ki kara ochiru (even monkeys fall from trees) Usualy i'm that monkey. If you want an intelligent answer, Don't ask me. To understand Recursion, you must first understand Recursion.
-
Michael Schubert wrote:
I bet that a simple compression algorithm would squeeze this down
Unlikey. Compression works by looking for repeated patterns, and since each tweet is 140 characters maximum, there's little opportunity for compression. Yes, bolt a 1000 tweets together, and they'll compress really well. But then I would imagine it's really diffcult to process the compressed data as the site needs to.
Solution: Build an initial compression table over the whole data set, then use this tabel for compressing the tweets. Still, you are right in oen aspect: the per-message overhead (user id, datetime, IP? What else???) is significant compared to 140 characters of content.
Agh! Reality! My Archnemesis![^]
| FoldWithUs! | sighist | WhoIncludes - Analyzing C++ include file hierarchy -
See here: http://www.neowin.net/news/storing-tweets-requires-four-petabytes-of-data-a-year[^] I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc.
When an alien race finally finds the ruins of human civilisation, they will ultimately reach the conclusion that we died out because we were all complete f***in' idiots.
Beam it all straight to the incinerator, Number 1.
-
See here: http://www.neowin.net/news/storing-tweets-requires-four-petabytes-of-data-a-year[^] I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc.
I don't Tweet, so this may be an obvious answer, but why does that information get saved? Do folks go back and reference it at later dates?
"One man's wage rise is another man's price increase." - Harold Wilson
"Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons
"Man who follows car will be exhausted." - Confucius
-
See here: http://www.neowin.net/news/storing-tweets-requires-four-petabytes-of-data-a-year[^] I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc.