Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. The Lounge
  3. 4 petabytes of useless drivel

4 petabytes of useless drivel

Scheduled Pinned Locked Moved The Lounge
csharpalgorithmsannouncement
21 Posts 13 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M Offline
    M Offline
    Michael Schubert
    wrote on last edited by
    #1

    See here: http://www.neowin.net/news/storing-tweets-requires-four-petabytes-of-data-a-year[^] I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc.

    N E R P M 7 Replies Last reply
    0
    • M Michael Schubert

      See here: http://www.neowin.net/news/storing-tweets-requires-four-petabytes-of-data-a-year[^] I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc.

      N Offline
      N Offline
      NormDroid
      wrote on last edited by
      #2

      Michael Schubert wrote:

      I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc

      I've got a better idea, trash the whole lot. PS - dunno you sniped you with a 1, but my guess it's group that frequently uses coz, plz.

      Two heads are better than one.

      M OriginalGriffO 2 Replies Last reply
      0
      • M Michael Schubert

        See here: http://www.neowin.net/news/storing-tweets-requires-four-petabytes-of-data-a-year[^] I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc.

        E Offline
        E Offline
        Electron Shepherd
        wrote on last edited by
        #3

        Michael Schubert wrote:

        I bet that a simple compression algorithm would squeeze this down

        Unlikey. Compression works by looking for repeated patterns, and since each tweet is 140 characters maximum, there's little opportunity for compression. Yes, bolt a 1000 tweets together, and they'll compress really well. But then I would imagine it's really diffcult to process the compressed data as the site needs to.

        Server and Network Monitoring

        M G P 3 Replies Last reply
        0
        • N NormDroid

          Michael Schubert wrote:

          I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc

          I've got a better idea, trash the whole lot. PS - dunno you sniped you with a 1, but my guess it's group that frequently uses coz, plz.

          Two heads are better than one.

          M Offline
          M Offline
          Michael Schubert
          wrote on last edited by
          #4

          Norm .net wrote:

          dunno you sniped you with a 1, but my guess it's group that frequently uses coz, plz.

          Probably. Or a tweeting twat. Which is not mutually exclusive.

          N 1 Reply Last reply
          0
          • M Michael Schubert

            Norm .net wrote:

            dunno you sniped you with a 1, but my guess it's group that frequently uses coz, plz.

            Probably. Or a tweeting twat. Which is not mutually exclusive.

            N Offline
            N Offline
            NormDroid
            wrote on last edited by
            #5

            :)

            Two heads are better than one.

            1 Reply Last reply
            0
            • E Electron Shepherd

              Michael Schubert wrote:

              I bet that a simple compression algorithm would squeeze this down

              Unlikey. Compression works by looking for repeated patterns, and since each tweet is 140 characters maximum, there's little opportunity for compression. Yes, bolt a 1000 tweets together, and they'll compress really well. But then I would imagine it's really diffcult to process the compressed data as the site needs to.

              Server and Network Monitoring

              M Offline
              M Offline
              Michael Schubert
              wrote on last edited by
              #6

              Electron Shepherd wrote:

              Unlikey

              And obviously I was exaggerating.

              H 1 Reply Last reply
              0
              • N NormDroid

                Michael Schubert wrote:

                I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc

                I've got a better idea, trash the whole lot. PS - dunno you sniped you with a 1, but my guess it's group that frequently uses coz, plz.

                Two heads are better than one.

                OriginalGriffO Offline
                OriginalGriffO Offline
                OriginalGriff
                wrote on last edited by
                #7

                Norm .net wrote:

                dunno you sniped you with a 1, but my guess it's group that frequently uses coz, plz

                They can read??? :omg:

                Real men don't use instructions. They are only the manufacturers opinion on how to put the thing together.

                "I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
                "Common sense is so rare these days, it should be classified as a super power" - Random T-shirt

                1 Reply Last reply
                0
                • E Electron Shepherd

                  Michael Schubert wrote:

                  I bet that a simple compression algorithm would squeeze this down

                  Unlikey. Compression works by looking for repeated patterns, and since each tweet is 140 characters maximum, there's little opportunity for compression. Yes, bolt a 1000 tweets together, and they'll compress really well. But then I would imagine it's really diffcult to process the compressed data as the site needs to.

                  Server and Network Monitoring

                  G Offline
                  G Offline
                  Gary R Wheeler
                  wrote on last edited by
                  #8

                  Compressing the tweet itself is probably an incidental problem. Twitter's 'value' to advertisers and such lies in the meta-information about the tweets.

                  Software Zen: delete this;
                  Fold With Us![^]

                  N 1 Reply Last reply
                  0
                  • M Michael Schubert

                    See here: http://www.neowin.net/news/storing-tweets-requires-four-petabytes-of-data-a-year[^] I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc.

                    R Offline
                    R Offline
                    R tsumami
                    wrote on last edited by
                    #9

                    Just trash evry "tweet" older then 1 hour that should solve it. Thats about the avarage lifespan according to this

                    saru mo ki kara ochiru (even monkeys fall from trees) Usualy i'm that monkey. If you want an intelligent answer, Don't ask me. To understand Recursion, you must first understand Recursion.

                    N R 2 Replies Last reply
                    0
                    • R R tsumami

                      Just trash evry "tweet" older then 1 hour that should solve it. Thats about the avarage lifespan according to this

                      saru mo ki kara ochiru (even monkeys fall from trees) Usualy i'm that monkey. If you want an intelligent answer, Don't ask me. To understand Recursion, you must first understand Recursion.

                      N Offline
                      N Offline
                      NormDroid
                      wrote on last edited by
                      #10

                      it's probably the average attention span of the prople who use twatter.

                      Two heads are better than one.

                      1 Reply Last reply
                      0
                      • M Michael Schubert

                        See here: http://www.neowin.net/news/storing-tweets-requires-four-petabytes-of-data-a-year[^] I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc.

                        P Offline
                        P Offline
                        Pete OHanlon
                        wrote on last edited by
                        #11

                        Just remove Steven Fry's posts - removing his luvvieness would squeeze it down to 0.00001% easily.

                        I have CDO, it's OCD with the letters in the right order; just as they ruddy well should be

                        Forgive your enemies - it messes with their heads

                        My blog | My articles | MoXAML PowerToys | Onyx

                        1 Reply Last reply
                        0
                        • G Gary R Wheeler

                          Compressing the tweet itself is probably an incidental problem. Twitter's 'value' to advertisers and such lies in the meta-information about the tweets.

                          Software Zen: delete this;
                          Fold With Us![^]

                          N Offline
                          N Offline
                          NormDroid
                          wrote on last edited by
                          #12

                          Gary R. Wheeler wrote:

                          Twitter's 'value' to advertisers and such lies in the meta-information about the tweets.

                          I'd love to see what's of value :)

                          Two heads are better than one.

                          E 1 Reply Last reply
                          0
                          • R R tsumami

                            Just trash evry "tweet" older then 1 hour that should solve it. Thats about the avarage lifespan according to this

                            saru mo ki kara ochiru (even monkeys fall from trees) Usualy i'm that monkey. If you want an intelligent answer, Don't ask me. To understand Recursion, you must first understand Recursion.

                            R Offline
                            R Offline
                            Richard A Dalton
                            wrote on last edited by
                            #13

                            R-tsumami wrote:

                            Just trash evry "tweet" older then 1 hour that should solve it.

                            That's what jumped out at me when I read this: > All that data is being analyzed by Weil and his team to attempt to > find information that would be useful to help make Twitter a > profitable business. Twitter has been working hard to become > profitable If you're storing all that data, and growing at that rate, and needing to move to a new data center, and you aren't profitable, then maybe you need to look at your business model. I might be proven dramatically wrong, and it wouldn't be the first time, but we seem to be in the midst of a twitter bubble where the perceived value of tweets has gotten out of all proportion. Perhaps if tweets are really worth so much twitter should try auctioning them off, make the storage someone elses problem. Up for auction today, we have the 2009 tweet collection...what am I bid? Seriously? No bids? Years from now when someone figures out how to make a profit from these things you could be sitting on a goldmine? Someone start me off at $1? Anyone? Anyone? Bueller? Anyone? huh! whadya know, this $^!* is worthless. -Rd

                            R 1 Reply Last reply
                            0
                            • N NormDroid

                              Gary R. Wheeler wrote:

                              Twitter's 'value' to advertisers and such lies in the meta-information about the tweets.

                              I'd love to see what's of value :)

                              Two heads are better than one.

                              E Offline
                              E Offline
                              Electron Shepherd
                              wrote on last edited by
                              #14

                              A lot of it will be indirect. For example, BMW launch a new car via television advertising - their marketing department would a) like to know how many people are talking about the new car b) *really really like* to get hold of their twitter handles

                              Server and Network Monitoring

                              N 1 Reply Last reply
                              0
                              • E Electron Shepherd

                                A lot of it will be indirect. For example, BMW launch a new car via television advertising - their marketing department would a) like to know how many people are talking about the new car b) *really really like* to get hold of their twitter handles

                                Server and Network Monitoring

                                N Offline
                                N Offline
                                NormDroid
                                wrote on last edited by
                                #15

                                Good argument :), yes coming from a marketing angle this could be a bonanza.

                                Two heads are better than one.

                                1 Reply Last reply
                                0
                                • M Michael Schubert

                                  Electron Shepherd wrote:

                                  Unlikey

                                  And obviously I was exaggerating.

                                  H Offline
                                  H Offline
                                  Henry Minute
                                  wrote on last edited by
                                  #16

                                  Surely not!

                                  Henry Minute Do not read medical books! You could die of a misprint. - Mark Twain Girl: (staring) "Why do you need an icy cucumber?" “I want to report a fraud. The government is lying to us all.”

                                  1 Reply Last reply
                                  0
                                  • R Richard A Dalton

                                    R-tsumami wrote:

                                    Just trash evry "tweet" older then 1 hour that should solve it.

                                    That's what jumped out at me when I read this: > All that data is being analyzed by Weil and his team to attempt to > find information that would be useful to help make Twitter a > profitable business. Twitter has been working hard to become > profitable If you're storing all that data, and growing at that rate, and needing to move to a new data center, and you aren't profitable, then maybe you need to look at your business model. I might be proven dramatically wrong, and it wouldn't be the first time, but we seem to be in the midst of a twitter bubble where the perceived value of tweets has gotten out of all proportion. Perhaps if tweets are really worth so much twitter should try auctioning them off, make the storage someone elses problem. Up for auction today, we have the 2009 tweet collection...what am I bid? Seriously? No bids? Years from now when someone figures out how to make a profit from these things you could be sitting on a goldmine? Someone start me off at $1? Anyone? Anyone? Bueller? Anyone? huh! whadya know, this $^!* is worthless. -Rd

                                    R Offline
                                    R Offline
                                    R tsumami
                                    wrote on last edited by
                                    #17

                                    Richard A. Dalton wrote:

                                    Years from now when someone figures out how to make a profit from these things you could be sitting on a goldmine? Someone start me off at $1? Anyone? Anyone? Bueller? Anyone?

                                    Does that include the 4 petabytes storage, or is that sold seperately?

                                    saru mo ki kara ochiru (even monkeys fall from trees) Usualy i'm that monkey. If you want an intelligent answer, Don't ask me. To understand Recursion, you must first understand Recursion.

                                    1 Reply Last reply
                                    0
                                    • E Electron Shepherd

                                      Michael Schubert wrote:

                                      I bet that a simple compression algorithm would squeeze this down

                                      Unlikey. Compression works by looking for repeated patterns, and since each tweet is 140 characters maximum, there's little opportunity for compression. Yes, bolt a 1000 tweets together, and they'll compress really well. But then I would imagine it's really diffcult to process the compressed data as the site needs to.

                                      Server and Network Monitoring

                                      P Offline
                                      P Offline
                                      peterchen
                                      wrote on last edited by
                                      #18

                                      Solution: Build an initial compression table over the whole data set, then use this tabel for compressing the tweets. Still, you are right in oen aspect: the per-message overhead (user id, datetime, IP? What else???) is significant compared to 140 characters of content.

                                      Agh! Reality! My Archnemesis![^]
                                      | FoldWithUs! | sighist | WhoIncludes - Analyzing C++ include file hierarchy

                                      1 Reply Last reply
                                      0
                                      • M Michael Schubert

                                        See here: http://www.neowin.net/news/storing-tweets-requires-four-petabytes-of-data-a-year[^] I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc.

                                        M Offline
                                        M Offline
                                        Mark_Wallace
                                        wrote on last edited by
                                        #19

                                        When an alien race finally finds the ruins of human civilisation, they will ultimately reach the conclusion that we died out because we were all complete f***in' idiots.

                                        Beam it all straight to the incinerator, Number 1.

                                        1 Reply Last reply
                                        0
                                        • M Michael Schubert

                                          See here: http://www.neowin.net/news/storing-tweets-requires-four-petabytes-of-data-a-year[^] I bet that a simple compression algorithm would squeeze this down to 0.00001% of the original data considering the frequency of "words" like coz, plz, etc.

                                          D Offline
                                          D Offline
                                          David Crow
                                          wrote on last edited by
                                          #20

                                          I don't Tweet, so this may be an obvious answer, but why does that information get saved? Do folks go back and reference it at later dates?

                                          "One man's wage rise is another man's price increase." - Harold Wilson

                                          "Fireproof doesn't mean the fire will never come. It means when the fire comes that you will be able to withstand it." - Michael Simmons

                                          "Man who follows car will be exhausted." - Confucius

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups