Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • World
  • Users
  • Groups
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Code Project
  1. Home
  2. Database & SysAdmin
  3. Database
  4. Billion rows

Billion rows

Scheduled Pinned Locked Moved Database
databasequestionmysqldesigncloud
21 Posts 3 Posters 0 Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J Offline
    J Offline
    jschell
    wrote on last edited by
    #1

    Just a random question since I already have a different solution but I had the following situation. I have a table with a billion rows (actually probably about 1.1 billion.) I was trying to span the table, read every row and do an analysis. Certainly couldn't load the entire table. I was using a paged query (limit/count). Each page took about 90 minutes for the query itself. So not really something that was going to allow me to do much analysis. Any other ideas on spanning it or speeding it up? At one point I was even considering just dumping it and writing an app to do the analysis outside of the database. That was about the only other solution I had. The database was MySQL (AWS Aurora actually). The relevant parts of the table were as follows and the id has a primary key. (I didn't design the table.)

    Table T
    id char(22)
    RefId1 char(22)
    RefId2 char(22)
    ...

    J 2 Replies Last reply
    0
    • J jschell

      Just a random question since I already have a different solution but I had the following situation. I have a table with a billion rows (actually probably about 1.1 billion.) I was trying to span the table, read every row and do an analysis. Certainly couldn't load the entire table. I was using a paged query (limit/count). Each page took about 90 minutes for the query itself. So not really something that was going to allow me to do much analysis. Any other ideas on spanning it or speeding it up? At one point I was even considering just dumping it and writing an app to do the analysis outside of the database. That was about the only other solution I had. The database was MySQL (AWS Aurora actually). The relevant parts of the table were as follows and the id has a primary key. (I didn't design the table.)

      Table T
      id char(22)
      RefId1 char(22)
      RefId2 char(22)
      ...

      J Offline
      J Offline
      Jorgen Andersson
      wrote on last edited by
      #2

      The obvious question is: What kind of analysis?

      Wrong is evil and must be defeated. - Jeff Ello

      J 1 Reply Last reply
      0
      • J jschell

        Just a random question since I already have a different solution but I had the following situation. I have a table with a billion rows (actually probably about 1.1 billion.) I was trying to span the table, read every row and do an analysis. Certainly couldn't load the entire table. I was using a paged query (limit/count). Each page took about 90 minutes for the query itself. So not really something that was going to allow me to do much analysis. Any other ideas on spanning it or speeding it up? At one point I was even considering just dumping it and writing an app to do the analysis outside of the database. That was about the only other solution I had. The database was MySQL (AWS Aurora actually). The relevant parts of the table were as follows and the id has a primary key. (I didn't design the table.)

        Table T
        id char(22)
        RefId1 char(22)
        RefId2 char(22)
        ...

        J Offline
        J Offline
        Jorgen Andersson
        wrote on last edited by
        #3

        Another question. Do you need to keep track on what page you are analysing or is sequential paging good enough?

        Wrong is evil and must be defeated. - Jeff Ello

        J 1 Reply Last reply
        0
        • J Jorgen Andersson

          The obvious question is: What kind of analysis?

          Wrong is evil and must be defeated. - Jeff Ello

          J Offline
          J Offline
          jschell
          wrote on last edited by
          #4

          Jörgen Andersson wrote:

          The obvious question is: What kind of analysis?

          Since I needed to span the database not sure how that matters unless you would perhaps suggest random sampling. But in this case I needed to validate that the ids in each row did in fact match my expectations of what they were pointing to.

          M 1 Reply Last reply
          0
          • J Jorgen Andersson

            Another question. Do you need to keep track on what page you are analysing or is sequential paging good enough?

            Wrong is evil and must be defeated. - Jeff Ello

            J Offline
            J Offline
            jschell
            wrote on last edited by
            #5

            Paging was not part of the solution. Paging was necessary only because I could not read the entire collection into memory. However I did need to insure that all of the data was read and that the same row was not read twice.

            J 1 Reply Last reply
            0
            • J jschell

              Jörgen Andersson wrote:

              The obvious question is: What kind of analysis?

              Since I needed to span the database not sure how that matters unless you would perhaps suggest random sampling. But in this case I needed to validate that the ids in each row did in fact match my expectations of what they were pointing to.

              M Offline
              M Offline
              Mycroft Holmes
              wrote on last edited by
              #6

              jschell wrote:

              What kind of analysis?

              I think that is critical, the design of your analysis is the driving factor and is going to be the one area you can gain the most benefit from optimization.

              Never underestimate the power of human stupidity RAH

              J 1 Reply Last reply
              0
              • J jschell

                Paging was not part of the solution. Paging was necessary only because I could not read the entire collection into memory. However I did need to insure that all of the data was read and that the same row was not read twice.

                J Offline
                J Offline
                Jorgen Andersson
                wrote on last edited by
                #7

                Then I believe I have enough info to recommend you to not use pagination the way you do.

                SELECT * FROM tbl LIMIT 2000,10;

                is an extraordinary inefficient query. It's internally functioning like this pseudocode:

                SELECT BOTTOM 10 *
                FROM (
                SELECT TOP (2000+10) *
                FROM T
                )

                In short it selects the 2010 first rows to just throw away the first 2000 rows. If you can store the max(ID) from the previous page I'd recommend trying this instead

                SELECT id
                ,RefId1
                ,RefId2
                ,...
                FROM T
                WHERE id > @PreviousMaxID
                ORDER BY id
                LIMIT @PageSize

                You obviously need to have an index on the id column for this to be fast.

                Wrong is evil and must be defeated. - Jeff Ello

                J 1 Reply Last reply
                0
                • J Jorgen Andersson

                  Then I believe I have enough info to recommend you to not use pagination the way you do.

                  SELECT * FROM tbl LIMIT 2000,10;

                  is an extraordinary inefficient query. It's internally functioning like this pseudocode:

                  SELECT BOTTOM 10 *
                  FROM (
                  SELECT TOP (2000+10) *
                  FROM T
                  )

                  In short it selects the 2010 first rows to just throw away the first 2000 rows. If you can store the max(ID) from the previous page I'd recommend trying this instead

                  SELECT id
                  ,RefId1
                  ,RefId2
                  ,...
                  FROM T
                  WHERE id > @PreviousMaxID
                  ORDER BY id
                  LIMIT @PageSize

                  You obviously need to have an index on the id column for this to be fast.

                  Wrong is evil and must be defeated. - Jeff Ello

                  J Offline
                  J Offline
                  jschell
                  wrote on last edited by
                  #8

                  Jörgen Andersson wrote:

                  If you can store the max(ID) from the previous page I'd recommend trying this instead

                  I like the idea. Unfortunately the id is not sequential (again not my idea.) So to do the paged query based on that it would require sorting the primary id then paging on that. So perhaps exactly the same or even less efficient.

                  J 1 Reply Last reply
                  0
                  • M Mycroft Holmes

                    jschell wrote:

                    What kind of analysis?

                    I think that is critical, the design of your analysis is the driving factor and is going to be the one area you can gain the most benefit from optimization.

                    Never underestimate the power of human stupidity RAH

                    J Offline
                    J Offline
                    jschell
                    wrote on last edited by
                    #9

                    Ok. Per the OP table looks like the following

                    Table T
                    id char(22)
                    RefId1 char(22)
                    RefId2 char(22)

                    I need to do the following 1. Verify that RefId1 and RefId2 are in a different table either in that order (1,2) or (2,1) 2. Report if neither or only one id is found. 3. Report if more than one match is found 4. If found report if a different column (not documented above) is same as in the second table. The second table also has a billion rows. Both tables have indexes on the id, and the two other columns documented above.

                    M 1 Reply Last reply
                    0
                    • J jschell

                      Jörgen Andersson wrote:

                      If you can store the max(ID) from the previous page I'd recommend trying this instead

                      I like the idea. Unfortunately the id is not sequential (again not my idea.) So to do the paged query based on that it would require sorting the primary id then paging on that. So perhaps exactly the same or even less efficient.

                      J Offline
                      J Offline
                      Jorgen Andersson
                      wrote on last edited by
                      #10

                      That's why you need to have an index on the id. (or any unique column or combination of columns) The index is already sorted per definition. So the WHERE id > @PreviousMaxID just walks down the b-tree to the right id and starts counting leafnodes until it reaches @PageSize. I don't know how AWS Aurora is organized, but if your table is clustered, you don't even need to lookup the pages and Bob's your uncle.

                      Wrong is evil and must be defeated. - Jeff Ello

                      J 1 Reply Last reply
                      0
                      • J jschell

                        Ok. Per the OP table looks like the following

                        Table T
                        id char(22)
                        RefId1 char(22)
                        RefId2 char(22)

                        I need to do the following 1. Verify that RefId1 and RefId2 are in a different table either in that order (1,2) or (2,1) 2. Report if neither or only one id is found. 3. Report if more than one match is found 4. If found report if a different column (not documented above) is same as in the second table. The second table also has a billion rows. Both tables have indexes on the id, and the two other columns documented above.

                        M Offline
                        M Offline
                        Mycroft Holmes
                        wrote on last edited by
                        #11

                        oh that is ugly - the 1,2 - 2,1 is going to cost you. Can you do it in a couple of steps, inner join on the 2 layouts and eliminate them from the process, possibly even move the non matching records into another table for your reporting analysis (one assumes the majority have valid matching records).

                        Never underestimate the power of human stupidity RAH

                        J 1 Reply Last reply
                        0
                        • M Mycroft Holmes

                          oh that is ugly - the 1,2 - 2,1 is going to cost you. Can you do it in a couple of steps, inner join on the 2 layouts and eliminate them from the process, possibly even move the non matching records into another table for your reporting analysis (one assumes the majority have valid matching records).

                          Never underestimate the power of human stupidity RAH

                          J Offline
                          J Offline
                          jschell
                          wrote on last edited by
                          #12

                          Mycroft Holmes wrote:

                          oh that is ugly - the 1,2 - 2,1 is going to cost you.

                          Yep - not my design.

                          Mycroft Holmes wrote:

                          Can you do it in a couple of steps, inner join on the 2 layouts and eliminate them from the process,

                          I suspect that join between the two tables will return on the order of a billion rows. Either because there are very few rows that are not matched or there are zero. I am only expecting the first because with legacy data orphans might occur.

                          Mycroft Holmes wrote:

                          possibly even move the non matching records

                          I have to find them first. Once I find them I don't consider the analysis to be a problem. Actually for those that do not match I consider it likely they are orphans.

                          M 1 Reply Last reply
                          0
                          • J Jorgen Andersson

                            That's why you need to have an index on the id. (or any unique column or combination of columns) The index is already sorted per definition. So the WHERE id > @PreviousMaxID just walks down the b-tree to the right id and starts counting leafnodes until it reaches @PageSize. I don't know how AWS Aurora is organized, but if your table is clustered, you don't even need to lookup the pages and Bob's your uncle.

                            Wrong is evil and must be defeated. - Jeff Ello

                            J Offline
                            J Offline
                            jschell
                            wrote on last edited by
                            #13

                            I do not believe that is how that works. The primary key is indexed. However indexing is a btree and based on a hash. I presume natural ordering on the index would be to walk the btree. But, again, that would be based on the hash. In contrast "id > previous" is based on the value. I do not have access to the hashing algorithm and even if I used "hash(id) > hash(previous)" then it would end up doing a table scan (or at least an index scan) which would not help at all.

                            J 1 Reply Last reply
                            0
                            • J jschell

                              I do not believe that is how that works. The primary key is indexed. However indexing is a btree and based on a hash. I presume natural ordering on the index would be to walk the btree. But, again, that would be based on the hash. In contrast "id > previous" is based on the value. I do not have access to the hashing algorithm and even if I used "hash(id) > hash(previous)" then it would end up doing a table scan (or at least an index scan) which would not help at all.

                              J Offline
                              J Offline
                              Jorgen Andersson
                              wrote on last edited by
                              #14

                              B-tree is not hash based. MySQL DOES support[^] hash-based indexing, but that is not supported on neither MyISAM nor InnoDb. Only for Memory storage engine and NDB-clusters. And then you still need to choose between b-tree OR hash-based index. Hash-based indexes are obviously NOT ordered. I don't know anything about Amazons databases, so whether your index is hash-based or not is something you need to check. In my very personal opinion, the only reason to use a hash-based index instead of b-tree is if your data doesn't have any kind of ordering to it, like GUIDs for example. <edit>Curiosity took over, I wondered why your RefID was char(22), and it seems like a base64 encoding of GUIDs use exactly 22 characters.</edit>

                              Wrong is evil and must be defeated. - Jeff Ello

                              J 1 Reply Last reply
                              0
                              • J jschell

                                Mycroft Holmes wrote:

                                oh that is ugly - the 1,2 - 2,1 is going to cost you.

                                Yep - not my design.

                                Mycroft Holmes wrote:

                                Can you do it in a couple of steps, inner join on the 2 layouts and eliminate them from the process,

                                I suspect that join between the two tables will return on the order of a billion rows. Either because there are very few rows that are not matched or there are zero. I am only expecting the first because with legacy data orphans might occur.

                                Mycroft Holmes wrote:

                                possibly even move the non matching records

                                I have to find them first. Once I find them I don't consider the analysis to be a problem. Actually for those that do not match I consider it likely they are orphans.

                                M Offline
                                M Offline
                                Mycroft Holmes
                                wrote on last edited by
                                #15

                                Split it into groups by taking the first 1 or 2 characters, process each set independently. Getting desperate here as I'm running out of ideas.

                                Never underestimate the power of human stupidity RAH

                                J 1 Reply Last reply
                                0
                                • J Jorgen Andersson

                                  B-tree is not hash based. MySQL DOES support[^] hash-based indexing, but that is not supported on neither MyISAM nor InnoDb. Only for Memory storage engine and NDB-clusters. And then you still need to choose between b-tree OR hash-based index. Hash-based indexes are obviously NOT ordered. I don't know anything about Amazons databases, so whether your index is hash-based or not is something you need to check. In my very personal opinion, the only reason to use a hash-based index instead of b-tree is if your data doesn't have any kind of ordering to it, like GUIDs for example. <edit>Curiosity took over, I wondered why your RefID was char(22), and it seems like a base64 encoding of GUIDs use exactly 22 characters.</edit>

                                  Wrong is evil and must be defeated. - Jeff Ello

                                  J Offline
                                  J Offline
                                  jschell
                                  wrote on last edited by
                                  #16

                                  Jörgen Andersson wrote:

                                  B-tree is not hash based.

                                  Ok. I stand corrected. So that should work.

                                  Jörgen Andersson wrote:

                                  I don't know anything about Amazons databases

                                  They do not expose the underlying implementation. But it should be a binary equivalent so your point should hold.

                                  Jörgen Andersson wrote:

                                  and it seems like a base64 encoding of GUIDs use exactly 22 characters

                                  Yes I believe that is correct. Not my design but I believe something I have looked at in just the last day would support that.

                                  J 1 Reply Last reply
                                  0
                                  • M Mycroft Holmes

                                    Split it into groups by taking the first 1 or 2 characters, process each set independently. Getting desperate here as I'm running out of ideas.

                                    Never underestimate the power of human stupidity RAH

                                    J Offline
                                    J Offline
                                    jschell
                                    wrote on last edited by
                                    #17

                                    Also an interesting idea. For base 64 encoding that would give approximately 244000 per query(2 characters0 and, per the other comment, it should be fast enough to use that a page. The only limitation there would be if the values were not uniformly distributed.

                                    M J 2 Replies Last reply
                                    0
                                    • J jschell

                                      Also an interesting idea. For base 64 encoding that would give approximately 244000 per query(2 characters0 and, per the other comment, it should be fast enough to use that a page. The only limitation there would be if the values were not uniformly distributed.

                                      M Offline
                                      M Offline
                                      Mycroft Holmes
                                      wrote on last edited by
                                      #18

                                      jschell wrote:

                                      if the values were not uniformly distributed

                                      Uhm I would presume the fields are indexed and physical location would be irrelevant.

                                      Never underestimate the power of human stupidity RAH

                                      1 Reply Last reply
                                      0
                                      • J jschell

                                        Also an interesting idea. For base 64 encoding that would give approximately 244000 per query(2 characters0 and, per the other comment, it should be fast enough to use that a page. The only limitation there would be if the values were not uniformly distributed.

                                        J Offline
                                        J Offline
                                        Jorgen Andersson
                                        wrote on last edited by
                                        #19

                                        It's not uniformly distributed per definition, but the first half of a guid is the time part, and it's stored in reverse order of significance, so for this purpose it could probably be considered as uniformly distributed.

                                        Wrong is evil and must be defeated. - Jeff Ello

                                        J 1 Reply Last reply
                                        0
                                        • J jschell

                                          Jörgen Andersson wrote:

                                          B-tree is not hash based.

                                          Ok. I stand corrected. So that should work.

                                          Jörgen Andersson wrote:

                                          I don't know anything about Amazons databases

                                          They do not expose the underlying implementation. But it should be a binary equivalent so your point should hold.

                                          Jörgen Andersson wrote:

                                          and it seems like a base64 encoding of GUIDs use exactly 22 characters

                                          Yes I believe that is correct. Not my design but I believe something I have looked at in just the last day would support that.

                                          J Offline
                                          J Offline
                                          Jorgen Andersson
                                          wrote on last edited by
                                          #20

                                          If it's a btree index it should work. But if it is a hash index, I really don't have a solution. Except adding a btree index on the table. Hash indexes are faster for lookups, but totally useless for ranges.

                                          Wrong is evil and must be defeated. - Jeff Ello

                                          1 Reply Last reply
                                          0
                                          Reply
                                          • Reply as topic
                                          Log in to reply
                                          • Oldest to Newest
                                          • Newest to Oldest
                                          • Most Votes


                                          • Login

                                          • Don't have an account? Register

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • World
                                          • Users
                                          • Groups