Billion rows

Mycroft Holmes

oh that is ugly - the 1,2 - 2,1 is going to cost you. Can you do it in a couple of steps, inner join on the 2 layouts and eliminate them from the process, possibly even move the non matching records into another table for your reporting analysis (one assumes the majority have valid matching records).

Never underestimate the power of human stupidity RAH

jschell

Mycroft Holmes wrote:

oh that is ugly - the 1,2 - 2,1 is going to cost you.

Yep - not my design.

Mycroft Holmes wrote:

Can you do it in a couple of steps, inner join on the 2 layouts and eliminate them from the process,

I suspect that join between the two tables will return on the order of a billion rows. Either because there are very few rows that are not matched or there are zero. I am only expecting the first because with legacy data orphans might occur.

Mycroft Holmes wrote:

possibly even move the non matching records

I have to find them first. Once I find them I don't consider the analysis to be a problem. Actually for those that do not match I consider it likely they are orphans.

jschell

I do not believe that is how that works. The primary key is indexed. However indexing is a btree and based on a hash. I presume natural ordering on the index would be to walk the btree. But, again, that would be based on the hash. In contrast "id > previous" is based on the value. I do not have access to the hashing algorithm and even if I used "hash(id) > hash(previous)" then it would end up doing a table scan (or at least an index scan) which would not help at all.

Jorgen Andersson

B-tree is not hash based. MySQL DOES support[^] hash-based indexing, but that is not supported on neither MyISAM nor InnoDb. Only for Memory storage engine and NDB-clusters. And then you still need to choose between b-tree OR hash-based index. Hash-based indexes are obviously NOT ordered. I don't know anything about Amazons databases, so whether your index is hash-based or not is something you need to check. In my very personal opinion, the only reason to use a hash-based index instead of b-tree is if your data doesn't have any kind of ordering to it, like GUIDs for example. <edit>Curiosity took over, I wondered why your RefID was char(22), and it seems like a base64 encoding of GUIDs use exactly 22 characters.</edit>

Wrong is evil and must be defeated. - Jeff Ello

Mycroft Holmes

Split it into groups by taking the first 1 or 2 characters, process each set independently. Getting desperate here as I'm running out of ideas.

Never underestimate the power of human stupidity RAH

jschell

Jörgen Andersson wrote:

B-tree is not hash based.

Ok. I stand corrected. So that should work.

Jörgen Andersson wrote:

I don't know anything about Amazons databases

They do not expose the underlying implementation. But it should be a binary equivalent so your point should hold.

Jörgen Andersson wrote:

and it seems like a base64 encoding of GUIDs use exactly 22 characters

Yes I believe that is correct. Not my design but I believe something I have looked at in just the last day would support that.

jschell

Also an interesting idea. For base 64 encoding that would give approximately 244000 per query(2 characters0 and, per the other comment, it should be fast enough to use that a page. The only limitation there would be if the values were not uniformly distributed.

Mycroft Holmes

jschell wrote:

if the values were not uniformly distributed

Uhm I would presume the fields are indexed and physical location would be irrelevant.

Never underestimate the power of human stupidity RAH

Jorgen Andersson

It's not uniformly distributed per definition, but the first half of a guid is the time part, and it's stored in reverse order of significance, so for this purpose it could probably be considered as uniformly distributed.

Wrong is evil and must be defeated. - Jeff Ello

Jorgen Andersson

If it's a btree index it should work. But if it is a hash index, I really don't have a solution. Except adding a btree index on the table. Hash indexes are faster for lookups, but totally useless for ranges.

Wrong is evil and must be defeated. - Jeff Ello

jschell

For clarification on that the value is actually based on the UUID from java. And that UUID would seem to have a uniform distribution because the layout is not strictly ordered by time - the UUID uses the minor time (probably seconds/millis) as the most significant part.