Performance question

Kosta Cherry

Here is interesting question: I have ~1000 objects each of which is described by ~60mln records. I need the best read performance in the following condition: - select ~5000 records (by the key) for each object. Which will perform faster: if I have all of these objects inside one table (~60bln records) and run one SELECT from that table, or if I have each object inside it's own table and run ~1000 selects (for each table). Database is either MS SQL, PostgreSQL or MySQL. I don't care about write performance. Database choice as well as layout will depend on the answer to this question. It'll take quite some time to model the situation, so I wanted to see if anyone has reasonable input. Thank you!

Colin Angus Mackay

There is insufficient information to give any reasonable answer. Anything would be pure guess work. You are best prototyping your situation and seeing which comes out top.

Kosta Cherry wrote:

select ~5000 records (by the key) for each object.

What do you mean by that? Do you mean that you want to pass the individual key for 5000 rows to the database, or you have a range (min/max) of key values that results in 5000 rows being returned?

Kosta Cherry wrote:

I have each object inside it's own table

What do you mean by "object"?

Kosta Cherry wrote:

I don't care about write performance.

Have you considered using indexes rather than splitting up tables? (indexes slow down writes but can improve read performance if used correctly)

Kosta Cherry wrote:

It'll take quite some time to model the situation

From what you've indicated so far there is only really one logical table, but you perhaps want to split it up into 1000 physical tables. If the situation can be modeled by only one logical table, what takes up the time in modelling?

Upcoming FREE developer events: * Developer Day Scotland Recent blog posts: * Follow up on hiring a software developer * The Value of Smaller Methods My website | blog

Kosta Cherry

Colin Angus Mackay wrote:

here is insufficient information to give any reasonable answer. Anything would be pure guess work. You are best prototyping your situation and seeing which comes out top. Kosta Cherry wrote: select ~5000 records (by the key) for each object.

I mean SQL like this: select * from mytable where ((timekey > X and timekey < Y) AND/OR some other conditions, but they all involve indexed columns only) The result brings up ~5000 (or so) rows. If all records are within several tables, I will run that query ~1000 times. If all records are within one table, I will add into WHERE additional condition like "objectID in (select objectID from other table where ....)", or may be just a join, or whatever other condition. The problem with modeling will be that selection from many tables will be done from multithreaded application, and I'm not sure how servers will react compared with single-threaded approach when all sits in one table and retrieved with one query; plus, those DB servers should be properly tuned for one approach or another. All of this involves a lot of modeling. I'm kind of leaning towards the "many tables" approach from the point of maintainability (where I can have each table on different table space or even different server).

Colin Angus Mackay wrote:

Kosta Cherry wrote: I have each object inside it's own table What do you mean by "object"?

"Object" here is just named collection of millions of similar records - like, for example, file is collection of bytes, or picture is collection of pixels.

Colin Angus Mackay wrote:

Kosta Cherry wrote: I don't care about write performance. Have you considered using indexes rather than splitting up tables? (indexes slow down writes but can improve read performance if used correctly)

Of course I did. I just don't know which way it'll work faster. On one hand, having all records in one table increases Btree+ index depth by 1-2 levels (depending on number of keys in the leaf), which increases seek time; on another hand, having records in different tables requires additional time for parsing (1000 queries instead of 1); so it's a hard to say what will be faster in the end.

Colin Angus Mackay

Kosta Cherry wrote:

The problem with modeling will be that selection from many tables will be done from multithreaded application, and I'm not sure how servers will react compared with single-threaded approach when all sits in one table and retrieved with one query

A good database server will be expecting to get many queries at the same time. The only restriction is, as I recall, that you can't put down more than one query simultaneously on a single connection. So you have to open up a new connection on each thread. When you complete a query you will obviously close the connection which returns it to the pool and another thread can potentially use it.

Kosta Cherry wrote:

those DB servers should be properly tuned for one approach or another.

Well, to an extent. I would guess that database servers are already tuned to having multiple queries run against it. It is the tuning of the queries to avoid things like deadlocks that you need to be concerned about. If all the queries are effectively read only then I don't see any problems in that area. The problems come when you have queries writing to the database. Since you already said write performance isn't an issue, I'm guessing that insertions and updates don't happen frequently, or happen in batches outside of normal hours.

Kosta Cherry wrote:

I'm kind of leaning towards the "many tables" approach from the point of maintainability (where I can have each table on different table space or even different server).

Personally, I would have thought the many-tables (especially if they are all the same structure) approach would be a maintenance nightmare. Also, I think you are possibly over-optimising at this stage.

Kosta Cherry wrote:

"Object" here is just named collection of millions of similar records - like, for example, file is collection of bytes, or picture is collection of pixels.

Well, if all the rows are using the same structure, are they not ALL similar? Differences by a column value (or small number of column values) isn't introducing dissimilartity as far as I can see.

Kosta Cherry wrote:

Of course I did. I just don't know which way it'll work faster.

In my experience you never fully know until it is running. I've implemented a system on

Paul Conrad

What is the OP even asking? I got lost part way through it :laugh:

"I guess it's what separates the professionals from the drag and drop, girly wirly, namby pamby, wishy washy, can't code for crap types." - Pete O'Hanlon