Slow query when using @variable in Where clause [modified]

Meysam Mahfouzi

I've got a table containing around 1,500,000 rows. This table has a column called "Culture" whose value could be either 0 or 1. Right now, all the values are 1. The following query executes very fast (less than a second):

SELECT TOP 10 * FROM News
WHERE Culture = 1

The following query executes very fast as well:

SELECT TOP 10 * FROM News
WHERE Culture = 0

When I put a variable in where clause, it still executes very fast (less than a sec):

DECLARE @c TINYINT
SET @c = 1

SELECT TOP 10 * FROM News
WHERE Culture = @c

But the following query takes several minutes to complete:

DECLARE @c TINYINT
SET @c = 0

SELECT TOP 10 * FROM News
WHERE Culture = @c

There two things to mention: 1- There is a nonclustered index which contains Culture field. 2- The same thing happens if I execute the four above-mentioned queries wrapped in a stored procedure. I'm really confused :doh: I don't know what the problem is...

Rob Philpott

I had a very similar issue recently myself. Not eaxactly sure why it is but it was suggested that SQL Server can't optimise queries which contain variables - they could change as the query executes. Anyway, the general conclusion and advice I drew was to avoid using variables and join onto other tables where possible. Have you tried inserting @c into a single line temporary table and joining on that? I admit that its rather a grubby solution, but then again it is SQL Server and sometimes you have to hold its hand.

Regards, Rob Philpott.

Ben Fair

Have you compared the execution plans? That should shed some light on how SQL Server is processing them differently and what its doing that is making it take longer.

Hope in one hand and poop in the other; see which fills up first. Hope and change were good slogans, now show us more than words.

Wendelius

The problem is that the data is unevenly distributed. Another part of the problem is that the optimizer doesn't see the values for bind variables as it does when they are literals. Few questions: - Do you have significant differences with row counts having Culture 1 or 0? - Also how many different values for cultures do you have? The first thing you should do is to run UPDATE STATISTICS News. After doing that, could you inform if it helped and answers to those questions. Mika

The need to optimize rises from a bad design.My articles[^]

Meysam Mahfouzi

Hi Mika, As I told before, all the Culture values are 1 at this time but in future, other values like 0, 2 and 3 will be added to possible values. Unfortunately I do not have access to database right now (because I'm at home) but tomorrow morning I'll run UPDATE STATISTICS [TableName] to see if it helps. I forgot to say that I have inserted these 1.5 million rows into local database with a data generator application in 5 minutes and this might be cause of the problem (maybe statistics couldn't be updated correctly). Thanks

Meysam Mahfouzi

Hi Rob, Thanks for suggested solution. I'll test it as soon as I get access to database tomorrow morning.

Meysam Mahfouzi

Hi Ben, I couldn't see the execution plan for the long running query since it didn't finish executing after 25 minutes and I couldn't wait more :)

Wendelius

Ok. Also make sure that you have auto create statistics and auto update statistics on. Even though those database options are on there's still need to update statistics regularly since the algorithm in auto updates isn't very good or precise (thus leading to falsely biased histograms). One thing about your query. You specified TOP 10, but the question is TOP 10 from what order? You didn't have any ORDER BY in the statement. Was this intentional since that would always return first 10 rows from the table matching your criteria and they are most likely always the same rows.

The need to optimize rises from a bad design.My articles[^]

Meysam Mahfouzi

No. it was not intentional. The actual query is using pagination (using ROWNUMBER()) and rows are actually ordered by the Date column. This is just a simple sample query.

Wendelius

That's what I suspected. And now we're actually talking about a whole different story. Now if you modify the query and use a date column, for example something like:

DECLARE @c TINYINT
SET @c = 0
SELECT TOP 10 *
FROM News
WHERE Culture = @c
ORDER BY AdditionDate

you need to add a new index to your table on columns Culture, AdditionDate. If you make that modification to your query, your filter factor will be dramatically different so the index will be efficient. There's a simple rule of thumb (which is correct in most of the cases). If the filter factor regarding an index is less than 5%, using index is efficient. Above that it's not efficient. So in your first example if you had 4 different cultures and let's say that they were evenly distributed, the filter factor was 25%. So in that case using the index (culture) is going to be a very poor solution by the optimizer. But if it does use the index in that scenario (as I suspect it did) you will suffer from poor performance. In such cases actually removing the index makes the performance better (unless the index is used in other scenarios). The reasons why some of your test were performing better was because the data was unevenly distributed. So what I suggest is that you first modify the query to the final version and begin optimization after that (if needed anymore). What I said about UPDATE STATISTICS is still a valid statement.

The need to optimize rises from a bad design.My articles[^]

Ben Fair

It looks like Mika has helped you very well, and hopefully you're on your way to resolving the problem. I should've been more explicit in my message that I was referring to the Estimated Execution Plan rather than the Actual Execution Plan. For the Estimated execution plan, the query is not actually run, it's just parsed and analyzed. It's handy to see how SQL Server plans on executing the query, which may be different than the execution plan that it actually uses. However, I've found it to be a very good resource in situations like this where something is long-running and the actual execution plan is hard to come by.

Hope in one hand and poop in the other; see which fills up first. Hope and change were good slogans, now show us more than words.

Meysam Mahfouzi

Dear Mika, Here is the most similar query to the actual working one in our database:

DECLARE @Start INT, @Count INT

SET @Start = 10
SET @Count = 5

DECLARE @c TINYINT
SET @c = 1;

WITH paging AS (
SELECT id, title, description, ROW_NUMBER() OVER (ORDER BY Date DESC) rownum
FROM News
WHERE Culture = @c
)
SELECT * FROM paging WHERE rownum BETWEEN @Start AND (@Start + @Count - 1)

The above query runs very fast but when I set @c to 0, it becomes a long-running query. From execution plan it's seen that only Date index is used while Culture index is not. I have uploaded an image of the execution plan of the above query here[^] for you to have a look at. I also ran "UPDATE STATISTICS News" which made no difference. Thanks for any help

modified on Wednesday, January 28, 2009 2:49 AM

Meysam Mahfouzi

Rob Philpott wrote:

Have you tried inserting @c into a single line temporary table and joining on that?

Dear Rob, I used the following query but it didn't make any difference:

DECLARE @Start INT, @Count INT

SET @Start = 10
SET @Count = 5

DECLARE @c TINYINT
SET @c = 0;

WITH paging AS (
SELECT id, title, [description], ROW_NUMBER() OVER (ORDER BY Date DESC) rownum
FROM news
JOIN (SELECT @C AS Cult) AS CC ON news.Culture = CC.Cult
--WHERE Culture = @c
)
SELECT * FROM paging WHERE rownum BETWEEN @Start AND (@Start + @Count - 1)

modified on Wednesday, January 28, 2009 3:03 AM

Wendelius · modified on Wednesday, January 28, 2009 2:49 AM

How many rows do you have in the table that have Culture 0 and how many for Culture 1? Also is the estimated execution plan same in both cases?

The need to optimize rises from a bad design.My articles[^]

Meysam Mahfouzi

Right now all culture values in database are 1. And yes, the Estimated Execution Plan is exactly the same for both @c = 1 and @c = 0.

Wendelius

Okay, first you could add an index with two columns: Culture, Date DESC:

CREATE INDEX X_Test ON News (Culture, Date DESC);

After that, see what happens for the execution plan (and the performance). After doing that, you could experiment a change to the query (I suspect that the above modification will not be enough, but if it is, the skip the rest):

SELECT TOP(@Count)
id, title, description
FROM News a
WHERE Culture = @c
AND @Start <= (SELECT COUNT(*)
FROM News b
WHERE b.Date < a.Date
AND b.Culture = @c)
ORDER BY Date DESC

Again see execution plan and performance. Also is the ID column defined as IDENTITY and are the records added to the table in date order. In other words what I'm after, if the identity is greater on one row is it always true that its a newer post based on date column? If you could post the results for those two operations.

The need to optimize rises from a bad design.My articles[^]

Meysam Mahfouzi

Dear Mika, Adding the following index:

CREATE INDEX X_Test ON News (Culture, Date DESC);

didn't solve the problem. But after I disabled the index which only was on Date column, both queries ran very fast. (that is, disabling an index apparently solved the problem). The index containing the two columns (Culture, Date) was not being used as long as there was an index on Date column. So now, this query runs very fast for both @c = 0 and @c = 1

DECLARE @Start INT, @Count INT

SET @Start = 1
SET @Count = 5

DECLARE @c TINYINT
SET @c = 1;

WITH paging AS (
SELECT id, title, [description], ROW_NUMBER() OVER (ORDER BY Date DESC) rownum
FROM News
WHERE Culture = @c
)
SELECT * FROM paging WHERE rownum BETWEEN @Start AND (@Start + @Count - 1)

But it turned out to be a temporary cause of happiness! If I start changing the value of @start parameter, things will start to change. When @c is set to 1, the query runs very fast for the following @start values: 1, 10, 100, 1000, 10,000 But when I set @start to 100,000, the query become a long-running one (It took 2:27 mins to execute). After executing the query again (with @start set to 100,000) it executed very fast (probably due to previous execution). Now when I set @start to 200,000 again it goes to sleep... When @c is set to 0, it will execute very fast for any value of @start variable. Regarding your question about Date column: As I told you, I've inserted 1,500,000 rows randomly into News table. Therefor, even though the ID column is an incrementing Identity column, the Date is not necessary greater for greater ID columns. In the real-world running database though, the Date value is greater for greater ID columns for sure. Here is the execution plan of my query with the new two-column index: look[^] and here is the execution plan for your query: look[^] ** p.s. I also noted that the order of columns in Index is important in this case. If I move the Culture to become the second column in index, the index becomes of no use.

Wendelius

Maysam Mahfouzi wrote:

after I disabled the index which only was on Date column, both queries ran very fast

Yeah, that's on of those annoying features for SQL Server optimizer. It isn't very good in index selection.

Maysam Mahfouzi wrote:

But when I set @start to 100,000, the query become a long-running one

Yes, that was predictable since the further you go, the more rows have to be read before you're "in the right position". And I suspect that this was the way both query versions behaved?

Maysam Mahfouzi wrote:

In the real-world running database though, the Date value is greater for greater ID columns for sure.

Is that something we can rely on? If it is, it would be possible to change the query to be based on id (however in that case you should regenerate the test environment data to correspond actual data). Another option is that can you change the parameters? Instead of defining start point as number of previous rows is ot ok to use some other mechanism (...thinking about the solution...)

The need to optimize rises from a bad design.My articles[^]

Meysam Mahfouzi

Mika Wendelius wrote:

And I suspect that this was the way both query versions behaved?

It only happened for @c = 1

Mika Wendelius wrote:

Is that something we can rely on?

Probably, but I think it would be better to write a query independent of Date values. Frankly, I think my query is not much complicated and everybody now and then my find himself dealing with such queries. It's really sad that SQL Server is not able to handle it. Thanks for trying to help me Mika, please let me know if you come up with any solution.

Wendelius

What about the parameter question I asked? I was thinking about a solution where you would define a starting point as float. Initial starting point is 0 and you pass how many rows you want per page (for example 10). Next time you call this query, you would specify the starting point as the float found on the last row of the page you got and then again how many rows you want. So what this means in action is that first you create a new column:

ALTER TABLE News
ADD StartPoint AS ((Culture * 100000000) + CONVERT(float, Date))
PERSISTED

Then you index it:

CREATE NONCLUSTERED INDEX X_StartPoint ON News (StartPoint)

Now the query would be like:

DECLARE @Start float, @Count INT
SET @Start = 0
SET @Count = 5
DECLARE @c TINYINT
SET @c = 1;
SELECT TOP(@Count)
id, title, description, StartPoint
FROM News a
WHERE Culture = @c
AND StartPoint > @Start
AND StartPoint < ((@c + 1) * 100000000)
ORDER BY Start

The need to optimize rises from a bad design.My articles[^]