Faster way of filling a Dataset

MumbleB

Hi Guys. Suppose this question has been asked numerous times. I have a database "postgresql" with about 20 columns and about 1.4million records. I select data from the database between two given date ranges and it takes about 4 minutes to fill the dataset. I would like to speed this up. I am doing no updates, just using the data to compile a report. is there a faster way to do this? Below code is what I have. My results are OK but just filling the dataset is a bit of a problem i that it takes +- 4 to 5 minutes to fill the dataset.

        string sql = @"SELECT \* FROM datbase where bus\_received\_time\_stamp between @startDate AND @endDate";

        //Making use of a Postgresql Connection helper
        NpgsqlConnection conn = new NpgsqlConnection(conns);

        //Here we format the dateTimePicker dates to be used with the database
        string fromDate = dtStartDate.Text;
        string toDate = dtEndDate.Text;

        //Instanciate a SqlCommand and connect to the DB and fill a Dataset
        NpgsqlCommand cmd = new NpgsqlCommand();
        cmd.Connection = conn;
        cmd.CommandText = sql;
        cmd.Parameters.AddWithValue("@startDate", fromDate + " 00:00:01");
        cmd.Parameters.AddWithValue("@endDate", toDate + " 23:59:59");

        #endregion

        conversionRate = txtConversionRate.Text;
        double conRate = Convert.ToDouble(conversionRate);

        try
        {
            //Open the DB Connection
            conn.Open();
            setText(this, "Connection Established");
            NpgsqlDataAdapter da = new NpgsqlDataAdapter(cmd);
            setText(this, "Collecting Data for Processing!");
            da.Fill(dts);

//Time between the two "setText statements takes a good 4 to 5 minutes
setText(this, "Define Data To be Used");

From here the processing takes a few minutes as there quite a bit of matching to do etc. However the main thing here is that filling the DataAdapter takes too much time. I would like to cut this down to maybe a few seconds, maybe a minute? I have added an Index on the DB on the Datee column to try and speed things up but now sure if this is correct? Any ideas??

Excellence is doing ordinary things extraordinarily well.

Jasmine2501

Sounds like you are retrieving a huge dataset and doing calculations in your client code? If these statistics or calculations can be done on the SQL server, it will likely increase performance (because sql servers are really good at aggregating stats) and it will decrease network load, which I suspect is your problem in the first place. Getting 1.4 million rows from the database will always take a long time because it has to transfer all that data over the network. If you only had one boolean column in your table, 1.4 million rows would be 1.3MB of data - your 20 columns could result in your "fill the dataset" being a 20MB download, and that's never gonna be real fast. To recap - solutions involve reducing the amount of data you need to transfer. Ideally, only transfer exactly what needs to be seen by the user and nothing more. Even if your query is super fast, transferring data over the network isn't. Your index on the date column is about all you can do to speed up the query since it's your only selection criteria, but again, finding the data isn't probably the issue.

SledgeHammer01

Filling the dataset is always going to be the bottleneck in your code. That being said... A few "band-aid" fixes: 1) Your query is pretty basic, but you *might* shave off a bit of time by using a stored procedure instead. 2) You *might* shave off a bit of time by indexing the bus_received_time_stamp column 3) Do you have an enterprise grade SQL server? Or are you running it on some re-purposed podunk PC? 4) Are you connecting to the server via gigabit ethernet? With those band-aid fixes out of the way...: 1) The real cause of your problem is that you are asking for a LOT of data... 1.4M rows x 20 columns = 28M "cells"... if each cell is only 1 byte, thats 28M bytes = 26MB of data you are asking for. Most likely your average cell size could be 10 bytes or even 100 bytes. Now you are asking for 260MB to 2.6GB of data. 2) Its ***highly*** doubtful you need all that data. What are you doing with it? Displaying it to a user? That's great... but whats a person going to do with 1.4M rows of data? Nothing... a human can't process that amount of data. 3) Are you doing some kind of calculation on the data? If so, you can likely do that on the server and then only retrieve the result... or you can set it to run as a job at midnight or something and have it cached... 4) You could also page the data if the user really needs all that data which again I find **highly** doubtful. You really need to define your problem better.

Lost User

SledgeHammer01 wrote:

Its ***highly*** doubtful you need all that data. What are you doing with it? Displaying it to a user? That's great... but whats a person going to do with 1.4M rows of data? Nothing... a human can't process that amount of data.

Post said it's for a report; but good point :)

Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^]

Pete OHanlon

As has been said, that's a lot of data. Do you really need all that data though? If this is for a report, do you really need all 20 columns? If you reduce what you're retrieving, this should help speed things up.

I was brought up to respect my elders. I don't respect many people nowadays.
CodeStash - Online Snippet Management | My blog | MoXAML PowerToys | Mole 2010 - debugging made easier

SledgeHammer01

If someone plopped a report with 1.4M rows on it on your desk, what are you gonna do with it? I'd toss it in the recycle bin without even looking at it. If he needs all that data to generate a usable report, thats another story. In that case, he should probably do what another poster suggested and run the stats on the server instead of downloading all the data and doing it on the client.

Lost User

SledgeHammer01 wrote:

If someone plopped a report with 1.4M rows on it on your desk, what are you gonna do with it? I'd toss it in the recycle bin without even looking at it.

I'd do that with any report; then again, sometimes some bureaucrat needs a paper-trail from here to the moon, just to be able to say to his boss that the evidence has been archived. Yes, I'll come up with the "filter, don't load all"-text, but that's more appropriate for grids/displaying than for reporting. For reporting, I'd state that time matters little.

SledgeHammer01 wrote:

If he needs all that data to generate a usable report, thats another story.

He'll be the one to answer that question, and eventually, you'd still end up writing that report. Keyword here is "needs".

SledgeHammer01 wrote:

In that case, he should probably do what another poster suggested and run the stats on the server instead of downloading all the data and doing it on the client.

It would still need databinding, whether you do it on the client or on the server. I'd suggest doing so on the client; I'd hate to see fifty people abusing the server while they have an idle desktop.

Bastard Programmer from Hell :suss: If you can't read my code, try converting it here[^]

SledgeHammer01

All depends on his requirements. I'm kind of skeptical he needs to generate *real-time* reports for such massive amounts of data. In that case, I'd have a scheduled job that runs on the server at 12:01am and generates the report for the previous day and caches it somewhere. If he needs to generate real-time reports for an arbitrary date range, yeah... there isn't really a way around that. It's going to take 4 to 5 minutes. It would still be a clever design to have a service that runs at 12:01am and pulls the data for the previous day and caches it somewhere. Perhaps in compressed form :) since you can't compress data over the SQL protocol (well, you can with 3rd party software I guess). Then his client or whatever knows to pull 05202013.zip, 05212013.zip, 05222013.zip and concatenate them for those 3 days for example. You'd probably shrink down the data from 2.6GB to less then a gig. You'd have to test it out of course :)... the time you save on pulling the zips might be balanced out by the time it takes to unzip + concat + load into datasets. Then again, depending on his application, the nightly service might be able to do some preprocessing steps as well.

Lost User

Good answer :thumbsup:

MumbleB

Thanks Pete and all the other guys. I need all the data to generate a report, summary billing report. SO client-side calculations etc is a must. What I have found helpful was creating a stored procedure on the DB, it is a local DB not a network DB. Concept phase before business would through money at a server. Did a heck of a lot of reading up on PostGres as this is my DB. For now I am sitting at 2 million records and it runs fine. Takes about 2 minutes to collect and do all relevant calculations etc. Indexing the date column helped with performance as well as I use this to create the report for a specific date range. Just for interest sake, 2 million records is only one months worth of data, which is what I am extracting from an Oracle Database which carries way more data than what I am carrying to do the billing. So, in short, it works for now. Will see as the local DB grows how the performance degrades.

Excellence is doing ordinary things extraordinarily well.

MumbleB

Thanks Sledge. I did all the above before returning to the thread and after doing a whole lot of background work on Postgres Databases etc. Stored procedure definitely helped and then indexing did as well. I am in the process of looking at using a refcurs as well. But the proof will be in the pudding on Sunday when I have to run the billing for the month of May. Just over 2 million records now and it takes around 2, 3 or 4 minutes to complete everything, calculations included and creating PDF output reports.

Excellence is doing ordinary things extraordinarily well.