Question on performance...
-
Not so much a programming question, but more of a what's your opinion / experience on the matter. So, a customer of mine is importing CSV files through a website, hosted in Microsoft Azure. The file typically has around 2000 to 2500 lines and 20 to 25 values per line (depending on the type of import). However, something is currently very wrong in my software or database and more often than not my customer is greeted by a timeout error (well, a "something went wrong" message, because I do proper error handling ;)). Last week, it took 17 tries to import one file. The funny thing is, it ultimately works every time. Of course it wasn't always like that, but the import table in Azure SQL has over 2 million records now and I need to validate for double records, calculate a couple of other fields for which I need additional data, create some grouped data, etc. Since I've added some stuff in the past year it's gotten slower and slower. Not so much an issue in my development environment, but apparently a bottleneck in production. Good thing they're calling me now that it's out of control, rather than sooner when it was more manageable :laugh: Anyway, I've currently got it down to four to six seconds, which is still an eternity I think. Inserting so many records into the database, as well as fetching a good amount, just takes some time, apparently. I'm doing everything synchronously (well, async, but waiting) and I haven't checked indexes yet, so maybe I could get it a bit faster still. Perhaps upgrading my database in Azure could help a bit to. If I really wanted to, I could make it instant, handle everything async, and give the user live updates. They've got like five imports a day, so it's not like these five seconds are that big of a deal. Other than that the system is pretty fast and they're very satisfied :D So, for my question, how long should such an action take, according to you? Is five seconds alright and should I just show a "busy" message, or is five seconds completely unacceptable (considering the scenario)?
Best, Sander Azure DevOps Succinctly (free eBook) Azure Serverless Succinctly (free eBook) Migrating Apps to the Cloud with Azure arrgh
I would implement a tmp table with a twist: diff I would create a table, lets call it "ImportData" with 3 columns: ID: i assume there is a field on the CSV that is unique Data: the complete CVS line Status: a importa status depending on the operation I like char instead of int with "magical numbers" for these kinds os status A single char will not hurt performance Then: 1: Set the status to '?' before importing 2: Import to this table row by row 3: Have a SQL trigger that compares the value of column 'Data' If it changed, set Status to 'U' (Updated) if did not change set status to 'N' (NOP, No operation) 4: At the end, set all rows with '?' to 'D' (Deleted), these are the rows not present on the CSV 5: Process all rows with status different from 'N' (the changed ones) If 'U', insert or update on main table, if 'D' delete from main table This will give you a differential update with only the changed rows being updated in your Main data table. Will also give you a 'snapshot' of all data received in the last CSV file. Very useful for debug I have several imports done this way, usually imports are very fast since (in my case) most CSV rows do not change. If the import data is XML, o JSON you can do the same by saving the XML node or JSON node on the 'Data' column.
-
Daniel Pfeffer wrote:
Indexes are used to maintain the sort order of the records in a table. Searching an index is much faster than searching an unindexed table, and best of all - multiple search criteria may be stored as multiple indexes.
Bit of mansplaining there ;) But yeah, I actually checked my indexes and found one with a very high impact on performance. Removed an unused one too. I'm not going to add a unique key, as there are around 20 fields that, together, should be unique. I feel like that would hurt performance more than it would add to it. That's why I'm checking that manually, because I can search for potentially double data on a specific index, which I need to do anyway to let the user know they're uploading double data (which is now actually allowed, since last month). Removed some data analysis at this step too. I'm storing some redundant data so I don't have to analyze it later. Turns out, with a well placed index, the analysis is instant later on.
Best, Sander Azure DevOps Succinctly (free eBook) Azure Serverless Succinctly (free eBook) Migrating Apps to the Cloud with Azure arrgh.js - Bringing LINQ to JavaScript
A few other ideas: * Check for index fragmentation. If it's high that will hurt performance. In SSMS, right click on Indexes and select either Rebuild or Reindex to see what the values are. Then OK to do that action. * Even if you need 20 columns to make a unique index, then do it as a composite key. That will still perform better than you manually doing the check. Manual checks may also have race conditions between the check and the insert. * When creating indexes, don't forget about included columns. These are columns that are not a part of the index, but are retrieved with it. It allows for your index to be small and fast, but you get the data you need faster. * Use the Execution Plan in SSMS to see where your bottlenecks are on the database side. Sometimes it will also offer index suggestions. Enjoy!
Bond Keep all things as simple as possible, but no simpler. -said someone, somewhere
-
How do you handle validation towards your users? A user imports a file and wants to know why it failed (e.g. some value is invalid or the file was already (partly) imported).
Best, Sander Azure DevOps Succinctly (free eBook) Azure Serverless Succinctly (free eBook) Migrating Apps to the Cloud with Azure arrgh.js - Bringing LINQ to JavaScript
Well, since these updaters run as an automated process, the customer gets an email. Internally, all errors, warnings, and performed actions are logged, so if it were a manually initiated process, we could provide a full report to the user on the website as well.
Latest Article:
Create a Digital Ocean Droplet for .NET Core Web API with a real SSL Certificate on a Domain -
Not so much a programming question, but more of a what's your opinion / experience on the matter. So, a customer of mine is importing CSV files through a website, hosted in Microsoft Azure. The file typically has around 2000 to 2500 lines and 20 to 25 values per line (depending on the type of import). However, something is currently very wrong in my software or database and more often than not my customer is greeted by a timeout error (well, a "something went wrong" message, because I do proper error handling ;)). Last week, it took 17 tries to import one file. The funny thing is, it ultimately works every time. Of course it wasn't always like that, but the import table in Azure SQL has over 2 million records now and I need to validate for double records, calculate a couple of other fields for which I need additional data, create some grouped data, etc. Since I've added some stuff in the past year it's gotten slower and slower. Not so much an issue in my development environment, but apparently a bottleneck in production. Good thing they're calling me now that it's out of control, rather than sooner when it was more manageable :laugh: Anyway, I've currently got it down to four to six seconds, which is still an eternity I think. Inserting so many records into the database, as well as fetching a good amount, just takes some time, apparently. I'm doing everything synchronously (well, async, but waiting) and I haven't checked indexes yet, so maybe I could get it a bit faster still. Perhaps upgrading my database in Azure could help a bit to. If I really wanted to, I could make it instant, handle everything async, and give the user live updates. They've got like five imports a day, so it's not like these five seconds are that big of a deal. Other than that the system is pretty fast and they're very satisfied :D So, for my question, how long should such an action take, according to you? Is five seconds alright and should I just show a "busy" message, or is five seconds completely unacceptable (considering the scenario)?
Best, Sander Azure DevOps Succinctly (free eBook) Azure Serverless Succinctly (free eBook) Migrating Apps to the Cloud with Azure arrgh
This is a huge area with all sorts of possible ways of solving it. I currently work on a rather huge codebase, part of which performs automated imports and exports in a batching system. That would probably be overkill for what you are wanting. There isn't a one solution for everything in this sort of case - however a staging table is always a good place to start - import the data into a staging table then you can decide on a strategy from there. The strategy might involve selectively picking records that are then fed into your main table's or even allowing the user to choose the data. The advantage of a staging table is that you can isolate any issues within the staging table as well as be able to query it and analyse data from there without affecting your main tables. As for timings - for something like an import there should really be no guarantees of how many seconds it takes because there are so many variables out of your control(connection speed, number of records, record sizes etc.) What you can do is provide some sort of progress bar to let the user know what stage things are at(I actually deliberately slow the progress bar in some applications so that users get a sense of something happening rather than something flashing on and off the screen).
“That which can be asserted without evidence, can be dismissed without evidence.”
― Christopher Hitchens
-
Not so much a programming question, but more of a what's your opinion / experience on the matter. So, a customer of mine is importing CSV files through a website, hosted in Microsoft Azure. The file typically has around 2000 to 2500 lines and 20 to 25 values per line (depending on the type of import). However, something is currently very wrong in my software or database and more often than not my customer is greeted by a timeout error (well, a "something went wrong" message, because I do proper error handling ;)). Last week, it took 17 tries to import one file. The funny thing is, it ultimately works every time. Of course it wasn't always like that, but the import table in Azure SQL has over 2 million records now and I need to validate for double records, calculate a couple of other fields for which I need additional data, create some grouped data, etc. Since I've added some stuff in the past year it's gotten slower and slower. Not so much an issue in my development environment, but apparently a bottleneck in production. Good thing they're calling me now that it's out of control, rather than sooner when it was more manageable :laugh: Anyway, I've currently got it down to four to six seconds, which is still an eternity I think. Inserting so many records into the database, as well as fetching a good amount, just takes some time, apparently. I'm doing everything synchronously (well, async, but waiting) and I haven't checked indexes yet, so maybe I could get it a bit faster still. Perhaps upgrading my database in Azure could help a bit to. If I really wanted to, I could make it instant, handle everything async, and give the user live updates. They've got like five imports a day, so it's not like these five seconds are that big of a deal. Other than that the system is pretty fast and they're very satisfied :D So, for my question, how long should such an action take, according to you? Is five seconds alright and should I just show a "busy" message, or is five seconds completely unacceptable (considering the scenario)?
Best, Sander Azure DevOps Succinctly (free eBook) Azure Serverless Succinctly (free eBook) Migrating Apps to the Cloud with Azure arrgh
This falls under the 'checking to make sure I'm not leaving my drink on the roof of my car before driving off' category of suggestions. You may want to make sure the database connection is not closing and reconnecting between each record. This is something I've run into back in my old Access days. :wtf:
-
Not so much a programming question, but more of a what's your opinion / experience on the matter. So, a customer of mine is importing CSV files through a website, hosted in Microsoft Azure. The file typically has around 2000 to 2500 lines and 20 to 25 values per line (depending on the type of import). However, something is currently very wrong in my software or database and more often than not my customer is greeted by a timeout error (well, a "something went wrong" message, because I do proper error handling ;)). Last week, it took 17 tries to import one file. The funny thing is, it ultimately works every time. Of course it wasn't always like that, but the import table in Azure SQL has over 2 million records now and I need to validate for double records, calculate a couple of other fields for which I need additional data, create some grouped data, etc. Since I've added some stuff in the past year it's gotten slower and slower. Not so much an issue in my development environment, but apparently a bottleneck in production. Good thing they're calling me now that it's out of control, rather than sooner when it was more manageable :laugh: Anyway, I've currently got it down to four to six seconds, which is still an eternity I think. Inserting so many records into the database, as well as fetching a good amount, just takes some time, apparently. I'm doing everything synchronously (well, async, but waiting) and I haven't checked indexes yet, so maybe I could get it a bit faster still. Perhaps upgrading my database in Azure could help a bit to. If I really wanted to, I could make it instant, handle everything async, and give the user live updates. They've got like five imports a day, so it's not like these five seconds are that big of a deal. Other than that the system is pretty fast and they're very satisfied :D So, for my question, how long should such an action take, according to you? Is five seconds alright and should I just show a "busy" message, or is five seconds completely unacceptable (considering the scenario)?
Best, Sander Azure DevOps Succinctly (free eBook) Azure Serverless Succinctly (free eBook) Migrating Apps to the Cloud with Azure arrgh
I had a similar issue where I needed to import 10's of thousands of records every day, most of them duplicates. We only inserted if the record didn't already exist. It was a serious performance bottleneck. I generated hashes for all of the existing rows and added an index for the new hash column. Now I calculate the hash of the values in each new row to be inserted and compare the hash to existing rows. I skip the insert if an existing identical row already exists. Using the index solved the insert performance problem for me. I don't recommend using MD5 or similar simple hashes because it's too easy to create duplicate hashes.
-
No, you just need to read the CSV-file to an IEnumerable of sorts and connect it to an EntityDataReader that you use as an input to SqlBulkCopy. EntityDatareader is a part of System.Data.EntityClient. Or you can use a CSV-Reader[^] that you connect directly to SqlBulkCopy.
Wrong is evil and must be defeated. - Jeff Ello
The BulkInsert is using SqlBulkCopy internally. Using SqlBulkCopy directly is about equally fast.
Best, Sander Azure DevOps Succinctly (free eBook) Azure Serverless Succinctly (free eBook) Migrating Apps to the Cloud with Azure arrgh.js - Bringing LINQ to JavaScript
-
Not so much a programming question, but more of a what's your opinion / experience on the matter. So, a customer of mine is importing CSV files through a website, hosted in Microsoft Azure. The file typically has around 2000 to 2500 lines and 20 to 25 values per line (depending on the type of import). However, something is currently very wrong in my software or database and more often than not my customer is greeted by a timeout error (well, a "something went wrong" message, because I do proper error handling ;)). Last week, it took 17 tries to import one file. The funny thing is, it ultimately works every time. Of course it wasn't always like that, but the import table in Azure SQL has over 2 million records now and I need to validate for double records, calculate a couple of other fields for which I need additional data, create some grouped data, etc. Since I've added some stuff in the past year it's gotten slower and slower. Not so much an issue in my development environment, but apparently a bottleneck in production. Good thing they're calling me now that it's out of control, rather than sooner when it was more manageable :laugh: Anyway, I've currently got it down to four to six seconds, which is still an eternity I think. Inserting so many records into the database, as well as fetching a good amount, just takes some time, apparently. I'm doing everything synchronously (well, async, but waiting) and I haven't checked indexes yet, so maybe I could get it a bit faster still. Perhaps upgrading my database in Azure could help a bit to. If I really wanted to, I could make it instant, handle everything async, and give the user live updates. They've got like five imports a day, so it's not like these five seconds are that big of a deal. Other than that the system is pretty fast and they're very satisfied :D So, for my question, how long should such an action take, according to you? Is five seconds alright and should I just show a "busy" message, or is five seconds completely unacceptable (considering the scenario)?
Best, Sander Azure DevOps Succinctly (free eBook) Azure Serverless Succinctly (free eBook) Migrating Apps to the Cloud with Azure arrgh
1. Figure out the bottle neck. Since it does not happen in Dev, but does in production, I suspect it is the duplicate checking. If you size dev the same as production will it reproduce in dev? Also, do you have more fault tolerance/active nodes in production? If you are doing a few thousand small transactions in a high availability setup, it will take longer. Do you perform all of your duplicate checks in a single query? Or one import row at a time? If you setup good, unique indexes, then you can skip the duplicate checking and let the DB do it for you. This would point you toward a commit per import row. We had an import feature that ended up with an unnecessary exponential complexity. It was fine during testing but started getting really slow at only 150 rows.
-
charlieg wrote:
these values are text? numeric?
Text, numbers, decimals, dates.
charlieg wrote:
Just how large is the overall file?
I have an initial file of 248 KB.
charlieg wrote:
What's the connection speed between the web site and the azure based system?
They both run in Azure in the same region, so I suspect the connection is fast.
charlieg wrote:
Does your azure system have sufficient resources?
Yeah, we don't have the fastest database (50 DTUs), but it's plenty sufficient for everything else.
charlieg wrote:
see where you are spending your time.
Inserting 2500 lines.
await context.BulkInsertAsync(lines);
is the exact line ;) Although I suspect there may be some other long running queries in production. Possibly getting all the lines for a specific date. I'm looking at my new code now (and added an index) and the inserting takes the longest by far (a rewrite of this functionality was necessary for other reasons too, I just gave it more priority because of this).charlieg wrote:
Solve the timeout issue, and I'd give it 50/50 your performance issues go away.
The timeout issue is the performance issue :laugh:
Best, Sander Azure DevOps Succinctly (free eBook) Azure Serverless Succinctly (free eBook) Migrating Apps to the Cloud with Azure arrgh.js - Bringing LINQ to JavaScript
Sander Rossel wrote:
Yeah, we don't have the fastest database (50 DTUs), but it's plenty sufficient for everything else.
I ran some testing of our application with Azure SQL some time ago and found it to be very poor for an OLAP style IO bound workload, and probably checking a big index for uniqueness is similar. 50 DTU's on Basic or Standard tier is no more than 200 IOPS - DTU-based purchasing model - Azure SQL Database | Microsoft Learn[^] For us it was better to use a SQL server on an Azure virtual machine, then we could do optimisations like striping the database across multiple smaller (unmanaged) hdd's using Windows Storage Spaces. Nowadays there are probably better managed disk options with SSD's etc to get you a decent IO performance level. Configure storage for SQL Server VMs - SQL Server on Azure VMs | Microsoft Learn[^]
-
Sander Rossel wrote:
Yeah, we don't have the fastest database (50 DTUs), but it's plenty sufficient for everything else.
I ran some testing of our application with Azure SQL some time ago and found it to be very poor for an OLAP style IO bound workload, and probably checking a big index for uniqueness is similar. 50 DTU's on Basic or Standard tier is no more than 200 IOPS - DTU-based purchasing model - Azure SQL Database | Microsoft Learn[^] For us it was better to use a SQL server on an Azure virtual machine, then we could do optimisations like striping the database across multiple smaller (unmanaged) hdd's using Windows Storage Spaces. Nowadays there are probably better managed disk options with SSD's etc to get you a decent IO performance level. Configure storage for SQL Server VMs - SQL Server on Azure VMs | Microsoft Learn[^]
50 DTU's on Standard tier is more than enough for everything else we have. It's not an OLAP application, it's more administrative. It has some master data, you can import and edit orders (it's this importing that's causing me headaches) and some one-off reports and other functionality. The import lines (about 2500 a file) are grouped, some prices are calculated and those prices are added per group in code, which is super fast. Other than that, the lines are printed on a report. All in all it's a pretty small application that ties the production and the financial software together. The database currently stores 3.38 GB of data of which this import table is about 2/3rds (and that's all we do with those lines) :sigh:
Best, Sander Azure DevOps Succinctly (free eBook) Azure Serverless Succinctly (free eBook) Migrating Apps to the Cloud with Azure arrgh.js - Bringing LINQ to JavaScript