SQL vs Code

Andy Brummer

If that had allowed us to scale out it would have been an option. However, we were so resource constrained on the database servers that it probably would not have scaled better. The application was a web log processing application which processed anywhere between 10 and billions of records a day. We calculated rolling aggregates over the past week for various attributes to make predictions, and had another process to apply future data backwards to detect additional fraud. The original design used things like materialized views to generate the aggregates, however that was not able to scale with the hardware constraints that we had. The next design calculated lated loaded aggregates with a set of indexes on the data table. However inserts became too slow. It is possible that CLR aggregation would be able to help, but I was flabbergasted to find that opening up multiple threads and pulling chunks of raw data straight out the the main table while building aggregates on the fly with .net dictionaries was tremendously fast. It allowed us to load up a week of data in under a minute hitting 70% utilization of a gigabit network, and then processing at 3-5,000 transactions a second. For some of the larger clients we had a special build which was able to hit 25,000 to 30,000 with 10Gigs of data in memory. Not bad for a little C# console app. In fact, the biggest bottleneck was normalizing cookie data.

I can imagine the sinking feeling one would have after ordering my book, only to find a laughably ridiculous theory with demented logic once the book arrives - Mark McCutcheon

rghubert

We find LINQ and CLR to be extremely interesting innovations that enable flexible options -- in combination with stored procedures/SQL -- to solve more problems fast. SQL is mature, LINQ and CLR are just getting started, so give them a chance, and a try...

Pete OHanlon

We do a lot of spatial work, which we previously had to delegate out to MapInfo to process; now we do it directly in Oracle Spatial and it works a treat.

"WPF has many lovers. It's a veritable porn star!" - Josh Smith

As Braveheart once said, "You can take our freedom but you'll never take our Hobnobs!" - Martin Hughes.

My blog | My articles | MoXAML PowerToys | Onyx

Dark Yak

This question can not be answered without knowing the details of the implementation. These are general remarks to consider : * In a performance perspective : SQL works a lot faster on processing dataset than individual row computation (in general). But as Andy said, there is some caveat : in one of my project, i do all computation by the database (around 10 millions rows), it was working fine, but slowly when we reach 60 millions, the total duration grows exponentially. Because the resources needed to run the procedure starves all the server ressources (meory, paging, ...). I think there is a balance in term of the computed dataset size where sql is better than code. You've just to find it :-) * In a design perspective : In my opinion, moving business logic (computation, calculation) inside the database should be avoid (if possible). I usally used database as data storage and few basic queries (basic agreggation). For complex data manipulation with business logic, it should be done in the BAL (Business Access Layer). Conclusion : - You have to determine what is best for you : architecture design versus performance design - You have to determine what is best approach : whole dataset processing, chunk of data processing, or streaming the process

Mycroft Holmes

Pete O'Hanlon wrote:

We do a lot of spatial work

90% of my work is CRUD financial calcs so SQL deals with that nicely. Tell me does oravle have the equivalent of the HierarchyID that SQL Server uses. We have approx 4tb of data to work on and I have opted for Oracle rather than SQL Server as the databse.

Never underestimate the power of human stupidity RAH

Fabio Franco

I'm with you on that. I've always favored pure SQL for performance. In a project I worked on I advised the project leader of the possible performance pitfalls of using LINQtoSQL. Guess what? System got slow as hell and a lot of effort had to be put in in order to improve performance. The argument of productivity fell on this case. I beleive that the other case might be true when no complex operations happens with data from the database, the productivity might be good without affecting performance much.

Andy Brummer

Dark Yak wrote:

In my opinion, moving business logic (computation, calculation) inside the database should be avoid (if possible). I usally used database as data storage and few basic queries (basic agreggation). For complex data manipulation with business logic, it should be done in the BAL (Business Access Layer).

As far as scaling into large datasets, there are other solutions like greenplum, cubes, hadoop, etc. that might provide better benefit than just writing custom code. So when it comes down to it, there is no clear cut answer to the question. What works in one context can easily be a horrible solution in another. There are so many choices now, it's hard to pick the right one.

I can imagine the sinking feeling one would have after ordering my book, only to find a laughably ridiculous theory with demented logic once the book arrives - Mark McCutcheon

JasonPSage

Sounds like the right choice to me bro! If "It bites you later" - it's because something has changed... you have options... refine the SQL or go to the mid Tier - Sounds Like you Hit a home Run if you ask me!

Know way too many languages... master of none!

Dan Neely

But your successor won't have that option to do his bogotesting?

3x12=36 2x12=24 1x12=12 0x12=18

Ishmael Turner

I don't know your domain or the requirements of the query, but you mentioned Trees, Recursion, and CTEs. Maybe you already know about the Nested Set model? There is a good article about it at http://dev.mysql.com/tech-resources/articles/hierarchical-data.html[^]. This can improve the performance of queries of hierarchal data at a cost during insert and update. It can also massively simplify the SQL you use to query. Sorry if this is all well known... maybe you're already using it! :-O

YSLGuru

"Can't you use C# in SQL Server now?" Sure you can. There's lots of ways to get around using native SQL (T-QSL for SQL Server or PL/SQL for Oracle) and let the procedural programmer avoid learning/using a set based langauge but they all have the same downside; they are procedural based solutions. So the answer to the second half of your question, "Wouldn't those C# optimizations work with the DB itself?" is 'Probably, but they would still not perform near as well as native SQL.'.

YSLGuru

Well if that poor sap is as adminant about forcing a round peg into a square hole then yeah he probably will have a hard time. Everyone esle though who uses the application that makes use of this SQL Solution will be very happy that the solution selected was based not on what the programmer wanted to do but what was best in terms of performance. We already have enough examples of bad coding for Relational Database backends; where the creator of said solution either did not want to or could not understand how to use a set based language and instead kept looking for ways to make procedural answers work on set based probelms; or how to make a square peg fit in a round hole.

YSLGuru

Kudos for doing whats best in terms of performance and the end users instead of whats easier or more favored/fun.

Sinisa Hajnal

We use CLRs for statistics of warehouse orders and intern transactions. It has in any given day several thousands of rows in items, several hundred to thousands of documents and same number of various locations to track. And we have data collected over ten years. Normally it's few seconds for day or month analysis, but we had to make few yearly and total statistics ('trends') and 'same month over years' comparisons. Nothing beats CLR, but yes, it needs several tweaks on the database to enable it. (We tried CTEs, too slow; same with table variables and temporary tables and cursors)

YSLGuru

"(We tried CTEs, too slow; same with table variables and temporary tables and cursors)" Well that was your problem. Cursors are nothing but procedural methodology done within T-SQl so its no ownder you got bad performance. As far as CLR's the only scenario where CLR's will perform better then standard T-SQl that uses a proper set based (that means NO cursors) is where text manipulation is involed at some measurable level. For example if you want to find a key word within a very large amount of text then a CLR will do better. If however you are looking for aggregate values on some set of data then CLR's will not out perform properly written T-SQl code. I work with millions of rows on a regular basis and in a few queries, hundreads of millions of rows so I have some applicable experince with dealing with performance issues and there's no way I'd evcer use a cursor for any of the processes we do and I would only consider a CLR if the process was heavy in text manipulation. CLR's have tehir place and when used properly they're great. The problem is they are often used in the wrong scenario just as cursors are used when they should not be. This is because its much easier to use and understand a cursor in T-SQL when your background is in a procedural based language then to work out a pure set based solution (that means NO cursors). I totally understand why progarmmers opt to use cursors soo often because I've done procedural programming but that doesn't change the fact that a pure set base solution, except when heavy text manipulation is involved, will perfrom better %99 of the time if not higher.

Sinisa Hajnal

YSLGuru wrote:

Well that was your problem. Cursors are nothing but procedural methodology done within T-SQl so its no ownder you got bad performance.

Yes, we know, that's why we DO NOT do it unless it needs to be done. And our DBA has over 20 years of XP with relational databases and we take over only when he says he cannot do anything more to optimize query time. THEN we try CLR :) Most complicated thing I did was tariff calculation on goods transport that included dynamic items depending on distance, fuel consumption, client, special privileges etc. Initially done with CTE's it proved slow. Table variables filled independently for each step and combined near the end proved good solution, calculation time going from 20 seconds to .4 seconds