How NOT to Code - Challenge 11

Feb 09, 2010

Today's challenge is in SQL. So what's wrong with this SQL statement? We'll assume it's running on SQL Server, but I suspect this is bad across most if not all database engines


SELECT Persons.LastName, Persons.FirstName, Orders.OrderNo
FROM Persons INNER JOIN Orders ON Persons.P_Id=Orders.P_Id
WHERE Persons.PhoneNumber like '404%' or Orders.Processed = 1
ORDER BY Persons.LastName

Comments

Dale Fraser

Dale Fraser wrote on 02/09/10 9:24 AM

Well you dont really tell us what the query is supposed to be doing.

But assuming you want a lost of all people where they have a a 404 phone numer or a processed order it looks fine.

However it doesnt make much sense that you would want such a list.
John Mason

John Mason wrote on 02/09/10 10:27 AM

The query is simply providing data nothing more and like all of these examples will work. Don't get too hooked on the type of data itself. There's one particular issue with the logic of this SQL.
spills

spills wrote on 02/09/10 11:26 AM

I really don't see the problem with the query but, maybe my morning coffee hasn't kicked in. The one potential issue may be the "Persons.PhoneNumber like '404%'" statement, and whether or not PhoneNumber is stored as a "number" or "text" datatype.
Mike

Mike wrote on 02/09/10 1:30 PM

One issue could be that when viewing the result set, you would not be able to determine why each result was returned. Did it return because of a matched phone number? Did it return from a processed order?
Alan Geleynse

Alan Geleynse wrote on 02/09/10 1:31 PM

The order of the WHERE clauses should be reversed.

Checking "Orders.Processed = 1" is a very quick operation, while "Persons.PhoneNumber like '404%'" requires string processing.

By reversing them you can take advantage of short circuiting and avoid processing the phone number any time there is an order that was processed.
lordb8r

lordb8r wrote on 02/09/10 2:23 PM

Don't mess much w/ SQL, but for the logic, why is it an "or" query? Wouldn't you want both the 404 "AND" orders processed? Otherwise, you get results back that show both...
Andreas Schuldhaus

Andreas Schuldhaus wrote on 02/09/10 2:23 PM

For me it doesn't make sense to filter for a phone number OR a processed order. IMHO it should rather be AND
John Mason

John Mason wrote on 02/09/10 2:39 PM

I probably should have made the SQL more cryptic to avoid concern over data. The 'OR' is the problem but not in the ways stated so far.
lordb8r

lordb8r wrote on 02/09/10 3:37 PM

I'll try again...is the issue that persons.PhoneNumber is being compared for either the 404, OR, it is matching against orders.processed that are equal to 1?
Andreas Schuldhaus

Andreas Schuldhaus wrote on 02/10/10 12:03 AM

When translating the inner join to a where clause (it might be easier to read)

WHERE
Persons.P_Id=Orders.P_Id
AND Persons.PhoneNumber like '404%'
OR Orders.Processed = 1

then you can see, that the OR would break the join an return all rows where orders.processed = 1, regardless if there is a corresponding P_Id. The OR part must be put in parenthesis.

AND (Persons.PhoneNumber like '404%'
OR Orders.Processed = 1 ) to avoid that.
Jeff Price

Jeff Price wrote on 02/12/10 1:59 PM

Well...as everyone else have stated, not knowing WHAT you are expecting back kinda makes it difficult :)

There could be an error using an INNER JOIN. If a person has a 404 phone, but no orders they won't be returned. Ditto for an orphaned processed order (has no person if your db design would allow that).
John Mason

John Mason wrote on 02/12/10 2:41 PM

Ok, so here's the problem. It's a performance issue. With my SQL examples, more times than not they'll be a performance related problem.

"Or" calls are very interesting because it adds complexity to any logic statement. In programming languages, this simply adds an additional step in processing which isn't a real problem. As habit, programmers tend to not worry over "or" calls. In modeling languages like SQL, adding complexity can be a very real problem. Now, I'm not saying stop using "or" in your where SQL statements. In most cases, there's not a problem at all.

However, this example is where an "or" call can cause a big problem. When you have a table join where the "or" is stepping between the parent/child tables the database engine doesn't handle this very well. If you run a query like this one and look into the execution plan for example in SQL Server, you should see very fat and expensive clustered index scans (if the indexing is more or less correct).

So what is the solution? Use a sql "union" instead.

...
where WHERE Persons.PhoneNumber like '404%'

union

...
Where Orders.Processed = 1

Run this logic and view the execution plan, there will be a big improvement.
Dale Fraser

Dale Fraser wrote on 02/12/10 3:46 PM

I wouldnt use your syntax in a fit, no coder would understand it. Can you post the full union code not just ...'s

Im not convinced this is a good solution on many levels.
John Mason

John Mason wrote on 02/12/10 4:56 PM

@Dale, certainly. Here ya go..

SELECT Persons.LastName, Persons.FirstName, Orders.OrderNo
FROM Persons INNER JOIN Orders ON Persons.P_Id=Orders.P_Id
WHERE Persons.PhoneNumber like '404%'
UNION
SELECT Persons.LastName, Persons.FirstName, Orders.OrderNo
FROM Persons INNER JOIN Orders ON Persons.P_Id=Orders.P_Id
WHERE Orders.Processed = 1
ORDER BY Persons.LastName

Let me know what you don't like.
Peter Boughton

Peter Boughton wrote on 02/12/10 7:50 PM

Don't know about Dale, but I don't like duplicating all except one line of a query, just because Microsoft's database engine is a pile of shit that can't handle a simple OR. :/
John Mason

John Mason wrote on 02/12/10 9:34 PM

@Peter - I bet Oracle would have similar results.
Dale Fraser

Dale Fraser wrote on 02/12/10 9:38 PM

I wouldnt write this query, it would seriously be unreadable to most developers and SQL people.

If I saw that code before this post, I would honestly have no idea what it did.

I've never had any performance issues using or.
John Mason

John Mason wrote on 02/12/10 10:48 PM

@Dale - there is far more complex (and possibly unreadable) sql statements. This one is very simple and readable in my opinion. It's completely possible you will not have an issue doing the 'or' in this case, but it is very inefficient for the underlining database engine. Like most performance issues it really depends on load and size. SQL is a very different beast from normal programming - just one of those things.
Peter Boughton

Peter Boughton wrote on 02/13/10 8:48 AM

Have you tested Oracle and proved that?

I don't want this to come across the wrong way, but you're kinda going "OMG DON'T USE OR!!!" but combined with vague get-outs that clearly confuses people, rather than communicating your intended point.

Above all else, developers should write code that is concise and readable.

*IF* there is a measurable performance issue for a specific query, they should know how to analyse and improve it.

That applies to an SQL statement as much as any code/language, but the point (I think) you want to make here is that relational databases work in a different way to other programming engines, and that an operation which is 'cheap' in CF or JS might not be in an RDBMS (and vica versa).

If so, my suggestion would be that - instead of doing it as a challenge - this is a topic which would be better covered as a more general article discussing those differences, and (most importantly) how a developer might check for and analyse problems like this (i.e. how to generate and understand execution plans).

Anyway, hopefully that's all received constructively, and gives helpful thoughts. :)
John Mason

John Mason wrote on 02/13/10 1:27 PM

To everyone - thanks for the feedback and questions. One of the points here is to have dialogue. So thanks.

@Peter, no I said I bet it would be a similar problem for Oracle. You're free to test it on different engines if you like.

I just did a simple test again this morning with a couple of tables in SQL Server. They're around 1/2 million rows each. Here are the stats from the IO statistics report for the two..

With the 'or'
Logical Reads
23746
22496

With the 'union'
Logical Reads
11248
11873

The 'union' saved me in I/O with a lot less rows to scan to get the data I needed from the tables. Please carefully read what I'm saying - I'm not saying 'omg stop using ors'. This performance issue may not be a problem for you in particular. Again, load and data size are big factors when it comes to SQL. Also the indexing strategy of the tables will have a big impact on this as well.

Probably the best policy to take from this is to try different SQL strategies and compare. For example you can get the IO stats with the "set statistics IO on" command and also view the "actual execution plan" via the tab on SSMS. Be careful of the "estimated" numbers though - the sql server reporting is far from perfect and can be plain wrong.

Yes, SQL as a language is completely different, which is one of the main points to this. It's simply a statement that the engine uses to match a set of data to. That underlining operation is completely different from say a processing language like Java or CF. Probably one of the reasons DBAs hate seeing SQL written by us programmers ;)

There are plenty of blog posts and articles explaining execution plans. I really don't think I could add anything new there.

This series is simply various gotcha's in programming. If there are questions like this, I felt it best to work through them in the comments. I may add a blog post here or there depending on things.
Larry

Larry wrote on 06/18/13 4:53 PM

John Mason is absolutely correct with his example using a "union" instead of an "or".

Once you get past a few hundred thousand records, the performance is vastly improved with a union.

Actually one of the few times a union improves performance.

Write your comment



(it will not be displayed)