Jump to content

SQL Best Practice On Joining Tables ?

Fallen Soul
Go to solution Solved by Franck,

For MySQL I can't tell if it's that critical as I use it rarely as I need much faster and robust database. But for example on MSSQL a where clause can be read prior of executing anything which allow MSSQL to use prediction to find records much faster. When the condition is embed in the join it need to filter per Row on the RowStore making it filtering many many times. When dealing with low record count it is a non issue but when joining billions of records with another billions of records there is a huge performance impact.

 

BUT filter in the join can apply the filter right away and leave more space in the RAM while applying the filter on the where clause force larger chunk to be loaded at a time which cause more RAM to be used. Filter is still applied on the result set row by row but you do have a complete set of fields that may contain nulls which are way heavier in memory than for example a simple bit that was filtered. If you have bad table designs you can more than double the RAM of what the query result actually contains when using where clause.

 

So if you are RAM limited try to put as much conditions within the join itself. You will have a smaller RAM footprint hence allowing more user queries to pass but they are slower.

If RAM is not an issue, use the WHERE clause to filter which will make the query faster but take bigger RAM footprint.

 

Now this is just a general guideline which doesn't mean it's a cure for all solution. Each query, each server, each database software are different. At the end of the day you should test your query against an evaluator like a query plan or query statistic and try real life scenario and adapt the query for REAL use case. For example on MSSQL if you test you query back to back it will get everything loaded into RAM tables and suddently a 20 second query will go down to 2 seconds as it doesn't use the table on disk but the one running in the RAM which is much faster. That 2 seconds wouldn't be a good value to assume if you query is ran once every hour but it is if it's ran once every 5 seconds.

Hi, 

I am reasonably new to SQL scripting using both tran-SQL and MySQL and would like to know the best practice for doing table joins.

One of the guys at work states you put all the table joins and filters under the join statement and I have a mate who states you put it in the where clause? 


Example:
 

SELECT

t1.*

FROM dbo.table1 AS t1

INNER JOIN dbo.table2 AS t2

 ON t1.id = t2.table1_id



WHERE t2.name like 'test1%'

AND t2.remote = 'y'

 

or would you write it like this: 

SELECT

t1.*

FROM dbo.table1 AS t1

INNER JOIN dbo.table2 AS t2

 ON t1.id = t2.table1_id

   AND t2.name like 'test1%'

   AND t2.remote = 'y'

 

Link to comment
Share on other sites

Link to post
Share on other sites

Use the where clause, the trick with SQL is to write things in a predictable manner such that the query plan generated is sensible. The issue with putting filters on joins is the server handles those statements during a join within the query plan, whereas the where filters can apply as a distinct step within the query plan often triggering before a join even happens resulting in less data being read.

Arch Linux on Samsung 840 EVO 120GB: Startup finished in 1.334s (kernel) + 224ms (userspace) = 1.559s | U mad windoze..?

Link to comment
Share on other sites

Link to post
Share on other sites

For MySQL I can't tell if it's that critical as I use it rarely as I need much faster and robust database. But for example on MSSQL a where clause can be read prior of executing anything which allow MSSQL to use prediction to find records much faster. When the condition is embed in the join it need to filter per Row on the RowStore making it filtering many many times. When dealing with low record count it is a non issue but when joining billions of records with another billions of records there is a huge performance impact.

 

BUT filter in the join can apply the filter right away and leave more space in the RAM while applying the filter on the where clause force larger chunk to be loaded at a time which cause more RAM to be used. Filter is still applied on the result set row by row but you do have a complete set of fields that may contain nulls which are way heavier in memory than for example a simple bit that was filtered. If you have bad table designs you can more than double the RAM of what the query result actually contains when using where clause.

 

So if you are RAM limited try to put as much conditions within the join itself. You will have a smaller RAM footprint hence allowing more user queries to pass but they are slower.

If RAM is not an issue, use the WHERE clause to filter which will make the query faster but take bigger RAM footprint.

 

Now this is just a general guideline which doesn't mean it's a cure for all solution. Each query, each server, each database software are different. At the end of the day you should test your query against an evaluator like a query plan or query statistic and try real life scenario and adapt the query for REAL use case. For example on MSSQL if you test you query back to back it will get everything loaded into RAM tables and suddently a 20 second query will go down to 2 seconds as it doesn't use the table on disk but the one running in the RAM which is much faster. That 2 seconds wouldn't be a good value to assume if you query is ran once every hour but it is if it's ran once every 5 seconds.

Link to comment
Share on other sites

Link to post
Share on other sites

Your first query is correct, and will give you the results you are looking for. Joins are usually the quickest vs subqueries or CTEs so I think speed is of least concern here. If you are working with large tables you can create temporary tables that are filtered on some parameters, and then perform joins on those. CTE is great is you will be using the CTE more than once. I have experience with MS sql, and serverless AWS databases, and the speed also comes down to what you are using. You dont want to be running a query over and over if speed is needed, so if you dont need updated data you can run your query with joins, and save the result in a temp table, and then use that temp table.  

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×