SQL WHERE List Matches Any or All

I saw a cool post recently from Jon Galloway called "Passing lists to SQL Server 2005 with XML Parameters".  This is a pattern I've used several times while building the new version of Channel 9.  If you'd like to learn how to pass in lists to stored procedures, check out Jon's post.

One of the times I've used it is to search our database for all entries from two of our forums.  For this example, we'll say Techoff and Sandbox.  Once you have a temp table with the two forum ids (actually forums in our system are just tags too) you can just do a where in statement like the following:

SELECT e.* FROM Entry e INNER JOIN EntryForum ef ON e.EntryID = ef.EntryID WHERE ef.ForumID IN (SELECT ForumID FROM ForumList)

Note: This is all pseudo code to represent the basics of how we do this.  This is not the exact code.

This selects all the entries (or posts) from our database that are from the list of forums I passed into the ForumList temp table.  WHERE IN specificies that all rows be returned that match ANY of the records in my temp table.  The following statement would be equivalent and work exactly the same.

SELECT e.* FROM Entry e INNER JOIN EntryForum ef ON e.EntryID = ef.EntryID WHERE ef.ForumID = @ForumID1 OR ef.ForumID = @ForumID2

Note: In the above example, @ForumID1 and @ForumID2 have the values that were stored in the ForumList temp table in the example above that one.

This works pretty well.  The other thing we do with passing in lists though is selecting only the entries that match ALL (not ANY) of the list we pass in to the stored procedure.  The example of this is when searching by multiple tags.  So for instance, you want to search on our site for all content that contains information on WPF AND WCF.  The previous example won't work.  It would instead need to be something like this…

SELECT e.* FROM Entry e INNER JOIN EntryTag et ON e.EntryID = et.EntryID WHERE et.TagID = @TagIDWPF AND et.TagID = @TagIDWCF

Using WHERE IN, we can't do this (at least I couldn't find anything in the docs or internet searching to say otherwise).  Duncan helped figure out the idea on how to do this and here is the implentation I came up with:

DECLARE @TagCount int
DECLARE @Tags TABLE (TagID bigint)
DECLARE @Entries TABLE (EntryID bigint)

SELECT @TagCount = COUNT(*) FROM @Tags
;
WITH Entries(EntryID, MatchCount) AS
(
    SELECT
        e.EntryID,
        COUNT(DISTINCT t.TagID) AS MatchCount
    FROM
        Entry e
            INNER JOIN
        EntryTag et
            ON
                e.EntryID = et.EntryID
            INNER JOIN
        @Tags t
            ON
                et.TagID = t.TagID
    GROUP BY
        e.EntryID
)
INSERT INTO @Entries (EntryID) SELECT EntryID FROM Entries WHERE MatchCount = @TagCount

What is this code doing?  Well, first, it's doing a count on the tags that were passed in (again, from XML turned into a temp table) and storing it in a variable.  Then, it creates a Common Table Expression or CTE around a query that returns all the entries that match the tag list and how many of those tags it matches up with.  If you're not familiar with CTEs, they're basically a wrapper around a query so you can write a query against it.  Kind of like a subquery, but much more organized.  Recursive CTEs are particularly powerful and cool, but that's another blog post.  So then after creating the CTE, fill another temp table with everything from the CTE where the MatchCount equals the count of how many tags were passed in originally.  This means that the entry returned had ALL the tags passed in associated with it.  So this will now only return entries that match ALL of the tags from the list that I passed in (stored in @Tags).  I hope this helps someone.  🙂

UPDATE: Check out the first comment from Bryan.  He points out a slightly better implementation.  Thanks, Bryan!

ROW_NUMBER() OVER Not Fast Enough With Large Result Set

So I'm working on improving the performance of the next version of Channel 9.  For those of you not familiar, our team created a new platform for our department to build community sites (blogs, forums, videos, tagging, etc).  You can see an example of the platform running on 10.  Currently, there's not a whole lot of data in the 10 database so we've been able to get away with murder from a performance standpoint.  Now that I've imported all the data from Community Server (Channel 9) to 10 (EvNet Platform), we've been finding that 250,000 rows doesn't perform very well with our current code.

The architecture of our platform is pretty simple.  There's an Entry table that houses Entries, Threads & Comments and the relationships between them.  When you're viewing a blog, you're looking at this table.  When you're looking at a forum, you're looking at this table.  We differentiate everything in our system by tags.  There's one for each forum, each blog, and content tags to describe what's in each entry.

Currently, to do paging we've been using the ROW_NUMBER OVER function.  It's a new feature of SQL Server 2005 and is very simple and easy to use.  Unfortunately, it doesn't work very well with a lot of rows (250,000 for example).  I did some searching and came across this gem of an article.  It uses an interesting trick to use SET ROWCOUNT to get the first record to start with in a paged result set, then you just run the query again and set the row count again to the number you want where the values are greater than the first row from the starting point of the paged result set and man is it snappy.  Do check it out if you're having troubles with performance of ROW_NUMBER() OVER.

Oh yah, I forgot to mention how much of an improvement this change made.  Before the change queries were taking about 8 seconds on average.  After the change, the queries now take less than 1 second.  Depending on if SQL has decided to cache the results or not, it's pretty much instantaneous.