Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I have a database where one table "Common" is referenced by several other tables. I wanted to see what records in the Common table were orphaned (i.e., had no references from any of the other tables).
I ran this query:
select *
from Common
where common_id not in (select common_id from Table1)
and common_id not in (select common_id from Table2)
I know that there are orphaned records, but no records were returned. Why not?
Update:
These articles in my blog describe the differences between the methods in more detail:
NOT IN
vs. NOT EXISTS
vs. LEFT JOIN / IS NULL
: SQL Server
NOT IN
vs. NOT EXISTS
vs. LEFT JOIN / IS NULL
: PostgreSQL
NOT IN
vs. NOT EXISTS
vs. LEFT JOIN / IS NULL
: Oracle
NOT IN
vs. NOT EXISTS
vs. LEFT JOIN / IS NULL
: MySQL
When table1.common_id
is not nullable, all these queries are semantically the same.
When it is nullable, NOT IN
is different, since IN
(and, therefore, NOT IN
) return NULL
when a value does not match anything in a list containing a NULL
.
This may be confusing but may become more obvious if we recall the alternate syntax for this:
common_id = ANY
SELECT common_id
FROM table1 t1
The result of this condition is a boolean product of all comparisons within the list. Of course, a single NULL
value yields the NULL
result which renders the whole result NULL
too.
We never cannot say definitely that common_id
is not equal to anything from this list, since at least one of the values is NULL
.
Suppose we have these data:
common
table1
LEFT JOIN / IS NULL
and NOT EXISTS
will return 3
, NOT IN
will return nothing (since it will always evaluate to either FALSE
or NULL
).
In MySQL
, in case on non-nullable column, LEFT JOIN / IS NULL
and NOT IN
are a little bit (several percent) more efficient than NOT EXISTS
. If the column is nullable, NOT EXISTS
is the most efficient (again, not much).
In Oracle
, all three queries yield same plans (an ANTI JOIN
).
In SQL Server
, NOT IN
/ NOT EXISTS
are more efficient, since LEFT JOIN / IS NULL
cannot be optimized to an ANTI JOIN
by its optimizer.
In PostgreSQL
, LEFT JOIN / IS NULL
and NOT EXISTS
are more efficient than NOT IN
, sine they are optimized to an Anti Join
, while NOT IN
uses hashed subplan
(or even a plain subplan
if the subquery is too large to hash)
–
–
–
–
If you want the world to be a two-valued boolean place, you must prevent the null (third value) case yourself.
Don't write IN clauses that allow nulls in the list side. Filter them out!
common_id not in
select common_id from Table1
where common_id is not null
–
The short answer:
There is a NULL within the collection returned by your subquery. You can solve the problem by removing that NULL value before finishing the subquery or to use NOT EXISTS
predicate instead of NOT IN
, as it does it implicitly.
The long answer (From T-SQL Fundamentals, Third edition, by Itzik Ben-Gan)
This is an example: Imagine there is a order with a NULL orderid inside Sales.Orders table, so the subquery returns some integers, and a NULL value.
SELECT custid, companyname
FROM Sales.Customers
WHERE custid NOT IN(SELECT O.custid
FROM Sales.Orders AS O);
The explanation on why the query from above returns an empty set:
Obviously, the culprit here is the NULL
customer ID you added to the Orders table. The NULL
is one of the elements returned by the subquery.
Let’s start with the part that does behave like you expect it to. The IN predicate returns TRUE for a customer who placed orders (for example, customer 85), because such a customer is returned by the subquery. The NOT operator negates the IN
predicate; hence, the NOT TRUE
becomes FALSE
, and the customer is discarded. The expected behavior here is that if a customer ID is known to appear in the Orders table, you know with certainty that you do not want to return it.
However (take a deep breath), if a customer ID from Customers doesn’t appear in the set of non-NULL customer IDs in Orders, and there’s also a NULL
customer ID in Orders, you can’t tell with certainty that the customer is there—and similarly you can’t tell with certainty that it’s not there. Confused? I hope I can clarify this explanation with an example.
The IN
predicate returns UNKNOWN
for a customer such as 22 that does not appear in the set of known customer IDs in Orders. That’s because when you compare it with known customer IDs you get FALSE, and when you compare it with a NULL you get UNKNOWN
. FALSE
OR UNKNOWN
yields UNKNOWN
. Consider the expression 22 NOT IN (1, 2, <other non-22 values>, NULL)
. This expression can be rephrased as NOT 22 IN (1, 2, …, NULL)
. You can expand this expression to NOT (22 = 1 OR 22 = 2 OR … OR 22 = NULL)
. Evaluate each individual expression in the parentheses to its truth value and you get NOT (FALSE OR FALSE OR … OR UNKNOWN)
, which translates to NOT UNKNOWN
, which evaluates to UNKNOWN
.
The logical meaning of UNKNOWN
here, before you apply the NOT
operator, is that it can’t be
determined whether the customer ID appears in the set, because the NULL
could represent that
customer ID. The tricky part here is that negating the UNKNOWN
with the NOT
operator still yields
UNKNOWN
. This means that in a case where it is unknown whether a customer ID appears in a set, it
is also unknown whether it doesn’t appear in the set. Remember that a query filter discards rows that
get UNKNOWN
in the result of the predicate.
In short, when you use the NOT IN
predicate against a subquery that returns at least one NULL
,
the query always returns an empty set. So, what practices can you follow to avoid such trouble? First,
when a column is not supposed to allow NULLs
, be sure to define it as NOT NULL
. Second, in all
queries you write, you should consider NULLs and the three-valued logic. Think explicitly about
whether the query might process NULLs, and if so, whether SQL’s treatment of NULLs is correct for
you. When it isn’t, you need to intervene. For example, our query returns an empty set because of the
comparison with the NULL
. If you want to check whether a customer ID appears only in the set of
known values, you should exclude the NULLs—either explicitly or implicitly. To exclude them explicitly,
add the predicate O.custid
IS NOT NULL to the subquery, like this:
SELECT custid, companyname
FROM Sales.Customers
WHERE custid NOT IN(SELECT O.custid
FROM Sales.Orders AS O
WHERE O.custid IS NOT NULL);
You can also exclude the NULLs implicitly by using the NOT EXISTS
predicate instead of NOT IN
,
like this:
SELECT custid, companyname
FROM Sales.Customers AS C
WHERE NOT EXISTS
(SELECT *
FROM Sales.Orders AS O
WHERE O.custid = C.custid);
Recall that unlike IN
, EXISTS
uses two-valued predicate logic. EXISTS
always returns TRUE
or
FALSE
and never UNKNOWN
. When the subquery stumbles into a NULL
in O.custid
, the expression
evaluates to UNKNOWN
and the row is filtered out. As far as the EXISTS
predicate is concerned, the
NULL
cases are eliminated naturally, as though they weren’t there. So EXISTS
ends up handling only
known customer IDs. Therefore, it’s safer to use NOT EXISTS
than NOT IN
.
The information above is taken from Chapter 4 - Subqueries, T-SQL Fundamentals, Third edition
select *
from Common
where common_id not in (select common_id from Table1 where common_id is not null)
and common_id not in (select common_id from Table2 where common_id is not null)
–
select *
from Common c
where not exists (select t1.commonid from table1 t1 where t1.commonid = c.commonid)
and not exists (select t2.commonid from table2 t2 where t2.commonid = c.commonid)
select c.commonID, t1.commonID, t2.commonID
from Common c
left outer join Table1 t1 on t1.commonID = c.commonID
left outer join Table2 t2 on t2.commonID = c.commonID
where t1.commonID is null
and t2.commonID is null
I ran a few tests and here were my results w.r.t. @patmortech's answer and @rexem's comments.
If either Table1 or Table2 is not indexed on commonID, you get a table scan but @patmortech's query is still twice as fast (for a 100K row master table).
If neither are indexed on commonID, you get two table scans and the difference is negligible.
If both are indexed on commonID, the "not exists" query runs in 1/3 the time.
–
–
–
FROM Common T
LEFT JOIN Table1 T1 ON T.common_id = T1.common_id
LEFT JOIN Table2 T2 ON T.common_id = T2.common_id
WHERE T1.common_id IS NULL
AND T2.common_id IS NULL
–
We want the row in Common to return, because it doesn't exist in any of the other tables. However, the null throws in a monkey wrench.
With those values, the query is equivalent to:
select *
from Common
where 1 not in (2)
and 1 not in (3, null)
That is equivalent to:
select *
from Common
where not (1=2)
and not (1=3 or 1=null)
This is where the problem starts. When comparing with a null, the answer is unknown. So the query reduces to
select *
from Common
where not (false)
and not (false or unkown)
false or unknown is unknown:
select *
from Common
where true
and not (unknown)
true and not unkown is also unkown:
select *
from Common
where unknown
The where condition does not return records where the result is unkown, so we get no records back.
One way to deal with this is to use the exists operator rather than in. Exists never returns unkown because it operates on rows rather than columns. (A row either exists or it doesn't; none of this null ambiguity at the row level!)
select *
from Common
where not exists (select common_id from Table1 where common_id = Common.common_id)
and not exists (select common_id from Table2 where common_id = Common.common_id)
Also you can visit the following link to know Anti join
select department_name,department_id from hr.departments dep
where not exists
(select 1 from hr.employees emp
where emp.department_id=dep.department_id
order by dep.department_name;
DEPARTMENT_NAME DEPARTMENT_ID
Benefits 160
Construction 180
Contracting 190
.......
But if we use NOT IN
in that case we do not get any data.
select Department_name,department_id from hr.departments dep
where department_id not in (select department_id from hr.employees );
no data found
This is happening as (select department_id from hr.employees
) is returning a null value and the entire query is evaluated as false. We can see it if we change the SQL slightly like below and handle null values with NVL function.
select Department_name,department_id from hr.departments dep
where department_id not in (select NVL(department_id,0) from hr.employees )
Now we are getting data:
DEPARTMENT_NAME DEPARTMENT_ID
Treasury 120
Corporate Tax 130
Control And Credit 140
Shareholder Services 150
Benefits 160
Again we are getting data as we have handled the null value with NVL function.
–
select *,
(select COUNT(ID) from ProductMaster where ProductMaster.CatID = CategoryMaster.ID) as coun
from CategoryMaster
Your statement:
where common_id not in (select common_id from Table1)
and common_id not in (select common_id from Table2)
looks for common_id
not in Table1
AND not in Table2
.
What if it's not in Table1 but in Table2 - although it violates your "orphan" definition.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
SQL query seems to work for 'AND T1.email_address_ IN (subquery)', but returns 0 rows for 'AND T1.email_address_ NOT IN (subquery)'
See more linked questions