So... So you think you can sort?
The perils of ORDER BY and NLS_SORT
Like with many articles, this one found its origin in an application bug. A developer came to me saying that the ORDER BY
in his SQL
gave incorrect results. After some investigation, this article was the result.
ORDER BY
is so easy to add at the end of a query. And I expect you also know about ASC
end DESC
. Good, but that is not all there is to say about sorting (in Oracle).
Do you know that there is a session parameter that influences the result of an ORDER BY
enormously? Since it is a session parameter, you as a developer can not know what value the user will have set it to (knowingly, but usually quite unknowingly). Therefore you, as a developer, should set this parameter whenever you write code that depends on the sort order of a query. Unless you want two distinct runs of the same code to yield different results. Or unless you want the code that behaves perfectly in your environment to crash on the environment of your colleague / the tester / the code reviewer / production / the end user / ...
The parameter in question is NLS_SORT
. Note that, according to the Oracle documentation, NLS_SORT
"is derived from NLS_LANGUAGE
". What this means is that whenever you change NLS_LANGUAGE
, you will also have changed NLS_SORT
. Therefore make sure to set NLS_LANGUAGE
first and then set NLS_SORT
:
alter session set NLS_LANGUAGE='ENGLISH';
alter session set NLS_SORT='BINARY';
So what is it all about?
Sorting per letter.
Let's sort a table with 4 records, each of 1 character: dot, underscore, lowercase a, uppercase A.
Let's try with 12 settings of NLS_SORT
. Since some variants turn out to give the same result, they only get one column in the output:
UNICODE_BINARY
gives the same result asBINARY
.
I guess that might be different on databases that are not in a Unicode character set and where the text expression being sorted contains Unicode values.The X-versions of the languages (
DUTCH
vsXDUTCH
/FRENCH
andXFRENCH
/GERMAN
andXGERMAN
/WEST_EUROPEAN
vsXWESTEUROPEAN
) give the same result
(UNICODE_)BINARY | EEC_EURO | (X)DUTCH | (X)FRENCH | (X)GERMAN | LATIN | (X)WEST_EUOPEAN | PUNCTUATION | |
1 | . | . | . | . | . | . | . | . |
2 | A | _ | _ | _ | _ | _ | _ | _ |
3 | _ | A | A | A | a | A | A | A |
4 | a | a | a | a | A | a | a | a |
Did you know that in German lowercase comes before uppercase?
Do you think this is all you need to know about sorting? Well, look what happens when sorting longer strings... (below)
Did you know that trying to set NLS_SORT
to 'AMERICAN'
or 'ENGLISH'
gives an error? I guess that it is because of the A
in ASCII
.
Sorting strings with punctuation (e.g. "codes")
Here I sort eight 3-letter codes composed of A, a, B, b and the underscore.
(UNICODE_)BINARY | EEC_EURO | (X)DUTCH | (X)FRENCH | (X)GERMAN | LATIN | (X)WEST_EUROPEAN | PUNCTUATION | |
1 | AAA | A_A | A_A | A_A | a_a | A_A | A_A | A_A |
2 | AAB | A_B | a_a | a_a | A_A | a_a | a_a | a_a |
3 | A_A | AAA | AAA | AAA | aaa | AAA | AAA | A_B |
4 | A_B | AAB | aaa | aaa | AAA | aaa | aaa | a_b |
5 | a_a | a_a | AAB | AAB | aab | AAB | AAB | AAA |
6 | a_b | a_b | aab | aab | AAB | aab | aab | aaa |
7 | aaa | aaa | A_B | A_B | a_b | A_B | A_B | AAB |
8 | aab | aab | a_b | a_b | A_B | a_b | a_b | aab |
Note how in DUTCH
, XDUTCH
, FRENCH
, XFRENCH
, LATIN
, WEST_EUROPEAN
and XWEST_EUROPEAN
there is an _
below an a
, while the test above shows _
being sorted above the a
! And at the same time there is also an _
above an A
, meaning that there are A
and a
between two _
?!?!!??!?
What is going on? Well, My first thought was: The _
are removed from the string before sorting. Let's visualise that by removing the _
from the sorted list (DUTCH
):
DUTCH | |
1 | AA |
2 | aa |
3 | AAA |
4 | aaa |
5 | AAB |
6 | aab |
7 | AB |
8 | ab |
Clearly, even that is not fully correct: Why is AAA
below aa
when A
is above a
in our first test?
My second guess is that not only are the _
removed, the (first) ordering is done without looking at case (so all letters uppercased / or all letters lowercased).
And then, if there are strings with equal sort order, they are compared again using/including case.
It is normal/acceptable that a short string (like AA
) comes before a longer string that starts with that short string (like AAA
) So both aa
and AA
end up before aaa
and AAA
: aa
and AA
with order 1; aaa
and AAA
as order 2.
In the second step AA
gets 1.1 and aa
1.2, while AAA
gets 2.1 and aaa
2.2.
GERMAN
and XGERMAN
also show letters between underscores: the same thing is happening as with (X)DUTCH
, (X)FRENCH
, LATIN
and (X)WEST_EUROPEAN
.
The only difference with those is that in German lowercase comes before uppercase (as seen before).
Let's add in the .
, resulting in twelve strings (and let's remove half of the seemingly equal sorting schemes)
(UNICODE)BINARY | EEC_EURO | DUTCH | GERMAN | PUNCTUATION | |
1 | A.A | A.A | A.A | a.a | A.A |
2 | A.B | A.B | A_A | a_a | a.a |
3 | AAA | A_A | a.a | A.A | A.B |
4 | AAB | A_B | a_a | A_A | a.b |
5 | A_A | AAA | AAA | aaa | A_A |
6 | A_B | AAB | aaa | AAA | a_a |
7 | a.a | a.a | AAB | aab | A_B |
8 | a.b | a.b | aab | AAB | a_b |
9 | a_a | a_a | A.B | a.b | AAA |
10 | a_b | a_b | A_B | a_b | aaa |
11 | aaa | aaa | a.b | A.B | AAB |
12 | aab | aab | a_b | A_B | aab |
What you should learn from this:
Sorting is not as straightforward as it seems!
If you have trouble sorting (for example) names because of uppercase/lowercase, try one of the language-specific sorts: They tend to ignore case until sorting equal strings, giving you exactly what you probably want.
If you need to sort any kinds of codes (which may include punctuation like
_
and.
), stay away from the language-specific sorts: they give unexpected results.BINARY
,EEC_EURO
andPUNCTUATION
are all quite different but "logical" nevertheless.