So... So you think you can sort?
The perils of ORDER BY and NLS_SORT
Like with many articles, this one found its origin in an application bug. A developer came to me saying that the ORDER BY in his SQL gave incorrect results. After some investigation, this article was the result.
ORDER BY is so easy to add at the end of a query. And I expect you also know about ASC end DESC. Good, but that is not all there is to say about sorting (in Oracle).
Do you know that there is a session parameter that influences the result of an ORDER BY enormously? Since it is a session parameter, you as a developer can not know what value the user will have set it to (knowingly, but usually quite unknowingly). Therefore you, as a developer, should set this parameter whenever you write code that depends on the sort order of a query. Unless you want two distinct runs of the same code to yield different results. Or unless you want the code that behaves perfectly in your environment to crash on the environment of your colleague / the tester / the code reviewer / production / the end user / ...
The parameter in question is NLS_SORT. Note that, according to the Oracle documentation, NLS_SORT "is derived from NLS_LANGUAGE". What this means is that whenever you change NLS_LANGUAGE, you will also have changed NLS_SORT. Therefore make sure to set NLS_LANGUAGE first and then set NLS_SORT:
alter session set NLS_LANGUAGE='ENGLISH';
alter session set NLS_SORT='BINARY';
So what is it all about?
Sorting per letter.
Let's sort a table with 4 records, each of 1 character: dot, underscore, lowercase a, uppercase A.
Let's try with 12 settings of NLS_SORT. Since some variants turn out to give the same result, they only get one column in the output:
UNICODE_BINARYgives the same result asBINARY.
I guess that might be different on databases that are not in a Unicode character set and where the text expression being sorted contains Unicode values.The X-versions of the languages (
DUTCHvsXDUTCH/FRENCHandXFRENCH/GERMANandXGERMAN/WEST_EUROPEANvsXWESTEUROPEAN) give the same result
(UNICODE_)BINARY | EEC_EURO | (X)DUTCH | (X)FRENCH | (X)GERMAN | LATIN | (X)WEST_EUOPEAN | PUNCTUATION | |
| 1 | . | . | . | . | . | . | . | . |
| 2 | A | _ | _ | _ | _ | _ | _ | _ |
| 3 | _ | A | A | A | a | A | A | A |
| 4 | a | a | a | a | A | a | a | a |
Did you know that in German lowercase comes before uppercase?
Do you think this is all you need to know about sorting? Well, look what happens when sorting longer strings... (below)
Did you know that trying to set NLS_SORT to 'AMERICAN' or 'ENGLISH' gives an error? I guess that it is because of the A in ASCII.
Sorting strings with punctuation (e.g. "codes")
Here I sort eight 3-letter codes composed of A, a, B, b and the underscore.
(UNICODE_)BINARY | EEC_EURO | (X)DUTCH | (X)FRENCH | (X)GERMAN | LATIN | (X)WEST_EUROPEAN | PUNCTUATION | |
| 1 | AAA | A_A | A_A | A_A | a_a | A_A | A_A | A_A |
| 2 | AAB | A_B | a_a | a_a | A_A | a_a | a_a | a_a |
| 3 | A_A | AAA | AAA | AAA | aaa | AAA | AAA | A_B |
| 4 | A_B | AAB | aaa | aaa | AAA | aaa | aaa | a_b |
| 5 | a_a | a_a | AAB | AAB | aab | AAB | AAB | AAA |
| 6 | a_b | a_b | aab | aab | AAB | aab | aab | aaa |
| 7 | aaa | aaa | A_B | A_B | a_b | A_B | A_B | AAB |
| 8 | aab | aab | a_b | a_b | A_B | a_b | a_b | aab |
Note how in DUTCH, XDUTCH, FRENCH, XFRENCH, LATIN, WEST_EUROPEAN and XWEST_EUROPEAN there is an _ below an a, while the test above shows _ being sorted above the a! And at the same time there is also an _ above an A, meaning that there are A and a between two _ ?!?!!??!?
What is going on? Well, My first thought was: The _ are removed from the string before sorting. Let's visualise that by removing the _ from the sorted list (DUTCH):
DUTCH | |
| 1 | AA |
| 2 | aa |
| 3 | AAA |
| 4 | aaa |
| 5 | AAB |
| 6 | aab |
| 7 | AB |
| 8 | ab |
Clearly, even that is not fully correct: Why is AAA below aa when A is above a in our first test?
My second guess is that not only are the _ removed, the (first) ordering is done without looking at case (so all letters uppercased / or all letters lowercased).
And then, if there are strings with equal sort order, they are compared again using/including case.
It is normal/acceptable that a short string (like AA) comes before a longer string that starts with that short string (like AAA) So both aa and AA end up before aaa and AAA: aa and AA with order 1; aaa and AAA as order 2.
In the second step AA gets 1.1 and aa 1.2, while AAA gets 2.1 and aaa 2.2.
GERMAN and XGERMAN also show letters between underscores: the same thing is happening as with (X)DUTCH, (X)FRENCH, LATIN and (X)WEST_EUROPEAN.
The only difference with those is that in German lowercase comes before uppercase (as seen before).
Let's add in the ., resulting in twelve strings (and let's remove half of the seemingly equal sorting schemes)
(UNICODE)BINARY | EEC_EURO | DUTCH | GERMAN | PUNCTUATION | |
| 1 | A.A | A.A | A.A | a.a | A.A |
| 2 | A.B | A.B | A_A | a_a | a.a |
| 3 | AAA | A_A | a.a | A.A | A.B |
| 4 | AAB | A_B | a_a | A_A | a.b |
| 5 | A_A | AAA | AAA | aaa | A_A |
| 6 | A_B | AAB | aaa | AAA | a_a |
| 7 | a.a | a.a | AAB | aab | A_B |
| 8 | a.b | a.b | aab | AAB | a_b |
| 9 | a_a | a_a | A.B | a.b | AAA |
| 10 | a_b | a_b | A_B | a_b | aaa |
| 11 | aaa | aaa | a.b | A.B | AAB |
| 12 | aab | aab | a_b | A_B | aab |
What you should learn from this:
Sorting is not as straightforward as it seems!
If you have trouble sorting (for example) names because of uppercase/lowercase, try one of the language-specific sorts: They tend to ignore case until sorting equal strings, giving you exactly what you probably want.
If you need to sort any kinds of codes (which may include punctuation like
_and.), stay away from the language-specific sorts: they give unexpected results.BINARY,EEC_EUROandPUNCTUATIONare all quite different but "logical" nevertheless.