So... So you think you can sort?

The perils of ORDER BY and NLS_SORT

Like with many articles, this one found its origin in an application bug. A developer came to me saying that the ORDER BY in his SQL gave incorrect results. After some investigation, this article was the result.

ORDER BY is so easy to add at the end of a query. And I expect you also know about ASC end DESC. Good, but that is not all there is to say about sorting (in Oracle).

Do you know that there is a session parameter that influences the result of an ORDER BY enormously? Since it is a session parameter, you as a developer can not know what value the user will have set it to (knowingly, but usually quite unknowingly). Therefore you, as a developer, should set this parameter whenever you write code that depends on the sort order of a query. Unless you want two distinct runs of the same code to yield different results. Or unless you want the code that behaves perfectly in your environment to crash on the environment of your colleague / the tester / the code reviewer / production / the end user / ...

The parameter in question is NLS_SORT. Note that, according to the Oracle documentation, NLS_SORT "is derived from NLS_LANGUAGE". What this means is that whenever you change NLS_LANGUAGE, you will also have changed NLS_SORT. Therefore make sure to set NLS_LANGUAGE first and then set NLS_SORT:

alter session set NLS_LANGUAGE='ENGLISH';
alter session set NLS_SORT='BINARY';

So what is it all about?

Sorting per letter.

Let's sort a table with 4 records, each of 1 character: dot, underscore, lowercase a, uppercase A.

Let's try with 12 settings of NLS_SORT. Since some variants turn out to give the same result, they only get one column in the output:

  • UNICODE_BINARY gives the same result as BINARY.
    I guess that might be different on databases that are not in a Unicode character set and where the text expression being sorted contains Unicode values.

  • The X-versions of the languages (DUTCH vs XDUTCH / FRENCH and XFRENCH / GERMAN and XGERMAN / WEST_EUROPEAN vs XWESTEUROPEAN) give the same result

(UNICODE_)BINARYEEC_EURO(X)DUTCH(X)FRENCH(X)GERMANLATIN(X)WEST_EUOPEANPUNCTUATION
1........
2A_______
3_AAAaAAA
4aaaaAaaa

Did you know that in German lowercase comes before uppercase?

Do you think this is all you need to know about sorting? Well, look what happens when sorting longer strings... (below)

Did you know that trying to set NLS_SORT to 'AMERICAN' or 'ENGLISH' gives an error? I guess that it is because of the A in ASCII.

Sorting strings with punctuation (e.g. "codes")

Here I sort eight 3-letter codes composed of A, a, B, b and the underscore.

(UNICODE_)BINARYEEC_EURO(X)DUTCH(X)FRENCH(X)GERMANLATIN(X)WEST_EUROPEANPUNCTUATION
1AAAA_AA_AA_Aa_aA_AA_AA_A
2AABA_Ba_aa_aA_Aa_aa_aa_a
3A_AAAAAAAAAAaaaAAAAAAA_B
4A_BAABaaaaaaAAAaaaaaaa_b
5a_aa_aAABAABaabAABAABAAA
6a_ba_baabaabAABaabaabaaa
7aaaaaaA_BA_Ba_bA_BA_BAAB
8aabaaba_ba_bA_Ba_ba_baab

Note how in DUTCH, XDUTCH, FRENCH, XFRENCH, LATIN, WEST_EUROPEAN and XWEST_EUROPEAN there is an _ below an a, while the test above shows _ being sorted above the a! And at the same time there is also an _ above an A, meaning that there are A and a between two _ ?!?!!??!?

What is going on? Well, My first thought was: The _ are removed from the string before sorting. Let's visualise that by removing the _ from the sorted list (DUTCH):

DUTCH
1AA
2aa
3AAA
4aaa
5AAB
6aab
7AB
8ab

Clearly, even that is not fully correct: Why is AAA below aa when A is above a in our first test?
My second guess is that not only are the _ removed, the (first) ordering is done without looking at case (so all letters uppercased / or all letters lowercased).
And then, if there are strings with equal sort order, they are compared again using/including case.
It is normal/acceptable that a short string (like AA) comes before a longer string that starts with that short string (like AAA) So both aa and AA end up before aaa and AAA: aa and AA with order 1; aaa and AAA as order 2.
In the second step AA gets 1.1 and aa 1.2, while AAA gets 2.1 and aaa 2.2.

GERMAN and XGERMAN also show letters between underscores: the same thing is happening as with (X)DUTCH, (X)FRENCH, LATIN and (X)WEST_EUROPEAN.
The only difference with those is that in German lowercase comes before uppercase (as seen before).

Let's add in the ., resulting in twelve strings (and let's remove half of the seemingly equal sorting schemes)

(UNICODE)BINARYEEC_EURODUTCHGERMANPUNCTUATION
1A.AA.AA.Aa.aA.A
2A.BA.BA_Aa_aa.a
3AAAA_Aa.aA.AA.B
4AABA_Ba_aA_Aa.b
5A_AAAAAAAaaaA_A
6A_BAABaaaAAAa_a
7a.aa.aAABaabA_B
8a.ba.baabAABa_b
9a_aa_aA.Ba.bAAA
10a_ba_bA_Ba_baaa
11aaaaaaa.bA.BAAB
12aabaaba_bA_Baab

What you should learn from this:

  1. Sorting is not as straightforward as it seems!

  2. If you have trouble sorting (for example) names because of uppercase/lowercase, try one of the language-specific sorts: They tend to ignore case until sorting equal strings, giving you exactly what you probably want.

  3. If you need to sort any kinds of codes (which may include punctuation like _ and .), stay away from the language-specific sorts: they give unexpected results. BINARY, EEC_EURO and PUNCTUATION are all quite different but "logical" nevertheless.