Resources:
- GitHub (Sort code, data.py)
- Timing Data
- Raw Timing Data (CSV files)

Overview

  • Pick a pivot element.
  • Partition the array into 3 parts:
    • First part: all elements in this part is less than the pivot.
    • Second part: the pivot itself (only one element!)
    • Third part: all elements in this part is greater than or equal to the pivot.
  • Then, apply the quicksort algorithm to the first and the third part. (recursively)
    Source: me.dt.in.th

Scripts

In analysing an algorithm’s performance, we should always consider the average case.
For sorting, this would mean a randomised array of values.

Since I was conducting my tests on a non-CSE computer, I can’t use the ./gen program provided in lab06 (I mean it’s not like I used it last time either).

RNG

1
python3 -c 'from random import shuffle; (lambda s: [print(s), [print(i) for i in (lambda a: shuffle(a) or a)(list(range(s)))]])(100000)'

Have I told you I love my one-liner scripts :)

You can’t do multiline statements in lambda functions, nor can you use the semicolon to separate statements… Instead we can execute multiple functions by putting them in an array.

More RNG

Or I could repurpose my sort detective helper script.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#!/bin/python3
import sys

import subprocess
from random import shuffle

# Returns the mean of a list of items (to 3 decimal places)
def average(items):
    sum = 0
    for item in items:
        sum += item
    return round(sum / len(items), 3)

n = 100
while True:
    print(str(n) + ",", file=sys.stderr, end=" ")

    nums = list(range(n))                              # Ascending
    if sys.argv[1] == "D": nums = nums[::-1]           # Descending

    times = []
    for x in range(10):
        
        # Fixed from last time, where the shuffled order was always the same!
        if sys.argv[1] == "R": shuffle(nums)           # Random
        string = "\n".join(map(str, [n] + nums)).encode()
        
        with subprocess.Popen(["/usr/bin/time", "-f%U", "./quicksort", "-p<INSERTYOURMODEFLAGHERE>", "-q"], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE) as process:
            comm = process.communicate(string)
            times.append(float(comm[1].decode().strip()))

    avg = average(times)
    print(str(n) + ",", avg)
    print(avg, file=sys.stderr)

    if avg > 1.0: break
    if n == 500000: break

    if n < 10000: n += 100
    else: n += 2000

This script will generate the timings for lots of numbers (0-10000 in increments of 100, then 10000-500000 in increments of 2000), and also take the average time over 10 sorts.

Data


Download Timings

Analysis

Note: The x-axis increases by increments of 100, and then increments of 2000 after 10000 items
 

This graph was made from the timings of a randomised array from each of the sort methods.

The graph above suggests that all three sorts perform very similarly.

In the data set, the random method was more performant than that of the naive and median method.

This could be as a result of the naive and median methods requiring more operations (nature of a naive sort, and to compare and switch values for a median sort)

The timings for the median method were (surprisingly) more inconsistent, than the naive sort, evident in the visible timing spikes

For a dataset of about 364,000 items, both the naive and median sort methods spiked up in sort times

// Aside: It would have been better to perform the same tests with the same randomised numbers

Naive Pivot


Note: The x-axis increases by increments of 100, and then increments of 2000 after 10000 items
 

This algorithm uses the left/right-most value of an array as the pivot index.
It is naive as there is no thought given into which value to pick as the pivot.

This algorithm reaches its worst-case time complexity when sorting an already sorted array.

Median of Three Pivot


Note: The x-axis increases by increments of 100, and then increments of 2000 after 10000 items
 

The median of three pivot algorithm compares the first, middle and last values of an array, and uses the median value as the pivot index - whilst also sorting the three values (lo, mi, hi) in place. This poses the advantage of the sort always pivoting around a value where there are most likely lesser and greater values.

Sorting a randomised array of numbers takes (expectedly) more time than an already sorted ascending array.

The performance of a median sort on descending items is interesting - as in my head it should perform similar to a randomised array.
Overall the median method has a better overall sort performance than the naive sort, which bottlenecks for an (as/des)cending array.

Random Pivot


Note: The x-axis increases by increments of 100, and then increments of 2000 after 10000 items
 

Albeit not knowing the nature of the values in an array, the random pivot method generates the best average case operation.
This algorithm selects a random value in an array as the pivot index.
It can sort an array of 500,000 random values in roughly 0.11s (on my laptop at least).

For a quick sort on an (as/des)cending array of numbers), the random pivot sort takes slightly longer than a median method - which is to be expected as there is no thought given into picking a pivot, hence on average more partitions will be performed.
Like the median sort, the random pivot sort is faster than the naive sort

Gotta go faster!

By implementing a different sorting algorithim within another algorithm, for instance an Insertion Sort within the Quick Sort - we can optimise the performance of our sort. Such is because an Insertion Sort may perform fewer comparisons, swaps and recursions (for a small number of values).

To later compare… for a randomised array of 1,000,000 values, it takes an unoptimised/non-hybrid Quick Sort an average of 0.206s to sort.

In-place Insertion Sort

If a partition is smaller than a threshold M, then an Insertion Sort may be performed instead of the Quick Sort for that partition.

“switch to insertion sort for partitions smaller than the threshold M”

For a randomised array of 1,000,000 values…
values -> /usr/bin/time --format="%U" ./quicksort -pr -q -sa<THRESHOLD>

Threshold (M)Avg12345
00.2080.210.210.200.210.21
10.2100.200.210.210.210.22
20.2080.210.210.210.200.21
30.2080.220.200.200.210.21
40.2000.200.190.190.200.22
50.2020.200.200.200.200.21
60.2000.200.200.200.200.20
70.1980.200.200.200.190.20
80.1920.190.190.190.190.20
90.1840.180.180.180.190.19
100.1960.200.190.190.200.19
1000.1960.200.190.200.190.20
10000.4420.460.450.430.440.43
100003.2083.383.273.163.063.17

This optimisation (threshold M=9) shaves off 0.206 - 0.184 = 0.022s of execution time.

Deferred Insertion Sort

Another way to reduce the number of insertion sorts performed is to do just one Insertion Sort at the end of all Quick Sort operations.

“do not sort partitions smaller than the threshold M; leave them unsorted, and do an insertion sort at the end”

For a randomised array of 1,000,000 values…
values -> /usr/bin/time --format="%U" ./quicksort -pr -q -sb<THRESHOLD>

Threshold (M)Avg12345
00.2060.200.210.200.210.21
10.2080.210.210.210.200.21
20.2060.200.210.210.210.20
30.2060.210.210.210.190.21
40.2040.210.190.200.210.21
50.1980.200.200.200.190.20
60.1940.200.200.190.190.19
70.1980.200.200.200.200.19
80.2000.200.200.200.200.20
90.1980.200.190.200.200.20
100.1880.190.190.180.190.19
1000.1860.190.190.180.180.19
10000.4400.420.440.440.430.47
100003.1863.313.073.273.053.23

This optimisation (threshold M=100) shaves off 0.206 - 0.186 = 0.020s of execution time.

Appendix