Development • 21. Jan ’20

Numerische Vergleiche ohne Überlauf

Wenn man Zahlen verarbeitet, die nicht aus vertrauenswürdigen Quellen stammen, sollte man immer damit rechnen, dass diese Zahlen nicht in den Datentypen passen, den man verwendet. Solche numerischen Überläufen können schnell zu subtilen Bugs oder sogar Sicherheitsproblemen führen.

Vor kurzem habe ich Code geschrieben, der die Gültigkeit von JSON Web Tokens überprüft. JWTs enthalten einen sogenannten Claim ("exp"), der das Verfallsdatum in Sekunden seit der Epoche (00:00:00 UTC am 1. Januar 1970) angibt. Ich habe das in Perl geschrieben, und es ist nicht ausgeschlossen, dass es noch Perls gibt, die eine Integergröße von 32 Bites haben.

But 32 bit integers for the epoch will overflow in January 2038 and then start again at -2147483648 (2³¹). On a 32 bit system, the seconds since the epoch in 2039 will be negative, and so they are considered less than the current time (if you read this before January 2038).

Let's see how arbitrarily large decimal integers can be compared in C. Like most comparison functions, ours will return a negative value, when the first number is less than the second, a positive one when it is greater, and zero, when both numbers are equal.

The Naive Approach With `strcmp()`

The function strcmp() compares two strings lexically. And because the ASCII code of the number 0 is 48, while that of 9 is 57, why not just compare the two numbers as strings:

int
compare_integers(const char *num1, const char *num2)
{
    return strcmp(num1, num2);
}

But that has several issues:

Integers may have a leading + or -.
The string "23" is considered greater than "1303" because "2" is greater than "1" as a string.
Integers may have gratuitous leading zeros which leads to more problems.

So that method will not work.

Checking Whether a String is an Integer

The first step should be to check whether the input strings are really integers. That is easy with regular expressions. In C it is a little bit more complicated:

static int
is_integer(const char *ptr)
{
    if (ptr[0] == '-') {
        ++ptr;
    } else if (ptr[0] == '+') {
        ++ptr;
    }

    if (ptr[0] == '\0') return 0;

    while (ptr[0] != '\0') {
        if (!isdigit(ptr[0])) return 0;

        ++ptr;
    }

    return 1;
}

First a possible leading - or + sign is skipped.

In line 10, we check whether the first character after the possible sign is a null byte which terminates strings in C. If that is the case, the "integer" was either an empty string or a lone + or - without any digits.

The while loop beginning in line 12 checks whether all remaining characters are decimal digits.

The Comparison Function

If both input strings are valid interpretations of integers, we can compare them.

The comparison is very simple if one number is positive and the other negative. So we first check for a possible + or - sign:

static int
compare_integers(const char *num1, const char *num2)
{
    char sign1 = '+', sign2 = '+';

    if (*num1 == '-' || *num1 == '+') {
        sign1 = *num1;
        ++num1;
    }

    if (*num2 == '-' || *num2 == '+') {
        sign2 = *num2;
        ++num2;
    }

    /* ... */
}

Numbers without a sign are positive. So we initialize the two variables with a + sign.

If either of the strings starts with a sign, we store it and skip the sign. In C, incrementing a pointer to a string has the effect that the first character is skipped.

while (num1[0] == '0') ++num1;
if (!num1[0]) --num1;

while (num2[0] == '0') ++num2;
if (!num2[0]) --num2;

Next we skip all leading zeros, again by incrementing the pointer to the string. But that would turn the string "0" into an empty string. So we undo the last skip, if the resulting string is empty.

Note: The if clause if (!num1[0]) checks, whether the string num1 is empty. If the first character of the string is a null byte, the string is terminated there, and that means it is empty.

There is one edge case that has to be solved. The numbers "-0" and "+0" are equal. We therefore change "-0" into "+0" before we start comparing:

if (sign1 == '-' && num1[0] == '0' && num1[1] == '\0')
    sign1 = '+';

if (sign2 == '-' && num2[0] == '0' && num2[1] == '\0')
    sign2 = '+';

This can be read as: if the sign is - and the first character is a 0 and the second character is the terminating null byte, change the minus into a plus sign.

Now that everything is normalized, there is an early exit, when the two signs differ:

if (sign1 == '-' && sign2 == '+') {
    return -1;
}

if (sign1 == '+' && sign2 == '-') {
    return -1;
}

Before we continue with the comparison, we have to handle the case that both numbers are negative. 100 is greater than 10 but -100 is less than 10. In other words, if both numbers are negative, the result of the comparison must be inverted, +1 must become -1 and vice versa. This can be easily achieved by multiplying the result with -1 later.

So we check whether the numbers are negative:

int result_sign = sign1 == '-' ? -1 : +1;

There is no need to check that sign2 is - because at this point we know that sign1 and sign2 are the same.

And now comes the actual trick! Even without converting the strings into real integers (which can cause overflows) it is trivial to decide which of the two strings represent the greater integer. We just compare the lengths of the strings:

ssize_t l1 = strlen(num1);
size_t l2 = strlen(num2);

if (l1 > l2) {
    return result_sign * +1;
} else if (l1 < l2) {
    return result_sign * -1;
}

A "longer" numbers is always greater than a "shorter" number, at least after a possible leading sign or gratuitous leading zeros have been removed. This is the case here.

And what if two numbers have equal length? Then you can just use the already mentioned function strcmp() and do a string comparison. "2304" is greater than "1303", both as a string and as a number:

return result_sign * strcmp(num1, num2);

By the way, since Sunday, September 9th 2001 at 01:46:40 AM UTC Unix timestamps have 10 digits. This will last until Saturday, November 20th 2286 05:46:39 PM UTC, leap seconds ignored. That means that for that period of time it will work to compare Unix timestamps as strings! But you are still a lousy developer, if you do so! 😁

Floating Point Numbers

Comparing floating point numbers without overflow is left as an exercice to the reader because it is simple. You first compare the part before the dot, and if that is equals you always compare the fractional parts as strings.

Source Code

The complete source code (download link) follows below. Save it as compare-integers.c and create a binary executable compare-integers from it with the command make compare-integers.

You can test it like ./compare-integers 1303 2304.

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
#include <sys/types.h>>
static int is_integer(const char *str);
static int compare_integers(const char *num1, const char *num2);

int
main(int argc, char *argv[])
{
    if (argc != 3) {
        fprintf(stderr, "Usage: %s INTEGER1 INTEGER2\n", argv[0]);
        exit(1);
    }

    if (!is_integer(argv[1])) {
        fprintf(stderr, "'%s' is not an integer.\n", argv[1]);
        exit(1);
    }

    if (!is_integer(argv[2])) {
        fprintf(stderr, "'%s' is not an integer.\n", argv[2]);
        exit(1);
    }

    int result = compare_integers(argv[1], argv[2]);

    if (result > 0) {
        printf("%s is greater than %s.\n", argv[1], argv[2]);
    } else if (result < 0) {
        printf("%s is less than %s.\n", argv[1], argv[2]);
    } else {
        printf("%s equals %s.\n", argv[1], argv[2]);
    }

    return 0;
}

static int
compare_integers(const char *num1, const char *num2)
{
    char sign1 = '+', sign2 = '+';

    if (*num1 == '-' || *num1 == '+') {
        sign1 = *num1;
        ++num1;
    }

    if (*num2 == '-' || *num2 == '+') {
        sign2 = *num2;
        ++num2;
    }

    while (num1[0] == '0') ++num1;
    if (!num1[0]) --num1;

    while (num2[0] == '0') ++num2;
    if (!num2[0]) --num2;

    if (sign1 == '-' && num1[0] == '0' && num1[1] == '\0')
        sign1 = '+';

    if (sign2 == '-' && num2[0] == '0' && num2[1] == '\0')
        sign2 = '+';

    if (sign1 == '-' && sign2 == '+') {
        return -1;
    }

    if (sign1 == '+' && sign2 == '-') {
        return -1;
    }

    int result_sign = sign1 == '-' ? -1 : +1;

    size_t l1 = strlen(num1);
    size_t l2 = strlen(num2);

    if (l1 > l2) {
        return result_sign * +1;
    } else if (l1 < l2) {
        return result_sign * -1;
    }

    return result_sign * strcmp(num1, num2);
}

static int
is_integer(const char *ptr)
{
    if (ptr[0] == '-') {
        ++ptr;
    } else if (ptr[0] == '+') {
        ++ptr;
    }

    if (ptr[0] == '\0') return 0;

    while (ptr[0] != '\0') {
        if (!isdigit(ptr[0])) return 0;

        ++ptr;
    }

    return 1;
}

Kommentar hinterlassen

Die Angabe der E-Mail-Adresse ist freiwillig. Bitte bedenke aber, dass ohne gültige E-Mail-Adresse keine Benachrichtigung über eine Antwort möglich ist. Die Adresse wird nicht zusammen mit dem Kommentar angezeigt!

Guido Flohr

Numerische Vergleiche ohne Überlauf

The Naive Approach With `strcmp()`

Checking Whether a String is an Integer

The Comparison Function

Floating Point Numbers

Source Code

Kommentar hinterlassen

Lerna Mono-Repos mit internen Abhängkeiten

Postfix, Amavis und ClamAV extrem langsam

Filtern und Suchen in sichtbarem Content mit AngularJS (Teil 1/2)

Google Maps API-Schlüssel gratis nutzen

Bulgarischer Joghurt

Angular - Bootstrap modale Dialoge mit Ngrx-Effects

Die einfache Würfelmethode verbessern

Die einfachste Methode den Zauberwürfel zu lösen

JSON.stringify() missbrauchen

Tücken von JavaScript `for...in`-Schleifen

Elektronische Rechnungen mit freier und quelloffener Software erzeugen

Dynamische Angular-Konfiguration

Categories

Tags

Numerische Vergleiche ohne Überlauf

The Naive Approach With strcmp()

Checking Whether a String is an Integer

The Comparison Function

Floating Point Numbers

Source Code

Kommentar hinterlassen

Categories

Tags

The Naive Approach With `strcmp()`