Discogs Collection Clean-up

As I continue my 12 apps in 12 months journey this month’s is, by my own admission, a low effort submission! That’s not to say it isn’t useful, it is to me at least, but let’s just say it’s not going to win any programming awards.

What is Discogs?

For those that aren’t aware Discogs is an online platform that allows you to record your media collection, mainly vinyl in my case, and it also has a marketplace allowing sales of the same. One part of the recording of items is that it allows you to record the grading of the item and its sleeve on a scale that runs from Poor to Mint.

When I was first starting out using Discogs, I was less than diligent in my recording of grading of my items, and so consequently I have a number of items that are missing this information. Hence the need for Discogs Clean-up. This will also force me to listen to these items in order to grade them.

Discogs Clean-up

Like many software services, Discogs has its own API which allows you access to information on your collection, although not all information – no price information is exposed, for example. And, unlike, say, Evernote, Discogs has a reasonable rate limit of about 60 per minute for authenticated requests, and 25 per minute for unauthenticated requests. Even better the current limits are included in the response headers, making it easy to ensure that you stay within the rate limits. This turned out to be important for this project where the whole of my collection was going to be parsed.

With about 1,100 items to scan, a computer can clearly work through them more quickly than the 60 per minute rate limit, so without intervention, you will hit the limit. The answer to this is to inspect the headers after each call to work out how close you are to the rate limit and then add a pause. As we don’t know when the period started, it isn’t possible to know how long you have left before the count resets, so the only thing to do in this situation is to pause for the whole period, which is a minute. For this, I wrote the following function:

function handleDiscogsRateLimit($headers)
{
    $rateLimit = null;
    $rateLimitRemaining = null;
    $rateLimitUsed = null;
    foreach (explode("\r\n", $headers) as $header) {
        if (stripos($header, 'X-Discogs-Ratelimit:') === 0) {
            $rateLimit = (int)trim(substr($header, strlen('X-Discogs-Ratelimit:')));
        }
        if (stripos($header, 'X-Discogs-Ratelimit-Remaining:') === 0) {
            $rateLimitRemaining = (int)trim(substr($header, strlen('X-Discogs-Ratelimit-Remaining:')));
        }
        if (stripos($header, 'X-Discogs-Ratelimit-Used:') === 0) {
            $rateLimitUsed = (int)trim(substr($header, strlen('X-Discogs-Ratelimit-Used:')));
        }
    }

    // If close to rate limit, sleep until reset
    if ($rateLimit !== null && $rateLimitRemaining !== null && $rateLimitUsed !== null) {
        if (DEBUG) echo "Rate Limit: {$rateLimit}, Remaining: {$rateLimitRemaining}, Used: {$rateLimitUsed}" . PHP_EOL;
        if ($rateLimitRemaining < 5) {
            if (DEBUG) echo "Approaching Discogs rate limit. Sleeping for 60 seconds..." . PHP_EOL;
            sleep(60);
        }
    } else {
        echo "Rate limit headers not found or incomplete." . PHP_EOL;
        echo $headers . PHP_EOL;
        die();
    }
}

Slowing things down also brings different issues such as hitting the PHP execution limit if you run the database population routine from the web. The only real way around this is to run from the command line and send it to the background leaving it to get on with its work:

nohup php recache.php > recache.log 2>&1 &

Kicking off a Recache

It is unlikely that your collection changes that frequently but you are going to want to update at some point. One obvious way would be to set it in cron such as follows:

0 0 * * 0 php /var/www/discogs-cleanup/src/recache.php  > ~/recache.log 2>&1

However, maybe you’ve spent the day having a session listening and updating Discogs and can’t wait until Sunday? The following code starts the process in the background and you can kick it off by clicking on the # in the footer.

    // Absolute path to the script
    $script = __DIR__ . '/recache.php';

    // Escape the path for safety
    $escapedScript = escapeshellarg($script);

    // Command to run in background
    $cmd = "$php $escapedScript > /dev/null 2>&1 &";
    $output = [];
    $out = '';
    $return_var = 0;

    $last_line = exec($cmd, $output, $return_var);
    if ($return_var !== 0) {
        $out = "<p>Recache failed with return code $return_var.</p>";
    } else {
        $out = "<p>Recache started successfully.</p>";
    }

However, you need to jump through some hoops to get it working:

  • The PHP CLI binary (php) must be available on your server and in the system PATH.
  • The web server user (e.g. www-data on Apache) must have permission to execute the CLI script.
  • You must detach the script from the web process so it runs in the background.
  • You need to ensure that the php executable is set in config.php

For me, the number of times I need to do this is so small that I’m not sure it’s worth the effort, especially as I can log in to the server and start the process from the command line. However, it is there if you need it.

Download the latest code from Github.

Leave a Reply

Your email address will not be published. Required fields are marked *