Creating a Word Cloud from Your Day One Entries

A while back I wrote a post about querying the Day One database in which I teased more to come with this statement:

I wanted more detail than Day One offered and had ideas of how I could mine my entries in cool ways (more on that in a later post).

Well, this is that “later post”! I wanted to visualise my entries as a word cloud.

Massaging the data

In order to produce the cloud I needed to do two things: extract the data and then display it. Sounds simple doesn’t it? I quickly found that there were a number of issues with this mainly that I needed the data to be massaged before it was useful. I explain how I did that here but if you just want to skip straight to the code you can find that on its dedicated Github page.

Stop words

I knew that without some processing the output would be swamped by small words such as “and”, “a”, “it” and so on, what are called stop words. Therefore, I built an array of the words that I wanted rid of. You can add and remove from this list as suits your own particular requirements.

// build an array of stop words to remove
$stopWords = array('\'s', 'a', 'all', 'also', 'an', 'but', 'by', 'the', 'in', 'on', 'is', 'and', 'to', 'he', 'she', 'they', 'my', 'as', 'we', 'of', 'I', 'at', 'had', 'us', 'it', 'so', 'or', 'was', 'that', 'for', 'with', 'up', 'out', 'have', 'went', 'be', 'which', 'this', 'back', 'where', 'get', 'me', 'do', 'when', 'got', 'go', 'off', 'then', 'were', 'our', 'not', 'did');
$pattern = '/\b(?:' . implode('|', $stopWords) . ')\b/i';

Next, we open the database and cycle through every row in the ZENTRY table processing it to remove any Day One image links, remove punctuation and, finally, remove those stop words.

$return = preg_replace('/!\[\]\(dayone-moment:\/\/[^\)]+\)/', '', $row['ZMARKDOWNTEXT']);
$return = preg_replace('/[^\p{L}\p{N}\s]/u', '', $return);
$return = preg_replace($pattern, '', $return);

This gives a space separate string which is converted to an array and then written to a file, one word per line.

Next, the file is sorted into alphabetic order meaning all the same words will be together. I initially did this in PHP but found it was too slow so I get the OS to do it.

exec('sort -o words.txt words.txt');

The next section opens this sorted file and cycles through it one line at a time counting all occurrences of the same word and writing this out to a new comma-separated file with the count as the first column and the word the second. I then drop back down to the OS to get this file sorted into numerical order.

exec("sort -t',' -k1 -n -r -o words_count_out.txt words_count.txt");

We now have a file with all punctuation and stop words removed and with every unique word counted – just what is needed for the word cloud.

The final step is to open another PHP script that displays the word cloud itself. The reason that this is split is simply that the first part takes too long to run in a browser and times out so is best to run from a command line. Make sure you change the location to be correct for your server.

exec('open "http://dayone.local:8888/display.php"');

Displaying the data

I did look for a PHP package to create the word cloud but nothing came up that was really suitable. In the end, I found this JavaScript package wordcloud2.js which did exactly what I was looking for.

One quirk of wordcloud2.js is that it expects the list to be not the frequency of the word but the font size that you want to display that word in which has led to this very inelegant frig. I am going to look for a much better solution when I can get my head around it.

$ret = explode(",",$ln);
if ($ret[0] > 1000){
     $size = (int)$ret[0]/100;
}elseif ($ret[0] > 100){
     $size = (int)$ret[0]/10;
}else{
     $size = (int)$ret[0]/1;
}

With that data massaged once again, I could be passed to the JavaScript as a JSON string and the word cloud displayed.

// Get the word cloud data from PHP
var db = <?php echo $wordCloudJson; ?>;

list = [];
for (var i in db) {
    list.push([db[i]["word"], db[i]["freq"]])
}

Putting it all together

To run the code you need to do the following:

  1. download the code from GitHub.
  2. make a copy of your Day One database (DayOne.sqlite) and place it in the same folder as the code. Read about how to find your database in this article.
  3. Run the code: php index.php

You will see output as follows and can take quite a while depending on the number of entries you have in Day One. Once it is done a page will open in your default browser showing the word cloud.

Wordcloud2.js comes with a lot of options that allow you to change the way that the output is formatted so take a look at the documentation if you want to change that.

There are a couple of improvements that can be made to the display that I will look at over the next few weeks but this is doing exactly what I wanted and allows me to better visualise my data.

Leave a Reply

Your email address will not be published. Required fields are marked *