How We Manage to Remove Unused Compiled Nuxt.js Files on AWS S3

Photo by Gary Chan on Unsplash

TLDR: I wrote a script that will remove old compiled file generated by Nuxt.js that has been uploaded to AWS S3. The script is intended to be use on Continuous Integration environment (we are using Gitlab CI). If you are looking for the solution, just scroll to the bottom of this post. If you want the story, you can continuous reading this article.

…My journey of learning to write shell scripts…

Just to give you a perspective, at Backstreet Academy we are frequently deploying a change on our website both staging and production. In average, it is around 15 deployments per day. This create a lot of compiled files on the S3 buckets. Since we implemented this automation CDN deployment we have hundreds of compiled JavaScript and CSS file that created by Nuxt.js siting on our S3 buckets and only small percentage of it that actually used by the website. Based on this situation, we need to have some functionality that will delete unused compiled files that has been generated by Nuxt.js, I also have mention this idea on my previous post.

To remove something that not used or old, we need to determine it by using identifier. At first we want to implement some storage that save the state of the deployment, it could be a simple static file in AWS S3 or database. So we can use that state we determine which file we can remove. But this idea is little bit complex to implement and I don’t want it. Then we thought “Why not just use S3 bucket prefix as the identifier?”. This approach seems more simple, we just need to fetch the prefix and determine which compiled files that can be remove.

Setting Up Identifier

We made a quick search to find if is there any a hash algorithm that can be sorted newer to older and we couldn’t find it. May be we didn’t spend much time, but if you know it please let us know. Then we decided to use Unix timestamp as the identifier. This timestamp is generated and stored in environment variable file during docker build. This is how it looks like in our build script.

Then we can use BUILD_TIME as the directory/prefix in S3 bucket in our gulpfile.js, script that we used for compiling and uploading the compiled file to CDN. If you see my previous post, I have one gulp task that upload everything. So, at first I try to upload everything with BUILD_TIME prefix (including the images) to S3 bucket, which means the URL of the asset will be like this https://xyz.cloudfront.net/website/1234567890/app.js.

This makes website build time increase, because every build will upload all the static files including the images and I don’t want that. Between those static files, only CSS and JavaScript files that frequently change each build (i.e. fixing bugs or adding feature) and this changes is rarely happen on image. So we decided to exclude the images from CSS and JavaScript and it means we need to create 2 different gulp task. One for images and the other for CSS and JavaScript files.

This is how the gulpfile.js looks like:

Remove the files

After we have the identifier, we need to figure out how to remove those directories from S3 bucket. Since Unix timestamp is used as identifier, it means we are able to sort the value and exclude the newest one, because that static directory is being used by the website. This removing process happens after the deployment stage and we want to use our existing deployment docker image (backstreetacademy/docker-aws) that also used for build stage and deployment stage on our Gitlab CI. The image is based on Alpine Linux and it means I need to write the shell script for this task.

Writing shell script is kind a like a journey for me. Because I rarely do that and once I do, I felt like I discover a new world that have so much a thing to explore in Linux. The first idea that come into my mind is that we can grab the timestamp by passing the file path on S3 bucket, then put the timestamp values on array that can be sorted from newest to oldest. Then remove the newest timestamp. Then we remove all file that have older timestamp prefix.

But how we grab the timestamp value from prefix?

Since all the CSS and JavaScript file names that have been uploaded to S3 bucket has been replaced with hash 1234abcd.js, so it kinda difficult to grab the file as individual file. There is 2 options to do this, first we add BUILD_TIME to the file name (i.e. abcd.1234567890.js) or we have a file with constant name that can be used fetching process. So we chose the later, means we put some create a sample file with constant name to make it easier for this process. This file is generated during build stage on our Gitlab CI Pipeline. So what I was doing is make a changes on the gulpfile.js that I explained above by adding the line below to deploy-css-js task.

Code snippet above basically create a static file named build with content from BUILD_TIME to the dist directory. Then the file will be uploaded during build stage to the build directory on S3 bucket. Using this file, now we can easily grab each individual file path from S3 bucket using AWS CLI which already available on the backstreetacademy/docker-aws image. At the first I try to use aws s3 ls command to grab those file path, but I was unable to figure the way to do properly. What I want to get is the path like this:

/static/1234567890/build
/static/0987654321/build

Instead I got only the content of the prefix like this:

/image-one.jpg
/image-two.jpg
/image-three.jpg

When I run the command using recursive options, the result is showing all of the files in that bucket and this is too much. Even though I can using grep to filter the result but the process will take time. Then I decided to use s3api which allowed us to use a lot of API that provided by AWS S3 SDK which is not available in s3 command.

With s3api, I can get the list of build file using list-objects-v2. Here is the full command that I run to get the list of build file:

$ aws s3api list-objects-v2 --bucket my-bucket --prefix "static" --query "Contents[?contains(Key, 'build')]" --output json

The data from the command then need to be parsed in order to get the list of build file path. That’s when jq coming in (jq also already installed on our backstreetacademy/docker-aws image). We just need pipe the result to this command: jq -r '.[] | .Key'. Which basically says that grab item on the array in JSON output and return Key value. By doing this the result ended up like this:

$ aws s3api list-objects-v2 --bucket my-bucket --prefix "static" --query "Contents[?contains(Key, 'build')]" — output json | jq -r 
'.[] | .Key'
static/1111111111/build
static/2222222222/build
static/3333333333/build

We already have the list of the file path, and what we need to do is extract the timestamp from the path. I did a simple searching and found out that we can use sed to do the job. We just need to remove the prefix (/static/) and the suffix (/build) on the file path.

$ aws s3api list-objects-v2 --bucket my-bucket --prefix "static" --query "Contents[?contains(Key, 'build')]" — output json | jq -r 
'.[] | .Key' | sed -e 's/^static\///' | sed -e 's/\/build$//'
1111111111
2222222222
3333333333

The line is too long for me. I decided to put the result from s3api into the variable. So it will looks more clear.

After getting the timestamp, I need to store the timestamp on the array so that later it can be sorted as my initial idea. After trial and error, I couldn’t manage to find how to create array in shell. Until I found out array is only available if you are using Bash shell and our backstreetacademy/docker-aws image is using Alpine Linux image as the base image which doesn’t have Bash (/bin/bash) installed by default instead they have Bourne shell (/bin/sh).

“Duh”

Since we cannot use array as the temporary storage, I have to find another way to manage this timestamp value. There is a command called sort that can do the sorting but if you want to do reverse sorting you need to pass flag -r to the command. Why reverse? because the newest the timestamp value, the larger they are. Pipe the result to sort -r, then we can get the sorted result.

When we get the sorted result, we need to remove the first value from the result because the newest timestamp is the prefix value that being used by the website static file path. I found out that we can do that with tail. I use tail a lot when analyzing server logs, but I just knew that if you passing argument -n +2 it will exclude the first line from the output. Seems has different purposes, but hey! it remove the first value from the result.

Well, we get the prefix of the files that can be removed from S3 bucket. All we need to do is remove all the files inside that prefix/directory. We can approach this functionality by using 2 step. First get all of the file path with that prefix value then we removed it. This 2 steps require s3api functionality.

To get list of file inside that prefix we need to supply the result to aws s3api list-objects-v2. But in order the result to be consumed by the command we need to wrap command into a shell function that receive the value from output.

When we get the result we need to extract the Key from JSON output, which mean we need to use jq again here.

# …other process
echo $OBJECTS | jq -r '.[] | .Key' | sed -e 's/^static\///' | sed -e 's/\/build$//' | sort -r | tail -n + 2 | list | jq -r '.[] | .Key'

The last thing is remove the file using extracted Key value. Again this removal functionality need to be wrapped inside shell function in order to be able receive pipe result from jq.

That’s it. This is the script that handle the old compiled file removal in our S3 bucket. Before we used it on production, I have small little problem when testing this script in our staging environment. The problem basically I assume the output of s3api is always JSON, because when I test it on my local, it gave me JSON result. Instead the default output is plain text, so I need to pass --output json on the arguments. When I was running this script for the first time, it takes time since we have hundreds of old compiled file that previously stored on our S3 buckets. But it’s just the first time run, the next run is pretty quick.

Putting the script on Gitlab CI Pipeline

The script above is just the bare minimum. In order to be usable in our Gitlab CI pipeline, we need to make the bucket name as the argument of the script. So the end result of the script would be like this:

Since this script must be run after deployment completed, we made additional stage to our pipeline which is clean stage.

That’s all. This is story how we manage to find a way to remove old compiled files that generated by Nuxt.js in our S3 buckets. I hope this can help you to solve your problem or at least you can learn more from this article. Or if you have ideas that can improve this, let me know because I might miss something important here. See you in my next post!

Lead Software Engineer at Mekari — Empower businesses and professionals to progress effortlessly (https://mekari.com/)