Storing Images with SeaweedFS

November 22, 2016

Background

I have been working on a web app¹ that stores and displays screenshots (and related test metadata) taken from automated UI tests. Its goal is to reduce the debugging time of these tests. It also enables live monitoring of tests and provides a sort of audit.

A key requirement for the app it to store lots of images (10-100 million).

Issue with MongoDB GridFS

An early prototype used MongoDB to store the images using GridFS. However, one major problem made me search for an alternative.

After deleting old images the disk space used was not automatically reclaimed. Apart from being annoying and requiring extra maintenance, this caused a problem when the disk reached over half full - it was no longer possible to reclaim the space.

It also did not seem as efficient or as architecturally clean because the image content had to be read from the database and then served by the Python server.

SeaweedFS vs. Others

After some research I found SeaweedFS (previously WeedFS), which seemed to closely fit my needs for this project.

Here is a quick comparison of SeaweedFS against alternatives I considered.

Besides efficiently storing a lot of files, SeaweedFS also aims to serve files fast, mostly with only one disk read operation.
The other benefit is the deployment. It is written in Go so it’s compiled and statically linked, therefore deployment only requires a single file to run. There’s nothing else to install.

Amazon S3

S3 is more complex than required
It is cloud only - traffic would have to go via the internet even if everything else is local.

MongoDB’s GridFS

No automatic reclaiming of unused space
- Also fails if using more than half of the disk space
More complex than Seaweed
Requires custom code to serve images

GlusterFS

Too many features that this project did not require
Relatively complex to install (the “Quick Start” has around 700 words)

MogileFS

More complex than Seaweed

Compared to MogileFS’s 3 layer structure (dbnode, tracker, storage node), SeaweedFS only has 2 layers (master node, volume node). SeaweedFS is basically a key->file store.
Relatively complex to install, you may need to install perl, libraries, databases, etc.

Architectural Trade-Offs

As with any architectural choice there are trade-offs.

By choosing to de-couple the storage of images from the DB they can scale independently, however this does mean extra work to manage the link between data and files in Seaweed, extra parts to the overall deployment and having to learn how it works.

Advantages

Storage capacity and read load scales independently from the DB
Faster and more efficient
Less custom code to serve the images

Disadvantages

Custom code to ensure link between metadata and images
Separate backup strategy required
Extra component vs. Custom File system or database storage

A Single Page App and a JSON-over-HTTP (REST) API written with Flask a Python micro-framework and MongoDB ↩︎