Storing Images with SeaweedFS
Background
I have been working on a web app1 that stores and displays screenshots (and related test metadata) taken from automated UI tests. Its goal is to reduce the debugging time of these tests. It also enables live monitoring of tests and provides a sort of audit.
A key requirement for the app it to store lots of images (10-100 million).
Issue with MongoDB GridFS
An early prototype used MongoDB to store the images using GridFS. However, one major problem made me search for an alternative.
After deleting old images the disk space used was not automatically reclaimed. Apart from being annoying and requiring extra maintenance, this caused a problem when the disk reached over half full - it was no longer possible to reclaim the space.
It also did not seem as efficient or as architecturally clean because the image content had to be read from the database and then served by the Python server.
SeaweedFS vs. Others
After some research I found SeaweedFS (previously WeedFS), which seemed to closely fit my needs for this project.
Here is a quick comparison of SeaweedFS against alternatives I considered.
- Besides efficiently storing a lot of files, SeaweedFS also aims to serve files fast, mostly with only one disk read operation.
- The other benefit is the deployment. It is written in Go so it’s compiled and statically linked, therefore deployment only requires a single file to run. There’s nothing else to install.
Amazon S3
- S3 is more complex than required
- It is cloud only - traffic would have to go via the internet even if everything else is local.
MongoDB’s GridFS
- No automatic reclaiming of unused space
- Also fails if using more than half of the disk space
- More complex than Seaweed
- Requires custom code to serve images
GlusterFS
- Too many features that this project did not require
- Relatively complex to install (the “Quick Start” has around 700 words)
MogileFS
- More complex than Seaweed
Compared to MogileFS’s 3 layer structure (dbnode, tracker, storage node), SeaweedFS only has 2 layers (master node, volume node). SeaweedFS is basically a key->file store.
- Relatively complex to install, you may need to install perl, libraries, databases, etc.
Architectural Trade-Offs
As with any architectural choice there are trade-offs.
By choosing to de-couple the storage of images from the DB they can scale independently, however this does mean extra work to manage the link between data and files in Seaweed, extra parts to the overall deployment and having to learn how it works.
Advantages
- Storage capacity and read load scales independently from the DB
- Faster and more efficient
- Less custom code to serve the images
Disadvantages
- Custom code to ensure link between metadata and images
- Separate backup strategy required
- Extra component vs. Custom File system or database storage