How to use Amazon’s S3 web service for Scaling Image Hosting

logo_aws

Most startups have been there – you have a simple site, and you want to have users upload photos of themselves or something else to share.  We were there as well just a few years ago, when building out the very first versions of TeachStreet.  While previously working at Amazon, I worked on a few image hosting solutions and already knew some of the pitfalls and challenges of building out a system to scale.

images-upload-example

Here were some of our high level requirements:

  • Keep redundant copies of images in case of failure
  • Allow dynamic resizing and cropping of images (so we don’t have to pre-generate them)
  • Must be fast (but cheap)
  • Must scale independently of our core web application

Having worked with keeping source images in sync between multiple hosts before, we knew that it could be a challenge, and in terms of host failure, a huge pain.  Right around that time, S3 gained traction, and solved our redundant copies issues.  We could push our images to Amazon, and never have to worry about backing them up, or keeping extra copies in case of hardware failure (this became Amazon’s problem).

Our Solution

We chose to write a separate rails app to serve these images and handle the resizing, cropping, or any other effects we needed.  Rmagick (which uses ImageMagik) is able to provide these changes for us, and serve the image back to the user.  The process is as follows:

  1. Handle request
  2. Fetch original source image from S3
  3. Resize/apply effects
  4. Return result back to user

Image Server

Now we need to go back and optimize for the “fast” requirement.  Doing a request to S3 for each request (and resize) takes some time.  For performance, each of our image servers cache the source, and any resize, locally to disk.  Since images are never updated (only created), and get a unique ID for each one, we don’t have to worry about cache invalidation, only expiration.  We can then write a simple script to remove images from this disk cache with files of an access time greater than a certain threshold (say 30 days).  That way, if we change from one size thumbnail to another, eventually the old thumbnail sizes will get purged.

Performance Optimizations

Implementation-wise, we also can use a few more tricks up our sleeves to eek out performance.  First, when fronting our rails app with nginx, we can use x-sendfile to return the file location to nginx.  This allows the rails app to prevent from having to stream the file data back, and on subsequent requests, results in just a file lookup on disk (it doesn’t have to read the contents of the file).  Also, we can ensure that all files are converted to jpgs, then optimized and stripped of any extra header info.  This will minimize the file size as much as possible before sending them back, which will improve the overal latency and throughput for later requests.

Lastly, from the client-side, we can trick browsers a bit further.  By creating extra DNS entries for the same servers, we can make the browser think that these are different servers.  Many modern browsers allow for a maximum of four simultaneous requests per host.  Our web app is then responsible for distributing the requests – by hashing & modding the url, we can evenly distribute the images across four hostnames.  This allows browsers with the capability to parallelize requests, at a slight cost of extra dns lookups.

images-parallel-get-crop

By leveraging Amazon’s S3 web service technology, we’ve been able to reduce our overhead in having to build/manage a redundant file store.

Example requests

Example requests

Further steps/more optimization?

Still, there are more steps we could take to optimize this further, if needed.  First, if we know commonly requested image sizes & effects, we could prime the cache on image upload.  This would avoid the extra lookup to S3 except in a failure case.  If our caches begin to get very large (as we scale), we could use the dns to map to different servers, even increasing the number of dns entries for servers (modding out to a larger set), or routing to different servers based on url (for different image sizes/etc).

Right now, most of our users are in the US.  If we had an international site, we might consider using different S3 backends for storage (in Singapore, Hong Kong, Japan, or Europe), as well as using a CDN to front images.  Generally speaking, CDNs are quite expensive for scrappy little startups like us.  We could even consider using Amazon Cloudfront as our CDN.

Alternatives?

Other alternatives we’ve seen to this problem have varied.  Paperclip is a great plugin that provides much of the same functionality, but doesn’t provide the on-the-fly resizing, and is usually applied to a database model (our solution relies on external guids for each image).  Cassandra (or MongoDB with GridFS) could also be an alternative backend for S3 if the latency on non-cached requests needs further improvement.

Want to learn more?  Check out our Programming Classes, Web & Graphic Design Classes, or Information Technology Classes.

This entry was posted in Building TeachStreet, Engineering. Bookmark the permalink.
  • waloeiii

    Instead of managing the cache yourself you could store these in memcache and retrieve them with the nginx memcache module. You could also do something similar with Varnish.

  • http://blog.daryn.net daryn

    Good suggestion.

    In this case, however, individual images are long-lived, and accessed infrequently, so disk-based caching made more sense than in-memory caching. That would certainly be a smart design for images that are in more demand, like on the TeachStreet homepage.

  • http://blog.sentientmonkey.com Scott Windsor

    Daryn's right for our use cases; unless we really wanted to eek out better performance for certain scenarios, memcache wouldn't help much. We also don't have much hardware, so running a big memcache instance locally might steal RAM away from our mongrels and other apps on the machines. I haven't had a chance to play with the nginx memcache module yet, but I've heard good things.

    Varnish looks pretty cool as well – I'll have to check that out. We've mostly stuck to nginx so far because #1 it's really simple, and #2 it's really fast.

  • Matt

    In our application we did something similar, however we didn't use S3. We wrote a custom nginx module to md5 hash the path (which contained resizing parameters) and let nginx find the file on the filesystem, falling back to resizing, caching and serving the resized image when it didn't exist. We found it sped things up considerably because it bypassed the application altogether, and nginx just served the image from the filesystem.

    Here's the module we made, it only works on our legacy version of nginx (0.6.35), but you may be able to get it to work on new versions…

    http://github.com/kaleidomedallion/interface_nginx

  • http://blog.sentientmonkey.com Scott Windsor

    Nice! I haven't been hardcore enough to write a custom nginx module yet. :-)

    This could be a nice optimization for us – using x-sendfile frees up our app for streaming and the return, but it still has to hit the rails stack. For images that we've already generated, it would be pretty awesome to serve them directly from nginx.

  • http://www.facebook.com/jeffiel Jeff Lawson

    Seems like there should be a webservice out there that specializes in image hosting. IE, you POST an image, with some params such as rescaled sizes you want, and it gives you back a set of URLs for different sizes and formats. Any takers?

  • http://www.nosnivelling.com daveschappell

    Check out http://transloadit.com/

    Someone on HackerNews mentioned it today, and it looks like just what the doctor ordered!

  • Pingback: How to use Amazon’s S3 web service for Scaling Image Hosting | TeachStreet Blog « Netcrema – creme de la social news via digg + delicious + stumpleupon + reddit

  • http://cyberfox.com/blog Morgan

    Greetings,
    I talked about something like this on the S3 forums a while ago; we used S3 as the image store, but what we did was pre-resize images to a set of sizes. The user uploaded directly to S3 (jquery forms rock!), and was redirected to our site on completion. We triggered a background job off of that to pull down the image, resize it, and push the resized images back up.

    The truth is we didn't need it to be an arbitrary number of different sizes; we knew what our thumbnail sizes would be in advance, and S3 storage is sufficiently cheap that we didn't think twice about pushing up shrunken versions. We also kept the originals, but in 'private' storage, so only our app could access them, in case we needed to re-resize later.

    The thumbnails (shrunken versions) we marked as public on S3, so we could serve them directly from Amazon, and never needed to touch our servers. Finish up with a few rotatable CNAMEs to point to our S3 bucket (like your asset hosts) and image hosting upload or view never touches our app server bandwidth, and is fast for the end user.

    It's a fun problem, and your solution makes sense for your constraints.

    – Morgan

  • Pingback: === popurls.com === popular today

  • http://blog.sentientmonkey.com Scott Windsor

    Nice! I like the idea of pushing directly to & serving from S3, but it didn't quite work for our constraints (on-the-fly resizing). It would be pretty cool though, if there was a good solution for batch-resizing existing images directly in S3. Maybe a map-reduce or ec2 task? :-)

  • Pingback: How to use Amazon’s S3 web service for Scaling Image Hosting | TeachStreet Blog : Popular Links : eConsultant

  • http://cloudberrylab.com cloudberryman

    I always enjoy learning what other people think about Amazon Web Services and how they use them. Check out my very own tool CloudBerry Explorer that helps to
    manage S3 on Windows . It is a freeware. http://s3.cloudberrylab.com/

  • http://twitter.com/cowboyx CowboyX

    I have a question about your CNAMEs. My understanding was that you could only apply one cname to each S3 bucket (e.g. if “images1.teachst.com”, Amazon would automatically pull pull content from the single bucket named “images1.teachst.com”). Are you using separate buckets, or did you figure out a trick around this?

  • teachstreet

    I believe you are correct about 1 CNAME per bucket.
    images*.teachst.compoint to our own servers that run the rails app
    that Scott is talking about
    here.

    This, by the way, is another optimization: we used teachst.com instead of
    teachstreet.com so that we wouldn't have the overhead of any cookies we are
    using in the main web-app being sent in requests and responses, when they
    aren't needed at all for any of the image requests.

  • http://blog.sentientmonkey.com Scott Windsor

    Thanks! If I ever ran windows, I'd check it out. Right now I only ever use it for IE testing. :-)

  • http://blog.sentientmonkey.com Scott Windsor

    Also, we just heard about http://www.uploadjuicer.com/ which is a local Seattle Startup that *just* launched. Cool stuff.

  • Pingback: Delicious Bookmarks for July 26th through July 27th « Lâmôlabs

  • Pingback: Up and “Running” « The Making of DeadPigeonz

  • Pingback: Ruby Web Application Developer – TeachStreet (Seattle, WA) | TeachStreet Blog

  • Pingback: TeachStreet: Ruby Web Application Developer

  • Pingback: Ruby Web Application Developer | Freelance Market | Freelance Machine

  • Pingback: RegexHacks :: Blog » The Top 150 Web Development Highlights from 2010

  • Pingback: Quora

  • Bucketexp

    You can try Bucket Explorer tool available in for Mac,Linux and Windows. You can upload the images on S3 and you can view the image by our Web Url generator..
    http://www.bucketexplorer.com/documentation/amazon-s3–how-to-generate-url-for-amazon-s3-file.html