James Tauber

journeyman of some

blog > 2009 > 01 > 26 >

Serving Up User Contributed Media From A Separate Server

One commonly recommended practice in Django (although applicable elsewhere) is to serve up your static media from a different server than the one running Django for dynamic pages.

This becomes a slight challenge when you have user-contributed media (like allowing users to upload photos).

Here are some possibilities I can think of:

I guess S3-based solutions add some extra issues but ideas 2, 3 and 4 would be applicable.

Anyone have experiences (good or bad) with any of these? Any possibilities I'm missing?

Categories:
prev « django » next
prev « web » next

Comments (16)

Filip Salomonsson on Jan. 26, 2009:

Well, it doesn't necessarily have to be two physically separate servers.

If you run apache+mod_python to serve dynamic content, you can use lighttpd/nginx/whatnot on the same machine to serve static media.

(I'd say the main reason for the recommendation is that it's potentially a big waste of resources to use an apache process with a full python interpreter for a "dumb" task like serving static files.)

Eduardo Padoan on Jan. 26, 2009:

@Filip: The main reason for it is that If your server instance can handle N req/day, and some page have 5 static files, you will end up handling N/6 req/sec (1 for the dinamic, 5 for the static content).
http://www.b-list.org/weblog/2008/jun/23/media/

James Tauber on Jan. 26, 2009:

For the purposes of my question, assume it IS two physically (or virtually) separate servers.

daaku on Jan. 26, 2009:

Another option is to let users upload direct to S3:

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1434

Wes Winham on Jan. 26, 2009:

I'm in the process of trying to solve this problem with S3. Our current high level idea is to:

1) Upload the files to a specific upload directory
2) Have a management command run in the background on cron to upload local files to an S3 bucket
3) On successful upload, set a flag on the model to indicate that it should use the S3Storage (not sure if we'll keep two FileFields on the model for this or exactly how the implementation will look)
4) Until the flag is set, serve the file through apache on the same server

So basically, 2 + 4 in your list. It seems to be a reasonable solution.

Simon Willison on Jan. 26, 2009:

Serving from two physically separate servers shouldn't be needed for the vast majority of sites. I host all of my Django stuff on Apache/mod_wsgi behind an nginx reverse proxy, and have nginx serve up any static files directly. nginx is lightning fast at this. You can improve front-end performance by serving static files from a separate subdomain, but even that can be handled using just a single nginx for everything using virtual hosts.

If you do need to serve from a physically separate server, it's probably because you're dealing with high scale. If you ARE, you might want to look at a simple distributed filesystem like MogileFS (created to solve this problem at LiveJournal). If you want something a bit simpler, a message queue of some sort to trigger worker processes that move files between server without living inside your Django processes would be a good idea. Or you could just use S3.

Anders Pearson on Jan. 26, 2009:

I've always used the last one (with a simple CGI script running on the media server that the app server uploads to and returns the URL where it's stored the file). I've been doing that for years not so much for performance reasons but because our django/turbogears apps run on virtual servers that I like to keep small and lean and our IT department maintains the big central media servers (that only have a stripped down apache available) so anything I can get onto their servers I no longer have to worry about.

I'm still trying to figure out how to get this approach working with sorl.thumbnail. The other downside is that uploads take a bit longer since it has to do browser -> app server -> media server, but the second transfer is on the LAN at least and can often be done asynchronously.

!thinking on Jan. 26, 2009:

Can you let second server a responsibility of uploading the image since he is the one going to server it. So it has a URL that Django calls with some GUID to associate the upload with model ?

jens persson on Jan. 26, 2009:

Perhaps letting the django server serve the media on another port and then let the media server fetch anything that 404s through that "backchanel".
You would get some hits, but the vast majority would be offloaded to the media server(s), and the media servers would be dead simple without any logic.

Carl Meyer on Jan. 26, 2009:

Agreed with Simon on the rarity of needing separate physical servers, but taking that as a given: another possibility would be to write your own Django file upload handler that directly streams the upload through to your media server as it comes. I think this is the approach taken by the S3 storage backends, but it seems like it would be possible with non-S3 media servers as well.

Enrique Pérez on Jan. 26, 2009:

Another possibility would be a mechanism like that of tramline, by Infrae (http://www.infrae.com/products/tramline). In essence, it is a python module for an apache in front of the app server. From their website:
"""
Tramline makes sure uploaded files (in a form POST) don’t appear at the appserver but go directly into the filesystem. The only thing the appserver sees is a unique identifier of the uploaded file, so that the appserver can access it when needed. The binary data is gone at the time the POST reaches the appserver. You can check whether Tramline is in use by checking the 'tramline' header in the request, though frequently there’s no need to do so.

The appserver can control whether it accepts the uploaded file(s) in the output response header; if a 'tramline_ok' header is present, the uploaded files will be moved into the repository, ‘committing’ the upload. If it’s absent, the uploaded files will be removed, ‘aborting’ the upload.

Tramline also can handle downloads. The appserver can signal in the response headers that Tramline should push a file out of the filesystem to the end user, by adding a 'tramline_file' response header. The data of the file body as received during upload, containing the unique identifier of the file, should be sent back as the response body. Again the appserver does not see the binary data but only sends out an identifier to make the file be served by Apache.
"""

Filip Wasilewski on Jan. 26, 2009:

I think that the answer to this problem mostly depends on the requirements, available infrastructure and expected user experience.

In one of my projects I was using S3 to store big media files. Since the app was hosted outside of the amazon cloud, the browser-to-app-to-S3 transfer was unacceptable due to browser timeouts and bad user experience. On the other hand, a direct upload to S3 was problematic because of lack of content validation, poor error handling, poor multiple files upload via flash support and many other glitches. Cutting a long story short, a good enough solution was moving the app and the upload handler to EC2, taking advantage of the blazing fast transfers inside the amazon cloud.

So, if you ever need to split an app and a media servers, I would recommend starting with the simplest reasonable and available solution, i.e. go for a NFS or other webservice approach (preferably taking advantage of Django streaming upload capabilities) to save files on the media server, without engaging a background processes or storing extra information in the db (requires additional logic and makes things like caching pages problematic). If that for some reason doesn't meet needs of the users, there will be at least a clue which parameters of the system should be improved.

Mark Nottingham on Jan. 26, 2009:

Stick a caching accelerator in front of it; that way the most popular media gets cached, as well as anything you choose to make cacheable.

Graham Dumpleton on Jan. 28, 2009:

I personally question the use of Tramline. Using mod_python for writing Apache input/output filters is actually quite inefficient way of doing it. If you wanted that sort of thing, you would better off using a dedicated Apache module written in C code for doing the same thing.

For downloading files back to client, if using Apache/mod_wsgi you would be better off using wsgi.file_wrapper extension to WSGI, which should perform better than Tramline as it will use sendfile or memory mapping techniques available to do it efficiently. In Tramline I don't believe you can avoid reading in file to normal memory and traditional Apache in memory bucket brigade objects still used.

As to uploads, use of Tramline only makes sense if targeting applications written in languages other than Python, as well as Python, at same time. If not doing this, you could just as easily implement equivalent functionality as a WSGI middleware wrapper. Or even just have the application handle it as per normal.

Although I haven't tried it, an even more efficient way of handling downloads might be to generate X-Sendfile headers in response and have that be then handed by nginx front end. This would be better as it frees up Apache thread handling request straight away. The nginx server, being event driven and not thread based would then handle serving up the file much more efficiently than Apache could.

Andrew Tunnell-Jones on Jan. 30, 2009:

Perhaps the Nginx upload module combined with NFS would be a good fit. The Nginx module would accept the upload on your front-end machine, replace the file with the path/filename and then pass the post on to your back-end machine. Your back-end machine's Django process could then execute a move/rename operation on the file to a final location via an NFS export from the front-end machine.

http://www.grid.net.ru/nginx/upload.en.html

Ben Walton on Feb. 3, 2009:

James, what solution did you go for in the end? I'm just currently setting this up, was planning on using NFS but found out I couldn't as I don't have root access on the servers so was thinking about writing my own django file handler to scp / rsync the files across. Would be interested to know how you got on. Tweet me @benwalton.

Created: Jan. 26, 2009
Last Modified: Jan. 26, 2009
Author: James Tauber