Serving large files with Tornado safely without blocking
We need to take care of two things while serving large files using Tornado:
- It should not eat up the RAM
- It should not block the server
To do that, we'll have to read and send the files in chunks. What that means is we'll read a few megabytes and send them, then read the next few megabytes and send them and we'll keep doing that until we've read and sent the whole file.
Before moving on, I think it should go without saying that Tornado isn't recommended to serve large data. A specialized server, like Nginx, for this purpose should always be preferred when possible.
Serving the files safely so it doesn't eat up the RAM¶
We'll read the files in chunks, then write the chunk to the response, and flush it to the network socket.
Reading in chunks and flushing the data to network will ensure that we don't run out of RAM.
Here's a code example:
from tornado import web, iostream
class DownloadHandler(web.RequestHandler):
async def get(self, filename):
# chunk size to read
chunk_size = 1024 * 1024 * 1 # 1 MiB
with open(filename, 'rb') as f:
while True:
chunk = f.read(chunk_size)
if not chunk:
break
try:
self.write(chunk) # write the chunk to response
await self.flush() # send the chunk to client
except iostream.StreamClosedError:
# this means the client has closed the connection
# so break the loop
break
finally:
# deleting the chunk is very important because
# if many clients are downloading files at the
# same time, the chunks in memory will keep
# increasing and will eat up the RAM
del chunk
Preventing our server from blocking¶
When we await self.flush()
,
Tornado writes the current data to the network socket.
Theoretically, that means at the await self.flush()
statement, our coroutine should pause
because writing to socket takes some time. This little pause should allow the ioloop to
run other handlers asynchronously. This means that our server shouldn't block.
But that is not always the case. self.flush()
can be very fast if your client's
network is also fast. Thus, the delay is so small that our coroutine will keep running
without pausing and so, it will block the server.
A fool-proof way to make our code non-blocking is by putting it to sleep for a
nanosecond just after flush()
. That delay should be enough for the ioloop to
run other handlers. In fact, it doesn't have to be a nanosecond. It can be an even
smaller value. But for this example, I'll go with a nanosecond's pause.
UPDATE: I asked about this issue on Tornado's mailing list and Ben Darnell (Tornado's maintainer) gave out some pretty good tips. You can find the thread here. Do read it, he also posted a code example about "metered usage" which you can use to serve your clients fairly.
Example:
from tornado import web, iostream, gen
class DownloadHandler(web.RequestHandler):
async def get(self, filename):
...
try:
self.write(chunk)
await self.flush()
except iostream.StreamClosedError:
break
finally:
del chunk
# pause the coroutine so other handlers can run
await gen.sleep(0.000000001) # 1 nanosecond
This approach is pretty effective because the speed of a connection is still far slower than the speed of Tornado sending data to the socket. So, pausing for a nanosecond, and also serving other clients at the same time, doesn't really matter much.
Now that we've made the DownloadHandler
asynchronous, we can serve multiple clients
in a non-blocking way. Even if a few different users want to download different
files, at the same time, our server won't block.
Benchmarks¶
Tornado isn't meant for serving large data. You should, when you have the option, use Nginx to serve large files. The benchmarks clearly show that.
----------------------------+------------+---------------+---------------
Server | 1 request | 10 concurrent | 100 concurrent
| | requests | requests
----------------------------+------------+---------------+---------------
Nginx (w/ sendfile) | 0.130 sec | 0.978 sec | 15.790 sec
Nginx (w/o sendfile) | 0.155 sec | 1.472 sec | 22.424 sec
Tornado 5.0 (w/o sendfile) | 0.419 sec | 3.782 sec | 44.289 sec
It's quite apparent that Tornado can't keep up with Nginx.
What's sendfile
?
sendfile
is a function available on Linux (and Unix) which allows copying
file to a socket at kernel level. This is far more faster than what we're doing
- reading the file and writing to socket. While Python supports this using
os.sendfile
, Tornado doesn't. But there's an issue on Github about
adding this support to Tornado.
Some notes on performance¶
Nginx is so fast because it's optimized for serving files. While I don't know the inner workings of Nginx but I can safely say that part of the reason for its speed is the fact that it's written in C.
Although there's still room for optimizations in Tornado, by using sendfile
to write a file to socket which will make Tornado a little faster, but it still
won't be as fast as Nginx. So, serving large files with Tornado should only be
reserved for special cases where using Nginx is not possible.