[http] Python sample: Downloading file through HTTP protocol with multi-threads

February 14th, 2011 by bettermanlu

Free download manager is a popular tool that supports IE/FF to download files via HTTP, HTTPS and FTP.  One highlight of its features is its “download acceleration”. FDM splits files into several sections and downloads them simultaneously. Have you ever been curious about its implementations? Don’t worry, this article will shed some light on the basic theories behind it.

1. HTTP HEAD Request and HTTP Reponse Content-Length & Accept-Ranges Headers
The HEAD method is a standard HTTP method that acts as if I’ve made a GET request, but it returns only the headers and not the body. This allows me to find out some information about the resource without actually taking the time or using the bandwidth to download it.

For example, I can read the corresponding HTTP Response Content-Length header and determine the size of the resource.

Another important reponse’s header is

Accept-Ranges Header

This header indicates to the Web client that the server has the capability to handle range requests. There are only two valid formats for the Accept-Ranges header that are allowed according to the definition:

Accept-Ranges: bytes
Accept-Ranges: none

These basically indicate that the Web server does and does not accept range requests, respectively.

If Web server supports range requests, the client can then use below range header to download partial contents.

2. HTTP GET Request – Range Header
The Range header allows the HTTP client to request partial content, rather than the usual full content, by specifying a range of bytes it seeks to receive.

For example, to request the first 500 bytes of content,  the following Range header should be included in the request:

Range: bytes=0-499

A successful partial content response will be a 206 Partial Content.

With the above key points, we can write a python script:

3. Code sample: Multi-threads downloading file through HTTP protocol.

Basic workflow:
(1) send HTTP HEAD request to check whether Web server supports range request.
(2) If supports(“Accept-Ranges: bytes”), then read “Content-Length” header.
(3) Split the whole file into multiple blocks (100K bytes per block), and start #blocks HttpPartialDownloadThread to download each part.
(4) After all theads terminate, merge all partials into one big file.

############

#http_get_rangeRequest_multithread.py demo

#Download Fiddler2Setup.exe from www.getfiddler.com/dl/Fiddler2Setup.exe with multiple threads

#copyright: bettermanlu@gmail.com

########

import httplib

import string,time,shutil

from threading import *

doneCount = 0 #counter to count the finished thread number

#start of download thead class

class HttpPartialDownloadThread(Thread):

def __init__(self,hostURL,resourceURL,startByte,endByte,threadIndex):

Thread.__init__(self)

self.hostURL = hostURL

self.resourceURL = resourceURL

self.startByte = startByte

self.endByte = endByte

self.threadIndex = threadIndex

self.done = False

def run(self):

print ‘thread %s is running’ %self.threadIndex

self.partialDownload()

return

def partialDownload(self):

global doneCount

conn = httplib.HTTPConnection(self.hostURL)

conn.request(“GET”,self.resourceURL,headers={“Range”:”bytes=%s-%s” %(self.startByte,self.endByte)})

r1 = conn.getresponse()

print r1.status, r1.reason

file = open(“part_%s” %self.threadIndex,”wb”)

file.write(r1.read())

file.close()

self.done = True

doneCount += 1

conn.close()

return

#end of class

def mergeRanges(fileName,partialFileCount):

fout = file(‘%s’ %fileName, ‘wb’)

for i in range(0,partialFileCount):

fin = file(“part_%s”%i, ‘rb’)

shutil.copyfileobj(fin, fout, 65536)

fin.close()

fout.close()

def getContentLength(conn,resourceURL):

#send “HEAD” request to get the basic information of the resourceURL

conn.request(“HEAD”, resourceURL)

r1 = conn.getresponse()

print r1.status, r1.reason

#Note that you must have read the whole response before you can send a new request to the server.

#otherwise you will meet httplib.ResponseNotReady error, even you don’t need the body.

r1.read()

content_length = 0

#read “accept-ranges” header to see if server supports ranges request

accept_ranges = r1.getheader(“accept-ranges”)

if accept_ranges == “bytes”:

#read “content-length” header to get the length of the content section of the HTTP message in bytes

content_length = string.atoi(r1.getheader(“content-length”))

return content_length

def getRangeFileTest():

hostURL = “www.getfiddler.com”

resourceURL = “/dl/Fiddler2Setup.exe”

conn = httplib.HTTPConnection(hostURL)

contentLength = getContentLength(conn,resourceURL)

print contentLength

BLOCK_SIZE = 1000*100 #100K Bytes per block

if contentLength > 0:

#split the content into several parts: #BLOCK_SIZE per block.

blockNum = contentLength/BLOCK_SIZE

lastBlock = contentLength % BLOCK_SIZE

partialFileCount = 0

for i in range(0,blockNum+1):

startByte = BLOCK_SIZE*i

endByte = startByte + BLOCK_SIZE -1

if endByte > contentLength-1:

endByte = contentLength -1

if startByte < endByte:

downloadThread =  HttpPartialDownloadThread(hostURL,resourceURL,startByte,endByte,i)

downloadThread.start()

partialFileCount += 1

#ToDo: change it to event driven

while(doneCount < partialFileCount):

print “waiting all threads terminated.zzz…”

time.sleep(1)

#print doneCount,partialFileCount

print ‘Now merge them to one file’

mergeRanges(“test.exe”,partialFileCount)

if __name__ == ‘__main__’:

getRangeFileTest()

Ref:
1. Book “HTTP Developer’s Handbook” By Chris Shiflett
2. http://benramsey.com/archives/206-partial-content-and-range-requests/

Comments are closed.