Recently I write a python script to upload files using Requests Libaray. The script need about 7 miniutes to upload a file about 3GB in size. While the curl only take less than a miniute.
According server-side's log, the time usage between server accepting socket and closing it is less than a miniute when we use python script.
Where is the Requests library spend its time?
After reading Requests's code, I find the lib will read the full content into memory and encode using the multipart/form-data mime format.
The function _encode_files is used to process file upload request. In the line:159 it read the full content of file. And then call encode_multipart_formdata() function to encode data in the line:169.
After this, the libaray can calculate the content length in request body and add 'Content-Length' header.
Why is curl upload file so fast?
Because http client need send 'Content-Length' header before send body. So we need calculate body size at first. When we upload file, loading file content into memory and encoding body and then calculating body size is a trivial solution.
So is the time usage difference caused by the performace difference between python and c++?
After reading curl's code, I find the programming language performance difference is not the only reason.
When curl build request to upload file, it will build mimepart structure to describe such a file. It will start to read file content until begin to send http body. When send http body, the curl will call structure method to read data. The related code is here.
When we cacluate http body size before send body content, we only need file size. The length of encoded content is unrelated to file content. The related code is here.
At last, I use pycurl to refactor my script and the script perforamce is same as the command curl.