Rishi added the comment:

My observation is that a file with more than normal (exact numbers below) 
line-feed characters takes way too long. 

I tried porting the above patch to my default branch, but it has some boundary 
and CRLF/LF issues, but more importantly it relies on seeking the file-object, 
which in the real world is stdin for web browsers and hence is illegal in that 
environment.

I have attached a patch which is based on the same principle as Chui mentioned, 
ie reading a large buffer, but this patch does not deal with line feeds at all. 
It instead searches the entire boundary in a large buffer.

The cgi module file-object only relies on readline and read functionality - so 
I created a wrapper class around read and readline to introduce buffering 
(attached as patch).
 
When multipart boundaries are being searched, the patch fills a huge buffer, 
like in the original solution. It searches for the entire boundary and returns 
a large chunk of the payload in one call, rather than line by line.

To search, there are corner cases ( when boundary is overlapping between 
buffers) and CRLF issues. A boundary in itself could have repeating characters 
causing more search complexity. 
To overcome this, the patch uses simple regular exressions without any 
expanding or wild characters. If a boundary is not found, it returns the chunk 
- length of the buffer - CRLF prefixes, to ensure that no boundary is 
overlapping between two consecutive buffers. The expressions take care of CRLF 
issues. 

When read and readline are called, the patch looks for data in the buffer and 
returns appropriately.

There is a overall performance improvement in cases of large files, and very 
significant in case of files with very high number of LF characters.

To begin with I created a 20MB file with 20% of the file filled with LineFeeds. 

File - 20MB.bin
size - 20MB
description - file filled with 20% (~4MB) '\n'
Parse time with default cgi module - 53 seconds
Parse time with patch - 0.4s

This time increases linearly with the number of LFs for the default module.ie 
keeping the size same at 20MB and doubling the number of LFs to 40% would 
double the parse time. 

I tried with a normal large binary file that I found on my machine.
size: 88mb
description - binary executable on my machine,
              binary image has 140k lfs.
Parse time with default cgi module - 2.7s
Parse time with patch- 0.7s

I have tested with a few other files and noticed time is cut by atleast half 
for large files.


Note: 
These numbers are consitent over multiple observations.
I tested this using the script attached, and also on my localhost server.
The time taken is obtained by running the following code.

t1=time.time()
cProfile.run("fs = cgi.FieldStorage()")
print(str(len(fs['datafile'].value)))
t2 = time.time()
print(str(t2 - t1))

I have tried to keep the patch compatible with the current module. However I 
have introduced a ValueError excepiton in the module when boundary is very 
large ie. 1024 bytes. The RFC specifies the maximum length to be 70 bytes.

----------
keywords: +patch
nosy: +rishi.maker.forum
Added file: http://bugs.python.org/file36895/issue1610654.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue1610654>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to