new sorting algorithm
*Dear, Sir/Madam* Let me first tell you briefly who we are and where we are from, what we do. My name is Nas (full name Nasipa Bayedil) from Kazakhstan. In December 2020, we registered a company online in Dover, Delaware, the United States, because all major corporations are based in the United States. Direction of the company: research and development technologies of sorting and searching in databases and in Big Data. Our research and developments should be of interest to a rapidly growing tech market. My father Nurgali has a mathematics education, he has been interested in mathematics' and physics all his life, now he is retired and continues to study his favorite mathematics, when he became a pensioner to this love an interest in programming was added. And this new interest in programming led him to invent his own method of developing sorting algorithms. *In a nutshell about what we are doing.* We have developed several variants of sorting algorithms based on new ideas and improvements to well-known ideas. Our development is carried out in the application of the idea of the article Peter McIlroy's 1993 paper "Optimistic Sorting and Information Theoretic Complexity", which was implemented by *Tim Peters TimSort in a sorting algorithm in 2002 for use in Python.* The variant implemented by Tim Peters is improved by us as it does not take full advantage of Peter McIlroy's idea. This has been achieved through the development of new methods: 1. data recognition 2. data fusion We also used the research work Sorting N-Elements Using Natural Order: A New Adaptive Sorting Approach** but added some new ideas of our own. As a result, we got a hybrid NDsort algorithm, which uses the above particular algorithms. Existing sorting algorithms that are used in practice, very fast, I will not list their advantages; for improvement let's just indicate that some of them start to work slowly on certain types of data. The author tried to deal with this from a general, *mathematical position*, and not from the point of view of programming techniques, *and achieved the goal*. Fragments of the general theory of sorting by comparison have been developed, *several facts have been discovered that are subject to patenting *and which allow developing various variants of sorting algorithms. We believe that using this method to develop completely new, fast algorithms, approaching the speed of the famous *QuickSort*, the speed of which cannot be surpassed, but its drawback can be circumvented, in the sense of stack overflow, on some data. Additionally our general sorting algorithm can be applied to design data sorting chips. This is, of course, the business of specialists in this field, but we believe that our algorithm opens up a new approach to this problem. The fact is that there are currently two ways to develop data sorting chips. The first is that data is entered in parallel; the second is that data is entered sequentially. Our sorting algorithm allows parallel-sequential data input, which would speed up the entire sorting process. I hope our research will pique your interest. The distinctive features of our sorting algorithms are the minimization of the number of comparisons, a new approach to the sorting problem, the maximum use of common sense, and the use of a principle from philosophy "Occam's razor". We use C# (Visual Studio 2019 demo version) for research and write fragments of algorithms. We would like to ask for help from Python.org engineers in implementation of our idea for the Python language together. We will be glad to hear from you any comments and questions about our idea. Kind regards, Nas Bayedil Twitter: @NasBayedil Website: www.bazony.com.kz **June 2010 Journal of Computer Science 6(2) Project: Algorithm Analysis, Authors: Shamim Akhter, International University of Business Agriculture and Technology M. Tanveer Hasan. -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
Something like this is OK? import os def tail(f): chunk_size = 100 size = os.stat(f.fileno()).st_size positions = iter(range(size, -1, -chunk_size)) next(positions) chunk_line_pos = -1 pos = 0 for pos in positions: f.seek(pos) chars = f.read(chunk_size) chunk_line_pos = chars.rfind(b"\n") if chunk_line_pos != -1: break if chunk_line_pos == -1: nbytes = pos pos = 0 f.seek(pos) chars = f.read(nbytes) chunk_line_pos = chars.rfind(b"\n") if chunk_line_pos == -1: line_pos = pos else: line_pos = pos + chunk_line_pos + 1 f.seek(line_pos) return f.readline() This is simply for one line and for utf8. -- https://mail.python.org/mailman/listinfo/python-list
Re: new sorting algorithm
I suppose you should write to python-...@python.org , or in https://discuss.python.org/ under the section Core development -- https://mail.python.org/mailman/listinfo/python-list
Re: new sorting algorithm
This probably should start out as a module on Pypi. Is the sorting stable? Python guarantees that. On Sun, May 1, 2022 at 8:53 AM Nas Bayedil wrote: > *Dear, Sir/Madam* > > > Let me first tell you briefly who we are and where we are from, what we do. > > My name is Nas (full name Nasipa Bayedil) from Kazakhstan. > > In December 2020, we registered a company online in Dover, Delaware, the > United States, because all major corporations are based in the United > States. Direction of the company: research and development technologies of > sorting and searching in databases and in Big Data. Our research and > developments should be of interest to a rapidly growing tech market. > > My father Nurgali has a mathematics education, he has been interested in > mathematics' and physics all his life, now he is retired and continues to > study his favorite mathematics, when he became a pensioner to this love an > interest in programming was added. And this new interest in programming led > him to invent his own method of developing sorting algorithms. > > *In a nutshell about what we are doing.* > > We have developed several variants of sorting algorithms based on new ideas > and improvements to well-known ideas. > > Our development is carried out in the application of the idea of the > article Peter McIlroy's 1993 paper "Optimistic Sorting and Information > Theoretic Complexity", which was implemented by *Tim Peters TimSort in a > sorting algorithm in 2002 for use in Python.* > > The variant implemented by Tim Peters is improved by us as it does not take > full advantage of Peter McIlroy's idea. > > This has been achieved through the development of new methods: > > 1. data recognition > > 2. data fusion > > > > We also used the research work Sorting N-Elements Using Natural Order: A > New Adaptive Sorting Approach** but added some new ideas of our own. As a > result, we got a hybrid NDsort algorithm, which uses the above particular > algorithms. > > > > Existing sorting algorithms that are used in practice, very fast, I will > not list their advantages; for improvement let's just indicate that some of > them start to work slowly on certain types of data. The author tried to > deal with this from a general, *mathematical position*, and not from the > point of view of programming techniques, *and achieved the goal*. > > Fragments of the general theory of sorting by comparison have been > developed, *several facts have been discovered that are subject to > patenting *and which allow developing various variants of sorting > algorithms. > > We believe that using this method to develop completely new, fast > algorithms, approaching the speed of the famous *QuickSort*, the speed of > which cannot be surpassed, but its drawback can be circumvented, in the > sense of stack overflow, on some data. > > Additionally our general sorting algorithm can be applied to design data > sorting chips. This is, of course, the business of specialists in this > field, but we believe that our algorithm opens up a new approach to this > problem. The fact is that there are currently two ways to develop data > sorting chips. The first is that data is entered in parallel; the second is > that data is entered sequentially. Our sorting algorithm allows > parallel-sequential data input, which would speed up the entire sorting > process. > > I hope our research will pique your interest. The distinctive features of > our sorting algorithms are the minimization of the number of comparisons, a > new approach to the sorting problem, the maximum use of common sense, and > the use of a principle from philosophy "Occam's razor". > > We use C# (Visual Studio 2019 demo version) for research and write > fragments of algorithms. We would like to ask for help from Python.org > engineers in implementation of our idea for the Python language together. > > We will be glad to hear from you any comments and questions about our idea. > > > > > > Kind regards, > > Nas Bayedil > > > > Twitter: @NasBayedil > > Website: www.bazony.com.kz > > > > **June 2010 Journal of Computer Science 6(2) Project: Algorithm Analysis, > > Authors: Shamim Akhter, International University of Business Agriculture > and Technology M. Tanveer Hasan. > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: new sorting algorithm
On Mon, 2 May 2022 at 01:53, Nas Bayedil wrote: > We believe that using this method to develop completely new, fast > algorithms, approaching the speed of the famous *QuickSort*, the speed of > which cannot be surpassed, but its drawback can be circumvented, in the > sense of stack overflow, on some data. Hmm, actually TimSort *does* exceed the speed of quicksort for a lot of real-world data. For instance, if you take a large sorted list, append a handful of (unsorted) items to it, and then sort the list, TimSort can take advantage of the fact that the bulk of the list is sorted. It ends up significantly faster than re-sorting the entire list. How well does your new variant handle this case? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
On 01May2022 18:55, Marco Sulla wrote: >Something like this is OK? [...] >def tail(f): >chunk_size = 100 >size = os.stat(f.fileno()).st_size I think you want os.fstat(). >positions = iter(range(size, -1, -chunk_size)) >next(positions) I was wondering about the iter, but this makes sense. Alternatively you could put a range check in the for-loop. >chunk_line_pos = -1 >pos = 0 > >for pos in positions: >f.seek(pos) >chars = f.read(chunk_size) >chunk_line_pos = chars.rfind(b"\n") > >if chunk_line_pos != -1: >break Normal text file _end_ in a newline. I'd expect this to stop immediately at the end of the file. >if chunk_line_pos == -1: >nbytes = pos >pos = 0 >f.seek(pos) >chars = f.read(nbytes) >chunk_line_pos = chars.rfind(b"\n") I presume this is because unless you're very lucky, 0 will not be a position in the range(). I'd be inclined to avoid duplicating this code and special case and instead maybe make the range unbounded and do something like this: if pos < 0: pos = 0 ... seek/read/etc ... if pos == 0: break around the for-loop body. >if chunk_line_pos == -1: >line_pos = pos >else: >line_pos = pos + chunk_line_pos + 1 >f.seek(line_pos) >return f.readline() > >This is simply for one line and for utf8. And anything else where a newline is just an ASCII newline byte (10) and can't be mistaken otherwise. So also ASCII and all the ISO8859-x single byte encodings. But as Chris has mentioned, not for other encodings. Seems sane. I haven't tried to run it. Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
On Sun, May 1, 2022 at 3:19 PM Cameron Simpson wrote: > On 01May2022 18:55, Marco Sulla wrote: > >Something like this is OK? > Scanning backward for a byte == 10 in ASCII or ISO-8859 seems fine. But what about Unicode? Are all 10 bytes newlines in Unicode encodings? If not, and you have a huge file to reverse, it might be better to use a temporary file. -- https://mail.python.org/mailman/listinfo/python-list
Re: new sorting algorithm
On Sun, May 1, 2022 at 1:44 PM Chris Angelico wrote: > On Mon, 2 May 2022 at 06:43, Dan Stromberg wrote: > > On Sun, May 1, 2022 at 11:10 AM Chris Angelico wrote: > >> > >> On Mon, 2 May 2022 at 01:53, Nas Bayedil wrote: > >> > We believe that using this method to develop completely new, fast > >> > algorithms, approaching the speed of the famous *QuickSort*, the > speed of > >> > which cannot be surpassed, but its drawback can be circumvented, in > the > >> > sense of stack overflow, on some data. > >> > >> Hmm, actually TimSort *does* exceed the speed of quicksort for a lot > >> of real-world data. For instance, if you take a large sorted list, > >> append a handful of (unsorted) items to it, and then sort the list, > >> TimSort can take advantage of the fact that the bulk of the list is > >> sorted. It ends up significantly faster than re-sorting the entire > >> list. > > > > > > In fact, Timsort is O(n) for already-sorted data, while many quicksorts > are O(n^2) for already-sorted data. > > > > Quicksort can be salvaged by using a median-of-3 partitioning, but it's > still O(n^2) in the (less common) worst case. > > > > This is true, but purely sorted data isn't a very practical case. The > case of mostly-sorted data IS practical, though, so it's a quite big > win that it can be close to O(n), and still faster than inserting each > item individually. > You seem to be of the impression that nearly-sorted data isn't an uphill battle with a straightforward quicksort. I'm having a hard time convincing myself of that. -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
On Mon, 2 May 2022 at 09:19, Dan Stromberg wrote: > > On Sun, May 1, 2022 at 3:19 PM Cameron Simpson wrote: > > > On 01May2022 18:55, Marco Sulla wrote: > > >Something like this is OK? > > > > Scanning backward for a byte == 10 in ASCII or ISO-8859 seems fine. > > But what about Unicode? Are all 10 bytes newlines in Unicode encodings? Most absolutely not. "Unicode" isn't an encoding, but of the Unicode Transformation Formats and Universal Character Set encodings, most don't make that guarantee: * UTF-8 does, as mentioned. It sacrifices some efficiency and consistency for a guarantee that ASCII characters are represented by ASCII bytes, and ASCII bytes only ever represent ASCII characters. * UCS-2 and UTF-16 will both represent BMP characters with two bytes. Any character U+xx0A or U+0Axx will include an 0x0A in its representation. * UTF-16 will also encode anything U+000xxx0A with an 0x0A. (And I don't think any codepoints have been allocated that would trigger this, but UTF-16 can also use 0x0A in the high surrogate.) * UTF-32 and UCS-4 will use 0x0A for any character U+xx0A, U+0Axx, and U+A (though that plane has no characters on it either) So, of all the available Unicode standard encodings, only UTF-8 makes this guarantee. Of course, if you look at documents available on the internet, UTF-8 the encoding used by the vast majority of them (especially if you include seven-bit files, which can equally be considered ASCII, ISO-8859-x, and UTF-8), so while it might only be one encoding out of many, it's probably the most important :) In general, you can *only* make this parsing assumption IF you know for sure that your file's encoding is UTF-8, ISO-8859-x, some OEM eight-bit encoding (eg Windows-125x), or one of a handful of other compatible encodings. But it probably will be. > If not, and you have a huge file to reverse, it might be better to use a > temporary file. Yeah, or an in-memory deque if you know how many lines you want. Either way, you can read the file forwards, guaranteeing correct decoding even of a shifted character set (where a byte value can change in meaning based on arbitrarily distant context). ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: new sorting algorithm
On Mon, 2 May 2022 at 09:20, Dan Stromberg wrote: > > > On Sun, May 1, 2022 at 1:44 PM Chris Angelico wrote: >> >> On Mon, 2 May 2022 at 06:43, Dan Stromberg wrote: >> > On Sun, May 1, 2022 at 11:10 AM Chris Angelico wrote: >> >> >> >> On Mon, 2 May 2022 at 01:53, Nas Bayedil wrote: >> >> > We believe that using this method to develop completely new, fast >> >> > algorithms, approaching the speed of the famous *QuickSort*, the speed >> >> > of >> >> > which cannot be surpassed, but its drawback can be circumvented, in the >> >> > sense of stack overflow, on some data. >> >> >> >> Hmm, actually TimSort *does* exceed the speed of quicksort for a lot >> >> of real-world data. For instance, if you take a large sorted list, >> >> append a handful of (unsorted) items to it, and then sort the list, >> >> TimSort can take advantage of the fact that the bulk of the list is >> >> sorted. It ends up significantly faster than re-sorting the entire >> >> list. >> > >> > >> > In fact, Timsort is O(n) for already-sorted data, while many quicksorts >> > are O(n^2) for already-sorted data. >> > >> > Quicksort can be salvaged by using a median-of-3 partitioning, but it's >> > still O(n^2) in the (less common) worst case. >> > >> >> This is true, but purely sorted data isn't a very practical case. The >> case of mostly-sorted data IS practical, though, so it's a quite big >> win that it can be close to O(n), and still faster than inserting each >> item individually. > > > You seem to be of the impression that nearly-sorted data isn't an uphill > battle with a straightforward quicksort. > > I'm having a hard time convincing myself of that. > The median-of-three partitioning technique makes that work reasonably well, so it won't be pathologically slow. It's hardly Quicksort's best feature, but it could easily be a lot worse. I'd have to check, but I think it still manages to be O(n log n). Merge sort, of course, is a lot more consistent, but the asymptotic cost is still broadly the same. But Timsort manages to be close to O(n) for sorted data, reversed data, nearly-sorted or nearly-reversed data, etc. Makes it very handy for jobs like "click a heading to sort by it", where you might add multiple sort keys. (Plus, Python's implementation has some cool tricks for small collections that make it quite efficient.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
On 01May2022 23:30, Stefan Ram wrote: >Dan Stromberg writes: >>But what about Unicode? Are all 10 bytes newlines in Unicode encodings? > It seems in UTF-8, when a value is above U+007F, it will be > encoded with bytes that always have their high bit set. Aye. Design festure enabling easy resync-to-char-boundary at an arbitrary point in the file. > But Unicode has NEL "Next Line" U+0085 and other values that > conforming applications should recognize as line terminators. I disagree. Maybe for printing things. But textual data records? I would hope to end them with NL, and only NL (code 10). Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
Difference in Setup Between Windows 10 Running Python 3.9 and Windows 11 Running Python 3.10
Hello, I was recently running a Windows 10 machine Python 3.9. I simply created a batch file titled "Start-AIG.bat" which simply contained the following: "pythonw AIG.py". It started a python program titled "AIG.py" and the Python dialog box was displayed on my screen, running all day and night. I set up Windows to run this batch file upon startup and it worked fine. I remember having to do a bunch of research before I learned that I needed to put "pythonw AIG.py" in the batch file as opposed to "python AIG.py". However, my old windows 10 desktop crashed and I have a new Windows 11 machine. I installed the latest stable version of Python, which is 3.10. Now, when I try to execute the same batch file, it doesn't launch the app. It could be because I'm using Windows 11 or could be because I now have a newer version of Python. Does anyone have any ideas what I should do to get the Python script running on my new machine? Thank you! Brent -- https://mail.python.org/mailman/listinfo/python-list
Re: Difference in Setup Between Windows 10 Running Python 3.9 and Windows 11 Running Python 3.10
On 2022-05-02 02:56, Brent Hunter wrote: Hello, I was recently running a Windows 10 machine Python 3.9. I simply created a batch file titled "Start-AIG.bat" which simply contained the following: "pythonw AIG.py". It started a python program titled "AIG.py" and the Python dialog box was displayed on my screen, running all day and night. I set up Windows to run this batch file upon startup and it worked fine. I remember having to do a bunch of research before I learned that I needed to put "pythonw AIG.py" in the batch file as opposed to "python AIG.py". However, my old windows 10 desktop crashed and I have a new Windows 11 machine. I installed the latest stable version of Python, which is 3.10. Now, when I try to execute the same batch file, it doesn't launch the app. It could be because I'm using Windows 11 or could be because I now have a newer version of Python. Does anyone have any ideas what I should do to get the Python script running on my new machine? The problem is probably that the Python folder isn't in Windows' search path, but the recommended thing to do nowadays on Windows is to use the Python Launcher "py.exe" (or "pyw.exe" for no console window) instead: pyw AIG.py -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
On Mon, 2 May 2022 at 11:54, Cameron Simpson wrote: > > On 01May2022 23:30, Stefan Ram wrote: > >Dan Stromberg writes: > >>But what about Unicode? Are all 10 bytes newlines in Unicode encodings? > > It seems in UTF-8, when a value is above U+007F, it will be > > encoded with bytes that always have their high bit set. > > Aye. Design festure enabling easy resync-to-char-boundary at an > arbitrary point in the file. Yep - and there's also a distinction between "first byte of multi-byte character" and "continuation byte, keep scanning backwards". So you're guaranteed to be able to resynchronize. (If you know whether it's little-endian or big-endian, UTF-16 can also resync like that, since "high surrogate" and "low surrogate" look different.) > > But Unicode has NEL "Next Line" U+0085 and other values that > > conforming applications should recognize as line terminators. > > I disagree. Maybe for printing things. But textual data records? I would > hope to end them with NL, and only NL (code 10). > I'm with you on that - textual data records should end with 0x0A only. But if there are text entities in there, they should be allowed to include any Unicode characters, potentially including other types of whitespace. ChrisA -- https://mail.python.org/mailman/listinfo/python-list