On Wed, Dec 11, 2019 at 1:57 AM songbird <songb...@anthive.com> wrote: > > Chris Angelico wrote: > > On Tue, Dec 10, 2019 at 12:15 PM songbird <songb...@anthive.com> wrote: > >> > >> Chris Angelico wrote: > >> ... > >> > > >> > Here's an example piece of code. > >> > > >> > sock = socket.socket(...) > >> > name = input("Enter your username: ") > >> > code = input("Enter the base64 code: ") > >> > code = base64.b64decode(code) > >> > sock.write("""GET /foo HTTP/1.0 > >> > Authentication: Demo %s/%s > >> > > >> > """ % (name, code)) > >> > match = re.search(r"#[A-Za-z0-9]+#", sock.read()) > >> > if match: print("Response: " + match.group(0)) > >> > > >> > Your challenge: Figure out which of those strings should be a byte > >> > string and which should be text. Or alternatively, prove that this is > >> > a hard problem. There are only a finite number of types - two, to be > >> > precise - so by your argument, this should be straightforward, right? > >> > >> this isn't a process of looking at isolated code. this > >> is a process of looking at the code, but also the test cases > >> or working examples. so the inputs are known and the code > >> itself gives clues about what it is expecting. > > > > Okay. The test cases are also written in Python, and they use > > unadorned string literals to provide mock values for input() and the > > socket response. Now what? > > wouldn't there be clues in how that string is used in > the program itself (either calls to converters or when > the literal is assigned to some variable or used in a > print statement)? > > > > What if the test cases are entirely ASCII characters? > > it all goes utf in that case and the string is not > binary. > > > > What if the test cases are NOT entirely ASCII characters? > > if the program has more than one language then you may > have to see what the character set falls into. is it hex > it it octal or binary or some language. i'd guess there > will be clues in the code as to how that string is used > later. > > > >> regular expressions can be matched in finite time as well > >> as a fixed length text of any type can be scanned as a match > >> or rejected. > >> > >> if you examined a thousand uses of match and found the > >> pattern used above and then examined what those programs did > >> with that match what would you select as the first type, the > >> one used the most first, if that doesn't work go with the 2nd, > >> etc. > >> > > > > That's not really the point. Are your regular expressions working with > > text or bytes? Does your socket return text or bytes? > > clues in the program again. you're not limited to looking > only at the string itself, but the context of the entire > program. i'm sure patterns are there to be found if you > can scan enough programs they'll start showing up. once > you've found a viable pattern then you have a way to > generate a test case to see if it works or not. > > > > I've deliberately chosen these examples because they are hard. And I > > didn't even get into an extremely hard problem, with the inclusion of > > text inside binary data inside of text inside of bytes. (It does > > happen.) > > > > These problems are fundamentally hard because there is insufficient > > information in the source code alone to determine the programmer's > > intent. > > that is why we would be running the program itself and > examining test case results. > > none of these programs run in isolation, information is > known what they expect as input or produce as output. >
Go do it. Then come back and revisit your assumptions here. ChrisA -- https://mail.python.org/mailman/listinfo/python-list