Finding a text in raw data(size nearly 10GB) and Printing its memory address using python
I have a raw data of size nearly 10GB. I would like to find a text string and print the memory address at which it is stored. This is my code import os import re filename="filename.dmp" read_data=2**24 searchtext="bd:mongo:" he=searchtext.encode('hex') with open(filename, 'rb') as f: while True: data= f.read(read_data) if not data: break elif searchtext in data: print "Found" try: offset=hex(data.index(searchtext)) print offset except ValueError: print 'Not Found' else: continue The address I am getting is #0x2c0900 #0xb62300 But the actual positioning is # 652c0900 # 652c0950 -- https://mail.python.org/mailman/listinfo/python-list
Re: Finding a text in raw data(size nearly 10GB) and Printing its memory address using python
On Monday, April 23, 2018 at 11:01:39 PM UTC+5:30, Chris Angelico wrote: > On Tue, Apr 24, 2018 at 3:24 AM, Hac4u wrote: > > I have a raw data of size nearly 10GB. I would like to find a text string > > and print the memory address at which it is stored. > > > > This is my code > > > > import os > > import re > > filename="filename.dmp" > > read_data=2**24 > > searchtext="bd:mongo:" > > he=searchtext.encode('hex') > > Why encode it as hex? > > > with open(filename, 'rb') as f: > > while True: > > data= f.read(read_data) > > if not data: > > break > > elif searchtext in data: > > print "Found" > > try: > > offset=hex(data.index(searchtext)) > > print offset > > except ValueError: > > print 'Not Found' > > else: > > continue > > You have a loop that reads a slab of data from a file, then searches > the current data only. Then you search that again for the actual > index, and print it - but you're printing the offset within the > current chunk only. You'll need to maintain a chunk position in order > to get the actual offset. > > Also, you're not going to find this if it spans across a chunk > boundary. May need to cope with that. > > ChrisA I was encoding to try something.You can ignore that line.. Yea i was not maitaing the chunk position..Can u help me out with any link..I am out of ideas and this is my first time dealing with memory codes. Regards Samaksh -- https://mail.python.org/mailman/listinfo/python-list
Re: Finding a text in raw data(size nearly 10GB) and Printing its memory address using python
On Tuesday, April 24, 2018 at 1:28:07 AM UTC+5:30, Paul Rubin wrote: > Hac4u writes: > > I have a raw data of size nearly 10GB. I would like to find a text > > string and print the memory address at which it is stored. > > The simplest way is probably to mmap the file and use mmap.find: > > https://docs.python.org/2/library/mmap.html#mmap.mmap.find Thanks alot Buddy, And yea I will try to convert it in mmap.. Ur code helped alot. But I have few doubts 1. What is the use of overlap. 2. Ur code does not end..Like it does break even after searching through the entire file. Bdw, I modified your code.. import os import re filename="E:/bitdefender/test.vmem" read_data=2**24 offset=0 chunk_start=0 searchtext=b"bd:mongo:" search_length=len(searchtext) overlap_length = search_length - 1 he=searchtext.encode('hex') with open(filename, 'rb') as f: while True: data= f.read(read_data) if not data: break while True: offset=data.find(searchtext,offset) # print offset if offset < 0: break print "Found at",hex(chunk_start+offset) offset+=search_length chunk_start += len(data) data=data[read_data:] offset=0 -- https://mail.python.org/mailman/listinfo/python-list
Re: Finding a text in raw data(size nearly 10GB) and Printing its memory address using python
On Tuesday, April 24, 2018 at 12:54:43 AM UTC+5:30, MRAB wrote: > On 2018-04-23 18:24, Hac4u wrote: > > I have a raw data of size nearly 10GB. I would like to find a text string > > and print the memory address at which it is stored. > > > > This is my code > > > > import os > > import re > > filename="filename.dmp" > > read_data=2**24 > > searchtext="bd:mongo:" > > he=searchtext.encode('hex') > > with open(filename, 'rb') as f: > > while True: > > data= f.read(read_data) > > if not data: > > break > > elif searchtext in data: > > print "Found" > > try: > > offset=hex(data.index(searchtext)) > > print offset > > except ValueError: > > print 'Not Found' > > else: > > continue > > > > > > The address I am getting is > > #0x2c0900 > > #0xb62300 > > > > But the actual positioning is > > # 652c0900 > > # 652c0950 > > > Here's a version that handles overlaps. > > Try to keep in mind the distinction between bytestrings and text > strings. It doesn't matter as much in Python 2, but it does in Python 3. > > > filename = "filename.dmp" > chunk_size = 2**24 > search_text = b"bd:mongo:" > chunk_start = 0 > offset = 0 > search_length = len(search_text) > overlap_length = search_length - 1 > data = b'' > > with open(filename, 'rb') as f: > while True: > # Read in more data. > data += f.read(chunk_size) > if not data: > break > > # Search this chunk. > while True: > offset = data.find(search_text, offset) > if offset < 0: > break > > print "Found at", hex(chunk_start + offset) > offset += search_length > > # We've searched this chunk. Discard all but a portion of overlap. > chunk_start += len(data) - overlap_length > > if overlap_length > 0: > data = data[-overlap_length : ] > else: > data = b'' > > offset = 0 Thanks alot for the code. I have two questions 1. Why did u use overlap. And, In what condition it can be counted on? 2. Your code does not end. It keep on looking for sth ..Though it worked well. So, Thanks alot for the code. Here is my modified code(taken help from your code) import os import re filename="filename.dmp" read_data=2**24 offset=0 chunk_start=0 searchtext=b"bd:mongo:" search_length=len(searchtext) overlap_length = search_length - 1 he=searchtext.encode('hex') with open(filename, 'rb') as f: while True: data= f.read(read_data) if not data: break while True: offset=data.find(searchtext,offset) # print offset if offset < 0: break print "Found at",hex(chunk_start+offset) offset+=search_length chunk_start += len(data) data=data[read_data:] offset=0 -- https://mail.python.org/mailman/listinfo/python-list
Re: Finding a text in raw data(size nearly 10GB) and Printing its memory address using python
On Tuesday, April 24, 2018 at 4:13:17 AM UTC+5:30, MRAB wrote: > On 2018-04-23 22:11, Hac4u wrote: > > On Tuesday, April 24, 2018 at 12:54:43 AM UTC+5:30, MRAB wrote: > >> On 2018-04-23 18:24, Hac4u wrote: > >> > I have a raw data of size nearly 10GB. I would like to find a text > >> > string and print the memory address at which it is stored. > >> > > >> > This is my code > >> > > >> > import os > >> > import re > >> > filename="filename.dmp" > >> > read_data=2**24 > >> > searchtext="bd:mongo:" > >> > he=searchtext.encode('hex') > >> > with open(filename, 'rb') as f: > >> > while True: > >> > data= f.read(read_data) > >> > if not data: > >> > break > >> > elif searchtext in data: > >> > print "Found" > >> > try: > >> > offset=hex(data.index(searchtext)) > >> > print offset > >> > except ValueError: > >> > print 'Not Found' > >> > else: > >> > continue > >> > > >> > > >> > The address I am getting is > >> > #0x2c0900 > >> > #0xb62300 > >> > > >> > But the actual positioning is > >> > # 652c0900 > >> > # 652c0950 > >> > > >> Here's a version that handles overlaps. > >> > >> Try to keep in mind the distinction between bytestrings and text > >> strings. It doesn't matter as much in Python 2, but it does in Python 3. > >> > >> > >> filename = "filename.dmp" > >> chunk_size = 2**24 > >> search_text = b"bd:mongo:" > >> chunk_start = 0 > >> offset = 0 > >> search_length = len(search_text) > >> overlap_length = search_length - 1 > >> data = b'' > >> > >> with open(filename, 'rb') as f: > >> while True: > >> # Read in more data. > >> data += f.read(chunk_size) > >> if not data: > >> break > >> > >> # Search this chunk. > >> while True: > >> offset = data.find(search_text, offset) > >> if offset < 0: > >> break > >> > >> print "Found at", hex(chunk_start + offset) > >> offset += search_length > >> > >> # We've searched this chunk. Discard all but a portion of overlap. > >> chunk_start += len(data) - overlap_length > >> > >> if overlap_length > 0: > >> data = data[-overlap_length : ] > >> else: > >> data = b'' > >> > >> offset = 0 > > > > > > > > Thanks alot for the code. > > > > I have two questions > > > > 1. Why did u use overlap. And, In what condition it can be counted on? > > Suppose you're searching for b"bd:mongo:". > > What happens if a chunk ends with b"b" and the next chunk starts with > b"d:mongo:"? Or b"bd:m" and b"ongo:"? Or b"bd:mongo" and b":"? > > It wouldn't find a match that's split across chunks. > > > 2. Your code does not end. It keep on looking for sth ..Though it worked > > well. > > > > So, Thanks alot for the code. > > > Here's my code with a bug fix: > > filename = "filename.dmp" > chunk_size = 2**24 > search_text = b"bd:mongo:" > chunk_start = 0 > offset = 0 > search_length = len(search_text) > overlap_length = search_length - 1 > data = b'' > > with open(filename, 'rb') as f: > while True: > # Read in more data. > data += f.read(chunk_size) > if len(data) < search_length: > break > > # Search this chunk. > while True: > offset = data.find(search_text, offset) > if offset < 0: > break > > print "Found at", hex(chunk_start + offset) > offset += search_length > > # We've searched this chunk. Discard all but a portion of overlap. > chunk_start += len(data) - overlap_length > > if overlap_length > 0: > data = data[-overlap_length : ] > else: > data = b'' > > offset = 0 Got it.. Thanks aton for the explaination.. -- https://mail.python.org/mailman/listinfo/python-list