Finding a text in raw data(size nearly 10GB) and Printing its memory address using python

2018-04-23 Thread Hac4u
I have a raw data of size nearly 10GB. I would like to find a text string and 
print the memory address at which it is stored.

This is my code

import os
import re
filename="filename.dmp"
read_data=2**24
searchtext="bd:mongo:"
he=searchtext.encode('hex')
with open(filename, 'rb') as f:
while True:
data= f.read(read_data)
if not data:
break
elif searchtext in data:
print "Found"
try:
offset=hex(data.index(searchtext))
print offset
except ValueError:
print 'Not Found'   
else:
continue


The address I am getting is
#0x2c0900
#0xb62300

But the actual positioning is
# 652c0900
# 652c0950
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Finding a text in raw data(size nearly 10GB) and Printing its memory address using python

2018-04-23 Thread Hac4u
On Monday, April 23, 2018 at 11:01:39 PM UTC+5:30, Chris Angelico wrote:
> On Tue, Apr 24, 2018 at 3:24 AM, Hac4u  wrote:
> > I have a raw data of size nearly 10GB. I would like to find a text string 
> > and print the memory address at which it is stored.
> >
> > This is my code
> >
> > import os
> > import re
> > filename="filename.dmp"
> > read_data=2**24
> > searchtext="bd:mongo:"
> > he=searchtext.encode('hex')
> 
> Why encode it as hex?
> 
> > with open(filename, 'rb') as f:
> > while True:
> > data= f.read(read_data)
> > if not data:
> > break
> > elif searchtext in data:
> > print "Found"
> > try:
> > offset=hex(data.index(searchtext))
> > print offset
> > except ValueError:
> > print 'Not Found'
> > else:
> > continue
> 
> You have a loop that reads a slab of data from a file, then searches
> the current data only. Then you search that again for the actual
> index, and print it - but you're printing the offset within the
> current chunk only. You'll need to maintain a chunk position in order
> to get the actual offset.
> 
> Also, you're not going to find this if it spans across a chunk
> boundary. May need to cope with that.
> 
> ChrisA



I was encoding to try something.You can ignore that line..

Yea i was not maitaing the chunk position..Can u help me out with any link..I 
am out of ideas and this is my first time dealing with memory codes.

Regards
Samaksh
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Finding a text in raw data(size nearly 10GB) and Printing its memory address using python

2018-04-23 Thread Hac4u
On Tuesday, April 24, 2018 at 1:28:07 AM UTC+5:30, Paul Rubin wrote:
> Hac4u  writes:
> > I have a raw data of size nearly 10GB. I would like to find a text
> > string and print the memory address at which it is stored.
> 
> The simplest way is probably to mmap the file and use mmap.find:
> 
> https://docs.python.org/2/library/mmap.html#mmap.mmap.find

Thanks alot Buddy,

And yea I will try to convert it in mmap..

Ur code helped alot. But I have few doubts 
1. What is the use of overlap.
2. Ur code does not end..Like it does break even after searching through the 
entire file.


Bdw, I modified your code.. 




import os
import re
filename="E:/bitdefender/test.vmem"
read_data=2**24
offset=0
chunk_start=0
searchtext=b"bd:mongo:"
search_length=len(searchtext)
overlap_length = search_length - 1 
he=searchtext.encode('hex')
with open(filename, 'rb') as f:
while True:
data= f.read(read_data)
if not data:
break
while True:
offset=data.find(searchtext,offset)
# print offset
if offset < 0:
break
print "Found at",hex(chunk_start+offset)
offset+=search_length

chunk_start += len(data)  
data=data[read_data:] 
offset=0









-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Finding a text in raw data(size nearly 10GB) and Printing its memory address using python

2018-04-23 Thread Hac4u
On Tuesday, April 24, 2018 at 12:54:43 AM UTC+5:30, MRAB wrote:
> On 2018-04-23 18:24, Hac4u wrote:
> > I have a raw data of size nearly 10GB. I would like to find a text string 
> > and print the memory address at which it is stored.
> > 
> > This is my code
> > 
> > import os
> > import re
> > filename="filename.dmp"
> > read_data=2**24
> > searchtext="bd:mongo:"
> > he=searchtext.encode('hex')
> > with open(filename, 'rb') as f:
> >  while True:
> >  data= f.read(read_data)
> >  if not data:
> >  break
> >  elif searchtext in data:
> >  print "Found"
> >  try:
> >  offset=hex(data.index(searchtext))
> >  print offset
> >  except ValueError:
> >  print 'Not Found'
> >  else:
> >  continue
> > 
> > 
> > The address I am getting is
> > #0x2c0900
> > #0xb62300
> > 
> > But the actual positioning is
> > # 652c0900
> > # 652c0950
> > 
> Here's a version that handles overlaps.
> 
> Try to keep in mind the distinction between bytestrings and text 
> strings. It doesn't matter as much in Python 2, but it does in Python 3.
> 
> 
> filename = "filename.dmp"
> chunk_size = 2**24
> search_text = b"bd:mongo:"
> chunk_start = 0
> offset = 0
> search_length = len(search_text)
> overlap_length = search_length - 1
> data = b''
> 
> with open(filename, 'rb') as f:
>  while True:
>  # Read in more data.
>  data += f.read(chunk_size)
>  if not data:
>  break
> 
>  # Search this chunk.
>  while True:
>  offset = data.find(search_text, offset)
>  if offset < 0:
>  break
> 
>  print "Found at", hex(chunk_start + offset)
>  offset += search_length
> 
>  # We've searched this chunk. Discard all but a portion of overlap.
>  chunk_start += len(data) - overlap_length
> 
>  if overlap_length > 0:
>  data = data[-overlap_length : ]
>  else:
>  data = b''
> 
>  offset = 0



Thanks alot for the code.

I have two questions

1. Why did u use overlap. And, In what condition it can be counted on?
2. Your code does not end. It keep on looking for sth ..Though it worked well.

So, Thanks alot for the code.

Here is my modified code(taken help from your code)

import os
import re
filename="filename.dmp"
read_data=2**24
offset=0
chunk_start=0
searchtext=b"bd:mongo:"
search_length=len(searchtext)
overlap_length = search_length - 1 
he=searchtext.encode('hex')
with open(filename, 'rb') as f:
while True:
data= f.read(read_data)
if not data:
break
while True:
offset=data.find(searchtext,offset)
# print offset
if offset < 0:
break
print "Found at",hex(chunk_start+offset)
offset+=search_length

chunk_start += len(data)  
data=data[read_data:] 
offset=0  
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Finding a text in raw data(size nearly 10GB) and Printing its memory address using python

2018-04-23 Thread Hac4u
On Tuesday, April 24, 2018 at 4:13:17 AM UTC+5:30, MRAB wrote:
> On 2018-04-23 22:11, Hac4u wrote:
> > On Tuesday, April 24, 2018 at 12:54:43 AM UTC+5:30, MRAB wrote:
> >> On 2018-04-23 18:24, Hac4u wrote:
> >> > I have a raw data of size nearly 10GB. I would like to find a text 
> >> > string and print the memory address at which it is stored.
> >> > 
> >> > This is my code
> >> > 
> >> > import os
> >> > import re
> >> > filename="filename.dmp"
> >> > read_data=2**24
> >> > searchtext="bd:mongo:"
> >> > he=searchtext.encode('hex')
> >> > with open(filename, 'rb') as f:
> >> >  while True:
> >> >  data= f.read(read_data)
> >> >  if not data:
> >> >  break
> >> >  elif searchtext in data:
> >> >  print "Found"
> >> >  try:
> >> >  offset=hex(data.index(searchtext))
> >> >  print offset
> >> >  except ValueError:
> >> >  print 'Not Found'
> >> >  else:
> >> >  continue
> >> > 
> >> > 
> >> > The address I am getting is
> >> > #0x2c0900
> >> > #0xb62300
> >> > 
> >> > But the actual positioning is
> >> > # 652c0900
> >> > # 652c0950
> >> > 
> >> Here's a version that handles overlaps.
> >> 
> >> Try to keep in mind the distinction between bytestrings and text 
> >> strings. It doesn't matter as much in Python 2, but it does in Python 3.
> >> 
> >> 
> >> filename = "filename.dmp"
> >> chunk_size = 2**24
> >> search_text = b"bd:mongo:"
> >> chunk_start = 0
> >> offset = 0
> >> search_length = len(search_text)
> >> overlap_length = search_length - 1
> >> data = b''
> >> 
> >> with open(filename, 'rb') as f:
> >>  while True:
> >>  # Read in more data.
> >>  data += f.read(chunk_size)
> >>  if not data:
> >>  break
> >> 
> >>  # Search this chunk.
> >>  while True:
> >>  offset = data.find(search_text, offset)
> >>  if offset < 0:
> >>  break
> >> 
> >>  print "Found at", hex(chunk_start + offset)
> >>  offset += search_length
> >> 
> >>  # We've searched this chunk. Discard all but a portion of overlap.
> >>  chunk_start += len(data) - overlap_length
> >> 
> >>  if overlap_length > 0:
> >>  data = data[-overlap_length : ]
> >>  else:
> >>  data = b''
> >> 
> >>  offset = 0
> > 
> > 
> > 
> > Thanks alot for the code.
> > 
> > I have two questions
> > 
> > 1. Why did u use overlap. And, In what condition it can be counted on?
> 
> Suppose you're searching for b"bd:mongo:".
> 
> What happens if a chunk ends with b"b" and the next chunk starts with 
> b"d:mongo:"? Or b"bd:m" and b"ongo:"? Or b"bd:mongo" and b":"?
> 
> It wouldn't find a match that's split across chunks.
> 
> > 2. Your code does not end. It keep on looking for sth ..Though it worked 
> > well.
> > 
> > So, Thanks alot for the code.
> > 
> Here's my code with a bug fix:
> 
> filename = "filename.dmp"
> chunk_size = 2**24
> search_text = b"bd:mongo:"
> chunk_start = 0
> offset = 0
> search_length = len(search_text)
> overlap_length = search_length - 1
> data = b''
> 
> with open(filename, 'rb') as f:
>   while True:
>   # Read in more data.
>   data += f.read(chunk_size)
>   if len(data) < search_length:
>   break
> 
>   # Search this chunk.
>   while True:
>   offset = data.find(search_text, offset)
>   if offset < 0:
>   break
> 
>   print "Found at", hex(chunk_start + offset)
>   offset += search_length
> 
>   # We've searched this chunk. Discard all but a portion of overlap.
>   chunk_start += len(data) - overlap_length
> 
>   if overlap_length > 0:
>   data = data[-overlap_length : ]
>   else:
>   data = b''
> 
>   offset = 0

Got it.. 

Thanks aton for the explaination..
-- 
https://mail.python.org/mailman/listinfo/python-list