Max, all,

Thank you for the pointers and help. I'm pleased to say that I seem to have recovered my data. Still remaining to be found would be customizations I made to the standard report. But if what I have found stands up to some checks against known bank balances, etc, then I won't be too far off of where I was a month ago.

What I have been trying to sift through (to recap a bit) is the result of a recovery done with testdisk/photorec, which left a blizzard of files and file fragments on a multi-terabyte harddrive. By and large the filenames were lost (though not in all cases) and photorec uses a list of known file signatures to try to append the appropriate file extension. This largely works, but also not always. Finally, if I had known of a definitive file signature *before* I started the recovery, that might have helped. But for text-oriented files (vs JPEGs, PDFs, executables, etc) that's not always reliable or available.

Fortunately, photorec seems to recognize XML and xml.gz formatted files. Diving head first into a pool I hadn't been in before, I came up with bash scripts (this is a Linux machine I'm working on) to do recursive searches. Basically, I would open a terminal window and
  gedit ~/.bashrc

and add the following to the end:


function odsgrep(){
 term="$1"
 echo Start search : $term
 OIFS="$IFS"
 IFS=$'\n'
 for file in $(find . -name "*.ods"); do
    echo $file;
unzip -p "$file" content.xml | tidy -q -xml 2> /dev/null | grep -i -F "$term" > /dev/null;
    if [ $? -eq 0 ]; then
       echo FOUND FILE $file;
       echo $file;
    fi;
 done
 IFS="$OIFS"
 echo Finished search : $term
}

function mattpdfgrep(){
 term="$1"
 echo Start search : $term
 OIFS="$IFS"
 IFS=$'\n'
 for file in $(find . -name "*.pdf"); do
    #echo $file;
pdftotext -htmlmeta "$file" - | grep --with-filename --label="$file" --color -i -F "$term" ;
    if [ $? -eq 0 ]; then
      echo $file;
      pdfinfo $file;
    fi;
 done
 IFS="$OIFS"
 echo Finished search : $term
}

function mattxlsgrep(){
 term="$1"
 echo Start search : $term
 OIFS="$IFS"
 IFS=$'\n'
 for file in $(find . -name "*.xlsx"); do
    #echo $file;
xlsx2csv "$file" | grep --with-filename --label="$file" --color -i -F "$term" ;
    if [ $? -eq 0 ]; then
      echo $file;
    fi;
 done
 for file in $(find . -name "*.xls"); do
    #echo $file;
xls2csv "$file" | grep --with-filename --label="$file" --color -i -F "$term" ;
    if [ $? -eq 0 ]; then
      echo $file;
    fi;
 done
 IFS="$OIFS"
 echo Finished search : $term
}

function mattxmlgzgrep(){
 term="$1"
 echo Start search : $term
 OIFS="$IFS"
 IFS=$'\n'
 for file in $(find . -name "*.xml.gz"); do
    #echo $file;
gunzip -c "$file" | tidy -q -xml 2> /dev/null | grep -i -F "$term" > /dev/null;
    if [ $? -eq 0 ]; then
       echo FOUND FILE $file;
       #echo $file;
    fi;
 done
 IFS="$OIFS"
 echo Finished search : $term
}

function matttxtgrep(){
 term="$1"
 echo Start search : $term
 OIFS="$IFS"
 IFS=$'\n'
 for file in $(find . -name "*.txt"); do
    #echo $file;
    grep -i -F "$term" "$file"> /dev/null;
    if [ $? -eq 0 ]; then
       echo FOUND FILE $file;
       #echo $file;
    fi;
 done
 IFS="$OIFS"
 echo Finished search : $term
}

These custom commands (built from a 'net search that turned up a variant of the first one) allow for recursive file searches as well as subsequent unzipping and string search operations. Importantly, they attempt to look inside of spreadsheets and pdfs which aren't otherwise "grep-able".

To find the data, I used the mattxmlgzgrep routine to search *backwards* in time for the following
  <ts:date>2017-06

It found no files, which was expected, since I had last worked on this account in March or April, around US tax season. The next search for
  <ts:date>2017-05
also turned up nothing. But searching <ts:date>2017-04 turned up 1 hit and <ts:date>2017-03 turned up a large number. So even though the timestamp on the file was dated as of the recovery, by searching backwards for entries I was able to narrow things down.

Examining the file in gnucash (it seemed to have been pulled in cleanly) showed all the categories, accounts, data, etc that I expected to see.

It would be great to find the files related to the standard report customizations and I'll spend a little time trying to do that. Not sure what would be a suitable "marker" yet but I think I have a candidate or two. But after that I need to find the other records that made up some of this workflow. Fortunately, they were all digital to begin with and I believe I still have access online.

Thanks again for anyone's help. If there's anything I can share in return, let me know.

Matt


On 2017-06-30 14:06, m...@hyre.net wrote:
Dear Matt:

The problem is that the recovery operation (using
Testdisk/Photorec) results in files and file fragments
that may or may not be correctly identified by file
extensions.

   It sounds like what you want is a magic number (file-format ID:
https://en.wikipedia.org/wiki/File_format#Magic_number) for .gnucash
files.  Looking at my file it appears that ``<gnc-v2'' starting at the
41st character in the file whould do it.  (I presumee the `2' in
``-v2'' is a version number, and could change at some future date, but
for now that's not a problem.)

   It would be nice if the recovery program lets you add to the
file-ID list, otherwise you're back to grep.  I hope that it
recognizes gzipped files (possible GNUCash files, compressed), but if
not you want to look for the first two characters = 0x1f 0x8b.  Of
course, then you'll have to unzip them to see whether they're really
what you want.  :-/

   Gurus:  Is this right?  For future-proofing, can we assume the
magic number will always be in position 41?  Is there an actual,
designated, magic number for GNUCash files somewhere?

   Hope this makes sense/helps...


       Best wishes,

           Max Hyre
_______________________________________________
gnucash-user mailing list
gnucash-user@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-user
-----
Please remember to CC this list on all your replies.
You can do this by using Reply-To-List or Reply-All.

Reply via email to