Forgot to preface this response with:

The first example code will run your solution synchronously.

On Mon, Jun 8, 2020 at 10:07 AM Brian Yeh <heyna...@gmail.com> wrote:
>
> import Tesseract from 'tesseract.js';
> import pdfimage from 'pdf-image';
>
>
> var PDFImage = pdfimage.PDFImage;
> var pdfImage = new PDFImage(pdfFilePath, { convertOptions: { "-density": 
> "196" }});
>
>
> function recursiveFunction(pageIndex, numberOfPages)
> {
>  if (pageIndex > numberOfPages) {
>     return;
>  }
>  pdfImage.convertPage(pageIndex).then(function (pageImage) {
>    Tesseract.recognize(pageImage, 'eng').then(({ data: { text } }) => {
>      console.log(text);
>      //perform other synchronous processing with the text before moving to 
> the next page...
>      recursiveFunction(pageIndex, numberOfPages + 1)
>    });
>  });
> }
>
> recursiveFunction(0, numberOfPages)
>
> Additionally if each page processing does not rely on previous results for 
> generation I would not go the synchronous route. If you need the results 
> ordered as if it was synchronous you can simulate this with preallocated 
> array where each page is an index. See below:
>
> import Tesseract from 'tesseract.js';
> import pdfimage from 'pdf-image';
>
>
> var PDFImage = pdfimage.PDFImage;
> var pdfImage = new PDFImage(pdfFilePath, { convertOptions: { "-density": 
> "196" }});
>
> var acc = [];
> for(var pageIndex = 0; pageIndex < numberOfPages; pageIndex++)
> {
>     let currentPageIndex = pageIndex; //Make sure you do this!! node closures 
> will reference the wrong pageIndex otherwise!
>     pdfImage.convertPage(pageIndex).then(function (pageImage) {
>        Tesseract.recognize(pageImage, 'eng').then(({ data: { text } }) => {
>           console.log(text);
>           //perform other processing with the text before moving to the next 
> page...
>           //Save the result as a variable.
>           acc[currentPageIndex] = results;
>        });
>     });
> }
>
> The end result of the above code is that all processes will be done async but 
> when everything is finished your results will be saved in an array in an 
> ordered format from the first page to the last.
>
> Btw I'm not a node expert, just a generalist programmer, honestly not even 
> that into node... so someone else may have a better idea how to do this 
> idiomatically.
>
>
> On Monday, June 8, 2020 at 2:25:05 AM UTC-7, Matthew Hamilton wrote:
>>
>> I'm relatively new to JavaScript programming for node.js, and I've been 
>> reading about this for 5 hours and cannot wrap my head around it, so here I 
>> am... I am trying to get the text from a PDF for searching. I need page 
>> numbers, line numbers, and character positions of the results. It appears 
>> that pdf.js cannot keep line breaks at the very least. I wonder if it will 
>> keep multiple sequential spaces, but the line breaks are a dealbreaker, so 
>> I've moved on. Now I'm using pdf-image to convert the pdf document to a png 
>> for each page. Then I want to use tesseract.js to run OCR on the png files 
>> to get the text as it appears in the pdf including line breaks and extra 
>> spaces. The problem is if the pdf document is more than 5-10 pages, then 
>> execution kills my laptop. The process of converting the pdf to png's 
>> consumes over 12GB of RAM and never finishes to even move on the the OCR 
>> which has to be worse. The average number of pages in the pdf documents I am 
>> processing is 300-500, so I have to batch process. The problem I have is 
>> that pdf-image and tesseract.js both use promises for async processing. It's 
>> really the async that's killing my laptop. I just want to get the number of 
>> pages, loop over each page one at a time, convert it to png, then perform 
>> the OCR, then finish some other synchronous processing before moving on to 
>> the next page. The code I have right now that doesn't work is:
>>
>> import Tesseract from 'tesseract.js';
>> import pdfimage from 'pdf-image';
>>
>> var PDFImage = pdfimage.PDFImage;
>> var pdfImage = new PDFImage(pdfFilePath, { convertOptions: { "-density": 
>> "196" }});
>>
>> for(var pageIndex = 0; pageIndex < numberOfPages; pageIndex++)
>> {
>> pdfImage.convertPage(pageIndex).then(function (pageImage) {
>> Tesseract.recognize(pageImage, 'eng').then(({ data: { text } }) => {
>> console.log(text);
>> //perform other synchronous processing with the text before moving to the 
>> next page...
>> });
>> });
>> }
>>
>> I can process one page without the for loop in about 200ms, but when I try 
>> to loop everything gets messed up. I'm not sure how to proceed with 
>> processing these promises synchronously. I know promises are supposed to be 
>> more efficient, but sometimes order is important and resources are limited 
>> for unchecked parallel processing... Like file type conversion and OCR for 
>> 300-500 page documents.
>>
>> As a nice-to-have, I would also like to figure out how to load and 
>> initialize tesseract.js once and then just call the recognize method. I have 
>> tried the following code to achieve that, but I think it loads and 
>> initializes then reloads and initializes when it calls the recognize method. 
>> Controlling that behavior may not be possible, but I figured I'd throw it 
>> out there.
>>
>> (async () =>
>> {
>> await Tesseract.load();
>> await Tesseract.loadLanguage('eng');
>> await Tesseract.initialize('eng');
>> });
>>
>> //then perform the convertPage then recognize as shown in the first code 
>> block above...
>>
>> Thank you!
>
> --
> Job board: http://jobs.nodejs.org/
> New group rules: 
> https://gist.github.com/othiym23/9886289#file-moderation-policy-md
> Old group rules: 
> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
> ---
> You received this message because you are subscribed to the Google Groups 
> "nodejs" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to nodejs+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/nodejs/d973eb4b-e3e8-4394-bf82-f49e73efca70o%40googlegroups.com.

-- 
Job board: http://jobs.nodejs.org/
New group rules: 
https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
--- 
You received this message because you are subscribed to the Google Groups 
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to nodejs+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/nodejs/CALPEkarKbPou26oMmQYJGN7dmdfkP1Ob5jwH6SxesNedz2uDmA%40mail.gmail.com.

Reply via email to