I'm relatively new to JavaScript programming for node.js, and I've been reading about this for 5 hours and cannot wrap my head around it, so here I am... I am trying to get the text from a PDF for searching. I need page numbers, line numbers, and character positions of the results. It appears that pdf.js cannot keep line breaks at the very least. I wonder if it will keep multiple sequential spaces, but the line breaks are a dealbreaker, so I've moved on. Now I'm using pdf-image to convert the pdf document to a png for each page. Then I want to use tesseract.js to run OCR on the png files to get the text as it appears in the pdf including line breaks and extra spaces. The problem is if the pdf document is more than 5-10 pages, then execution kills my laptop. The process of converting the pdf to png's consumes over 12GB of RAM and never finishes to even move on the the OCR which has to be worse. The average number of pages in the pdf documents I am processing is 300-500, so I have to batch process. The problem I have is that pdf-image and tesseract.js both use promises for async processing. It's really the async that's killing my laptop. I just want to get the number of pages, loop over each page one at a time, convert it to png, then perform the OCR, then finish some other synchronous processing before moving on to the next page. The code I have right now that doesn't work is:
import Tesseract from 'tesseract.js'; import pdfimage from 'pdf-image'; var PDFImage = pdfimage.PDFImage; var pdfImage = new PDFImage(pdfFilePath, { convertOptions: { "-density": "196" }}); for(var pageIndex = 0; pageIndex < numberOfPages; pageIndex++) { pdfImage.convertPage(pageIndex).then(function (pageImage) { Tesseract.recognize(pageImage, 'eng').then(({ data: { text } }) => { console.log(text); //perform other synchronous processing with the text before moving to the next page... }); }); } I can process one page without the for loop in about 200ms, but when I try to loop everything gets messed up. I'm not sure how to proceed with processing these promises synchronously. I know promises are supposed to be more efficient, but sometimes order is important and resources are limited for unchecked parallel processing... Like file type conversion and OCR for 300-500 page documents. As a nice-to-have, I would also like to figure out how to load and initialize tesseract.js once and then just call the recognize method. I have tried the following code to achieve that, but I think it loads and initializes then reloads and initializes when it calls the recognize method. Controlling that behavior may not be possible, but I figured I'd throw it out there. (async () => { await Tesseract.load(); await Tesseract.loadLanguage('eng'); await Tesseract.initialize('eng'); }); //then perform the convertPage then recognize as shown in the first code block above... Thank you! -- Job board: http://jobs.nodejs.org/ New group rules: https://gist.github.com/othiym23/9886289#file-moderation-policy-md Old group rules: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines --- You received this message because you are subscribed to the Google Groups "nodejs" group. To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/nodejs/6495d8ad-44c7-48fa-9936-e8ec687d2807o%40googlegroups.com.