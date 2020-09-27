Yeah, yeah, I know all the arguments against just-throw-more-cores-at-the-problem - Amdahl's Law and all that. But... I have a shiny new Fujitsu SP-1425 scanner, with automatic document feeder. Which means I can, if I so choose, digitise old magazines and what-not pretty quickly. Which is neat. But if I want the benefits of digitisation, I have to OCR the resulting images. Which isn't a problem, Tesseract is free-as-in-speech-and-beer and does a fantastic job. Eventually. Running across a 300dpi magazine scan on my Ryzen 2700X, Tesseract takes about 132 seconds per page. Which adds up when you're scanning chunky magazines. The solution: parallelising the problem with GNU Parallel. I've got eight cores, 16 threads - so I can run 16 Tesseract workers at once. The output of GNU Parallel hammers home the difference that makes: Code: local:16/1/100%/132.0s That's after the first page has finished: average job speed of 132 seconds exactly. Code: local:0/36/100%/11.4s That's after the last page has finished: average job speed of 11.4 seconds. That's an 11.6x speed-up over running a single Tesseract worker. Okay, it's not a linear sixteenfold increase, but then I haven't got 16 cores - I've got eight cores each of which runs two threads, so every bit of extra performance above an eightfold boost is effectively a bonus. And there we go, nearly twelve times faster than if I were using Tesseract without GNU Parallel to speed things up: A full-text-searchable PDF. Noice.