Yeah, yeah, I know all the arguments against just-throw-more-cores-at-the-problem - Amdahl's Law and all that. But... I have a shiny new Fujitsu SP-1425 scanner, with automatic document feeder. Which means I can, if I so choose, digitise old magazines and what-not pretty quickly. Which is neat. But if I want the benefits of digitisation, I have to OCR the resulting images. Which isn't a problem, Tesseract is free-as-in-speech-and-beer and does a fantastic job. Eventually. Running across a 300dpi magazine scan on my Ryzen 2700X, Tesseract takes about 132 seconds per page. Which adds up when you're scanning chunky magazines. The solution: parallelising the problem with GNU Parallel. I've got eight cores, 16 threads - so I can run 16 Tesseract workers at once. The output of GNU Parallel hammers home the difference that makes: Code: local:16/1/100%/132.0s That's after the first page has finished: average job speed of 132 seconds exactly. Code: local:0/36/100%/11.4s That's after the last page has finished: average job speed of 11.4 seconds. That's an 11.6x speed-up over running a single Tesseract worker. Okay, it's not a linear sixteenfold increase, but then I haven't got 16 cores - I've got eight cores each of which runs two threads, so every bit of extra performance above an eightfold boost is effectively a bonus. And there we go, nearly twelve times faster than if I were using Tesseract without GNU Parallel to speed things up: A full-text-searchable PDF. Noice.
We do the same thing at work with some of our image analytics on our massive datasets, just throw as many workers at it as possible! Had a 128 thread server crunching through it all, was a noisey beast though!
I can imagine! The really nice thing about GNU Parallel, compared to something like xargs, is that you can set up remote workers - and all you need is SSH and a copy of GNU Parallel (and, y'know, the tool you're actually wanting to use) on each. Then when you execute a job, it runs it locally on all logical cores *and* across all logical cores on all accessible remote systems - copying the file it's working on to the remote system, working on it, and copying the result back to your local system. All pretty much invisibly. I don't use it much 'cos my desktop's the one with all the cores - my server's a dual-core and most of the rest of the hardware which gets left running all the time are some flavour of Raspberry Pi or other - but it's nice to have the option. Also means you can shove a noisy 128-core beast somewhere else and not be bothered by the noise!
Oh really? I did not know that! I'll have to keep that in mind for the future! Would be useful to be able to did it distrubuted across all of our workstations is needed!
Yup - pretty easy to use, too. You can either specify remote machines as you're constructing the command line or create a config file with 'em already in there.
Well, turns out that "132 seconds per page" may not have been quiiiiite accurate. I've never used Tesseract before, so I had no idea what to expect from its performance - and, as I usually do for batch jobs, I threw it at GNU Parallel right away. By default, GNU Parallel creates as many workers as "logical CPUs" (i.e. threads). So, parallel tesseract spawns 16 jobs. In total, it took about 6m50s to complete the recognition job. That sounded like a long time, so I ran a for i in *jpg; do tesseract; done instead. Which finished in 2m10s. A third the time of the parallel version. Yeah. Not really proving my point so much, there. Figured maybe there's something about Tesseract that doesn't like running on "logical" CPUs, so I tried parallel -j8 tesseract to limit it to eight workers. The result: 0m34s. We're still looking at just shy of a fourfold increase, but nowhere near the twelvefold I thought I was enjoying. Oh, well, ne'er mind. It'll be interesting to see if things are any different on a longer PDF...
Okay, this was irritating me so I did some investigation - because while I could see SMT having little to no impact on performance, it shouldn't hurt it that badly. To recap: That's not what you'd expect to see. That's *definitely* not what you'd expect to see. Turns out Tesseract isn't, as I falsely assumed, single-threaded. It has its own multithreading. Which is crap. Seriously, a four percent performance boost on an 8C16T CPU? Pfft. So, what happens if you disable it? That. That happens. 16 workers with threading disabled is demonstrably the fastest mode, as I had expected would be the case. It's 653 percent faster than the default mode with Tesseract's internal multithreading active, and 28 percent faster than running a single-threaded worker per physical CPU core. Itch scratched! EDIT: Decided to run one more test, to get a true view of the speed-up: the Apricot GW-BASIC Manual, 297 pages including covers. It took 319s for Tesseract alone, using its in-built multithreading. It took 37.5s for 16 Tesseract workers via GNU Parallel, with multithreading disabled. That's an 8.5x speedup. Can't complain at that!