CPU Why I Love Many Core Computing

Discussion in 'Hardware' started by Gareth Halfacree, 27 Sep 2020 at 14:23.

    Yeah, yeah, I know all the arguments against just-throw-more-cores-at-the-problem - Amdahl's Law and all that. But...

    I have a shiny new Fujitsu SP-1425 scanner, with automatic document feeder. Which means I can, if I so choose, digitise old magazines and what-not pretty quickly. Which is neat.

    But if I want the benefits of digitisation, I have to OCR the resulting images. Which isn't a problem, Tesseract is free-as-in-speech-and-beer and does a fantastic job. Eventually. Running across a 300dpi magazine scan on my Ryzen 2700X, Tesseract takes about 132 seconds per page. Which adds up when you're scanning chunky magazines.

    The solution: parallelising the problem with GNU Parallel. I've got eight cores, 16 threads - so I can run 16 Tesseract workers at once. The output of GNU Parallel hammers home the difference that makes:

    Code:
    local:16/1/100%/132.0s
    That's after the first page has finished: average job speed of 132 seconds exactly.

    Code:
    local:0/36/100%/11.4s
    That's after the last page has finished: average job speed of 11.4 seconds.

    That's an 11.6x speed-up over running a single Tesseract worker. Okay, it's not a linear sixteenfold increase, but then I haven't got 16 cores - I've got eight cores each of which runs two threads, so every bit of extra performance above an eightfold boost is effectively a bonus.

    And there we go, nearly twelve times faster than if I were using Tesseract without GNU Parallel to speed things up:

    upload_2020-9-27_14-23-32.png

    A full-text-searchable PDF. Noice.
     
    We do the same thing at work with some of our image analytics on our massive datasets, just throw as many workers at it as possible! Had a 128 thread server crunching through it all, was a noisey beast though!
     
    I can imagine!

    The really nice thing about GNU Parallel, compared to something like xargs, is that you can set up remote workers - and all you need is SSH and a copy of GNU Parallel (and, y'know, the tool you're actually wanting to use) on each. Then when you execute a job, it runs it locally on all logical cores *and* across all logical cores on all accessible remote systems - copying the file it's working on to the remote system, working on it, and copying the result back to your local system. All pretty much invisibly.

    I don't use it much 'cos my desktop's the one with all the cores - my server's a dual-core and most of the rest of the hardware which gets left running all the time are some flavour of Raspberry Pi or other - but it's nice to have the option. Also means you can shove a noisy 128-core beast somewhere else and not be bothered by the noise!
     
    Oh really? I did not know that! I'll have to keep that in mind for the future! Would be useful to be able to did it distrubuted across all of our workstations is needed!
     
    Yup - pretty easy to use, too. You can either specify remote machines as you're constructing the command line or create a config file with 'em already in there.
     
