Pass Large Array To Node Child Process

November 30, 2023 Post a Comment

I have complex CPU intensive work I want to do on a large array. Ideally, I'd like to pass this to the child process. var spawn = require('child_process').spawn; // dataAsNumbers

Solution 1:

With such a massive amount of data, I would look into using shared memory rather than copying the data into the child process (which is what is happening when you use a pipe or pass messages). This will save memory, take less CPU time for the parent process, and be unlikely to bump into some limit.

shm-typed-array is a very simple module that seems suited to your application. Example:

parent.js

"use strict";

const shm = require('shm-typed-array');
const fork = require('child_process').fork;

// Create shared memoryconstSIZE = 20000000;
const data = shm.create(SIZE, 'Float64Array');

// Fill with dummy dataArray.prototype.fill.call(data, 1);

// Spawn child, set up communication, and give shared memoryconst child = fork("child.js");
child.on('message', sum => {
    console.log(`Got answer: ${sum}`);

    // Demo only; ideally you'd re-use the same child
    child.kill();
});
child.send(data.key);

child.js

"use strict";

const shm = require('shm-typed-array');

process.on('message', key => {
    // Get access to shared memoryconst data = shm.get(key, 'Float64Array');

    // Perform processingconst sum = Array.prototype.reduce.call(data, (a, b) => a + b, 0);

    // Return processed data
    process.send(sum);
});

Note that we are only sending a small "key" from the parent to the child process through IPC, not the whole data. Thus, we save a ton of memory and time.

Of course, you can change 'Float64Array' (e.g. a double) to whatever typed array your application requires. Note that this library in particular only handles single-dimensional typed arrays; but that should only be a minor obstacle.

Solution 2:

I too was able to reproduce the delay your were experiencing, but maybe not as bad as you. I used the following

// main.jsconst fork = require('child_process').forkconst child = fork('./getStats.js')

const dataAsNumbers = Array(100000).fill(0).map(() =>Array(100).fill(0).map(() =>Math.round(Math.random() * 100)))

child.send({
  dataAsNumbers: dataAsNumbers,
})

And

// getStats.js
process.on('message', function (data) {
  console.log('data is ', data)
  process.exit(0)
})

node main.js 2.72s user 0.45s system 103% cpu 3.045 total

I'm generating 100k elements composed of 100 numbers to mock your data, make sure you are using the message event on process. But maybe your children are more complex and might be the reason of the failure, also depends on the timeout you set on your query.

If you want to get better results, what you could do is chunk your data into multiple pieces that will be sent to the child process and reconstructed to form the initial array.

Also one possibility would be to use a third-party library or protocol, even if it's a bit more work. You could have a look to messenger.js or even something like an AMQP queue that could allow you to communicate between the two process with a pool and a guaranty of the message been acknowledged by the sub process. There is a few node implementations of it, like amqp.node, but it would still require a bit of setup and configuration work.

Solution 3:

Why do you want to make a subprocess? The sending of the data across subprocesses is likely to cost more in terms of cpu and realtime than you will save in making the processing happen within the same process.

Instead, I would suggest that for super efficient coding you consider to do your statistics calculations in a worker thread that runs within the same memory as the nodejs main process.

You can use the NAN to write C++ code that you can post to a worker thread, and then have that worker thread to post the result and an event back to your nodejs event loop when done.

The benefit of this is that you don't need extra time to send the data across to a different process, but the downside is that you will write a bit of C++ code for the threaded action, but the NAN extension should take care of most of the difficult task for you.

Solution 4:

Use an in memory cache like https://github.com/ptarjan/node-cache, and let the parent process store the array contents with some key, the child process would retreive the contents through that key.

Solution 5:

You could consider using OS pipes you'll find a gist here as an input to your node child application.

I know this is not exactly what you're asking for, but you could use the cluster module (included in node). This way you can get as many instances as cores you machine has to speed up processing. Moreover consider using streams if you don't need to have all the data available before you start processing. If the data to be processed is too large i would store it in a file so you can reinilize if there is any error during the process. Here is an example of clustering.

var cluster = require('cluster');
var numCPUs = 4;

if (cluster.isMaster) {
    for (var i = 0; i < numCPUs; i++) {
        var worker = cluster.fork();
        console.log('id', worker.id)
    }
} else {
    doSomeWork()
}

functiondoSomeWork(){
    for (var i=1; i<10; i++){
        console.log(i)
    }
}

More info sending messages across workers question 8534462.

JavaScript Test