NodeJs - CALL_AND_RETRY_LAST Allocation failed - process out of memory - mobile agent

Steps:
1. Get everything running well —

  • Windows 8.1 x64 machine with 16GB of spare RAM, recent processors.
  • Installed NodeJs and all prerequisites such that I could get video successfully uploaded to the server with the device on a “fast-wifi” wifi connection.
  • USB3 connection working fine on latest drivers. On one test in particular there is a 30MB image to download.
  • Did not install any “wifi chooser” application. Chose the wifi configuration by hand on the device and verified it stays the same.
  • Running latest wpt code from git repository.
  • Able to get hundreds of results from this configuration.

2. Set Nexus 9 to connect to “slow-wifi” wifi —

  • We have a separate internal wifi connection configured to throttle internet speed on the device simulating a 4G connection, “slow-wifi.”

3. Rerun same tests —

  • By stopwatch, the worst page now takes ~1 minute to finish loading.

Expected:
Nexus 9 shows a longer waterfall chart due to taking longer to load some elements.

Actual:
After test concludes, this error message appears (below). The process is stuck after visiting the page for the first time, even when not capturing video:

wpt\agent\js\src\devtools.js:166 DevTools.onMessage_ : Command response id: 8 wpt\agent\js\src\process_utils.js:481 Function.cb : Callback for Tracing.end FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory wpt\agent\js\src\agent_main.js:149 Agent.<anonymous> : wd_server child process exit code 3221226505 signal null

Based on other NodeJs forums I began to experiment with playing directly with the command for starting node as follows:

C:\xyzzy\wpt\agent\js>node --max-old-space-size=4000000 src\agent_main --serverUrl http://xyzzy-private-instance.zillow.local --location XYZZY --browser browser_android_chrome.BrowserAndroidChrome --deviceSerial HTXYZZY --captureDir "C:\xyzzy\wpt\agent\js\lib\capture" --chromedriver "C:\xyzzy\wpt\agent\js\lib\webdriver\chromedriver\Win32\chromedriver-2.4.exe"

The process starts off well enough, connects, and takes on a test request. It still dies without apparently even trying to use the extra memory on offer.

Actually, when I began to try setting the memory limits by hand, the process dies (gets stuck with a “process out of memory” error) even when handling pages using the “fast-wifi”.

“–max-old-space-size=4000000” - Tried various powers of 2 from 64 up to 4 billion each with the same effect.

Sharing notes in case I can solve this issue and for anyone else down the road.
If you have already encountered and solved this, please share.

Thank you.

Strange that I’m not seeing the error on the public instance that also runs on Windows though memory use has been the bane of the node agent for a while. What version of node are you running? IIRC the max-old-space-size flag was a fairly recent addition.

There are a couple of things that might be able to help the issue:

1 - I recently changed the timeline code to write to disk instead of sending it all over IPC. We should be able to have the agent write the events to disk as they come in instead of all at once which might help reduce the overhead though it still loads the whole file to POST the result so there are a lot of places that would need to be fixed. If we wrote it out gzipped then it might be enough to keep the rest of the pipeline happy.

2 - I’m working on using netlog events for generating the waterfalls. If that works out then we’ll be able to turn off most of the tracing events and just get the network events (when timeline and trace are not enabled for a test). That will result in MUCH smaller traces. Probably a few weeks before the switch is done completely though.

Node v0.12.7 on Windows x64.

Is there a flag for this, or is this a pull request in the making?

Thank you, Patrick

Quick update: I was able to workaround or “unstuck” myself from this locally by running as follows, only when using an Administrator command prompt:

node src\agent_main --serverUrl http://xyzzy-private-instance.zillow.local --location XYZZY_LOCATION --browser browser_android_chrome.BrowserAndroidChrome --deviceSerial HTXYZZY --captureDir "C:\xyzzy\wpt\agent\js\lib\capture" --chromedriver "C:\xyzzy\wpt\agent\js\lib\webdriver\chromedriver\Win32\chromedriver-2.4.exe" --max-old-space-size 8192

Also, I can only get accurate results when running “First View and Repeat View” on the longest-running page.
When running “First View Only” the test works perfectly (can view results and explore them), then upon finishing the test, the NodeJs agent crashes or hangs with a few final words (below; full version attached as first-view-only-…txt). SIGINT to the process causes the test to mark itself as failed as the process exits.

D Sep_08_14:05:16.817 C:\xyzzy\wpt\agent\js\src\multipart.js:92 Multipart.exports.Multipart.addPart : addPart: name=error body=Page load timeout headers=undefined
[...]
D Sep_08_14:05:16.822 C:\xyzzy\wpt\agent\js\src\multipart.js:115 Multipart.exports.Multipart.addFilePart : addFilePart: name=file filename=1_results.zip size=31707
I Sep_08_14:06:10.489 C:\xyzzy\wpt\agent\js\src\agent_main.js:149 Agent.<anonymous> : wd_server child process exit code 0 signal null

[At 14:10 I could explore the results and view the result timeline on /result/150908_AR_J9V. Eventually, I send a SIGINT.]

A Sep_08_14:15:40.286 C:\xyzzy\wpt\agent\js\src\wpt_client.js:422 Client.onSignal_ : Received SIGINT, will exit after the current run is killed.
[......]
D Sep_08_14:18:11.194 C:\xyzzy\wpt\agent\js\src\multipart.js:92 Multipart.exports.Multipart.addPart : addPart: name=testerror body=Page load timeout headers=undefined
A Sep_08_14:18:12.673 C:\xyzzy\wpt\agent\js\src\wpt_client.js:417 Client.<anonymous> : Exiting due to SIGINT.

I have only seen this happen so far when doing “First View Only” against the NodeJs agent.

Out of curiosity, any reason you’re running the node command directly instead of the batch file? It shouldn’t be an issue but I’m not 100% sure about all of the flags that get passed through since the batch file does a bit of path discovery (like for the webdriver location).

In the debug output before you send the SIGINT, was there a call stack anywhere? One of my biggest pain points with the node agent is random exceptions throwing the state machine out of whack. I’m not sure why kids these days love functional programming so much, at least for a procedural app like the agent it is a nightmare.

[quote=“pmeenan, post:5, topic:9547”]
Out of curiosity, any reason you’re running the node command directly instead of the batch file?[/quote]
I reasoned to give different inputs to node, I should try calling node directly.
It does appear to be working. I exported/imported the environment variables between command prompts and checked that the variables made sense, as well.

[quote=“pmeenan, post:5, topic:9547”]
In the debug output before you send the SIGINT, was there a call stack anywhere? One of my biggest pain points with the node agent is random exceptions throwing the state machine out of whack.[/quote]
Call stack anywhere?
No. None.
In my full command-line output there was never a call stack shown.

From when I was debugging:
If I run node in “–debug” mode, the application does hit a call stack and exit early (attached). Perhaps that is interesting.

I never attempted to configure ipfw traffic shaping nor any of the TODO items from https://sites.google.com/a/webpagetest.org/docs/private-instances/node-js-agent/setup. Traffic shaping is done by connecting to a unique wifi solution already provided.

As I understand, the V8 command line arguments. must occur before the script, otherwise, they are treated as command line arguments to the script. So in this case, the --max-old-space-size has no effect.

BTW: I ran into a similar OOM issue that I addressed by removing the command line argument V8 command line argument --max-old-space-size.