Currently at Wuha, we are developing an app to index computer files and analyze them in our search engine.
This program works the same way as the sync app of Google Drive or Dropbox : it watches in real-time additions, deletions or changes of files and send them to the Wuha search engine.
In a first version, we were simply instantating a
ReadStream of the file. In NodeJS, a
Stream is a specific object that can read and/or write data. The
ReadStream allows then to read data on the hard drive and redirect them via HTTP on our server.
The issue with this approach is that the read stream is very fast, near the read speed of the drive, and then will saturate the internet bandwidth of the user. Here comes
Transform streams. To continue with the stream analogy, it’s like an hydroelectric dam on a river: a water stream flows in and we transform it into power. Here, data will flow, and we will apply transformations and output it in another stream.
In our case, it will be a simple dam here to the regulate the flow speed. We used the throttle library, which is an implementation of a
Transform stream counting data coming in and making regular pauses with
setTimeout to achieve the required speed.
Here is a Gist to illustrate that system:
Note that if
form-data can usually guess file metadata (name, size…) of a
ReadStream ; it’s not the case for a generic stream. You have to provide them manually for the upload to work correctly.
Thanks to simple streams manipulations, we achieved to control file upload speed and then index our users’ documents without blocking their connection.