Web scraping with Node.js
The last days I was building a web scraper-app, who get and stores data from sites and from where I can download the data in CSV. I used node.js, express, mongoDB and other packages.
Things that surprised me:
- I used monk as database layer to mongoDB and encountered various problems using queries which I got from their website. I had to mess around a lot only for get-queries. – But maybe I did something wrong.
- mongoDB not only being a bit buggy, but also using x*100mb of space for nearly no data – and no way to fix this
- Callbacks: They force the programmer to use the result of the higher-order function in the callback. Provides some learning about the scopes of variables, especially if you’re coming from a getter/setter background. The pro is the asynchronous processing (you can avoid it if necessary, e.g. with npm-async).
- Goes into the same: Can’t use my OOP-patterns known from Scala/Java. This may even be good, but the code gets a bit messy.
- Why, when nearly every thing I do with request, fs etc. is asynchronous, is node.js not build to control processes? It didn’t bug me till now, but what about bigger computations?
- Debugging and error messages are not soo helpful, but this too may come from my missing experience.
- I encountered two major bugs in the first 5/6 hours: This really shouldn’t happen. There is a workaround for both, but still it took me some googling to find this. (When something doesn’t work I except that’s my mistake.)
- Had to download a npm package to encode an url in windows-1250. Seriously?
- I didn’t get the feeling of functional programming like in Scala and Python. The freedom to build things your way is not the same with node.js
Pros
- I liked express/jade, as far as I used it
- No need to use semicolon every time. With some exceptions else you encounter errors/flaws. I know it’s bad style, but I still find it painful to type the semicolon after every statement.
- I like the npm-cheerio, which provides jQuery for the scraped html. However you can’t use .prop(), it’s reported.
What surprised me most: Since when are there in Javascript methods like Array.map(), Array.filter(), iterators etc.? There are even trying to introduce abstract function with “=>” and destructuring assignment (both still experimental). Javascript is becoming a real language. No more need to alter the DOM for short and readable code.
To summarise: There are two kind of things, the ones that work and the others that don’t. The things that work, are fast and easy to use, the others need workarounds and a lot of googling… I don’t know. It doesn’t really feel grown-up? I’m missing consistency and the implementation is still not as intuitive as it should feel (at least to me). But this may get better.
Don’t forget that I wouldn’t call me an expert in these things.