Toby Segaran
Programming Collective Intelligence. Building Smart Web 2.0 Applications
August 2007, 360 pages
O’Reilly.
ISBN: 978-0-596-52932-1
This book is a collection of different methods of obtaining extra information from publicly available data. The author declares that Web 2.0 should not only be considered a vast container of user generated content. Web 2.0 term should be considered wider: all the content which is published by separate site owners is also a source of hidden information, which can be extracted by taking massive amounts of such pages. This second—and not well known—aspect is exploited by Google, for example, when it rates pages with the help of external links to each page.
First eleven chapters lead the reader from basic algorithms towards more difficult ones, and the last, 12th chapter, summarizes all the algorithms mentioned in the book. Appendixes cover maths used in the algorithms and contain the links to useful external libraries.
Description flows along with the programming code; author uses Python. I find this combination not too productive and convenient for the reader. It does not matter which programming language the reader prefers, they will definitely understand Python syntax and will be able to copy the logic in their own prefered language. The point is that when you need to learn (or refresh) the idea of an algorithm, you cannot do that without skipping the descriptions, which describe the code, not the algorithm. The structure of an algorithm is being expressed via the code, which adds lots of unnecessary details such as procedures of reading files and converting their content to a memory-located matrix. An intent to allow the reader with any level of knowing Python and any level of general programming skills leads to the need of avoiding SQL servers and using SQLite instead. That makes the dependence on code even deeper, and leads to less readable code as it has to include more instructions which—again—have no direct relation to the algorithm.
The presence of chapter 12 Algorithm Summary smoothes the sharp corners of previous code-extensive chapters. I would personally prefer more descriptive algorithms in the main chapters and the code in appendix.
Anyway, the book gives good introduction into modern techniques of data mining in the reality of modern internet, and is useful as both a guide for those who would like to expand their knowledge of how to deal with Web 2.0 data, and for those who wish to make their understanding of the topic wider.
I would recommend it to everyone who starts any web service or site which relate to processing of multiple data (both textual and numeric), and for those who are interested in modern methods of processing such data. It also helps to understand how popular online services—ranking, filtering, recommending, etc.—work.
4 stars. ★ ★ ★ ★