lundi 12 septembre 2011

Applying a dynamic threshold to improve cluster detection of LSI






Program source code essentially fulfills two functions: telling the computer what it should do and telling human readers what the program is supposed to do. Unfortunately, human readers are not as good as compilers to understand what a programmer wish to tell. Latent semantic indexing (LSI) is commonly employed to cluster source code identifiers based on their frequency in a set of source codes. One compelling application of LSI is to give a meaning to the terms employed in source code.

Current LSI-based approaches require an arbitrary fixed threshold to say whether two identifiers are semantically related or not. One limitation of using a fixed threshold is to not be able to identify cluster in case of an asymmetric balancing. Dynamic hybrid cut "improves the effectiveness of LSI for detecting concerns in source code".

The approach is reasonable and intuitive. I particularly like the fact that their approach have been tried out on two industrial case studies. The article is easy to read, however it requires to be knowledgeable in LSI techniques.

jeudi 17 février 2011

Continuous delivery




"Continuous delivery" is about releasing reliable software through continuous and automated build, test and deployment. The principles of software delivery given in the book have been identified from multiple exposure by the authors to painful and costly maintenance issues. As said by the authors, the book is primarily recommended to developers, systems administrators, test and software managers.

This book is not recommended for a public greedy of academic and theoretical problems and solutions. The book adopted a practical stance to face problems regularly encountered by practitioners in the field.

The book is organized into 3 parts (foundations, the deployment pipeline and the delivery ecosystem) and totals 15 chapters. The first part gives the bit of theory that introduces the causes of reliability failure and motivates the need for a strong discipline among practitioners. The second part is about how to automatize the process that pulls a software from a version control to a deployment phase. The final part solves the technical obstacles encountered in practices, including managing the infrastructure and environment, software components dependencies and getting a good command of version control system.

By directly addressing the problems faced today by practitioners, the book describes technical tools such as version control systems. These tools are however fragile against evolution and new emerging tools, making the relevance of some chapters not sustainable. For example, CVS and Subversion are described, but not GIT, a new distributed version control system. The spread of GIT among projects will probably turn Subversion obsolete one day, as CVS has been replaced by Subversion. Even if some chapters may painfully ages, the problems and solutions given in the book will remain identical. Another weak point, is the short description of Hudson. Hudson is a popular continuous integration server. It has gained a large acceptance. A few pages on how to install and use it will have reduced the dependency of the book on external information source.

I highly recommend this book. It crystalizes in a easy-to-read manner the hard time given by achieving software reliability and maintenance. The problems and approaches to solve them apply to a very large range of software types. Even if the biography of the authors may suggest, this book does not solely apply to large industrial software but also to academic open-source small projects. As soon as you started to write code, a few classes or functions, tests and a delivery processes have be made explicit.