lundi 12 septembre 2011

Applying a dynamic threshold to improve cluster detection of LSI






Program source code essentially fulfills two functions: telling the computer what it should do and telling human readers what the program is supposed to do. Unfortunately, human readers are not as good as compilers to understand what a programmer wish to tell. Latent semantic indexing (LSI) is commonly employed to cluster source code identifiers based on their frequency in a set of source codes. One compelling application of LSI is to give a meaning to the terms employed in source code.

Current LSI-based approaches require an arbitrary fixed threshold to say whether two identifiers are semantically related or not. One limitation of using a fixed threshold is to not be able to identify cluster in case of an asymmetric balancing. Dynamic hybrid cut "improves the effectiveness of LSI for detecting concerns in source code".

The approach is reasonable and intuitive. I particularly like the fact that their approach have been tried out on two industrial case studies. The article is easy to read, however it requires to be knowledgeable in LSI techniques.