softwareclonesorg

University of BremenSoftware Engineering Group

iClones Documentation

This document describes how to use our incremental clone detection tool iClones to extract clone evolution data from a program's history. What you need is access to the source code of your program's different versions (for example, access to the repository). What you get in the end is a file in rcf (Rich Clone Format) that contains the clone evolution data. You can then use our tool cyclone or the rcf Java API to analyze the data.

Please note that you can use iClones for traditional single-version clone detection as well. iClones allows you to detect clones in a single version only, just as other token-based clone detectors. In general, the (incremental) analysis consists of the following steps:

  1. Preparing the source code (only incremental analysis)
  2. Cleaning up the source code
  3. Creating change information (only incremental analysis)
  4. Running the incremental detection
  5. Using the data

Checklist

There are a couple of things you might want to ensure before you start:

Ubuntu. We are developing and using iClones with Ubuntu. To maximize the probability that things will work, you might want to use Ubuntu as well. Since iClones is written in Java, it should—at least in principle—work with any other operating system as well.

Ruby. Ruby is not absolutely required but we provide scripts that can make your life easier. If you want to use them, you will need Ruby.

Memory. iClones requires a considerable amount of main memory. This depends primarily on the size of the system you want to analyze and the number of versions. For medium-size analyses, we suggest at least 4GB. Smaller projects or just testing things out might be feasible with 1GB or 2GB. For larger analyses you definitely need more than 4GB. Unfortunately, we cannot provide a formula for calculating the memory requirement. We recommend to keep an eye on iClones' memory consumption during execution.

1. Preparing the source code

Our incremental clone detection tool iClones reads the relevant source code from the file system. That is, there is no direct reading from a repository yet. The reason is that getting files from a repository is usually fairly slow. iClones expects a single directory d in your file system as input. The top-level directory d contains all the source files relevant for your analysis. d has to contain exactly one subdirectory for each version of your program that you want to analyze. iClones sorts the versions lexicographically according to the names of the respective directories. That's why you want to make sure that you chose appropriate names for the versions' directories. The organization of the directories is shown in the following figure. The directory d is what you later pass to iClones.


Organization of directories for the analysis.

The simplest way is to put everything that belongs to a version in the respective directory. To save space on your file system, you may want to delete everything that is not a source file relevant to your analysis. This can, for example, be done with find like this (in case of C code):

            find path/to/d -type f ! -name *.c ! -name *.h -delete

The above command removes all files that do not end on .c or .h.

Attention! Be careful with this command, because there is no way back.

2. Cleaning up the source code

For iClones to work correctly, the source code has to be cleaned. The clean-up includes removing invisible control characters, ensuring a new line at the end of each file, and replacing tab characters with spaces. You may want to try running iClones without cleaning up first, but be aware that this might cause problems.

For cleaning up, we provide a script iClean written in Ruby. The script expects two or more parameters. The first parameter determines how many spaces are inserted for each tab character. All other parameters are directories in which ALL files are cleaned recursively.

Attention! The script performs the cleaning in every file. You may want to create a backup copy before you apply the script. A sample call to iClean where every tab character is replaced by four spaces looks like the following:

            iclean 4 path/to/d

Feel free to modify the script so that it fits your own needs.

3. Creating change information

iClones also requires information about which files have changed from one version to the next. In the directory of each version, there has to be a file named changes that contains one line for each file that has changed between the previous and the current version. iClones recognizes three different types of changes: A, D, and M.

            A "relative/path/to/file"

tells iClones that the corresponding file has been added, that is, it appears for the first time in this version.

            D "relative/path/to/file"

tells iClones that the corresponding file has been deleted, that is, it does no longer exist in this version.

            M "relative/path/to/file"

tells iClones that the corresponding file has been modified.

There is one more type of change, R.

            R "old/path/to/file" "new/path/to/file"

tells iClones that the corresponding file is now known under a different name or path. This is needed so that iClones can track clones although files are moved or renamed. Please note, that this change does not imply that the file is modified.

Any line in the file that is empty or starts with # is ignored. For the very first version of your system, the changes file contains one line starting with A for each source file that is to be analyzed. Attention! You are responsible for ensuring consistency. For example, if a file is modified in a given version, you have to ensure that is has been added before. iClones tries to ignore such inconsistencies but may still crash in some situations.

Creating the change information can be quite time-consuming. Consequently, we also provide a script named iChanges to do the work for you. iChanges takes a language and the root directory of your analysis. The language is used only to determine which files are relevant based on the files' names. iChanges currently understands ada, c, cpp, cobol, and java. If you need other languages or have files without the default file endings, you can easily adjust the script. You can invoke the script, for example, as follows:

            ichanges java path/to/d

iChanges takes every pair of consecutive versions in D and creates the change information. Basically, it hashes the paths of the files in the old and new version to check which files exist in both version. For every file that is found in both version, iChanges executes diff to check whether there are any changes. If there are, the output of diff is parsed and the corresponding line is added to the changes file. If a file is found in only one of the versions, it is marked as either added or deleted. Please note that iChanges cannot detect file movements (changes of type R), because the files' paths are exactly what iChanges uses to determine whether two files from different versions are the same. If files are moved, this will appear as a deletion of the file's old version and an addition of the file's new version. This also affects the clones that are later found in these files, that is, they will not be mapped. You may, however, manually post-process the changes files and replace two corresponding changes of type D and A with a single change of type R.

Store only changed files

This is an extra for the ambitious user. You may have recognized that a lot of source code is stored redundantly since iClones only needs the source code of added and modified files for each version. In fact, you can delete all the files that have not been changed in a given version. Or, even better, not check out all these unchanged files in the first place. However, in the latter case, you have to do the creation of the changes files yourself. The script iChanges will not work, as it is intended to compare two complete versions. Still, with the help of your version control system's log, this is feasible. The advantage is, that it probably can tell you about file movement (R) as well.

4. Running the incremental detection

Now you are ready to run the real clone detection itself. All the previous steps have to be done only once, whereas you have to repeat this step every time you change your parameter settings.

iClones is written in Java and can be obtained from here. There is no installation needed, you can simply extract the archive. Inside, you'll find a script iclones.sh (or iclones.bat). Just execute the script and the detection will start.

There are different options available to configure the clone detection process. Most importantly the location of the source code that you want to analyze. There are three different ways in which iClones scans for options. These are with increasing precedence:

  1. A default value which is encoded in iClones
  2. A value specified in the file named .iclones.config in your home directory (key-value-pairs should be given in the form key=value one per line)
  3. A value passed to the script iclones (as -key value)

You may also use a combination of these methods to tell iClones which options it should use. The following table summarizes the options that iClones currently supports.

KeyValuesDefaultDescription
informat single
directory
single The input format of the sources. This parameter determines how the value passed to parameter input is interpreted. For inremental detection chose directory, for single-version detection chose single.
input String Determines from where sources and change information is read.
language java
c++
ada
java The language of the source that are anaylzed. c++ can also be used for C. In single-version detection, the language is used to determine which files are relevant based on their endings. It is, however, also relevant for incremental detection.
minblock Integer 20 Minimum length of identical token sequences that are used to merge near-miss clones. If set to 0, only identical clones are detected.
minclone Integer 100 Minimum length of clones measured in tokens.
outformat none
text
rcf
xml
text Format used for writing clone data. none disables output.
output String Standard output Destination for writing the clone data.

Once you started iClones it will analyze all the versions of your program. Please note that the first version usually takes significantly longer than the other versions. iClones will inform you about its progress. When iClones has finished, the output rcf file contains the clone evolution data.

5. Using the data

Now that you have an rcf file, you can use our tool cyclone to inspect the data or use our Java API to write your own analysis for the clone data.

Finally, if there is a thing that's on your mind and starts with

feel free to contact us.