iClones Documentation
This document describes how to use our incremental clone detection tool iClones to extract clone evolution data from a program's history. What you need is access to the source code of your program's different versions (for example, access to the repository). What you get in the end is a file in rcf (Rich Clone Format) that contains the clone evolution data. You can then use our tool cyclone or the rcf Java API to analyze the data.
Please note that you can use iClones for traditional single-version clone detection as well. iClones allows you to detect clones in a single version only, just as other token-based clone detectors. In general, the (incremental) analysis consists of the following steps:
- Preparing the source code (only incremental analysis)
- Cleaning up the source code
- Creating change information (only incremental analysis)
- Running the incremental detection
- Using the data
Checklist
There are a couple of things you might want to ensure before you start:
Ubuntu. We are developing and using iClones with Ubuntu. To maximize the probability that things will work, you might want to use Ubuntu as well. Since iClones is written in Java, it should—at least in principle—work with any other operating system as well.
Ruby. Ruby is not absolutely required but we provide scripts that can make your life easier. If you want to use them, you will need Ruby.
Memory. iClones requires a considerable amount of main memory. This depends primarily on the size of the system you want to analyze and the number of versions. For medium-size analyses, we suggest at least 4GB. Smaller projects or just testing things out might be feasible with 1GB or 2GB. For larger analyses you definitely need more than 4GB. Unfortunately, we cannot provide a formula for calculating the memory requirement. We recommend to keep an eye on iClones' memory consumption during execution.
1. Preparing the source code
Our incremental clone detection tool iClones
reads the relevant source code from the file system. That is, there is
no direct reading from a repository yet. The reason is that getting
files from a repository is usually fairly slow.
iClones expects a single directory
d
in your file system as input. The top-level directory
d
contains all the source files relevant for your
analysis. d
has to contain exactly one subdirectory for
each version of your program that you want to analyze.
iClones sorts the versions lexicographically
according to the names of the respective directories. That's why you
want to make sure that you chose appropriate names for the versions'
directories. The organization of the directories is shown in the
following figure. The directory d
is what you later pass
to iClones.
- Organization of directories for the analysis.
The simplest way is to put everything that belongs to a version in the respective directory. To save space on your file system, you may want to delete everything that is not a source file relevant to your analysis. This can, for example, be done with find like this (in case of C code):
find path/to/d -type f ! -name *.c ! -name *.h -delete
The above command removes all files that do not end on .c
or .h
.
Attention! Be careful with this command, because there is no way back.
2. Cleaning up the source code
For iClones to work correctly, the source code has to be cleaned. The clean-up includes removing invisible control characters, ensuring a new line at the end of each file, and replacing tab characters with spaces. You may want to try running iClones without cleaning up first, but be aware that this might cause problems.
For cleaning up, we provide a script iClean written in Ruby. The script expects two or more parameters. The first parameter determines how many spaces are inserted for each tab character. All other parameters are directories in which ALL files are cleaned recursively.
Attention! The script performs the cleaning in every file. You may want to create a backup copy before you apply the script. A sample call to iClean where every tab character is replaced by four spaces looks like the following:
iclean 4 path/to/d
Feel free to modify the script so that it fits your own needs.
3. Creating change information
iClones also requires information about
which files have changed from one version to the next. In the
directory of each version, there has to be a file named
changes
that contains one line for each file that has
changed between the previous and the current version.
iClones recognizes three different types of
changes: A
, D
, and M
.
A "relative/path/to/file"
tells iClones that the corresponding file has been added, that is, it appears for the first time in this version.
D "relative/path/to/file"
tells iClones that the corresponding file has been deleted, that is, it does no longer exist in this version.
M "relative/path/to/file"
tells iClones that the corresponding file has been modified.
There is one more type of change, R
.
R "old/path/to/file" "new/path/to/file"
tells iClones that the corresponding file is now known under a different name or path. This is needed so that iClones can track clones although files are moved or renamed. Please note, that this change does not imply that the file is modified.
Any line in the file that is empty or starts with #
is
ignored. For the very first version of your system, the changes file
contains one line starting with A
for each source file
that is to be analyzed. Attention! You are responsible for
ensuring consistency. For example, if a file is modified in a given
version, you have to ensure that is has been added before.
iClones tries to ignore such inconsistencies
but may still crash in some situations.
Creating the change information can be quite time-consuming.
Consequently, we also provide a script named
iChanges to
do the work for you. iChanges takes a
language and the root directory of your analysis. The language is used
only to determine which files are relevant based on the files' names.
iChanges currently understands
ada
, c
, cpp
,
cobol
, and java
. If you need other
languages or have files without the default file endings, you can
easily adjust the script. You can invoke the script, for example, as
follows:
ichanges java path/to/d
iChanges takes every pair of consecutive
versions in D
and creates the change information.
Basically, it hashes the paths of the files in the old and new version
to check which files exist in both version. For every file that is
found in both version, iChanges executes
diff to check whether there are any changes.
If there are, the output of diff is parsed
and the corresponding line is added to the changes file. If a file is
found in only one of the versions, it is marked as either added or
deleted. Please note that iChanges cannot
detect file movements (changes of type R
), because the
files' paths are exactly what iChanges uses
to determine whether two files from different versions are the same.
If files are moved, this will appear as a deletion of the file's old
version and an addition of the file's new version. This also affects
the clones that are later found in these files, that is, they will not
be mapped. You may, however, manually post-process the changes files
and replace two corresponding changes of type D
and
A
with a single change of type R
.
Store only changed files
This is an extra for the ambitious user. You may have recognized that
a lot of source code is stored redundantly since
iClones only needs the source code of added
and modified files for each version. In fact, you can delete all the
files that have not been changed in a given version. Or, even better,
not check out all these unchanged files in the first place. However,
in the latter case, you have to do the creation of the changes files
yourself. The script iChanges will not work,
as it is intended to compare two complete versions. Still,
with the help of your version control system's log, this is feasible.
The advantage is, that it probably can tell you about file movement
(R
) as well.
4. Running the incremental detection
Now you are ready to run the real clone detection itself. All the previous steps have to be done only once, whereas you have to repeat this step every time you change your parameter settings.
iClones is written in Java and can be
obtained from here. There is no
installation needed, you can simply extract the archive. Inside,
you'll find a script iclones.sh
(or
iclones.bat
). Just execute the script and the detection
will start.
There are different options available to configure the clone detection process. Most importantly the location of the source code that you want to analyze. There are three different ways in which iClones scans for options. These are with increasing precedence:
- A default value which is encoded in iClones
- A value specified in the file named
.iclones.config
in your home directory (key-value-pairs should be given in the formkey=value
one per line) - A value passed to the script iclones (as
-key value
)
You may also use a combination of these methods to tell iClones which options it should use. The following table summarizes the options that iClones currently supports.
Key | Values | Default | Description |
---|---|---|---|
informat | single directory |
single | The input format of the sources. This parameter determines how
the value passed to parameter input is interpreted.
For inremental detection chose directory , for
single-version detection chose single .
|
input | String | – | Determines from where sources and change information is read. |
language | java c++ ada |
java | The language of the source that are anaylzed. c++
can also be used for C. In single-version detection, the
language is used to determine which files are relevant based on
their endings. It is, however, also relevant for incremental
detection. |
minblock | Integer | 20 | Minimum length of identical token sequences that are used to merge near-miss clones. If set to 0, only identical clones are detected. |
minclone | Integer | 100 | Minimum length of clones measured in tokens. |
outformat | none text rcf xml |
text | Format used for writing clone data. none disables output. |
output | String | Standard output | Destination for writing the clone data. |
Once you started iClones it will analyze all the versions of your program. Please note that the first version usually takes significantly longer than the other versions. iClones will inform you about its progress. When iClones has finished, the output rcf file contains the clone evolution data.
5. Using the data
Now that you have an rcf file, you can use our tool cyclone to inspect the data or use our Java API to write your own analysis for the clone data.
Finally, if there is a thing that's on your mind and starts with
- “I've done exactly as written in this document, but...”
- “I followed your instructions, but I can't find information on...”
- “It works, but wouldn't it be better if...”
- “I want to analyze..., but...”
feel free to contact us.