Contents
Introduction
The repository has some features for analysis of its performance which are intended to help a developer identify opportunities for optimization. This page describes these features and how to use them.
Note that this page was written relative to a particular version of the repository code:
/vesta/vestasys.org/vesta/repos/85
If you haven't yet done so, it's probably best to start by reading the "Performance Tuning" section in the repository(8) man page.
Building with Performance Debug
Even when not turned on, the performance debug data recording features have some small negative impact on performance. For that reason, the performance debugging features are not compiled into the repository by default.
To enable them at build time, ./vesta/repos/perf_debug must be the value TRUE. The repository package has several top-level models whith names ending in "_perf.main.ves" which turn this feature on. Alternatively, you can add this to your top-level model:
. ++= [ vesta/repos/perf_debug = TRUE ];
This affects the repository build in two ways:
- It compiles the performance debug data recording features into the repository server itself.
- It causes three additional programs to be built:
vperfdebug is used to enable and disable the data recording
read_timing is used to analyze fine-grained timing data recorded during the processing of NFS transactions
read_lock_timing is used to analyze data record about the acquisition and release of two central locks
Recording Data
Once you're running a repository with the data recording features, you'll need to specifically enable them. This allows you to record data only during periods of interesting activity.
There are two types of data which can be recorded:
- Fine-grained timing of the processing of NFS calls. This helps to identify where the repository is spending time and why certain transactions are slow.
The timing of the acquisition and release of several central locks (StableLock, VolatileRootLock, and userGroup_lock).
Normally you want to record both types of data, but it is possible to enable and disable the recording of them independently.
To enable both types of data recording:
% vperfdebug -n -l Enabdled fine-graned NFS call timing Enabdled central lock timing
To disable the data recording:
% vperfdebug -N -L Disabdled fine-graned NFS call timing Disabdled central lock timing
(Note that enabling and disabling the performance data recording requires administrator access, so this must be done as vadmin, root, or vwizard.)
The timing data gets stored in files in sub-directories below [Repository]metadata_root. These directories have names which include the date, for example "timings.2005-12-04". Within these directories there are separate files for the two types of data recorded. Each thread gets its own file, and a new file is started periodically (by default once per hour, but this can be changed by setting [Repository]timing_file_minutes). The filenames include the process ID, the thread ID, and the hour and minute of the period which the file is for. Specifically, the filename format is:
"nfs_calls.pid_123.thread_456.HH_MM"
read_timing processes these files
"rwlocks.pid_123.thread_456.HH_MM"
read_lock_timing processes these files
Analyzing the Data
The recorded data is in a binary format (mainly to keep the repository from spending much time writing it). To get human-readable information out of it, you need to use the read_timing and read_lock_timing programs.
They each produce a text file containing a report and one or more files of input for gnuplot to generate graphs visualizing the data.
Here's how you run each of these tools:
read_timing [-o output_prefix] nfs_calls.file1 nfs_calls.file2 ...
- By default it creates files in the current directory beginning with "read_timing.out". Use the -o option to change this.
read_lock_timing [-o output_prefix] [-t slow_msecs] rwlocks.file1 rwlocks..file2 ...
- By default it creates files in the current directory beginning with "rwlock_timing.out". Use the -o option to change this.
Further analysis is done on periods where it took more than slow_msecs milliseconds to acquire one of the locks. This defaults to 1000 (1 second).
- You can use the -t option to cause shorter lock acquisition segments to be given scrutiny if not enough slow acquisitions are found.
- If roo many slow lock acquisitions are found, you can use the -t option to exclude some segments by raising the length considered "slow".
After running both tools, you'll want to run gnuplot on each of the files ending in ".gnuplot" which was written. Here's a simple shell script which will do that for you:
for f in *.gnuplot; do cat $f | /usr/bin/gnuplot done
You might find that you want larger images than the default size to get more detail in the graphs. You can increase the size by a factor of 1.5 with:
for f in *.gnuplot; do ( echo "set size 1.5,1.5" ; cat $f ) | /usr/bin/gnuplot done
(You can use other numbers, but 2 is probably the highest useful value unless you have a very large monitor.)
Reading the Report from read_timing
The report generated by read_timing (named "read_timing.out.report" unless you used the "-o" command-line flag) has one section for each NFS procedure. Each section first contains aggregate statistics for all calls of that type in the data set:
count: The total number of calls to this procedure
total time: The total amount of time spent processing all calls to this procedure
slowest time: How long the longest call to this procedure took to service
average time: The average time it took to service a call to this procedure
After this is data which pertains only to the slowest 1% of calls to this procedure. These calls are given more analysis because usually they demonstrate the causes of poor performance and represent the best opportunities for optimization.
There are four groups of information given about the slowest 1% of calls:
- The range of time spent in those calls (from the fastest of the slowest 1% to the slowest of the slowest 1%). This makes it easier to see how far these are from the average across all calls.
Three groups of data about where precisely in the code the slowest calls spent their time. These are given in terms of source intervals. There are various specific points in the code where records are made during the processing of NFS calls. (In the code, time points are recorded at each RECORD_TIME_POINT; statement.) This makes it possible for read_timing to count the amount of time taken between two locations in the source code.
slowest intervals in slowest 1%: For each of the slowest 1% of calls, the source interval which took the longest time is found. The sections lists those source intervals along with a count of the number of calls in which it was the slowest interval and the percentage of the slowest 1% in which it took the longest time. The ranges are sorted by how often each was the slowest interval.
total time spent in all intervals in slowest 1%: This lists each source interval found in the slowest 1% and the aggregate amount of time spent in each. It also lists what percentage of the total time taken to process all calls in the slowest 1% was spent in this interval. The ranges are sorted by the total amount of time each took across all of the slowest 1% of calls.
range of time time spent in all intervals in slowest 1%: For each source interval found in the slowest 1%, this section shows the range of time spent in that interval (from the smallest amount of time spent in it to the longest amount of time spent in it). The ranges are sorted by the longest amount of time spent in each.
Finally, some calls (currently lookup and readdir) list specific paths in the repository which required a long time during certain source intervals. After "Slowest StableLock LongId intervals recorded", a source interval is printed followed by a number of records of the amount of time taken in that interval and a specific path in the repository, printed both as a raw LongId and as a path into the repository. These can be used to identify particular places in the repository which may exhibit poor performance (e.g. due to a very large directory). Note that read_timing always uses the local repository to translate the recorded LongIds back into paths. LongIds are not the same across different repositories, so you should run it against the same repository where the data was recorded.
One important thing to be aware of when looking at the source intervals is that some of them represent the acquisition of different locks. To improve those, you need to identify what other thread or threads were holding the lock in question and reduce the amount of time they hold that lock.
Reading Graphs from read_timing
There are three types of graphs produced by read_timing:
all_procs_histo (e.g. read_timing.out.all_procs_histo.png): shows a histogram of the time taken to process all NFS requests. One line is drawn for each NFS procedure and another line is drawn for the sum across all procedures. The Y axis is the number of calls and is logarithmic. The X axis is a logarithmic binning of call times. Specifically, it is the logarithm base 2 of the time in milliseconds rounded up to the next integer. (So the point at 1 represents the number of calls which took longer than 20 millisecond and up to 21 milliseconds, the point at 2 represents the number of calls which took longer than 21 millisecond and up to 22 milliseconds, and so on.)
procedure_histo (e.g. read_timing.out.create_histo.png): shows just the histogram for the time to process a single procedure. The axes are the same as for all_procs_histo, but it's a little easier to read.
slow_procedure_histo (e.g. read_timing.out.slow_create_histo.png): shows a histogram of time spent in different source intervals during the processing of the slowest 1% of calls to this procedure. This gives another way to look at the source interval data for slow calls. The axes are the same as the other histograms, but one line is drawn for each source interval which has at least some values in more than just the 0 group on the X axis. The source intervals are sorted such that the first one listed has the highest amount of time in an individual call.
The histogram across all procedures can give you a sense of which procedures are taking the most time. The histogram for a single procedure can make it a little easier to see whether there's a gradual trend or whether the long calls are statistical outliers. The source interval histogram for slow calls can help with understanding where the calls are spending their time.
Reading the Report from read_lock_timing
The report generated by read_lock_timing (named "rwlock_timing.out.report" unless you used the "-o" command-line flag) lists information about the acquisition and release of central locks. StableLock and VolatileRootLock control directory structure in the repository. userGroup_lock controls access to the user and group information used to determine permissions and access. These are instances of the class ReadersWritersLock which allows for multiple simultaneous readers or a single writer. StableLock is used for /vesta and /vesta-work: anything which reads directory contents must acquire a read lock and anything which changes directory contents must acquire a write lock. VolatileRootLock controls just the volatile root (/vesta-work/.volatile), but tends to see a higher rate of locking and unlocking as individual volatile directories are created and deleted quickly as builds run.
The data recording for these locks includes the time each thread spends witing to acquire the lock, the time each thread spends holding the lock, and in most cases the reason which is was locked (e.g. "NFS:lookup" for servicing an NFS lookup call).
For any lock acquisitions that are considered "slow" (i.e. take longer than the tiem specified with the -t option to read_lock_timing), additional information is gathered. All other operations on the same lock from other threads that overlap in time with the lock acquisition are found. Information about all these operations, including how long the thread took to acquire the lock, how long it held the lock, how long it took to release the lock, where in time it ocurred relative to the other operations, and the reason it was locked. This makes it possible to see which operations in other threads contributed to the long lock aquisition time. Sometimes a single thread holding a lock is obviously the cause. In other cases several smaller operations may have ocurred in the right order to have a combined effect that caused a long acquisition time.
Reading Graphs from read_lock_timing
For each "slow" lock acquisition, read_lock_timing) produces a graph to visually show the sequence of events. Each vertical line shows one thread performing an operation using the lock during the slow lock acquisition. Different colors represent periods spent acquiring, holding, and releasing the lock. The colors for read and write locks are different. Threads are shown left to right in the same order they are listed top to bottom in the text report.