\chapter{Experiments}
We have performed experiments to determine which part of the software package is consuming most of the time and resources. The PETSC profiler has been used to find which functions are taking how much time and how frequently they are called. In the following sections, we have explained the experimental strategies being adopted with the rationale of performing them to extract more detailed informations.
\section{Experiment 1: Instrument the Code}
We started with the example codes in the Dendro package. Dendro is designed to solve image-registration problem by solving elastic problems. The elasticity solver has been instrumented with \texttt{MPI\_Wtime} to get timing information.
\begin{figure*}
\centering
\includegraphics[scale=0.5]{pngs/scal.png}
\caption{Scalability Study of the Multigrid Solver of Dendro}
\label{timing-scal}
\end{figure*}
We have plotted the timing distribution for the 4 basic blocks in Figure~\ref{timing-scal} for running it on varying number of processor while keeping the number of local points per processor equal to 1000. We can see that the \textbf{Solve} part of the routing dominates the timing. To investigate the performance of the solver (\texttt{newElasSolver})the number of points local to each of the processor is being taken as an input (\textbf{weak-scalability} approach). That is equivalent to provide same work-load on each of the processor.
We observe that, even with the increase in number of processors, of the number of local points to a processor has been kept constant, then, the timing and the Flops does not vary much. From Figure~\ref{timing-scal}, it is evident that Solve (\texttt{DAMGSolve} in PETSC) takes most of the time. So, we went deeper into the modules of the Multigrid Solver to find which functions contribute to most of the time. The issue is both time, efficiency and scalability. The I/O overheads can be impediment to the scalability of any application. The I/O overheads and parallel I/O issues has been analyzed in Section~\ref{ioex} of the Appendix.
\section{Experiment:2: Performance Statistics of Dendro}
We have used the PETSC profiler output to analyze the performance statistics. By looking at Figure~\ref{timing-scal} we can say that the Solver takes most of the time.We have analyzed the performance statistics for Dendro multigrid Solver and plotted them in Figure~\ref{stats}.
\begin{figure}
\centering
\includegraphics[scale=0.6]{pngs/multigrid.png}
\caption{Performance Statistics of the Multigrid Solver of Dendro}
\label{stats}
\end{figure}
\paragraph{}
The units for each curve in the figure are not same. The legends in the plot signify how much each quantity has been scaled to be put in the same graph for facilitating comparisons. Also, it will be interesting to know which components of the Solver take more time.
\begin{figure}
\centering
\includegraphics[scale=0.5]{pngs/perf32.png}
\caption{Breakdown of the timing and performance of the Multigrid Solver of Dendro}
\label{timing-perf}
\end{figure}
From Figure~\ref{timing-perf}, we can definitely observe that, \texttt{MatMult()} and \texttt{KSPSolve()} are taking most of the time. The performance statistics of the underlying functions also need to be analyzed in this section. For convenience of analysis, We have classified the functions into following 3 groups and plotted them separately.
\begin{enumerate}
\item \textbf{Matrix functions}
\begin{itemize}
\item MatMult
\item elMultFinest
\end{itemize}
\item Vector functions
\begin{itemize}
\item VecDot
\item VecAXPY
\item VecAYPX
\end{itemize}
\item Krylov Solver
\begin{itemize}
\item KSPSolve
\item PCApply
\end{itemize}
\end{enumerate}
\paragraph{}
We have plotted the Performance Statistics for Matrix Multiplication routines in Figure~\ref{matperf}. One nice thing to observe here is that the \texttt{MatMult} and \texttt{elMultFinest} have similar Flops and timing patterns, but, there is a significant different in the number of reductions they perform. The \texttt{elMultFinest} is a part of the \texttt{MatMult} operation being performed.
\begin{figure}
\centering
\includegraphics[scale=0.5]{pngs/mat-perf.png}
\caption{Performance Statistics for Matrix functions}
\label{matperf}
\end{figure}
In Figure~\ref{vecperf}, the performance statistics for Vector operations has been performed with the required scaling to fit the data in the plot.
\begin{figure}
\centering
\includegraphics[scale=0.5]{pngs/vec-perf.png}
\caption{Performance Statistics for Vector functions}
\label{vecperf}
\end{figure}
Figure~\ref{ksperf} contains the performance statistics for the \texttt{KSPSolve} and \texttt{PCApply} routine.
\begin{figure}
\centering
\includegraphics[scale=0.5]{pngs/ksp-perf.png}
\caption{Performance Statistics for Krylov Solver}
\label{ksperf}
\end{figure}
The \texttt{KSPSolve} and \texttt{PCApply} has almost similar characteristics as being shown in Figure ~\ref{ksperf}.
Impact of the memory requirements of different functions on performance needs to be addressed to optimize performance. In the next section we have studied the memory requirement for different modules of Dendro.
\section{Experiment-3: Memory Requirements and Performance Statistics}
In this section, We have plotted the number of times the operations are being performed and the associated memory-requirement for each of them. The operations being plotted are as follows:
\begin{enumerate}
\item Mat
\item Vec
\item Krylov Solver
\end{enumerate}
We have plotted in Figure~\ref{matreq}, the number of times the Creations/Destructions for Matrix (\texttt{Mat}) has been performed with the required memory for performing that.
\begin{figure*}
\centering
\includegraphics[scale=0.55]{pngs/matrix-plot.png}
\caption{Memory Requirements for performing Mat}
\label{matreq}
\end{figure*}
We have plotted in Figure~\ref{vecreq}, the number of times the Creations/Destructions for Vector (\texttt{Vec}) has been performed with the required memory for performing that.
\begin{figure*}
\centering
\includegraphics[scale=0.6]{pngs/vec-plot.png}
\caption{Memory Requirements for performing Vec}
\label{vecreq}
\end{figure*}
\paragraph{}
From the Figure~\ref{vecreq}, it is shown that as expected, the memory requirement for performing Vec grows linearly with the problem-size. In Figure~\ref{ksreq}, we have plotted the count and memory requirement for performing the Krylov Solve.
\begin{figure*}
\centering
\includegraphics[scale=0.6]{pngs/ks-plot.png}
\caption{Memory Requirements for performing Krylov Solver}
\label{ksreq}
\end{figure*}
\paragraph{}
But, here the memory requirement does not grow rapidly from problem-size (number of points local to any processor) = 1000 to problem size = 1500.
\section{Analyzing the Matrix-Vector Multiplication (Matrix-free)}\label{analyze}
From the experiments, it was clear that the KSPSolve is the solution module which takes most of the time and the \texttt{elMult} and \texttt{MatMult} is taking the majority of the solution time. So, we analyzed the code, which performs the \texttt{elMult} and \texttt{MatMult}. It is a small block of code (less than 50 lines) which is performing a matrix-vector multiplication in a matrix-free fashion and being called hundreds of times for a reasonable problem-size per processors. The \texttt{elMult} and \texttt{MatMult} is called 540 times for local number of points = 1000. The count of \texttt{elMult} indicates how many times the the Matrix-Vector multiplication routine \texttt{ElasticityMatMult()} is executed. If number of points = N, the approximated computation time is $\Theta(8N^2)$. Thus, on a 2.3 GHZ Power PC (assuming one operation per cycle), it should take
\begin{equation}
Time = 540\times\Theta(8N^2)\times\frac{1}{2.3}\times 10^{-9} seconds
\end{equation}
So, for a problem size (number of local points) of 1000, the time required will be
\begin{equation}
Time = 540\times(8\times1000^2)\times\frac{1}{2.3}\times 10^{-9} seconds=1.878 seconds
\end{equation}
With the default build parameters (no optimization), the \texttt{elMult} is taking 74.175 seconds. That means it can be around 39 times faster. With compiler optimization (O3), for a problem size of 1000 (points per processor), \texttt{elMult} takes around 5.575 seconds as shown in Figure~\ref{o1fig}.
\begin{figure}
\centering
\includegraphics[scale=0.55]{pngs/O1.png}
\caption{Performance Improvement by using -O3 over default Dendro Elasticity Solver (\texttt{newElasSolver})}
\label{o1fig}
\end{figure}
That means the code can still be 2.97 times faster for number of points per processor is 1000. Here, we neglect the memory access related issues involved as the vectors need to be loaded and hence, cache-performance will affect the theoretically achievable time. This analysis helps us to understand how much time such an operation should theoretically take and from the experimental results determine if the performance achieved is reasonable.