SSD6 Exercise5 心得

我叫默

浏览: 4525 次
性别:
来自: 武汉

最近访客更多访客>>

wingler

outlwier

Javens66458

神仙麽麽茶

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

2009-01 ( 1)
更多存档...

博客分类：

我叫默大三事件簿

Cache C C++C#算法

引用

Cache Lab: Improving Program Locality
INTRODUCTION
This exercise deals with optimizing memory-intensive code. Image processing is one area that benefits greatly from such optimizations.

In this exercise we'll be optimizing two functions: rotate , a function designed to rotate an image 90 degrees clockwise, and smooth , a function designed to smooth (or more precisely to blur) an image. Your goal is to maximize the cache hit rate of these functions on a simulated L1 cache. We provide you with a cache simulator that simulates the performance of a computer's cache.

For our purposes, we will consider an image to be represented by a two-dimensional matrix M, where Mi,j denotes the (i ,j )th pixel of M. To simplify things, this assignment will deal only with square images. Rows and columns are numbered in zero-indexed fashion (like arrays), so rows and columns number from 0 to N-1, where N is the width/height of the image matrix. Pixel values consist of four 1-byte fields representing red, green, blue, and alpha.

LOGISTICS
The files needed for this assignment can be downloaded here . Once you've extracted the zip file to its own directory, you'll see a number of C source and header files:

File: Function:

cache.c Contains the code used for the cache simulation

cache.h Header file for cache simulator

defs.h Contains commonly used definitions and structures

cache.vcproj Visual C++ project file

driver.c The main program for testing various functions

rotate.c Contains the rotate functions. You will modify this file.

smooth.c Contains the smooth functions. You will modify this file.

The only files you need to change, and the only files you will submit, are rotate.c and smooth.c . You're free to change the driver program as you see fit, but such changes won't be submissible.

IMPLEMENTATION DETAILS
Data Representation:
The fundamental data structure of our images is the pixel structure, shown below:

typedef struct {
    unsigned short red   : 8;
    unsigned short green : 8;
    unsigned short blue : 8;
    unsigned short alpha : 8;
} pixel;
The structure definition above defines a 32-bit pixel, with 8 bits for each of the red, green, blue and alpha (opacity) components.

A two-dimensional square image of width n is stored in a one-dimensional array of pixel s; the (i, j)th pixel of the image is at Img[PIXEL(i,j,n)] , and PIXEL is defined as follows:

#define PIXEL(i,j,n) ((i)*(n)+(j))
In order to use the cache simulator, we call it indirectly through use of the COPY and SMOOTH macros defined in defs.h . You must use these macros for doing your COPY and SMOOTH operations.

These are all defined in defs.h .

Cache Structure:
The cache we will be simulating is a 16 KB direct-mapped cache, with 32 byte cache lines. You may wish to refer back to the notes to determine how best to optimize for such a configuration.

Rotate:
The following C function takes a source image, src , of size dim x dim and puts a rotated copy into the destination image dst .

void rotate_naive(int dim, pixel* src, pixel* dst) {
        int i, j;
        for(i=0; i<dim; i++) {
                for(j=0; j<dim; j++) {
                        COPY(&dst[PIXEL(dim-1-j,i,dim)],&src[PIXEL(i,j,dim)]);
                }
        }
        return;
}
This code traverses the rows of the source image, copying each into a column of the destination image. Your task is to try to maximize the number of cache hits by adjusting the algorithm to take advantage of the cache.

Smooth:
The following C function takes a source image of size dim x dim that is specified by src , and puts a 'smoothed' copy into the destination image dst . The actual smoothing is done by the SMOOTH macro, which takes in first the address of the destination pixel, and then the addresses of the source pixel and the 8 pixels surrounding it. For cases on the border of the image, COPY the pixel straight from the source to the destination, so as to not have to deal with the special case of not having 8 surrounding pixels.

void smooth_naive(int dim, pixel *src, pixel *dst) {
        int i, j;
        for(i=0; i<dim;i++) {
                COPY(&dst[PIXEL(i,0,dim)], &src[PIXEL(i,0,dim)]);
                COPY(&dst[PIXEL(i,dim-1,dim)], &src[PIXEL(i,dim-1,dim)]);
        }
        for(j=1; j<dim-1;j++) {
                COPY(&dst[PIXEL(0,j,dim)], &src[PIXEL(0,j,dim)]);
                COPY(&dst[PIXEL(dim-1,j,dim)], &src[PIXEL(dim-1,j,dim)]);
        }
        for(i=1; i<dim-1; i++) {
                for(j=1; j<dim-1; j++) {
                        SMOOTH(&dst[PIXEL(j,i,dim)],
                                        &src[PIXEL(j,i,dim)],
                                        &src[PIXEL(j-1,i,dim)],
                                        &src[PIXEL(j+1,i,dim)],
                                        &src[PIXEL(j,i+1,dim)],
                                        &src[PIXEL(j,i-1,dim)],
                                        &src[PIXEL(j-1,i-1,dim)],
                                        &src[PIXEL(j+1,i+1,dim)],
                                        &src[PIXEL(j-1,i+1,dim)],
                                        &src[PIXEL(j+1,i-1,dim)]);
                }
        }
        return;
}
The code first takes care of the edge cases and does a straight copy for the border. It then traverses the image in standard fashion, smoothing each pixel as it comes. As with rotate, your goal is to maximize cache hitrate by improving locality.

Evaluation:
The improved algorithms you submit will be graded based on the cache simulator included in the zip file you downloaded earlier. Functions will be run on images of a number of different sizes (listed below), and for each size will be given a hitrate equal to the total number of cache hits divided by the number of cache attempts in the image. (Higher numbers are better.) A function's 'hit score' will be determined by taking the geometric mean (explained below) of the ratios produced by dividing your function's hitrate by the naive implementation's hitrate.

For both rotate and smooth, a geometric mean of 5 numbers is computed by taking the 5th root of the product of those numbers, so for the five dimensions listed below the formula would be:

hit score = (ratio64 * ratio128 * ratio256 * ratio512 * ratio1024 )1/5

Assumptions:
To make optimization easier, you may assume that the image dimensions will always be a multiple of 32. Your code must be able to correctly rotate for all dimensions that are multiples of 32, but your performance scores will be determined based solely upon the values listed below:

64 128 256 512 1024

SETUP
Versioning
The rotate.c and smooth.c that you unzip contain only two functions each: the original naive implementation of their respective function, and a "register" function. This function provides an easy way to compare multiple functions at the same time, and is called by the driver program before testing.

void register_rotate_functions() {
        add_rotate_function(&rotate, rotate_descr);
}
The function contains one or more calls to add_rotate_function. In the above example, add_rotate_function registers the function rotate along with a string rotate_descr which is an ASCII description of what the function does. See rotate.c to see how to create the string descriptions? The string can be at most 256 characters. The functions for smooth work analogously.

Testing
To test your functions, open the project file in Visual C++ and build it. You can add this project to a pre-existing (or new) VC++.Net Solution.

GRADING
This is how grading for the exercise will work:

Correctness: Your solutions must be 100% correct for any square image matrix with edge dimensions that are a multiple of 32. (The driver program will check correctness and will tell you if a particular implementation is incorrect.

Speed improvement: Your solutions will earn credit based on reaching a certain threshold, according to the tables below:

Rotate: Hit Score: Credit:
1.60 70/70
1.45 60/70
1.30 55/70
1.25 40/70
1.10 25/70
Smooth: Hit Score: Credit:
1.30 30/30
1.25 25/30
1.20 20/30
1.15 15/30
1.10 10/30

HINTS
The rotate function focuses on spatial locality: because each pixel is used only once, you should focus on using any pixels put into the cache by a previous pixel operation.

The smooth function benefits from spatial locality, but also reuses pixels it has read previously. Consequently, you should consider trying to improve temporal locality as well.

Try a large number of different functions. There is a FIND_BEST_OF_MANY #define flag in driver.c that can be used to find out which function provides the highest hit rate for each problem.

Just because your image is square doesn't mean you have to deal with the image in square pieces.

Remember the way things will be laid out in memory and how this affects what is put into the cache.

模拟缓存的属性如下：

/* cache parameters */
#define CACHE_SIZE	 16384	 /* 16KB cache */
#define SET_ASSOCIATIVITY 1 /* direct-map */
#define BLOCK_SIZE 32	/* 32-byte lines */

像素结构如下：

typedef struct {
   unsigned short red	: 8;
   unsigned short green	: 8;
   unsigned short blue	: 8;
   unsigned short alpha	: 8;
} pixel;

现总结如下：
缓存有2^14byte大小，是直接缓存，有2^14/2^5 = 2^9行，每行32 = 2^5byte。
每个像素4字节，也就是缓存中一个块可以存8个像素。

题目中第二个Smooth太简单，暂时就不讨论了。现在我们来研究下第一个要求，顺时针旋转图像。
虽然在实现上这个二维图像是按顺序保存在一维数组里的，但是应用起来和二维数组差不多。
下面是从像素数组获取像素的宏：

#define PIXEL(i,j,n) ((i%n)*(n)+(j%n))//图像长宽都是n

顺时针旋转图像的时候，实际上是同时在对两个像素数组进行操作。下面是原来的拷贝代码。其中dim是图像的长宽数值。（单位：像素）

void rotate(int dim, pixel *src, pixel *dst) {
	int i, j;

	for(i=0; i < dim; i++) {
		for(j=0; j < dim; j++) {
			COPY(&dst[PIXEL(dim-1-j,i,dim)], &src[PIXEL(i,j,dim)]);
		}
	}

	return;
}

由于缓存是写分配的，所以行优先的src数组不命中率是1/8，但是列优先的dst数组命中率就是几乎是0（除非dim非常的小——这种假设几乎没有意义）。
再分析一下：src数组的不命中率是正常的，但是dst数组——因为转了90度——就完全不能命中了。很显然，矩阵分块是一个很容易被想到的方案。假设已经知道了块应该分多大，嵌套循环的算法应该怎么设计呢？
在保证在分好的矩阵块内src数组仍然是行优先的前提下，dst数组要保证写入第一列的时候尽管全部miss，但是可以把之后要写入的若干列一起载入缓存；在扫描src的第二行时就可以被命中。这时会有一个问题：如果扫描src是从第一行开始，对应的就是同时把它转到dst中最右边的一列。这时要回想起来：图像是按照行优先的顺序保存在一个一维数组里的，所以对于dst[0][n-1]——这里要基于一个地址映射的假设：dst[0][n-1]是被映射到缓存块里面偏移量为0的位置。对于大小比较友好的图片（2的幂）来说，这个假设是很容易被保证的——，当它被读入缓存的时候，在缓存块里跟在它后面的那几个并不是dst[0][n-2]、dst[0][n-3]等等等，而是第二行的dst[1][0]、dst[1][2]。对dst[1][n-1]也是类似的情况——这样一来读入缓存的就不是我们想要的dst数组最右边的一个正方形矩阵，而是最右边的一列加上左边偏了一行的长方形矩阵。这样一来对src第二行进行扫描的时候，dst右边数第二列仍然没有在缓存中。
因此我们要修改一下循环的顺序：对于一个已经划好的矩阵，让src从最下面一行开始，按照行优先的顺序从下向上扫描。你可以自己画图看看，这样一来对应的dst矩阵就符合我们的要求了。扫描完这个矩阵之后，只要以矩阵为单位在进行扫描就行了。
代码如下：

void BlockedMatrixRotate(int dim,pixel * src, pixel * dst){
	int bsize = XXX;//这个是blockSize，划分的块矩阵边长还没定。
	int i, j, ii,jj;

	for(jj = 0; jj < dim; jj += bsize)	
		for (ii = 0; ii < dim; ii += bsize)		
			for (j = jj + bsize - 1;j >= jj;j-- )
				for (i = ii; i < ii + bsize; i++)
					COPY(&dst[PIXEL(dim-1-j,i,dim)], &src[PIXEL(i,j,dim)]);

}

现在解决了算法——勉强算是算法吧——的问题，我们再来看最重要的一个问题：划分的矩阵块应该定多大？

我最初的想法是这样的：“我们应该充分利用缓存，让src和dst的块矩阵加起来刚好或者几乎可以把缓存填满。”
按照方程2*x*x*4 = 16*1024，算出来x约等于45。
但是实际的执行结果却很不满意，具体的结果我就不列出来了，总之比没修改过的时候只提高了大概1%命中率，这离我们的要求——对边长为64、128、256、512、1024像素的图像命中率平均提高60%差的太远。
仔细考虑了一下，我才想起来以“充分利用缓存”作为目标是不对的，因为我们无法用块的大小来决定每个像素会被放到缓存的什么位置上去——它通过地址映射，在运行之前就已经决定好了。
基于对应每个缓存块可以存放的元素的大小，我分别尝试了blockSize = 8和4。可以猜想这两个值应该相当友好。果然，取8的时候命中率达到了原来的1.52倍，而取4的时候更是达到了1.64倍。（见附件图1）
不过既然一个缓存块刚好可以存放8个像素，为什么用8的时候平均提高倍数没有用4的时候高呢？
从图中可以看到，块大小选择8时，对应大小为128 256 512的图，其命中率足足提升了一倍，但是对应1024的图命中率完全没有提升！而块大小为4时，对这4种大小的图命中率都提升了0.85到0.86倍，因此平均下来块大小为4效果比较好。
首先，我们来分析为什么块大小为8的时候效果起初不错，但是对大图又不理想了：
千万不要忘记：内存中的数据应该存放在缓存的什么位置，这是由内存中的地址已经定好的！对src矩阵来说，不命中率为1/8是很显然的。重点在于dst矩阵。我们知道，本系统中的缓存是大小为16K的直接缓存——也就是如果组被映射到同一组，那么先读入的就会被覆盖掉。我们来算一下读入多少个像素之后，会开始覆盖之前读入的像素：对直接缓存来说，如果按照行优先存入，那么刚好把缓存存满之后，再读入就会产生覆盖。也就是按照地址顺序读入了大小为16KB的像素之后。可以算得，缓存中刚好可以存放4096个像素——对512大小的图像来说是8行，刚刚好没有覆盖，于是效率很高——对256和128大小的图像更是如此——但是对1024大小的图像来说，就是4行。这样就可以解释了：由于dst是列优先读入的，当读入4个的时候，刚好“读满”（伪），再想读入这一列的下一个像素时就会覆盖读入的第一列的那行——于是第5至8行就会把读入的第1~4行完全覆盖，这时再想读入第一行的第二个像素又会miss而覆盖到刚读入第五行的那个缓存快——于是一直抖动到结束，dst矩阵命中率为0。
然后，我们来分析为什么块大小为4的时候效果不错：
对src来说，对第一个4*4矩阵的扫描读入了4*8个像素到缓存，其中第一列的4个miss而之后的3个都命中。这个矩阵结束后矩阵向右移动，刚才读入的4*8个像素中右边的4*4个都可以用，全部命中，因此不命中率仍然是1/8不变，之后也是一样，不必再重复。而对于dst图，对最右边一列的矩阵进行扫描的时候，会同时把最左边下面一行开始的4*4像素读入缓存，但是并没有用——这无伤大雅，至少dst矩阵有了3/4的命中率。于是总命中率就是:
(7/8 + 3/4) / 2 = 0.8125，比原来的(7/8 + 0) / 2 = 0.4375 大了不少吧。

不过从图中可见，当图大小大于512时，命中率不都是0.8125，关於这点我没有自己去算，不过大概是由于src和dst两个矩阵在图中间的位置产生了少量冲突造成的吧。