(20 points) Assume that each core has a 2 GHz clock frequency, and has CPIs of 2, 8, and 8 for arithmetic, load/store, and branch instructions, respectively. There is a program that requires the execution of 3.69 X 109 arithmetic instructions, 3.69 x 10load/store instructions, and 2.6 x 108 branch instructions on a single core. When this program is parallelized to run over p cores, the number of arithmetic and load/store instructions per core is divided by 0.5x p individually but the number of branch instructions per core remains the same. (a) (12 points) Find the total execution time for this program respectively running on single core and 4

cores. Show the speedup of the 4-core result relative to the single-core result. (Please show the calculation procedure) (b) (4 points) In order for a single core to match the performance of 4 cores using the original CPI values, what should the CPI of load/store instructions be reduced to? (Please show the calculation procedure) (c) (4 points) If the CPI of the arithmetic instructions has already been reduced to 1, in order for a single core to match the performance of 4 cores using the original CPI value of the branch instructions, what should the CPI of load/store instructions be reduced to? (Please show the calculation procedure)

