Document: BusSpeed.txt File Group: Classic Benchmarks Creation Date: 30 October 1997 Revision Date: 26 November 1997 Title: Memory Bus Speed Test Keywords: BENCHMARK PERFORMANCE MEMORY BUS Abstract: This document contains details of a program designed to measure data transfer rates of memory busses and associated CPU speeds. Versions are supplied to run via DOS, Windows 95 or NT and OS/2. Besides measuring performance, the program can be used for reliability testing. Note: The program is still under test and should be treated as beta test software. Please submit feedback directly to Roy, or as a posting in Section 12 message board. Contributor: Roy_Longbottom@compuserve.com (101323,2241) Latest Revision: BAT files modifies to use 256 KB memory. See README.TXT for other revisions. MEMORY BUS SPEED TEST 1. INTRODUCTION This benchmark program is based on work by Verne Tice and suggestions from Scott Taylor. It loads data from an array via L1 cache, L2 cache or RAM using quad words (4 bytes). The high data transfer speeds of the program are produced via assembly language instructions that just move data from memory to CPU registers. The first test only loads the first four bytes out of 32, where it can normally be assumed that this will cause a burst read from memory of all 32 bytes. As the concept of loading data and doing nothing with it is a completely artificial situation and can produce misleading performance comparisons, further tests have been included to load all data and process it. Some routines are designed to give an indication of peak performance where more than one integer arithmetic unit is provided within the CPU. These features were added, particularly to help in identifying reasons for performance differences when running the Classic Benchmarks. Versions of the program are supplied to run via DOS, Windows 95 or NT and OS/2 using Watcom C/C++ Version 11 compiler. The programs can be run simply by clicking on an icon or entering the program name at a command line prompt to run the 32 byte test. Run time parameters can also be included to run extended sequences of more complex tests. Facilities provided via the run time parameters are: Running time up to 600 seconds, with automatic calibration, default 5 seconds. Volume of data to use, 2KB to 4096KB, defaults 4, 64, 2048. Address increment of 32 bytes (default). With the first 4 bytes loaded to a CPU register, on loading from L2 cache or RAM, 32 bytes are likely to be passed via the connecting bus. Address increment of 4 bytes to load all data to registers. Address offset of 0, 1, 2, 3, 4 to 64 bytes to show effects of data not aligned to 32 byte boundary. Loading to 1 register (default) or 2 registers. User defined alternating data pattern, default A5A5A5A5 and 5A5A5A5A (binary 1010... and 0101...). Alternates every 32 bytes. User defined data pattern, same pattern to all bytes. Checking data via AND or AND/XOR as opposed to loading via MOV instruction, using 1 or 2 registers. These combinations identify significant performance variations between different CPUs. Results appended to log file (BUSREAD.TXT or user defined file) with option to omit headings on running a series of tests. Programs BUSNT95.EXE, BUSDOS.EXE and BUSOS2.EXE are supplied to run via Windows 95 or NT, MS DOS and OS/2. W32BUSRD.BAT, DOSBUSRD.BAT and OS2BUSRD.CMD, corresponding batch files, are provided to run a series of tests. Other batch files with a TXT extension are also supplied to run other tests - see README.TXT. Result are calculated in terms of millions of load instructions per second (MIPS) and millions of bytes loaded to registers per second (MB/Sec). 2. PROGRAM DETAILS The same C++ program (BUSSPEED.CPP) was compiled for all versions. This includes assembly language used via #pragmas, each of which contains a sequence of 64 instructions to load data. a. LOADING WITH 32 BYTE INCREMENTS Executed via function readmem32(x, mempasses, offset) Register edx is loaded with the start address of a data array int x[1048640] (>4MB), eax with a pass count and ecx with an address byte offset (0 to 64). Each mov translates to a single machine code instruction, the address addition being carried out during instruction decoding. Each mov can be expected to complete within 1 CPU clock cycle but some CPUs can produce two results per clock, MIPS results approaching 2 x CPU MHz on testing via L1 cache. #pragma aux readmem32 = \ "add edx,ecx" \ "lp: mov ebx,[edx]" \ "mov ebx,[edx+32]" \ "mov ebx,[edx+64]" \ "mov ebx,[edx+96]" \ "mov ebx,[edx+128]" \ "mov ebx,[edx+160]" \ "mov ebx,[edx+192]" \ "mov ebx,[edx+224]" \ "mov ebx,[edx+256]" \ "mov ebx,[edx+288]" \ "mov ebx,[edx+320]" \ "mov ebx,[edx+352]" \ "mov ebx,[edx+384]" \ "mov ebx,[edx+416]" \ "mov ebx,[edx+448]" \ "mov ebx,[edx+480]" \ "mov ebx,[edx+512]" \ "mov ebx,[edx+544]" \ "mov ebx,[edx+576]" \ "mov ebx,[edx+608]" \ "mov ebx,[edx+640]" \ "mov ebx,[edx+672]" \ "mov ebx,[edx+704]" \ "mov ebx,[edx+736]" \ "mov ebx,[edx+768]" \ "mov ebx,[edx+800]" \ "mov ebx,[edx+832]" \ "mov ebx,[edx+864]" \ "mov ebx,[edx+896]" \ "mov ebx,[edx+928]" \ "mov ebx,[edx+960]" \ "mov ebx,[edx+992]" \ "mov ebx,[edx+1024]" \ "mov ebx,[edx+1056]" \ "mov ebx,[edx+1088]" \ "mov ebx,[edx+1120]" \ "mov ebx,[edx+1152]" \ "mov ebx,[edx+1184]" \ "mov ebx,[edx+1216]" \ "mov ebx,[edx+1248]" \ "mov ebx,[edx+1280]" \ "mov ebx,[edx+1312]" \ "mov ebx,[edx+1344]" \ "mov ebx,[edx+1376]" \ "mov ebx,[edx+1408]" \ "mov ebx,[edx+1440]" \ "mov ebx,[edx+1472]" \ "mov ebx,[edx+1504]" \ "mov ebx,[edx+1536]" \ "mov ebx,[edx+1568]" \ "mov ebx,[edx+1600]" \ "mov ebx,[edx+1632]" \ "mov ebx,[edx+1664]" \ "mov ebx,[edx+1696]" \ "mov ebx,[edx+1728]" \ "mov ebx,[edx+1760]" \ "mov ebx,[edx+1792]" \ "mov ebx,[edx+1824]" \ "mov ebx,[edx+1856]" \ "mov ebx,[edx+1888]" \ "mov ebx,[edx+1920]" \ "mov ebx,[edx+1952]" \ "mov ebx,[edx+1984]" \ "mov ebx,[edx+2016]" \ "add edx,2048" \ "dec eax" \ "jnz lp" \ parm [edx] [eax] [ecx] \ modify [ebx]; Before the timed run, a calibration phase could be expected to pre- load data into L1 or L2 cache, if it will fit. The volume of data covered by the test is determined by a variable loaded into register eax, for example 2, 32 and 1024 for 4KB, 64KB and 2048KB. These values are used by the default version as they currently provide measurements via L1 cache, L2 cache and RAM. b. LOADING DATA WITH 4 BYTE INCREMENTS #pragma aux readmem4 = \ "add edx,ecx" \ "lp: mov ebx,[edx]" \ "mov ebx,[edx+4]" \ "mov ebx,[edx+8]" \ . . . "mov ebx,[edx+248]" \ "mov ebx,[edx+252]" \ "add edx,256" \ "dec eax" \ "jnz lp" \ parm [edx] [eax] [ecx] \ modify [ebx]; Register eax values are, for example, 16, 256 and 8192 for 4KB, 64KB and 2048KB. c. LOADING DATA WITH 4 BYTE INCREMENTS TO 2 REGISTERS #pragma aux readmem4b = \ "add edx,ecx" \ "lp: mov ebx,[edx]" \ "mov ecx,[edx+4]" \ "mov ebx,[edx+8]" \ "mov ecx,[edx+12]" \ . . . "mov ebx,[edx+248]" \ "mov ecx,[edx+252]" \ "add edx,256" \ "dec eax" \ "jnz lp" \ parm [edx] [eax] [ecx] \ modify [ebx]; d. LOADING DATA WITH 4 BYTE INCREMENTS AND CHECKING Function checkmem1(x, mempasses, offset, result) is used for tests that check the data. For the first of these, the data is identical in all words loaded. The result parameter provides the address of the number used. This address is loaded to register edi and the value into ebx. The AND instruction is used with all data loaded. Assuming no data has been corrupted, ebx will contain the same number at the end and this is returned to the result variable where it is checked by the C++ code. #pragma aux checkmem1 = \ "add edx,ecx" \ "mov ebx,[edi]" \ "lp: and ebx,[edx]" \ "and ebx,[edx+4]" \ "and ebx,[edx+8]" \ . . . "and ebx,[edx+244]" \ "and ebx,[edx+248]" \ "and ebx,[edx+252]" \ "add edx,256" \ "dec eax" \ "jnz lp" \ "mov [edi],ebx" \ parm [edx] [eax] [ecx] [edi] \ modify [ebx]; e. LOADING DATA WITH 4 BYTE INCREMENTS AND CHECKING, 2 REGISTERS This stores the number in ebx and ecx, ANDing them at the end before returning the result. #pragma aux checkmem1b = \ "add edx,ecx" \ "mov ebx,[edi]" \ "mov ecx,ebx" \ "lp: and ebx,[edx]" \ "and ecx,[edx+4]" \ "and ebx,[edx+8]" \ . . . "and ecx,[edx+244]" \ "and ebx,[edx+248]" \ "and ecx,[edx+252]" \ "add edx,256" \ "dec eax" \ "jnz lp" \ "and ebx,ecx" \ "mov [edi],ebx" \ parm [edx] [eax] [ecx] [edi] \ modify [ebx]; f. CHECKING DATA ALTERNATING PATTERNS For these, data is inverted every 32 bytes e.g. FFFF000 to 0000FFFF. AND and XOR are used to collect the result. #pragma aux checkmem2 = \ "add edx,ecx" \ "mov ebx,[edi]" \ "lp: and ebx,[edx]" \ "and ebx,[edx+4]" \ "and ebx,[edx+8]" \ "and ebx,[edx+12]" \ "and ebx,[edx+16]" \ "and ebx,[edx+20]" \ "and ebx,[edx+24]" \ "and ebx,[edx+28]" \ "xor ebx,[edx+32]" \ "xor ebx,[edx+36]" \ . . . "and ebx,[edx+216]" \ "and ebx,[edx+220]" \ "xor ebx,[edx+224]" \ "xor ebx,[edx+228]" \ "xor ebx,[edx+232]" \ "xor ebx,[edx+236]" \ "xor ebx,[edx+240]" \ "xor ebx,[edx+244]" \ "xor ebx,[edx+248]" \ "xor ebx,[edx+252]" \ "add edx,256" \ "dec eax" \ "jnz lp" \ "mov [edi],ebx" \ parm [edx] [eax] [ecx] [edi] \ modify [ebx]; g. CHECKING DATA ALTERNATING PATTERNS, 2 REGISTERS #pragma aux checkmem2b = \ "add edx,ecx" \ "mov ebx,[edi]" \ "mov ecx,ebx" \ "lp: and ebx,[edx]" \ "and ecx,[edx+4]" \ "and ebx,[edx+8]" \ "and ecx,[edx+12]" \ "and ebx,[edx+16]" \ "and ecx,[edx+20]" \ "and ebx,[edx+24]" \ "and ecx,[edx+28]" \ "xor ebx,[edx+32]" \ "xor ecx,[edx+36]" \ . . . "xor ebx,[edx+248]" \ "xor ecx,[edx+252]" \ "add edx,256" \ "dec eax" \ "jnz lp" \ "and ebx,ecx" \ "mov [edi],ebx" \ parm [edx] [eax] [ecx] [edi] \ modify [ebx]; h. C PROGRAM The C program has a longhand structure to avoid unexpected timing overheads, for example: if (increment == 4) { if (check == 0) { if (reg == 2) { timea = millisecs(); for (j=0; j