Document: MMXspeed.txt File Group: Classic Benchmarks Creation Date: 30 September 1997 Revision Date: 14 October 1997 Title: MMX Instruction Speed Tests Keywords: BENCHMARK PERFORMANCE MMX Abstract: This document contains details of pre-compiled C programs that measure MMX instruction execution speeds. Windows 95/NT, OS/2 and DOS versions are supplied. The main purposes are to identify that MMX facilities are available and to produce performance ratings that can be used for broad non-application dependent comparisons. Note: The programs are still under test and should be treated as beta test software. Please submit feedback directly to Roy, or as a posting in Section 12 message board. The programs have been tested via Windows 95, DOS and OS/2 Warp 3 on a PC with a 200 MHz Intel MMX CPU. Contributor: Roy_Longbottom@compuserve.com Changes The DOS version is now 1.1. Version 1 was found to be slow when running with extended parameters. MMX INSTRUCTION SPEED TESTS 0. SUMMARY The tests were compiled using Watcom C/C++ 11 which has MMX facilities. The program runs five pairs of routines containing MMX instructions compare, multiply/add, pack, shift and or. The pairs comprise one using standard Watcom macros and one with extended macros that contain far more MMX instructions. A final test, with less dependency on cache/memory speeds, executes all MMX instructions. Speed ratings in terms of Millions of Operations Per Second (MOPS) are produced and data transfer rates in MegaBytes per Second (MB/S). The latter can be compared with MEMSPEED results. The default version uses 8 KB of memory to fit within a L1 cache but run time parameters can be used to force data transfers from L2 cache or RAM. Running time should be less than two minutes for the default version but can be longer with larger memory demands. The program uses a Watcom function to detect whether MMX features are available and, where not, an appropriate message is displayed. Note that this facility did not work on a Cyrix CPU but produced a system generated invalid operation code message. Pre-compiled versions MMXNT95.EXE, MMXOS2.EXE and MMXDOS.EXE are provided for Windows 95 or NT, OS/2 and MS DOS. Windows 3.1 users can run the DOS version. Results are displayed as the program is running and saved in file MMXRES.TXT which should appear in the same directory as the EXE files. A run time option allows a user defined log file to be used as an alternative. The DOS version requires DOS4GW.EXE the protected run-time program. Before running, all other applications should be closed. To run, click on the appropriate EXE icon or enter the program name (and optional run time parameters) at the command prompt. The most consistent speed measurements are likely to be produced when the PC is booted directly with DOS. Results should be sent to Roy_Longbottom@compuserve.com and details of the system under test should be included. The configuration details can be provided via program SYSTEMxx.EXE as supplied with the Classic Benchmarks. 1. INTRODUCTION With the introduction of MMX CPU instructions, it was considered appropriate to produce a simple synthetic benchmark to complement the Classic Benchmarks and other performance tests in the Library. After reading various reports indicating that assembly language programming was necessary to obtain the best performance via MMX and noting that compilers were available with MMX capabilities, firstly, I upgraded my Watcom compiler to Version 11 which has the new facilities. After some experimentation, I decided to include the following in the benchmark: 1. identification whether MMX facilities are in a particular CPU; 2. use of standard compiler facilities; 3. using the in-line assembler to produce enhanced performance; 4. measurement of a range of different functions; 5. production of execution speeds where some comparison with speed via conventional instructions might be possible; 6. production of data transfer rates that can be compared with those from the memory speed test (MEMSPEED); 7. producing a default version using data in L1 cache for maximum speed; 8. inclusion of optional run time parameters for a user defined log file and data in L2 cache or RAM. 2. MMX INSTRUCTIONS There are more than 40 MMX instructions for load, pack/unpack, add/subtract/multiply, or/and/xor, shift right/left, compare and control. These operate on eight 64 bit registers (mm0 to mm7). Some execute in the conventional manner on all 64 bits. Others use Single Instruction Multiple Data (SIMD) operation, executing the same type of instruction on 2, 4 or 8 independent sets of data (32 bits, 16 bits or 8 bits long). Examples used in the benchmark are described later. 3. PROGRAM DETAILS The program manipulates two data arrays a[] and b[] each declared as 131072 (128K) elements of 64 bits but only 512 elements (2 x 4 Kbytes) are used with the default parameters to try to ensure that data is in the L1 cache for maximum speed of operation. MMX facilities in the compiler are in the form of macros. Assembly code for the first one used is: #pragma aux (__m_binary) _m_pcmpeqb = \ ".586" \ "movq mm0,[eax]" \ "movq mm1,[edx]" \ "pcmpeqb mm0,mm1" \ "movq [esi],mm0" The two movq instructions load the input data into registers mmo and mm1. The pcmpeqb instruction compares each byte independently for equal. If the respective bytes of m1 are equal to the respective bytes of m0, the respective bytes of the result m0 are set to all ones, otherwise they are set to all zeros. The last movq returns the result. TEST LOOP 1 With associated overheads of calling such routines, performance expectations are relatively poor. To minimise further overheads due to loop control, the loops in the C++ program are partially unrolled with the following format for timing purposes: timea = millisecs(); for (i=0; i