- •Contents
- •List of Figures
- •List of Tables
- •Acknowledgments
- •Introduction to MPI
- •Overview and Goals
- •Background of MPI-1.0
- •Background of MPI-1.1, MPI-1.2, and MPI-2.0
- •Background of MPI-1.3 and MPI-2.1
- •Background of MPI-2.2
- •Who Should Use This Standard?
- •What Platforms Are Targets For Implementation?
- •What Is Included In The Standard?
- •What Is Not Included In The Standard?
- •Organization of this Document
- •MPI Terms and Conventions
- •Document Notation
- •Naming Conventions
- •Semantic Terms
- •Data Types
- •Opaque Objects
- •Array Arguments
- •State
- •Named Constants
- •Choice
- •Addresses
- •Language Binding
- •Deprecated Names and Functions
- •Fortran Binding Issues
- •C Binding Issues
- •C++ Binding Issues
- •Functions and Macros
- •Processes
- •Error Handling
- •Implementation Issues
- •Independence of Basic Runtime Routines
- •Interaction with Signals
- •Examples
- •Point-to-Point Communication
- •Introduction
- •Blocking Send and Receive Operations
- •Blocking Send
- •Message Data
- •Message Envelope
- •Blocking Receive
- •Return Status
- •Passing MPI_STATUS_IGNORE for Status
- •Data Type Matching and Data Conversion
- •Type Matching Rules
- •Type MPI_CHARACTER
- •Data Conversion
- •Communication Modes
- •Semantics of Point-to-Point Communication
- •Buffer Allocation and Usage
- •Nonblocking Communication
- •Communication Request Objects
- •Communication Initiation
- •Communication Completion
- •Semantics of Nonblocking Communications
- •Multiple Completions
- •Non-destructive Test of status
- •Probe and Cancel
- •Persistent Communication Requests
- •Send-Receive
- •Null Processes
- •Datatypes
- •Derived Datatypes
- •Type Constructors with Explicit Addresses
- •Datatype Constructors
- •Subarray Datatype Constructor
- •Distributed Array Datatype Constructor
- •Address and Size Functions
- •Lower-Bound and Upper-Bound Markers
- •Extent and Bounds of Datatypes
- •True Extent of Datatypes
- •Commit and Free
- •Duplicating a Datatype
- •Use of General Datatypes in Communication
- •Correct Use of Addresses
- •Decoding a Datatype
- •Examples
- •Pack and Unpack
- •Canonical MPI_PACK and MPI_UNPACK
- •Collective Communication
- •Introduction and Overview
- •Communicator Argument
- •Applying Collective Operations to Intercommunicators
- •Barrier Synchronization
- •Broadcast
- •Example using MPI_BCAST
- •Gather
- •Examples using MPI_GATHER, MPI_GATHERV
- •Scatter
- •Examples using MPI_SCATTER, MPI_SCATTERV
- •Example using MPI_ALLGATHER
- •All-to-All Scatter/Gather
- •Global Reduction Operations
- •Reduce
- •Signed Characters and Reductions
- •MINLOC and MAXLOC
- •All-Reduce
- •Process-local reduction
- •Reduce-Scatter
- •MPI_REDUCE_SCATTER_BLOCK
- •MPI_REDUCE_SCATTER
- •Scan
- •Inclusive Scan
- •Exclusive Scan
- •Example using MPI_SCAN
- •Correctness
- •Introduction
- •Features Needed to Support Libraries
- •MPI's Support for Libraries
- •Basic Concepts
- •Groups
- •Contexts
- •Intra-Communicators
- •Group Management
- •Group Accessors
- •Group Constructors
- •Group Destructors
- •Communicator Management
- •Communicator Accessors
- •Communicator Constructors
- •Communicator Destructors
- •Motivating Examples
- •Current Practice #1
- •Current Practice #2
- •(Approximate) Current Practice #3
- •Example #4
- •Library Example #1
- •Library Example #2
- •Inter-Communication
- •Inter-communicator Accessors
- •Inter-communicator Operations
- •Inter-Communication Examples
- •Caching
- •Functionality
- •Communicators
- •Windows
- •Datatypes
- •Error Class for Invalid Keyval
- •Attributes Example
- •Naming Objects
- •Formalizing the Loosely Synchronous Model
- •Basic Statements
- •Models of Execution
- •Static communicator allocation
- •Dynamic communicator allocation
- •The General case
- •Process Topologies
- •Introduction
- •Virtual Topologies
- •Embedding in MPI
- •Overview of the Functions
- •Topology Constructors
- •Cartesian Constructor
- •Cartesian Convenience Function: MPI_DIMS_CREATE
- •General (Graph) Constructor
- •Distributed (Graph) Constructor
- •Topology Inquiry Functions
- •Cartesian Shift Coordinates
- •Partitioning of Cartesian structures
- •Low-Level Topology Functions
- •An Application Example
- •MPI Environmental Management
- •Implementation Information
- •Version Inquiries
- •Environmental Inquiries
- •Tag Values
- •Host Rank
- •IO Rank
- •Clock Synchronization
- •Memory Allocation
- •Error Handling
- •Error Handlers for Communicators
- •Error Handlers for Windows
- •Error Handlers for Files
- •Freeing Errorhandlers and Retrieving Error Strings
- •Error Codes and Classes
- •Error Classes, Error Codes, and Error Handlers
- •Timers and Synchronization
- •Startup
- •Allowing User Functions at Process Termination
- •Determining Whether MPI Has Finished
- •Portable MPI Process Startup
- •The Info Object
- •Process Creation and Management
- •Introduction
- •The Dynamic Process Model
- •Starting Processes
- •The Runtime Environment
- •Process Manager Interface
- •Processes in MPI
- •Starting Processes and Establishing Communication
- •Reserved Keys
- •Spawn Example
- •Manager-worker Example, Using MPI_COMM_SPAWN.
- •Establishing Communication
- •Names, Addresses, Ports, and All That
- •Server Routines
- •Client Routines
- •Name Publishing
- •Reserved Key Values
- •Client/Server Examples
- •Ocean/Atmosphere - Relies on Name Publishing
- •Simple Client-Server Example.
- •Other Functionality
- •Universe Size
- •Singleton MPI_INIT
- •MPI_APPNUM
- •Releasing Connections
- •Another Way to Establish MPI Communication
- •One-Sided Communications
- •Introduction
- •Initialization
- •Window Creation
- •Window Attributes
- •Communication Calls
- •Examples
- •Accumulate Functions
- •Synchronization Calls
- •Fence
- •General Active Target Synchronization
- •Lock
- •Assertions
- •Examples
- •Error Handling
- •Error Handlers
- •Error Classes
- •Semantics and Correctness
- •Atomicity
- •Progress
- •Registers and Compiler Optimizations
- •External Interfaces
- •Introduction
- •Generalized Requests
- •Examples
- •Associating Information with Status
- •MPI and Threads
- •General
- •Initialization
- •Introduction
- •File Manipulation
- •Opening a File
- •Closing a File
- •Deleting a File
- •Resizing a File
- •Preallocating Space for a File
- •Querying the Size of a File
- •Querying File Parameters
- •File Info
- •Reserved File Hints
- •File Views
- •Data Access
- •Data Access Routines
- •Positioning
- •Synchronism
- •Coordination
- •Data Access Conventions
- •Data Access with Individual File Pointers
- •Data Access with Shared File Pointers
- •Noncollective Operations
- •Collective Operations
- •Seek
- •Split Collective Data Access Routines
- •File Interoperability
- •Datatypes for File Interoperability
- •Extent Callback
- •Datarep Conversion Functions
- •Matching Data Representations
- •Consistency and Semantics
- •File Consistency
- •Random Access vs. Sequential Files
- •Progress
- •Collective File Operations
- •Type Matching
- •Logical vs. Physical File Layout
- •File Size
- •Examples
- •Asynchronous I/O
- •I/O Error Handling
- •I/O Error Classes
- •Examples
- •Subarray Filetype Constructor
- •Requirements
- •Discussion
- •Logic of the Design
- •Examples
- •MPI Library Implementation
- •Systems with Weak Symbols
- •Systems Without Weak Symbols
- •Complications
- •Multiple Counting
- •Linker Oddities
- •Multiple Levels of Interception
- •Deprecated Functions
- •Deprecated since MPI-2.0
- •Deprecated since MPI-2.2
- •Language Bindings
- •Overview
- •Design
- •C++ Classes for MPI
- •Class Member Functions for MPI
- •Semantics
- •C++ Datatypes
- •Communicators
- •Exceptions
- •Mixed-Language Operability
- •Problems With Fortran Bindings for MPI
- •Problems Due to Strong Typing
- •Problems Due to Data Copying and Sequence Association
- •Special Constants
- •Fortran 90 Derived Types
- •A Problem with Register Optimization
- •Basic Fortran Support
- •Extended Fortran Support
- •The mpi Module
- •No Type Mismatch Problems for Subroutines with Choice Arguments
- •Additional Support for Fortran Numeric Intrinsic Types
- •Language Interoperability
- •Introduction
- •Assumptions
- •Initialization
- •Transfer of Handles
- •Status
- •MPI Opaque Objects
- •Datatypes
- •Callback Functions
- •Error Handlers
- •Reduce Operations
- •Addresses
- •Attributes
- •Extra State
- •Constants
- •Interlanguage Communication
- •Language Bindings Summary
- •Groups, Contexts, Communicators, and Caching Fortran Bindings
- •External Interfaces C++ Bindings
- •Change-Log
- •Bibliography
- •Examples Index
- •MPI Declarations Index
- •MPI Function Index
13.6. CONSISTENCY AND SEMANTICS |
443 |
Advice to users. Any sequence of operations containing the collective routines
MPI_FILE_SET_SIZE and MPI_FILE_PREALLOCATE is a write sequence. As such, sequential consistency in nonatomic mode is not guaranteed unless the conditions in Section 13.6.1, page 437, are satis ed. (End of advice to users.)
File pointer update semantics (i.e., le pointers are updated by the amount accessed) are only guaranteed if le size changes are sequentially consistent.
Advice to users. Consider the following example. Given two operations made by separate processes to a le containing 100 bytes: an MPI_FILE_READ of 10 bytes and an MPI_FILE_SET_SIZE to 0 bytes. If the user does not enforce sequential consistency between these two operations, the le pointer may be updated by the amount requested (10 bytes) even if the amount accessed is zero bytes. (End of advice to users.)
13.6.10 Examples
The examples in this section illustrate the application of the MPI consistency and semantics guarantees. These address
con icting accesses on le handles obtained from a single collective open, and
all accesses on le handles obtained from two separate collective opens.
The simplest way to achieve consistency for con icting accesses is to obtain sequential consistency by setting atomic mode. For the code below, process 1 will read either 0 or 10 integers. If the latter, every element of b will be 5. If nonatomic mode is set, the results of the read are unde ned.
/* Process 0 */ int i, a[10] ; int TRUE = 1;
for ( i=0;i<10;i++) a[i] = 5 ;
MPI_File_open( MPI_COMM_WORLD, "workfile",
MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &fh0 ) ; MPI_File_set_view( fh0, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL ) ; MPI_File_set_atomicity( fh0, TRUE ) ;
MPI_File_write_at(fh0, 0, a, 10, MPI_INT, &status) ; /* MPI_Barrier( MPI_COMM_WORLD ) ; */
/* Process 1 */ int b[10] ; int TRUE = 1;
MPI_File_open( MPI_COMM_WORLD, "workfile",
MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &fh1 ) ; MPI_File_set_view( fh1, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL ) ; MPI_File_set_atomicity( fh1, TRUE ) ;
/* MPI_Barrier( MPI_COMM_WORLD ) ; */ MPI_File_read_at(fh1, 0, b, 10, MPI_INT, &status) ;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
444 |
CHAPTER 13. I/O |
1A user may guarantee that the write on process 0 precedes the read on process 1 by imposing
2
3
temporal order with, for example, calls to MPI_BARRIER.
4Advice to users. Routines other than MPI_BARRIER may be used to impose temporal
5order. In the example above, process 0 could use MPI_SEND to send a 0 byte message,
6received by process 1 using MPI_RECV. (End of advice to users.)
7
8Alternatively, a user can impose consistency with nonatomic mode set:
9
/* Process 0 */
10
int i, a[10] ;
11
for ( i=0;i<10;i++)
12
a[i] = 5 ;
13
14
MPI_File_open( MPI_COMM_WORLD, "workfile",
15
MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &fh0 ) ;
16
MPI_File_set_view( fh0, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL ) ;
17
MPI_File_write_at(fh0, 0, a, 10, MPI_INT, &status ) ;
18
MPI_File_sync( fh0 ) ;
19
MPI_Barrier( MPI_COMM_WORLD ) ;
20
MPI_File_sync( fh0 ) ;
21
22/* Process 1 */
23int b[10] ;
24MPI_File_open( MPI_COMM_WORLD, "workfile",
25 |
MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &fh1 ) ; |
|
26MPI_File_set_view( fh1, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL ) ;
27MPI_File_sync( fh1 ) ;
28MPI_Barrier( MPI_COMM_WORLD ) ;
29MPI_File_sync( fh1 ) ;
30MPI_File_read_at(fh1, 0, b, 10, MPI_INT, &status ) ;
31
32 |
The \sync-barrier-sync" construct is required because: |
33
34
The barrier ensures that the write on process 0 occurs before the read on process 1.
35The rst sync guarantees that the data written by all processes is transferred to the
36storage device.
37
38The second sync guarantees that all data which has been transferred to the storage
39device is visible to all processes. (This does not a ect process 0 in this example.)
40
The following program represents an erroneous attempt to achieve consistency by elim-
41
inating the apparently super uous second \sync" call for each process.
42
43/* ---------------- THIS EXAMPLE IS ERRONEOUS --------------- */
44/* Process 0 */
45int i, a[10] ;
46for ( i=0;i<10;i++)
47a[i] = 5 ;
48
13.6. CONSISTENCY AND SEMANTICS |
445 |
MPI_File_open( MPI_COMM_WORLD, "workfile", |
|
MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &fh0 ) ; |
|
MPI_File_set_view( fh0, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL ) ; |
|
MPI_File_write_at(fh0, 0, a, 10, MPI_INT, &status ) ; |
|
MPI_File_sync( fh0 ) ; |
|
MPI_Barrier( MPI_COMM_WORLD ) ; |
|
/* Process 1 */ |
|
int b[10] ; |
|
MPI_File_open( MPI_COMM_WORLD, "workfile", |
|
MPI_MODE_RDWR | MPI_MODE_CREATE, MPI_INFO_NULL, &fh1 ) ; |
|
MPI_File_set_view( fh1, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL ) ; |
|
MPI_Barrier( MPI_COMM_WORLD ) ; |
|
MPI_File_sync( fh1 ) ; |
|
MPI_File_read_at(fh1, 0, b, 10, MPI_INT, &status ) ; |
|
/* ---------------- THIS EXAMPLE IS ERRONEOUS --------------- */
The above program also violates the MPI rule against out-of-order collective operations and will deadlock for implementations in which MPI_FILE_SYNC blocks.
Advice to users. Some implementations may choose to implement MPI_FILE_SYNC as a temporally synchronizing function. When using such an implementation, the \sync-barrier-sync" construct above can be replaced by a single \sync." The results of using such code with an implementation for which MPI_FILE_SYNC is not temporally synchronizing is unde ned. (End of advice to users.)
Asynchronous I/O
The behavior of asynchronous I/O operations is determined by applying the rules speci ed above for synchronous I/O operations.
The following examples all access a preexisting le \my le." Word 10 in my le initially contains the integer 2. Each example writes and reads word 10.
First consider the following code fragment:
int a = 4, b, TRUE=1; |
|
|
|
MPI_File_open( MPI_COMM_WORLD, |
"myfile", |
|
|
MPI_MODE_RDWR, MPI_INFO_NULL, &fh ) ; |
|||
MPI_File_set_view( fh, |
0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL ) ; |
||
/* MPI_File_set_atomicity( fh, |
TRUE ) ; |
Use this to set atomic mode. */ |
|
MPI_File_iwrite_at(fh, |
10, &a, |
1, MPI_INT, &reqs[0]) ; |
|
MPI_File_iread_at(fh, |
10, &b, |
1, MPI_INT, &reqs[1]) ; |
|
MPI_Waitall(2, reqs, statuses) |
; |
|
For asynchronous data access operations, MPI speci es that the access occurs at any time between the call to the asynchronous data access routine and the return from the corresponding request complete routine. Thus, executing either the read before the write, or the write before the read is consistent with program order. If atomic mode is set, then MPI guarantees sequential consistency, and the program will read either 2 or 4 into b. If atomic
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
446 |
CHAPTER 13. I/O |
1mode is not set, then sequential consistency is not guaranteed and the program may read
2something other than 2 or 4 due to the con icting data access.
3Similarly, the following code fragment does not order le accesses:
4
5
6
7
int a = 4, b;
MPI_File_open( MPI_COMM_WORLD, "myfile",
MPI_MODE_RDWR, MPI_INFO_NULL, &fh ) ;
8MPI_File_set_view( fh, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL ) ;
9 |
/* MPI_File_set_atomicity( fh, TRUE ) ; |
Use this to set atomic mode. */ |
|
|
10MPI_File_iwrite_at(fh, 10, &a, 1, MPI_INT, &reqs[0]) ;
11MPI_File_iread_at(fh, 10, &b, 1, MPI_INT, &reqs[1]) ;
12MPI_Wait(&reqs[0], &status) ;
13MPI_Wait(&reqs[1], &status) ;
14If atomic mode is set, either 2 or 4 will be read into b. Again, MPI does not guarantee
15sequential consistency in nonatomic mode.
16On the other hand, the following code fragment:
17
18int a = 4, b;
19MPI_File_open( MPI_COMM_WORLD, "myfile",
20 |
MPI_MODE_RDWR, MPI_INFO_NULL, &fh ) ; |
21MPI_File_set_view( fh, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL ) ;
22MPI_File_iwrite_at(fh, 10, &a, 1, MPI_INT, &reqs[0]) ;
23MPI_Wait(&reqs[0], &status) ;
24MPI_File_iread_at(fh, 10, &b, 1, MPI_INT, &reqs[1]) ;
25MPI_Wait(&reqs[1], &status) ;
26de nes the same ordering as:
27
28int a = 4, b;
29MPI_File_open( MPI_COMM_WORLD, "myfile",
30 |
MPI_MODE_RDWR, MPI_INFO_NULL, &fh ) ; |
31MPI_File_set_view( fh, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL ) ;
32MPI_File_write_at(fh, 10, &a, 1, MPI_INT, &status ) ;
33MPI_File_read_at(fh, 10, &b, 1, MPI_INT, &status ) ;
34
35
36
37
38
Since
nonconcurrent operations on a single le handle are sequentially consistent, and
the program fragments specify an order for the operations,
39MPI guarantees that both program fragments will read the value 4 into b. There is no need
40to set atomic mode for this example.
41Similar considerations apply to con icting accesses of the form:
42
43
44
45
46
MPI_File_write_all_begin(fh,...) ;
MPI_File_iread(fh,...) ;
MPI_Wait(fh,...) ;
MPI_File_write_all_end(fh,...) ;
47Recall that constraints governing consistency and semantics are not relevant to the
48following: