- •Contents
- •List of Figures
- •List of Tables
- •Acknowledgments
- •Introduction to MPI
- •Overview and Goals
- •Background of MPI-1.0
- •Background of MPI-1.1, MPI-1.2, and MPI-2.0
- •Background of MPI-1.3 and MPI-2.1
- •Background of MPI-2.2
- •Who Should Use This Standard?
- •What Platforms Are Targets For Implementation?
- •What Is Included In The Standard?
- •What Is Not Included In The Standard?
- •Organization of this Document
- •MPI Terms and Conventions
- •Document Notation
- •Naming Conventions
- •Semantic Terms
- •Data Types
- •Opaque Objects
- •Array Arguments
- •State
- •Named Constants
- •Choice
- •Addresses
- •Language Binding
- •Deprecated Names and Functions
- •Fortran Binding Issues
- •C Binding Issues
- •C++ Binding Issues
- •Functions and Macros
- •Processes
- •Error Handling
- •Implementation Issues
- •Independence of Basic Runtime Routines
- •Interaction with Signals
- •Examples
- •Point-to-Point Communication
- •Introduction
- •Blocking Send and Receive Operations
- •Blocking Send
- •Message Data
- •Message Envelope
- •Blocking Receive
- •Return Status
- •Passing MPI_STATUS_IGNORE for Status
- •Data Type Matching and Data Conversion
- •Type Matching Rules
- •Type MPI_CHARACTER
- •Data Conversion
- •Communication Modes
- •Semantics of Point-to-Point Communication
- •Buffer Allocation and Usage
- •Nonblocking Communication
- •Communication Request Objects
- •Communication Initiation
- •Communication Completion
- •Semantics of Nonblocking Communications
- •Multiple Completions
- •Non-destructive Test of status
- •Probe and Cancel
- •Persistent Communication Requests
- •Send-Receive
- •Null Processes
- •Datatypes
- •Derived Datatypes
- •Type Constructors with Explicit Addresses
- •Datatype Constructors
- •Subarray Datatype Constructor
- •Distributed Array Datatype Constructor
- •Address and Size Functions
- •Lower-Bound and Upper-Bound Markers
- •Extent and Bounds of Datatypes
- •True Extent of Datatypes
- •Commit and Free
- •Duplicating a Datatype
- •Use of General Datatypes in Communication
- •Correct Use of Addresses
- •Decoding a Datatype
- •Examples
- •Pack and Unpack
- •Canonical MPI_PACK and MPI_UNPACK
- •Collective Communication
- •Introduction and Overview
- •Communicator Argument
- •Applying Collective Operations to Intercommunicators
- •Barrier Synchronization
- •Broadcast
- •Example using MPI_BCAST
- •Gather
- •Examples using MPI_GATHER, MPI_GATHERV
- •Scatter
- •Examples using MPI_SCATTER, MPI_SCATTERV
- •Example using MPI_ALLGATHER
- •All-to-All Scatter/Gather
- •Global Reduction Operations
- •Reduce
- •Signed Characters and Reductions
- •MINLOC and MAXLOC
- •All-Reduce
- •Process-local reduction
- •Reduce-Scatter
- •MPI_REDUCE_SCATTER_BLOCK
- •MPI_REDUCE_SCATTER
- •Scan
- •Inclusive Scan
- •Exclusive Scan
- •Example using MPI_SCAN
- •Correctness
- •Introduction
- •Features Needed to Support Libraries
- •MPI's Support for Libraries
- •Basic Concepts
- •Groups
- •Contexts
- •Intra-Communicators
- •Group Management
- •Group Accessors
- •Group Constructors
- •Group Destructors
- •Communicator Management
- •Communicator Accessors
- •Communicator Constructors
- •Communicator Destructors
- •Motivating Examples
- •Current Practice #1
- •Current Practice #2
- •(Approximate) Current Practice #3
- •Example #4
- •Library Example #1
- •Library Example #2
- •Inter-Communication
- •Inter-communicator Accessors
- •Inter-communicator Operations
- •Inter-Communication Examples
- •Caching
- •Functionality
- •Communicators
- •Windows
- •Datatypes
- •Error Class for Invalid Keyval
- •Attributes Example
- •Naming Objects
- •Formalizing the Loosely Synchronous Model
- •Basic Statements
- •Models of Execution
- •Static communicator allocation
- •Dynamic communicator allocation
- •The General case
- •Process Topologies
- •Introduction
- •Virtual Topologies
- •Embedding in MPI
- •Overview of the Functions
- •Topology Constructors
- •Cartesian Constructor
- •Cartesian Convenience Function: MPI_DIMS_CREATE
- •General (Graph) Constructor
- •Distributed (Graph) Constructor
- •Topology Inquiry Functions
- •Cartesian Shift Coordinates
- •Partitioning of Cartesian structures
- •Low-Level Topology Functions
- •An Application Example
- •MPI Environmental Management
- •Implementation Information
- •Version Inquiries
- •Environmental Inquiries
- •Tag Values
- •Host Rank
- •IO Rank
- •Clock Synchronization
- •Memory Allocation
- •Error Handling
- •Error Handlers for Communicators
- •Error Handlers for Windows
- •Error Handlers for Files
- •Freeing Errorhandlers and Retrieving Error Strings
- •Error Codes and Classes
- •Error Classes, Error Codes, and Error Handlers
- •Timers and Synchronization
- •Startup
- •Allowing User Functions at Process Termination
- •Determining Whether MPI Has Finished
- •Portable MPI Process Startup
- •The Info Object
- •Process Creation and Management
- •Introduction
- •The Dynamic Process Model
- •Starting Processes
- •The Runtime Environment
- •Process Manager Interface
- •Processes in MPI
- •Starting Processes and Establishing Communication
- •Reserved Keys
- •Spawn Example
- •Manager-worker Example, Using MPI_COMM_SPAWN.
- •Establishing Communication
- •Names, Addresses, Ports, and All That
- •Server Routines
- •Client Routines
- •Name Publishing
- •Reserved Key Values
- •Client/Server Examples
- •Ocean/Atmosphere - Relies on Name Publishing
- •Simple Client-Server Example.
- •Other Functionality
- •Universe Size
- •Singleton MPI_INIT
- •MPI_APPNUM
- •Releasing Connections
- •Another Way to Establish MPI Communication
- •One-Sided Communications
- •Introduction
- •Initialization
- •Window Creation
- •Window Attributes
- •Communication Calls
- •Examples
- •Accumulate Functions
- •Synchronization Calls
- •Fence
- •General Active Target Synchronization
- •Lock
- •Assertions
- •Examples
- •Error Handling
- •Error Handlers
- •Error Classes
- •Semantics and Correctness
- •Atomicity
- •Progress
- •Registers and Compiler Optimizations
- •External Interfaces
- •Introduction
- •Generalized Requests
- •Examples
- •Associating Information with Status
- •MPI and Threads
- •General
- •Initialization
- •Introduction
- •File Manipulation
- •Opening a File
- •Closing a File
- •Deleting a File
- •Resizing a File
- •Preallocating Space for a File
- •Querying the Size of a File
- •Querying File Parameters
- •File Info
- •Reserved File Hints
- •File Views
- •Data Access
- •Data Access Routines
- •Positioning
- •Synchronism
- •Coordination
- •Data Access Conventions
- •Data Access with Individual File Pointers
- •Data Access with Shared File Pointers
- •Noncollective Operations
- •Collective Operations
- •Seek
- •Split Collective Data Access Routines
- •File Interoperability
- •Datatypes for File Interoperability
- •Extent Callback
- •Datarep Conversion Functions
- •Matching Data Representations
- •Consistency and Semantics
- •File Consistency
- •Random Access vs. Sequential Files
- •Progress
- •Collective File Operations
- •Type Matching
- •Logical vs. Physical File Layout
- •File Size
- •Examples
- •Asynchronous I/O
- •I/O Error Handling
- •I/O Error Classes
- •Examples
- •Subarray Filetype Constructor
- •Requirements
- •Discussion
- •Logic of the Design
- •Examples
- •MPI Library Implementation
- •Systems with Weak Symbols
- •Systems Without Weak Symbols
- •Complications
- •Multiple Counting
- •Linker Oddities
- •Multiple Levels of Interception
- •Deprecated Functions
- •Deprecated since MPI-2.0
- •Deprecated since MPI-2.2
- •Language Bindings
- •Overview
- •Design
- •C++ Classes for MPI
- •Class Member Functions for MPI
- •Semantics
- •C++ Datatypes
- •Communicators
- •Exceptions
- •Mixed-Language Operability
- •Problems With Fortran Bindings for MPI
- •Problems Due to Strong Typing
- •Problems Due to Data Copying and Sequence Association
- •Special Constants
- •Fortran 90 Derived Types
- •A Problem with Register Optimization
- •Basic Fortran Support
- •Extended Fortran Support
- •The mpi Module
- •No Type Mismatch Problems for Subroutines with Choice Arguments
- •Additional Support for Fortran Numeric Intrinsic Types
- •Language Interoperability
- •Introduction
- •Assumptions
- •Initialization
- •Transfer of Handles
- •Status
- •MPI Opaque Objects
- •Datatypes
- •Callback Functions
- •Error Handlers
- •Reduce Operations
- •Addresses
- •Attributes
- •Extra State
- •Constants
- •Interlanguage Communication
- •Language Bindings Summary
- •Groups, Contexts, Communicators, and Caching Fortran Bindings
- •External Interfaces C++ Bindings
- •Change-Log
- •Bibliography
- •Examples Index
- •MPI Declarations Index
- •MPI Function Index
42 |
CHAPTER 3. POINT-TO-POINT COMMUNICATION |
1synchronous send: The sender sends a request-to-send message. The receiver stores
2this request. When a matching receive is posted, the receiver sends back a permission-
3
4
5
6
to-send message, and the sender now sends the message.
standard send: First protocol may be used for short messages, and second protocol for long messages.
7bu ered send: The sender copies the message into a bu er and then sends it with a
8nonblocking send (using the same protocol as for standard send).
9Additional control messages might be needed for ow control and error recovery. Of
10
11
12
13
course, there are many other possible protocols.
Ready send can be implemented as a standard send. In this case there will be no performance advantage (or disadvantage) for the use of ready send.
14A standard send can be implemented as a synchronous send. In such a case, no data
15bu ering is needed. However, users may expect some bu ering.
16In a multi-threaded environment, the execution of a blocking communication should
17block only the executing thread, allowing the thread scheduler to de-schedule this
18thread and schedule another thread for execution. (End of advice to implementors.)
19
20
21
3.5 Semantics of Point-to-Point Communication
22
A valid MPI implementation guarantees certain general properties of point-to-point com-
23
munication, which are described in this section.
24
25
Order Messages are non-overtaking: If a sender sends two messages in succession to the
26
same destination, and both match the same receive, then this operation cannot receive the
27
second message if the rst one is still pending. If a receiver posts two receives in succession,
28
and both match the same message, then the second receive operation cannot be satis ed
29
by this message, if the rst one is still pending. This requirement facilitates matching of
30
sends to receives. It guarantees that message-passing code is deterministic, if processes are
31
single-threaded and the wildcard MPI_ANY_SOURCE is not used in receives. (Some of the
32
calls described later, such as MPI_CANCEL or MPI_WAITANY, are additional sources of
33
nondeterminism.)
34
If a process has a single thread of execution, then any two communications executed
35
by this process are ordered. On the other hand, if the process is multi-threaded, then the
36
semantics of thread execution may not de ne a relative order between two send operations
37
executed by two distinct threads. The operations are logically concurrent, even if one
38
physically precedes the other. In such a case, the two messages sent can be received in
39
any order. Similarly, if two receive operations that are logically concurrent receive two
40
successively sent messages, then the two messages can match the two receives in either
41
order.
42
43
Example 3.6 An example of non-overtaking messages.
44
45CALL MPI_COMM_RANK(comm, rank, ierr)
46IF (rank.EQ.0) THEN
47CALL MPI_BSEND(buf1, count, MPI_REAL, 1, tag, comm, ierr)
48CALL MPI_BSEND(buf2, count, MPI_REAL, 1, tag, comm, ierr)
3.5. SEMANTICS OF POINT-TO-POINT COMMUNICATION |
43 |
ELSE IF (rank.EQ.1) THEN
CALL MPI_RECV(buf1, count, MPI_REAL, 0, MPI_ANY_TAG, comm, status, ierr) CALL MPI_RECV(buf2, count, MPI_REAL, 0, tag, comm, status, ierr)
END IF
The message sent by the rst send must be received by the rst receive, and the message sent by the second send must be received by the second receive.
Progress If a pair of matching send and receives have been initiated on two processes, then at least one of these two operations will complete, independently of other actions in the system: the send operation will complete, unless the receive is satis ed by another message, and completes; the receive operation will complete, unless the message sent is consumed by another matching receive that was posted at the same destination process.
Example 3.7 An example of two, intertwined matching pairs.
CALL MPI_COMM_RANK(comm, rank, ierr)
IF (rank.EQ.0) THEN
CALL MPI_BSEND(buf1, count, MPI_REAL, 1, tag1, comm, ierr) CALL MPI_SSEND(buf2, count, MPI_REAL, 1, tag2, comm, ierr)
ELSE IF (rank.EQ.1) THEN
CALL MPI_RECV(buf1, count, MPI_REAL, 0, tag2, comm, status, ierr) CALL MPI_RECV(buf2, count, MPI_REAL, 0, tag1, comm, status, ierr)
END IF
Both processes invoke their rst communication call. Since the rst send of process zero uses the bu ered mode, it must complete, irrespective of the state of process one. Since no matching receive is posted, the message will be copied into bu er space. (If insu cient bu er space is available, then the program will fail.) The second send is then invoked. At that point, a matching pair of send and receive operation is enabled, and both operations must complete. Process one next invokes its second receive call, which will be satis ed by the bu ered message. Note that process one received the messages in the reverse order they were sent.
Fairness MPI makes no guarantee of fairness in the handling of communication. Suppose that a send is posted. Then it is possible that the destination process repeatedly posts a receive that matches this send, yet the message is never received, because it is each time overtaken by another message, sent from another source. Similarly, suppose that a receive was posted by a multi-threaded process. Then it is possible that messages that match this receive are repeatedly received, yet the receive is never satis ed, because it is overtaken by other receives posted at this node (by other executing threads). It is the programmer's responsibility to prevent starvation in such situations.
Resource limitations Any pending communication operation consumes system resources that are limited. Errors may occur when lack of resources prevent the execution of an MPI call. A quality implementation will use a (small) xed amount of resources for each pending send in the ready or synchronous mode and for each pending receive. However, bu er space may be consumed to store messages sent in standard mode, and must be consumed to store messages sent in bu ered mode, when no matching receive is available. The amount of space
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
44 |
CHAPTER 3. POINT-TO-POINT COMMUNICATION |
1available for bu ering will be much smaller than program data memory on many systems.
2Then, it will be easy to write programs that overrun available bu er space.
3MPI allows the user to provide bu er memory for messages sent in the bu ered mode.
4Furthermore, MPI speci es a detailed operational model for the use of this bu er. An MPI
5implementation is required to do no worse than implied by this model. This allows users to
6avoid bu er over ows when they use bu ered sends. Bu er allocation and use is described
7in Section 3.6.
8A bu ered send operation that cannot complete because of a lack of bu er space is
9erroneous. When such a situation is detected, an error is signalled that may cause the
10program to terminate abnormally. On the other hand, a standard send operation that
11cannot complete because of lack of bu er space will merely block, waiting for bu er space
12to become available or for a matching receive to be posted. This behavior is preferable in
13many situations. Consider a situation where a producer repeatedly produces new values
14and sends them to a consumer. Assume that the producer produces new values faster
15than the consumer can consume them. If bu ered sends are used, then a bu er over ow
16will result. Additional synchronization has to be added to the program so as to prevent
17this from occurring. If standard sends are used, then the producer will be automatically
18throttled, as its send operations will block when bu er space is unavailable.
19In some situations, a lack of bu er space leads to deadlock situations. This is illustrated
20by the examples below.
21
22Example 3.8 An exchange of messages.
23CALL MPI_COMM_RANK(comm, rank, ierr)
24IF (rank.EQ.0) THEN
25CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag, comm, ierr)
26CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag, comm, status, ierr)
27ELSE IF (rank.EQ.1) THEN
28CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag, comm, status, ierr)
29CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag, comm, ierr)
30END IF
31
32This program will succeed even if no bu er space for data is available. The standard send
33operation can be replaced, in this example, with a synchronous send.
34
35
Example 3.9 An errant attempt to exchange messages.
36CALL MPI_COMM_RANK(comm, rank, ierr)
37IF (rank.EQ.0) THEN
38CALL MPI_RECV(recvbuf, count, MPI_REAL, 1, tag, comm, status, ierr)
39CALL MPI_SEND(sendbuf, count, MPI_REAL, 1, tag, comm, ierr)
40ELSE IF (rank.EQ.1) THEN
41CALL MPI_RECV(recvbuf, count, MPI_REAL, 0, tag, comm, status, ierr)
42CALL MPI_SEND(sendbuf, count, MPI_REAL, 0, tag, comm, ierr)
43END IF
44The receive operation of the rst process must complete before its send, and can complete
45only if the matching send of the second processor is executed. The receive operation of the
46second process must complete before its send and can complete only if the matching send
47of the rst process is executed. This program will always deadlock. The same holds for any
48other send mode.