Introduction
A. What is parallel processing?
1. Two or more computers cooperating to accomplish one task
B. Why parallel processing?
1. To accomplish tasks too big for one computer alone
C. Basic paradigms
1. Shared memory
i. Some people find it easier to use, conceptually and practically
ii. Requires special hardware (e.g. dual processors, really 'spensive specialized machines)
iii. Out of our price range
2. Message-passing
i. Harder?
ii. No special hardware needed--networks of workstations
iii. Probably the most useful paradigm for YOU, valued listener, which is why I'm covering it.
3. There are other paradigms, and combinations of paradigms (for example, a virtual shared memory system actually implemented with message-passing), but these tend to be too slow to bother with and/or not used outside academia. Also, I don't have enough time to cover them.
Introducing MPICH
A. MPICH is a Free (speech) implementation of the MPI (Message Passing Interface) standard
1. It's just a C library.
i. If you know C, you can use MPICH.
ii. Works on any architecture that has a C compiler
B. Getting MPICH
1. http://www-unix.mcs.anl.gov/mpi/mpich
C. Installing MPICH
1. http://heather.cs.ucdavis.edu/~matloff/MPI/NotesMPICH.NM.html
D. Running MPICH/common traps and how to avoid them
1. MPICH function calls are verbose and picky
2. Fortunately, you can mostly do them cookbook-style, and MPICH defines a bunch of macros to help you on your way
3. To compile your program: mpicc <flags> <programname>
i. Standard cc flags accepted (e.g. -g, -o)
4. To run your program:
i. Have a process group file in your working directory. This file tells MPICH how to find the other computers you're using.
a. IMPORTANT! All nodes must be in every other node's .rhosts file. If not, your program will die with "Permission denied". This problem is extremely common.
b. The process group file looks like this: (show an example)
ii. Start your program with 'mpirun -p4pg <your-process-group-file> <executable-name>'.
C. The Distinguished Node
1. It's important to remember that the SAME program runs on all nodes, and therefore that the same instructions get executed on all nodes.
2. But some jobs only need to be done once per program run. Printing out results, say.
3. You just put in a little test in the program--if I'm node 0, then I print out the results. Node 0 now has a special job to do--it's distinguished.
4. In fact, having a distinguished node is frequently very handy! Sure, we could just have it print out the results. But we could also give it more special work to do. We could make it a master node and put it in charge of overseeing all the other nodes, for example.
Planning Your Program
A. Is parallel processing right for this task?
1. If the task isn't absofrickinglutely huge, probably not.
2. Finding all primes less than N--> pp outperformed sp only when N>1e6
B. Divvying up the work
1. Data parallelism: divide up the data among processors
i. Task pool
a. Every node grabs a chunk of data from a task pool, works on it, and grabs another one, until there are no more chunks left
b. You'll usually have a distinguished node in charge of handing out work and collecting results.
c. Responsive--new tasks can be added to the task pool dynamically
d. Example: graphics rendering, SETI@home
ii. Blocks of data
a. Every node does the same task on a pre-specified chunk or chunks of data
b. Requires static data, but if enough is hard-coded in advance, you might be able to get useful work out of the distinguished node
c. Example: keyspace searches, matrix operations
2. Functional parallelism: divide up the job among processors
i. Pipeline
a. Every node does a different step of a larger task
b. Ideally, each step requires roughly the same time/computing power
c. Think of a production line
3. Granularity
i. Fine-grained: smaller chunks of data/smaller tasks per node
a. More responsive, better load balancing (usually)
b. Higher communication costs--> performance hit
ii. Coarse-grained: bigger chunks of data/larger tasks per node
a. Less responsive, possible load balancing problems
b. Less communication needed--> performance gain
C. Spotting Oportunities for Parallelizing Your Task
1. Look at your loops
i. If your loop is executed N times, could either
a. many processors take a fraction of N (i.e. node 1 takes iterations 0..25, node 2 takes iterations 26..50, etc.)?
b. many processors each take a part of the work done in the loop? (i.e. node 1 takes instructions 1..5, etc.)?
ii. If you split up the work in one of those ways, how much communication would be required between nodes? The ideal situation would be no communication required, for problems that parallelize exceptionally well (embarassingly parallel).
D. Problems that do not parallelize well
i. Anything that requires extensive user input or I/O
ii. Performs particularly poorly under message-passing; performance may be OK under shared memory (graphics accelerators)
Anatomy of your average MPICH program
A. Data types in MPICH
1. Basically the same as in C, they just have different names.
i. Portability reasons--there's also MPICH for Fortran
2. You must use the MPI names in MPI function calls.
3. You don't need to cast, say, int to MPI_INT, or vice versa.
B. Basic structure
1. #include mpi.h
2. Any constants you want to use?
i. I suggest you #define constants for the types of messages you want to send. You can use them to get your nodes to figure out what kind of message they've just gotten and what action to take.
ii. For example, you can: #define WORK_MSG 1; #define STOP_MSG 0, and tell your worker nodes to return if they get a message of type STOP_MSG.
2. In the main() function
i. MPI_INIT(argc, argv): required for all MPICH programs. You must call this function before trying to access command-line args.
ii. MPI_Comm_World(&Numnodes) Finds out how many nodes there are and puts that in Numnodes.
iii. MPI_Comm_Rank(&Me) Finds out the "rank" of this node (its unique id among all nodes used) and puts it in Me.
iv. Parting of the Ways
a. Now that everything's set up, time to get everyone working.
b. if (me == 0) {hand_out_work; collect_results;}; else {do_work; send_results;}
v. The End: MPI_Finalize(); Also required for all MPICH programs.
Writing Your Program
A. How nodes communicate
1. TCP/IP packets... with all the flexibility and overhead you've come to expect from TCP/IP.
i. Internode communication is your bottleneck
B. Blocking vs. nonblocking I/O
1. Blocking I/O (synchronous)
i. When a node wishes to talk to another node, it waits (blocks) until both parties are ready to talk
2. Nonblocking I/O (asynchronous)
i. When a node wants to talk to another node, it just fires off its message and carries on
3. Which to use?
ii. Use nonblocking sends & receives if you don't care exactly when a message is received. But if you DO care (say you need information from another node before you can do something), use blocking receives.
4. Send format: MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int messagetag, MPI_Comm commhandle);
5. Receive format: MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status);
C. Scatter, Gather, and Broadcast
1. Up 'til now, we've discussed point-to-point communications. Now for some pp-specialized communication
2. Broadcast/multicast
i. One node sends the same message to all other nodes
ii. Say each node is responsible for iterations x through y of a certain loop. Instead of hardcoding x or y, you can calculate the chunk size at runtime and send that.
iii. MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root_id, MPI_Comm comm);
iv. This is more elegant than having N sends/receives, but I didn't use it much. I don't know why not.
3. Scatter
i. A node has an array. It sends the ith element of the array to node i.
ii. You can use this to send different information to each node.
iii. MPI_Scatter(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root-id, MPI_Comm comm);
iv. Remember that EVERYONE executes the scatter. MPI is smart enough to figure out who's sending and who's receiving.
4. Gather
i. A node has an array. The information sent to it by node i is put into the ith element.
ii. Suppose each node has a subtotal, say, of the prime-finding problem. You can do a gather to put each node's subtotal in an array resident at the distinguished node, and then add 'em all up.
iii. MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root_id, MPI_Comm comm);
iv. Everyone executes the gather. (see above)
D. Locks and barriers
1. Locks
i. Protect critical sections, avoid race conditions
a. There are sections in practically every program where you don't want more than one process at once.
b. For example, the "commit" phase of a database change operation, writing results to disk, etc.
c. Place a lock at the start of the critical section, and an unlock at the end
ii. How to use them efficiently
a. Keep critical sections as short as possible! Every second spent in a critical section is a second when, potentially, no one else can work
b. If you find that an awful lot of your code is in critical sections, re-consider whether parallel processing is an efficient way to approach your task.
2. Barriers
i. Stop here and wait for everyone to catch up
ii. Mostly useful at the end of a program--you usually put a barrier before aggregating or printing out results, for example--but you may find them useful elsewhere, too.
Debugging and Optimizing Your Program
A. Debugging a pp is a giant pain
1. Using printf statements
i. BEWARE! You cannot assume that they appear on your screen in the order that they were generated.
2. Attaching gdb to a running process
i. A more elegant solution, but I haven't had great success with it. Cumbersome.
ii. In your program, have a statement like "foo = 1; while (foo) {};"
iii. Start them all up and make note of the pids.
iv. Type 'gdb <pid>' on each node
v. Within gdb on each node, set foo=0
vi. You should now be able to step through the program from within gdb.
B. Deadlock
1. If your program hangs, look for the usual, mundane program-hanging culprits first.
2. But also consider deadlock--two or more nodes waiting, forever, for the other one(s) to give up waiting.
3. Example: Two nodes each wish to send the other a piece of data and receive a piece of data from the other. Each does a blocking receive and then a send. What happens? Each waits, forever, for the other to be ready to send data.
C. Load balancing
1. In a perfect world, all nodes would do an equal share of the work
2. In the real world, it's not uncommon to find that one node, or, less commonly, a subset of nodes, is doing most/all of the work.
3. Common symptoms of this: your pp program takes as long or longer than a single processor program would; you put printfs in the worker function and all printfs come from the same node.
4. To fix: Look long and hard at the way tasks are handed out.
Miscellaneous/Wrap-up
A. For more information
1. Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers, by Barry Wilkinson and Michael Allen
2. Parallel Programming with MPI, by Peter Pacheco
3. ECS 158, Parallel Programming, won't be offered this academic year. UCD's own Prof. Norm Matloff has a *great* MPI page at http://heather.cs.ucdavis.edu/~matloff/mpi.html .