name: title layout: true class: center, middle, title count: false --- ##How Can MPI Fit Into Today's Big Computing Jonathan Dursi
Senior Research Associate
Centre for Computational Medicine
The Hospital for Sick Children
https://github.com/ljdursi/EuroMPI2016 --- name: my-background-1 layout: false .left-column[ ## Who Am I? ### Old HPC Hand... ] .right-column[ Ex-astrophysicist turned large-scale computing. - Large-scale high-speed adaptive reactive fluid fluids - DOE ASCI Center at Chicago - ASCI Red - ASCI Blue - ASCI White - FORTRAN, MPI, Oct-tree regular adaptive mesh ] ??? Before we start, let me tell you a little about where I'm coming from. This story is probably pretty familiar to many of you - I started off doing science with computing, and ended up drifting to the other side of that divide, doing computing to support science. I was doing pretty pretty standard HPC stuff - high speed reactive flows (explicit methods), adaptive grid, FORTRAN with some C and python, that sort of thing. -- .right-column-cont[ - Joined HPC centre after postdoc - Worked with researchers on wide variety of problems - Got interested in genomics: - Large computing needs - Very interesting algorithmic challenges ] ??? After doing a postdoc in Toronto, I moved into the HPC Centre there, working with them and Compute Canada - a little like PRACE or XSEDE - and working with a lot of different researchers doing a lot of different problems. Before too long, I started becoming interested in genomics - partly because it was the new frontier, with fascinating and deep algorithmic challenges, but also required very large-scale computing to accomplish its promise. --- name: my-background-2 layout: false .left-column[ ## Who Am I? ### Old HPC Hand... ### Gone Into Genomics ] .right-column[ Started looking into Genomics in ~2013, made move in 2014 - Ontario Institute for Cancer Research - Working with Jared Simpson, author of ABySS (amongst other things) - First open-source human-scale de novo genome assembler - MPI-based ] ??? I began working in genomics with Jared Simpson, who amongst other accomplishments was the author of ABySS, one of the very early large-scale de novo genome assemblers from what we now call short reads. It tackled large - including human - genomes by using distributed memory, using MPI. So it sounds like I found a pretty good niche in Genomics for an HPCer, right? -- .right-column-cont[ - ABySS 2.0 just came out, with a new non-MPI mode ] ??? Not quite. ABySS 2.0 just came out, with a new non-MPI mode which will almost certainly become the default after the kinks are worked out. There are absolutely no plans by any of the development team to develop new MPI-based algorithms or tools --- name: my-background-2 layout: false .left-column[ ## Who Am I? ### Old HPC Hand... ### Gone Into Genomics ] .right-column[ In the meantime, one of the de-facto standards for genome analysis, GATK, has just announced that version 4 will support distributed cluster computing — using Apache Spark.
] ??? In the meantime, one of the de-facto standards for genome analysis, GATK, has just announced that version 4 will support distributed cluster computing — using Apache Spark. How did we get to the point where an entire field of science - and biology is much larger than physics - simply looks at MPI, shrugs, and then keeps going? My concern is the following: --- name: begin layout: false class: center, middle, inverse ## MPI's Place In Big Computing Is Small, and Getting Smaller ??? MPI's Place In Big Computing Is Small, and Getting Smaller I say this as someone who has used MPI happily and successfully in multiple disciplines, and sees little place for it in my new line of work. I'm also someone someone who disagrees with very little of Mark Parsons' or Bill Gropp's assessment of the past successes of MPI, but disagrees quite a bit about the extrapolation to the future. --- ##MPI is For...
.large-text[
1990s
All multi-node cluster computing
2000s
All multi-node
scientific
computing
2006ish
All multi-node scientific
simulation
2010s
All multi-node
physical sciences
simulation
2016
Much
multi-node physical sciences simulation
] ??? But I have concerns. Again, these are the concerns of a recently active MPI user who would like to see current HPC expertise built up around MPI effectively used in the future, and for MPI to succeed. My concern is that without any decision being conciously made that I'm aware of, the ambition of MPI has been continuously whittled down, and the applicability of MPI is being relentlessly de-scoped. By and large, clarity of focus is to be admired. But as Prof. Gropp has mentioned, “completeness” is important for a communications tool, and a message passing library that is specialized to a very particular form of technical computing is one that increasingly easily be simply not chosen. And it's never been clear to me why a parallel-computing library that transports data for data-crunching should have a speciality in the kinds of data being crunched. --- layout: false .left-column[ ## Top500 vs The Cloud ### 2013 ] .right-column[ .right[
] ] ??? Because right now that specialty is a shrinking portion of a rapidly growing pie. When I started looking at genomics scientific computing in 2013, the top500, where most MPI tasks are run, was already starting to be a fairly small fraction of the available computing out there... --- layout: false .left-column[ ## Top500 vs The Cloud ### 2013 ### 2016 ] .right-column[ .right[
] ] ??? And it's only gotten smaller as time goes on. --- name: begin layout: false class: center, middle, inverse ### MPI's Place In Big Computing Is Small, and Getting Smaller ## but it doesn't have to be this way ??? The underlying phenomenon is _good_ _news_ — this is the most exciting period of time ever in large-scale computing, with new disciplines, new data sources, and new types of hardware sprouting up everywhere in just a decade. There's so much to do! I claim it doesn't have to be this way. During this big-computing Cambrian explosion we're experiencing, MPI's shall we say "steadfastness" has been welcome but is holding it back from being part of this growth. But that need not be the case - the MPI community broadly has much to learn and much to contribute to these other models **if it chooses to do so**. --- layout: false .left-column[ ## Outline ] .right-column[ - A tour of some common big-data computing problems - Genomics and otherwise - Not so different from complex simulations - A tour of programming models to tackle them - Spark - Dask - Distributed TensorFlow - Chapel - A tour of data layers the models are built on - Akka - GASNET - TCP/UDP/Ethernet: Data Plane Development Kit - Libfabric/UCX - Where MPI is: programming model at the data layer - Where MPI can be ] ??? It's very difficult to describe what place MPI does and can occupy in the landscape without drawing a map, so what I want to do for the beginning of our time together is to walk through these new territories with you. I want to talk about some common big data analysis tasks and point out how similar they are to familiar simulation tasks. Then I'd like to show some of the programming models that are springing up are tackling these problems. These higher level programming models themselves take advantage of lower-level communications frameworks to implement the distributed computation. Understanding what these frameworks provide and how they enable the higher level models is crucial to understanding MPI's potential role. Once we've sketched out the map, I want to describe the place I see MPI as occupying now — and the attractive neighbouring vacant territories that we can occupy if we want. --- layout: false class: center, middle, inverse ## The Problems Big Data Frameworks Aim To Solve --- .left-column[ ## Big Data Problems ] .right-column[ Big Data problems same as HPC, if in different context - Large scale network problems - Graph operations - Similarity computations, clustering, optimization, - Linear algebra - Tree methods - Time series - FFTs, smoothing, ... Main difference: hit highly irregular, unstructured, dyanamic problems earlier in data analysis than simulation (but Exascale...) - Irregular data lookups: - Key-Value Stores ] ??? Ultimately Big Data problems are more or less the same problems we face in HPC although in a different context. The Big Data crowd hit problems of dealing with wildly irregular, unstructured, dynamic data structures first - irregularity to a degree that we've up til now been able to avoid in simulation... but may not be able to on the road to Exascale. --- .left-column[ ## Big Data Problems ### Key-value stores ] .right-column[ In simulation, you normally have the luxury of knowing exactly where needed data is, because you put it there. Maintained with global state, distributed state, or implicitly (structured mesh). .center[
] ] --- .left-column[ ## Big Data Problems ### Key-value stores ] .right-column[ With faults, or rapid dynamism, or too-large scale to keep all state nearby, less true. Data analysis: much of genomics involves lookups in indices too large for one node's memory - Good/simple/robust distributed hash tables within an application would be extremely valuable. Begins to look a lot like distributed key-value stores in big data. Highly irregular access (don't know where you're sending to, receiving from, when requests are coming, or how many messages you will receive/send) is not MPI's strong suit. .center[
] ] ??? But having a global view of data locations becomes less feasible even for simulations at large enough scale. Data analysis: much of genomics for instance involves looking data up in string indices or hash tables. And data sources get more numerous and cheaper - and so less reliable - an awful lot of other data analysis will look like this too Begins to look a lot like distributed key-value stores - a fundamental primitive in many big data applications. Highly irregular access (don't know where you're sending to, receiving from, when requests are coming, or how many messages you will receive/send) is not MPI's strong suit. The users and developers of in-memory NoSQL databases view performance as crucial, and so teams are enormously concerned with latency - especially since a query (and especially an update) may involve several messages between participating nodes. The discussions surrounding latency, bandwidth, cache utilisation, threading, etc would all be extremely familiar to those who work on HPC software. --- .left-column[ ## Big Data Problems ### Key-value stores ### Linear algebra ] .right-column[ Almost any sort of numeric computation requires linear algebra. In many big-data applications, the linear algebra is _extremely_ sparse and unstructured; say doing similarity calculations of documents, using a bag-of-words model. If looking at ngrams, cardinality can be enormous, no real pattern to sparsity .center[
] ] ??? Almost any sort of numeric computation requires linear algebra. That's just mathematics, and is inescapable. In many big-data applications, the linear algebra is _extremely_ sparse and unstructured; say doing similarity calculations of documents, using a bag-of-words model, where order is ignored and documents are described as simply occurrance counts words --- very sparse integer vectors. Words are one thing, but If looking at ngrams, cardinality can be enormous, no real pattern to sparsity --- .left-column[ ## Big Data Problems ### Key-value stores ### Linear algebra ### Graph problems ] .right-column[ As with other problems - big data graphs are like HPC graphs, but more so. Very sparse, very irregular: nodes can have enormously varying degrees, _e.g._ social graphs .center[
] https://en.wikipedia.org/wiki/Social_graph ] ??? As with other problems - big data graphs are like HPC graphs, but more so. For graphs in physical science simulation - say unstructured grids - or analysis, nodes typically have say within a factor of two or order of magnitude the same number of neighbours. Not so with (say) social media graphs: Very sparse, very irregular: nodes can have enormously varying degrees Inexplicably, Justin Bieber has far more twice as many followers as I do on twitter --- .left-column[ ## Big Data Problems ### Key-value stores ### Linear algebra ### Graph problems ] .right-column[ Generally decomposed in similar ways. Processing looks very much like neighbour exchange on an unstructured mesh; can map unstructured mesh computations onto (very regular) graph problems. .center[
] https://flink.apache.org/news/2015/08/24/introducing-flink-gelly.html ] --- .left-column[ ## Big Data Problems ### Key-value stores ### Linear algebra ### Graph problems ] .right-column[ Calculations on (_e.g._) social graphs are typically very low-compute intensity: - Sum - Min/Max/Mean So that big-data graph computations are often _more_ latency sensitive than more compute-intensive technical computations ⇒ lots of work done and in progresss to reduce communication/framework overhead .center[
] https://spark.apache.org/docs/1.2.1/graphx-programming-guide.html ] --- .left-column[ ## Big Data Problems ### Key-value stores ### Linear algebra ### Graph problems ### Commonalities ] .right-column[ The problems big-data practitioners face are either: - The same as in HPC - The same as HPCish large-scale scientific data analysis - Or what data analysis/HPC will be facing towards exascale - Less regular/structured - More dynamic So it's worth examining the variety of tools that community has built, - To see if they can benefit our community - To see what makes them tick - To see if we can improve them ] --- layout: false class: center, middle, inverse ## Big Data Programming Models --- layout: false class: center, middle, inverse ## Spark: http://spark.apache.com --- .left-column[ ## Spark ### Overview ] .right-column[ Hadoop came out in ~2006 with MapReduce as a computational engine, which wasn't that useful for scientific computation. * One pass through data * Going back to disk every iteration However, the ecosystem flourished, particularly around the Hadoop file system (HDFS) and new databases and processing packages that grew up around it. .center[
] ] --- .left-column[ ## Spark ### Overview ] .right-column[ Spark (2012) is in some ways "post-Hadoop"; it can happily interact with the Hadoop stack but doesn't require it. Built around concept of in-memry resilient distributed datasets * Tables of rows, distributed across the job, normally in-memory * Immutable * Restricted to certain transformations Used for database, machine learning (linear algebra, graph, tree methods), _etc._ .center[
] ] --- layout: false .left-column[ ## Spark ### Overview ### RDDs ] .right-column[ Spark RDDs prove to be a very powerful abstraction. Key-Value RDDs are a special case - a pair of values, first is key, second is value associated with. Linda tuple spaces, which underly Gaussian. Can easily use join, _etc._ to bring all values associated with a key together: - Like all stencil terms that are contribute at a particular grid point .center[
] ] --- .left-column[ ## Spark ### Overview ### RDDs ### Execution graphs ] .right-column[ Operations on Spark RDDs can be: * Transformations, like map, filter, join... * Actions like collect, foreach, .. You build a Spark computation by chaining together transformations; but no data starts moving until part of the computation is materialized with an action.
] --- .left-column[ ## Spark ### Overview ### RDDs ### Execution graphs ] .right-column[ Delayed computation + view of entire algorithm allows optimizations over the entire computation graph. So for instance here, nothing starts happening in earnest until the `plot_data()` (Spark notebook 1) ```python # Main loop: For each iteration, # - calculate terms in the next step # - and sum for step in range(nsteps): data = data.flatMap(stencil) \ .reduceByKey(lambda x, y:x+y) # Plot final results in black plot_data(data, usecolor='black') ``` Knowledge of lineage of every shard of data also means recomputation is straightforward in case of node failure ] --- .left-column[ ## Spark ### Overview ### RDDs ### Execution graphs ### Dataframes ] .right-column[ But RDDs are also building blocks. Spark Dataframes are lists of columns, like pandas or R data frames. Can use SQL-like queries to perform calculations. But this allows bringing the entire mature machinery of SQL query optimizers to bear, allowing further automated optimization of data movement, and computation. (Spark Notebook 2)
] --- .left-column[ ## Spark ### Overview ### RDDs ### Execution graphs ### Dataframes ### Graphs ] .right-column[ Graph library — [GraphX](http://spark.apache.org/graphx/) — has also been implemented on top of RDDs. Many interesting features, but for us: [Pregel](http://blog.acolyer.org/2015/05/26/pregel-a-system-for-large-scale-graph-processing/)-like algorithms on graphs. Nodes passes messages to their neighbours along edges.
] --- .left-column[ ## Spark ### Overview ### RDDs ### Execution graphs ### Dataframes ### Graphs ] .right-column[ This makes implementing unstructured mesh methods extremely straightforward (Spark notebook 4): ```scala def step(g:Graph[nodetype, edgetype]) : Graph[nodetype, edgetype] = { val terms = g.aggregateMessages[msgtype]( // Map triplet => { triplet.sendToSrc(src_msg(triplet.attr, triplet.srcAttr, triplet.dstAttr)) triplet.sendToDst(dest_msg(triplet.attr, triplet.srcAttr, triplet.dstAttr)) }, // Reduce (a, b) => (a._1, a._2, a._3 + b._3, a._4 + b._4, a._5 + b._5, a._6 + b._6, a._7 + b._7) ) val new_nodes = terms.mapValues((id, attr) => apply_update(id, attr)) return Graph(new_nodes, graph.edges) } ``` .center[
] ] ??? This makes implementing unstructured mesh methods extremely straightforward, and there's a little example with the talk that implements one in 50 lines of code or so. The argument, to be clear, isn't that this is researchers should be doing unstructured mesh codes; the argument is that the problems these tools are solving for their users aren't some completely alien and irrelevant forms of analyses. --- .left-column[ ## Spark ### Overview ### RDDs ### Execution graphs ### Dataframes ### Graphs ### Adoption ] .right-column[ Adoption has been rapid and substantial - Biggest open-source big-data platform - Huge private-sector adoption, which has been hard for HPC - Selected in industry because of performance and ease of use .center[
] .center[http://insidebigdata.com/2015/11/17/why-is-apache-spark-so-hot/] ] ??? Adoption has been rapid and substantial - Biggest open-source big-data platform - Huge private-sector adoption, which has been hard for HPC - Selected in industry because of performance and ease of use --- .left-column[ ## Spark ### Overview ### RDDs ### Execution graphs ### Dataframes ### Graphs ### Adoption ] .right-column[ Adoption has been enormous - Huge interest by programmers .center[
] .center[Google Search] .center[
] .center[Questions on Stack Overflow] ] --- .left-column[ ## Spark ### Overview ### RDDs ### Execution graphs ### Dataframes ### Graphs ### Adoption ### Pros/Cons ] .right-column[ **Cons** * JVM Based (Scala) means C interoperability always fraught. * Not much support for high-performance interconnects (although that's coming from third parties - [HiBD group at OSU](http://hibd.cse.ohio-state.edu)) * Very little explicit support for multicore yet, which leaves some performance on the ground. * Doesn't scale _down_ very well; very heavyweight **Pros** * Very rapidly growing * Performance improvements version to version * Easy to find people willing to learn ] ??? Looking over Spark, it's easy to see some Pros and Cons (from the point of view of _our_ use cases; Spark has been designed and optimized for its community, and Big Data users would have a different list of Pros/Cons) A big one from our point of view is that the relatively heavyweight infrastructure around individual Spark processes means that it scales *down* fairly poorly. But the point here isn't that I'm proposing that our simulation codes should be rewritten in Spark - that would be silly. The point is that other communities are building high-productivity tools for doing number-crunching for a user community that cares a great deal about performance and scale - not least of which because they pay real money for cycles - and they're not building it on top of MPI. --- class: center, middle, inverse count: false ## Dask: http://dask.pydata.org/ --- .left-column[ ## Dask ### Overview ] .right-column[ Dask is a python parallel computing package * Very new - 2015 * As small as possible * Scales down very nicely * Works very nicely with NumPy, Pandas, Scikit-Learn * Is definitely nibbling into MPI "market share" * For traditional numerical computing on few nodes * For less regular data analysis/machine learning on large scale * (likely siphoning off a little uptake of Spark, too) Used for very general data analysis (linear algebra, trees, tables, stats, graphs...) and machine learning ] --- .left-column[ ## Dask ### Overview ### Task Graphs ] .right-column[ Allows manual creation of quite general parallel computing data flows (making it a great way to prototype parallel numerical algorithms): ```python from dask import delayed, value @delayed def increment(x, inc=1): return x + inc @delayed def decrement(x, dec=1): return x - dec @delayed def multiply(x, factor): return x*factor w = increment(1) x = decrement(5) y = multiply(w, x) z = increment(y, 3) from dask.dot import dot_graph dot_graph(z.dask) z.compute() ```
] --- .left-column[ ## Dask ### Overview ### Task Graphs ] .right-column[ Once the graph is constructed, computing means scheduling either across threads, processes, or nodes * Redundant tasks (recomputation) pruned * Intermediate tasks discarded after use * Memory use kept low * If guesses wrong, task dies, scheduler retries * Fault tolerance .center[
] .center[http://dask.pydata.org/en/latest/index.html] ] --- .left-column[ ## Dask ### Overview ### Task Graphs ### Dask Arrays ] .right-column[ Array support also includes a small but growing number of linear algebra routines Dask allows out-of-core computation on arrays (or dataframes, or bags of objects): will be increasingly important in NVM era * Graph scheduler automatically pulls only chunks necessary for any task into memory * New: intermediate results can be spilled to disk ```python file = h5py.File(hdf_filename,'r') mtx = da.from_array(file['/M'], chunks=(1000, 1000)) u, s, v = da.linalg.svd(mtx) u.compute() ``` ] --- .left-column[ ## Dask ### Overview ### Task Graphs ### Dask Arrays ] .right-column[ Arrays have support for guardcells, which make certain sorts of calculations trivial to parallelize (but lots of copying right now): (From dask notebook) ```python subdomain_init = da.from_array(dens_init, chunks=((npts+1)//2, (npts+1)//2)) def dask_step(subdomain, nguard=2): # `advect` is just operator on a numpy array return subdomain.map_overlap(advect, depth=nguard, boundary='periodic') with ResourceProfiler(0.5) as rprof, Profiler() as prof: subdomain = subdomain_init nsteps = 100 for step in range(0, nsteps//2): subdomain = dask_step(subdomain) subdomain = subdomain.compute(num_workers=2, get=mp_get) ```
] --- .left-column[ ## Dask ### Overview ### Task Graphs ### Dask Arrays ### Diagnostics ] .right-column[ Comes with several very useful performance profiling tools which will be instantly famiilar to HPC community members
] --- .left-column[ ## Dask ### Overview ### Task Graphs ### Dask Arrays ### Diagnostics ### Pros/Cons ] .right-column[ **Cons** * Performance: Aimed at analysis tasks (big, more loosely coupled) rather than simulation * Scheduler+TCP: 200μs per-task overhead, orders of magnitude larger than an MPI message * Not a replacement in general for large-scale tightly-coupled computing * Python **Pros** * Trivial to install, start using * Outstanding for prototyping parallel algorithms * Out-of-core support baked in * With Numba+Numpy, very reasonable single-core performance * Automatically overlaps communication with computation: 200μs might not be so bad for some methods * Scheduler, communications all in pure python right now, rapidly evolving: * **Much** scope for speedup ] --- class: center, middle, inverse count: false ## TensorFlow: http://tensorflow.org --- .left-column[ ## TensorFlow ### Overview ] .right-column[ TensorFlow is an open-source dataflow for numerical computation with dataflow graphs, where the data is always in the form of “tensors” (n-d arrays). Very new: Released November 2015 From Google, who uses it for machine learning. Lots of BLAS operations and function evaluations but also general numpy-type operations, can use GPUs or CPUs. .center[
] ] --- .left-column[ ## TensorFlow ### Overview ### Graphs ] .right-column[ As an example of how a computation is set up, here is a linear regression example. TensorFlow notebook 1
] --- .left-column[ ## TensorFlow ### Overview ### Graphs ] .right-column[ Linear regression is already built in, and doesn't need to be iterative, but this example is quite general and shows how it works. Variables are explicitly introduced to the TensorFlow runtime, and a series of transformations on the variables are defined. When the entire flowgraph is set up, the system can be run. The integration of tensorflow tensors and numpy arrays is very nice.
] --- .left-column[ ## TensorFlow ### Overview ### Graphs ### Mandelbrot ] .right-column[ All sorts of computations on regular arrays can be performed. Some computations can be split across GPUs, or (eventually) even nodes. All are multi-threaded. .center[
] ] --- .left-column[ ## TensorFlow ### Overview ### Graphs ### Mandelbrot ### Wave Eqn ] .right-column[ All sorts of computations on regular arrays can be performed. Some computations can be split across GPUs, or (eventually) even nodes. All are multi-threaded. .center[
] ] --- .left-column[ ## TensorFlow ### Overview ### Graphs ### Mandelbrot ### Wave Eqn ### Distributed ] .right-column[ As with laying out the computations, distributing the computations is still quite manual: ```python with tf.device("/job:ps/task:0"): weights_1 = tf.Variable(...) biases_1 = tf.Variable(...) with tf.device("/job:ps/task:1"): weights_2 = tf.Variable(...) biases_2 = tf.Variable(...) with tf.device("/job:worker/task:7"): input, labels = ... layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1) logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2) # ... train_op = ... with tf.Session("grpc://worker7.example.com:2222") as sess: for _ in range(10000): sess.run(train_op) ``` Communications is done using [gRPC](http://www.grpc.io), a high-performance RPC library based on what Google uses internally. ] --- .left-column[ ## TensorFlow ### Overview ### Graphs ### Mandelbrot ### Wave Eqn ### Distributed ### Adoption ] .right-column[ Very rapid adoption, even though targetted very narrowly: deep learning All threaded number crunching on arrays and communication of results of those array calculations .center[
] ] --- .left-column[ ## TensorFlow ### Overview ### Graphs ### Mandelbrot ### Wave Eqn ### Distributed ### Adoption ### Pros/Cons ] .right-column[ **Cons** * N-d arrays only means limited support for, e.g., unstructured meshes, hash tables (bioinformatics) * Distribution of work remains limited and manual (but is expected to improve - Google uses this) **Pros** * C++ - interfacing is much simpler than Spark * Fast * GPU, CPU support, not unreasonable to expect Phi support shortly * Great for data processing, image processing, or computations on n-d arrays ] ??? This one is a little closer to our wheelhouse. Computationally-intensive operations on n-dimensional arrays of floating point values, communicating the results of those computaions, CPU and GPU, C++. Very nice high-productivity layer above the hard-core numerics. Performance is very important for these computations - but again, the underlying data transport is very definitely not MPI. --- layout: false class: center, middle, inverse ## Chapel: http://chapel.cray.com --- .left-column[ ## Chapel ### Overview ] .right-column[ Chapel was one of several languages funded through DARPA HPCS (High Productivity Computing Systems) project. Successor of [ZPL](http://research.cs.washington.edu/zpl/home/). A PGAS language with global view; that is, code can be written as if there was only one thread (think OpenMP) ```fortran config const m = 1000, alpha = 3.0; const ProblemSpace = {1..m} dmapped Block({1..m}); var A, B, C: [ProblemSpace] real; B = 2.0; C = 3.0; A = B + C; ``` `$ ./a.out --numLocales=8 --m=50000` ] --- .left-column[ ## Chapel ### Overview ### Domain Maps ] .right-column[ Chapel, and ZPL before it: * Separate the expression of the concurrency from that of the locality. * Encapsulate _layout_ of data in a "Domain Map" * Express the currency directly in the code - programmer can take control * Allows "what ifs", different layouts of different variables. What distinguishes Chapel from HPL (say) is that it has these maps for other structures - and user can supply domain maps:
http://chapel.cray.com/tutorials/SC09/Part4_CHAPEL.pdf ] --- .left-column[ ## Chapel ### Overview ### Domain Maps ### Jacobi ] .right-column[ Running the Jacobi example shows a standard stencil-on-regular grid calculation: ```bash $ cd ~/examples/chapel_examples $ chpl jacobi.chpl -o jacobi $ ./jacobi Jacobi computation complete. Delta is 9.92124e-06 (< epsilon = 1e-05) no of iterations: 60 ```
] --- layout: false .left-column[ ## Chapel ### Overview ### Domain Maps ### Jacobi ### Tree Walk ] .right-column[ Lots of things do stencils on fixed rectangular grids well; maybe more impressively, concurrency primitives allow things like distributed tree walks simply, too:
] --- .left-column[ ## Chapel ### Overview ### Domain Maps ### Jacobi ### Tree Walk ### Pros/Cons ] .right-column[ **Cons** * Compiler still quite slow * Domain maps are static, making (say) AMR a ways away. * (dynamic associative arrays would be a _huge_ win in bioinformatics) * Irregular domain maps are not as mature **Pros** * Growing community * Developers very interested in "onboarding" new projects * Open source, very portable * Using mature approach (PGAS) in interesting ways ] --- .left-column[ ## Common Themes ### Higher-Level Abstractions ] .right-column[ MPI — with collective operations and MPI-IO — has taught us well-chosen higher-level abstractions provide: - User: **both**: - higher performance (performance portability, autotuning) - higher productivity (less code, fewer errors) - Toolbuilder: Clear, interesting targets for: - Algorithmic development (research) - Implementation tuning (development) Better deal for both. .center[
] ] --- .left-column[ ## Common Themes ### Higher-Level Abstractions ] .right-column[ - Spark: Resilient distributed data set (table), upon which: - Graphs - Dataframes/Datasets - Machine learning algorithms (which require linear algebra) - Mark of a good abstraction is you can build lots atop it! - Dask: - Task Graph - Dataframe, array, bag operations - TensorFlow: - Data flow - Certain kinds of “Tensor” operations - Chapel: - Domains - Locales ] --- .left-column[ ## Common Themes ### Higher-Level Abstractions ### Data Flow ] .right-column[ All of the approaches we've seen implicitly or explicitly constructed dataflow graphs to describe where data needs to move. Then can build optimization on top of that to improve data flow, movement These approaches are extremely promising, and already completely usable at scale for some sorts of tasks. Already starting to attract attention in HPC, e.g. [PaRSEC at ICL](http://icl.utk.edu/parsec/):
] --- .left-column[ ## Common Themes ### Higher-Level Abstractions ### Data Flow ### Data Layers ] .right-column[ There has been a large amount of growth of these parallel programming environments for data crunching in the past few years None of the big data models described here are as old as 4 years of age as public projects. For all of them, performance is a selling point for their audience. None use MPI. ] --- name: begin layout: false class: center, middle, inverse ## Data Layers --- .left-column[ ## Data layers ] .right-column[ Even if you don't like the particular ones listed, programming models/environments like above would be useful for our scientists! The ones above are tuned for iterative data analysis - Less tightly coupled than much HPC simulation - But that's a design choice, not a law of physics - And science has a lot of data analysis to do anyway They aren't competitors for MPI, they're the sort of things we'd like to have implemented atop MPI, even if we didn't use them ourselves. ⇒ Worth examining the data-movement layers of these stacks to see what features they require in a communications framework. ] --- name: begin layout: false class: center, middle, inverse ## Spark, Flink: Akka http://akka.io --- .left-column[ ## Akka ### Overview ] .right-column[ Akka is modeled after [Erlang](http://erlang.org): ```scala class Ping(pong: ActorRef) extends Actor { var count = 0 def incrementAndPrint { count += 1; println("ping") } def receive = { case StartMessage => incrementAndPrint pong ! PingMessage case PongMessage => incrementAndPrint if (count > 99) { sender ! StopMessage context.stop(self) } else { sender ! PingMessage } } } object PingPongTest extends App { val system = ActorSystem("PingPongSystem") val pong = system.actorOf(Props[Pong], name = "pong") val ping = system.actorOf(Props(new Ping(pong)), name = "ping") ping ! StartMessage } ``` http://alvinalexander.com/scala/scala-akka-actors-ping-pong-simple-example ] ??? Are people here familiar with Erlang? Erlang is a message-passing language from Ericsson that went open-source at about the same time as MPI became available. It's widely used in telecommunications, and other soft-real-time systems like many control systems. The soft-real-time community cares a lot about latency, but also about fault tolerance. Erlang is a functional programming language with asyncronous, active message passing that triggers actions via pattern matching. We can see an example with this simple ping-pong example in Akka. If the "Ping" actor gets a start message, it sends a Ping message to the pong actor; if it gets a pong back, it either pings again or exits after enouugh iterations. --- .left-column[ ## Akka ### Overview ] .right-column[ Akka is a Scala based concurrency package that powers Spark, [Flink](http://flink.apache.org): Actors are message passing - Active messages (RPC) - Messages trigger code - Asynchronous communication and synchronous (w/ timeouts) - Unreliable transport: - At most once - But in-order guarantee for successful messages - Failure detection - Several backends - HTTP, TCP, or UDP - Support for persistence/checkpointing - Support for migration - Support for streaming: persistent data flow Also support for futures, etc. ] --- .left-column[ ## Akka ### Overview ### Benefits for Spark ] .right-column[ - Active Messages/Actors support very irregular executions discovered at runtime - Akka + JVM allows easy deployment of new code (à la Erlang) - Fault tolerance supports the resiliency goal .center[
] http://zhpooer.github.io/2014/08/03/akka-in-action-testing-actors/ ] --- name: begin layout: false class: center, middle, inverse ## UPC, Chapel, CAF, _etc_.: GASNET https://gasnet.lbl.gov --- .left-column[ ## GASNet ### Overview ] .right-column[ History: - 2002: Course Project (Dan Bonachea, Jaein Jeong) - Refactored UPC network stack into library for other applications Features: - Core API: Active messages - Extended API: RMA, Synchronization - Guaranteed delivery but unordered - Wide range of backends Used by UPC, Coarray Fortran, OpenSHMEM reference implementation, Legion, Chapel... .center[
] .center[https://people.eecs.berkeley.edu/~bonachea/upc/gasnet-project.html] ] --- .left-column[ ## GASNet ### Overview ### Benefits for PGAS languages ] .right-column[ RMA is very efficient for random access to large distributed mutable state - So not immediately helpful for Spark RDDs - Very useful for HPC - Compilers are very capable at reordering slow memory access to hide latency Active messages greatly simplify irregular communications patterns, starting tasks remotely .center[
] .center[https://xstackwiki.modelado.org/DEGAS] ] --- name: begin layout: false class: center, middle, inverse ## TensorFlow, Dask, and one or two other things: ## TCP/UDP/Ethernet --- name: begin layout: false class: center, middle, inverse ## TensorFlow, Dask, and one or two other things: ## TCP/UDP/Ethernet ## No, seriously --- .left-column[ ## TCP/UDP Ethernet ] .right-column[ High-Frequency traders who [_lay undersea cables_](http://www.telegraph.co.uk/technology/news/8753784/The-300m-cable-that-will-save-traders-milliseconds.html) to avoid latency use [TCP](http://www.openonload.org), often even internally. User-space networking stacks can have true zero-copies, avoid context switches, and so can be very performant, eg: - [Cisco UCS usNIC](http://blogs.cisco.com/performance/ultra-low-latency-ethernet-questions-and-answers) - [Teclo](https://www.snellman.net/blog/archive/2015-08-25-tcp-optimization-in-mobile-networks/) - [mTCP](https://github.com/eunyoung14/mtcp) - [lkl](https://github.com/libos-nuse/lkl-linuxmTCP), [libuinet](https://github.com/pkelsey/libuinet) Can be specialized to LAN only, optimized .center[
] .center[http://github.com/eunyoung14/mtcp] ] --- .left-column[ ## TCP/UDP Ethernet ## Benefits ] .right-column[ Very mature Lots of experience with RPC, eg gRPC from Google (used in TensorFlow); unreliable transport available Fault-tolerant [Data Plane Development Kit (DPDK)](http://dpdk.org) - userspace ethernet; ~80 cycles/packet ~45% of current top 500 is 10G or 1G ethernet But no RDMA (or slower). .center[
] .center[https://grcp.io] ] --- .left-column[ ## TCP/UDP Ethernet ## Benefits ] .right-column[ Example of this: [ScyllaDB](http://www.scylladb.com) NoSQL database, much faster than competitors - C++ - Sockets/Ethernet based, - Their own user-space TCP - and/or DPDK Based on underlying [SeaStar](http://www.seastar-project.org) framework - Very interesting C++ concurrent/parallel tools - Futures/Promises: Continuations - On-node message passing - Off-node with explicit networking .center[
] .center[http://scylladb.com] ] --- name: begin layout: false class: center, middle, inverse ## Upcoming: LibFabric, UCX --- .left-column[ ## Libfabric, UCX ### Overview ] .right-column[ Many different types of projects have to re-invent the wheel of network-agnostic layer - MPI Implementations - GASNet - High performance file systems - Control plane Projects like Libfabric, UCX, and CCI (now less active) aim to package up this layer and expose it to a variety of consumers Support - Active messages - Passive messages - RMA - Atomics (CAS,...) .center[
] ] --- .left-column[ ## Libfabric, UCX ### Overview ### Libfabric ] .right-column[ Grew out of the OpenFabrics alliance Hardware abstraction, not network: quite low-level Scalable addressing support (minimal memory usage) Lightweight Reliable or unreliable transport Substantial support: DOE, DOD, NASA, Intel, Cray, Cisco, Mellanox, IBM, UNH Some OpenSHMEM implementations, OpenMPI OFI MTL, MPICH channel, GASNet OFI conduit... .center[
] ] --- .left-column[ ## Libfabric, UCX ### Overview ### Libfabric ### UCX ] .right-column[ Started as UCCS, based on OpenMPI BTL/MTLs Aims as being higher-level API than Libfabric but in truth there's much overlap Model is more that a single job/task has a UCX "universe" IBM, UT, Mellanox, NVIDIA, ORNL, Pathscale, UH Reliable but out-of-order delivery MPICH, An OpenSHEMEM implementation, OpenMPI.. .center[
] ] --- .left-column[ ## Libfabric, UCX ### Overview ### Libfabric ### UCX ### Summary ] .right-column[ Two solid, rapidly maturing projects Clearly capable of supporting higher-level data layers (MPI, GASNet, OpenSHMEM), likely high-enough level for programming models to compile down to Will greatly reduce the barrier to writing high-performance cluster applications There are slight differences of focus, but is there room for both efforts? - Time will tell - Competition is good ] --- .left-column[ ## Data layers ### Summary ] .right-column[ The data layers that support these richer programming models have a few things in common. - Relaxed transport reliability semantics - Relaxing one or both of in-order and arrival guarantees, at least optionally - Sometimes it's more natural to handle that at higher levels - In some other cases, not necessary - In these cases, can get higher performance ] -- .right-column-cont[ - Fault Tolerance - More natural with relaxed semantics - Crucial for commodity hardware, large-scale ] -- .right-column-cont[ - Active messages/RPC - Makes more dynamic and/or irregular communication patterns much more natural - Allows much more complicated problems ] -- .right-column-cont[ - RMA - Used in some but not all - Very useful for handling large distributed mutable state - Not all problem domains above require this, but very useful for simulation ] --- name: begin layout: false class: center, middle, inverse ## Where does (and could) MPI fit in --- .left-column[ ## Whither MPI ### There's so much going on! ] .right-column[ Most exciting time in large-scale technical computing maybe ever. “Cambrian Explosion” of new problems, tools, hardware: - This is what we signed up for! MPI has much to contribute: - Great implementations - Great algorithmic work - Dedicated community which “eats its dog food” But neither MPI implmentations nor expertise is being used How Can MPI Fit Into Today's Big Computing? ] --- .left-column[ ## Whither MPI ### There's so much going on! ] .right-column[ "MPI" is a lot of things: - Standard - Implementations - Algorithm development community - Software development community Lots of places to fit in! ] --- .left-column[ ## Whither MPI ### There's so much going on! ### Standards ] .right-column[ MPI 4.0 is an enormous opportunity - It's becoming clear what the needs are for - Large scale scientific data analysis - Large scale "big data"-type analysis - Towards exascale - x.0 releases are precious things - Breaking backwards compatibility is allowed - Breaking backwards compatibility is expected ] --- .left-column[ ## Whither MPI ### There's so much going on! ### Standards ] .right-column[ For the API, it's fairly clear what the broader community needs: Relaxed reliability - Strict is primarily useful for scientists writing MPI code - Scientists shouldn't be writing MPI code - Allow applications that can handle out-of-order or dropped messages the performance win by doing so ] -- .right-column-cont[ Fault tolerance - Heroic attempts at this for years - Increasingly, vitally necessary - Maybe only allow at relaxed reliability levels? ] -- .right-column-cont[ Active messages - Really really important for dynamic, irregular problems (all at large enough scale) ] --- .left-column[ ## Whither MPI ### There's so much going on! ### Standards ] .right-column[ More Access to MPI Runtimes - Implementations have fantistic, intelligent runtimes - Already make many decisions for the user - Some bits are exposed through tools interface - Embrace the runtime, allow users more interaction ] --- .left-column[ ## Whither MPI ### There's so much going on! ### Standards ] .right-column[ Such an MPI could easily live between (say) Libfabric/UCX and rich programming models Delivering real value: - Through API - Through Datatypes - Through Algorithms - Collectives - IO - Other higher-level primitives? - Through intelligent runtimes Could be a network-agnostic standard not just in HPC but elsewhere ] --- .left-column[ ## Whither MPI ### There's so much going on! ### Standards ### Algorithms ] .right-column[ Collectives over nonreliable transport What does something like Dask or Spark — maybe built on MPI-4 — look like when dealing with NVM? - External-memory algorithms become cool again! - Migrating pages of NVM with RDMA? Increased interest in execution graph scheduling - But centralized scheduler is bottleneck - What does work-stealing look like between schedulers? - Google Omega Spark has put a lot of work into graph and machine-learning primitives - But for RDDs - What can you improve on with mutability? ] --- .left-column[ ## Whither MPI ### There's so much going on! ### Standards ### Algorithms ### Coding ] .right-column[ Projects exist that could greatly reduce time-to-science immediately, while MPI-4 is sorting out, and eventually take advantage of an MPI-4's capabilties Dask: - Try a high-performance network backend (libfabric/UCX?) - Distributed scheduling TensorFlow: - Higher-level DSL that generates the data flow graphs - Extend support to other accellerators Chapel: - Port code over, help find performance issues - Help with partial collectives (_eg_ not `MPI_COMM_WORLD`) - Lots of compiler work to be done: LLVM, optimizations, IDE ] --- ## The Future is Wide Open A possible future exists where every scientist — and most data scientists and big data practitioners — rely on MPI every day while only a few write MPI code -- Sockets for high-performance technical computing -- The platform to build important communications and numerical algorithms, rich programming and analysis enviroments -- There's much to be done, and the opportunities are everywhere -- But the world won't wait -- The world isn't waiting -- Good Luck!