%%%include "default.mgp" %deffont "standard" xfont "/usr/share/fonts/default/TrueType/arialn.ttf" %deffont "thick" xfont "/usr/share/fonts/default/TrueType/ariblk.ttf" %deffont "typewriter" xfont "/usr/share/fonts/default/TrueType/courbd.ttf" %deffont "titlepage" xfont "/usr/share/fonts/default/TrueType/coprgtb.ttf" %deffont "header" xfont "/usr/share/fonts/default/TrueType/ariblk.ttf" %deffont "italic" xfont "/usr/share/fonts/default/TrueType/courbd.ttf" %default 1 leftfill, size 1 %default 2 size 6, prefix " ", fore "black", font "header" %default 3 left, size 6, bar "#34575b", font "standard" %default 4 size 5, fore "black", vgap 30, prefix " ", font "standard" %tab 1 size 4, vgap 30, prefix " ", icon box "#34585b" 50 %tab 2 size 3, vgap 20, prefix " ", icon arc "#54787b" 50 %tab 3 size 2, vgap 20, prefix " ", icon delta3 "black" 40 %default 1 bimage "/home/becker/images/slidebg.jpg" 1512x1134 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %page %nodefault %%bgrad 0 0 16 0 0 "black" "yellow" %bimage "/home/becker/images/slidebg.jpg" 1512x1134 %center, size 9, font "titlepage", fore "black" Scalable Performance Clustering %center, size 9, font "titlepage", fore "black" State of the Art %center, size 6, font "titlepage", fore "black" and the future %center, size 9, font "titlepage", fore "black" %size 6, fore "#34585b", center Donald Becker %size 5, fore "#54787b", center Scyld Software %font "italic", size 3, fore "#54787b", center A Penguin Computing Company becker@scyld.com %size 3 Presented with MagicPoint %%image "/usr/share/doc/magicpoint-1.07a/sample/mgp3.xbm" %page %font "standard" Definitions %size 6, prefix " ", fore "black", font "header" Cluster: %size 6, prefix " ", fore "black", font "standard" A widely used term meaning Independent computers Combined into a unified system Through software and networking %pause %size 6, prefix " ", fore "black", font "header" Cluster Types: Scalable Performance Cluster High Availability (Fail-over) Cluster Resource Access Cluster %page What is Beowulf? Beowulf is Scalable Performance Clusters based on Commodity hardware Private system network Open source software (Linux) infrastructure %page What is Beowulf? Scalable Performance Clusters Improving performance proportionally with added machines Commodity hardware Mass-market, stand-alone compute nodes Private system network Nodes dedicated to computation Predictable, efficient and a simple security model Open source software (Linux) infrastructure Core software is verifiable %page Why clusters? Price for Performance Obvious initial reason Business market pays for engineering Efficient distribution and service channels As-need Scalability New machines can be automatically added New, faster machines can replace older machines Architecture and software remains the same Investment preserved Commodity platforms Performance growth rate Better continuity and availability Long-term viability %page Advantages of Commodity Systems Commodity CPUs Always available Many vendors Multiple CPU development teams Rapid improvements Common environment software HPC software traditionally "utilitarian" Software differences a barrier to understanding and use %page Very Brief Beowulf History Beowulf Project The Beowulf Project was started at NASA in 1994 Beowulf was intended to supplement supercomputers "Beowulf" was an apt project name Linux continues to be the dominant cluster OS %pause Scyld Beowulf Scyld was started in 1998 Redesigned for ease of use and deployment Scyld Beowulf is the Scyld product Innovative new generation cluster software "Scyld" was the father of Beowulf %page Cluster Software What is important in a cluster software system? %pause Well, what are the problems? %pause Complexity System management model %pause Installation First use learning curve %pause Applications to use the system Tool availability %pause Maintenance Maturity and continuity %page What is the goal? Understanding the goal helps show the path %pause Uniform virtual environment %pause Single System Image %pause Single System Illusion Creating the illusion of a single standard machine No performance impact allowed Changed+simpler is not necessarily simpler %page Previous Solutions How have these problems been addressed in the past? %pause Classic Beowulf Full OS installation on all nodes Supports user login on any node Administration by scripts Consistency and synchronization tools Cluster monitoring GUI %page New-generation Solution How the world gotten better? %pause New-generation Beowulf Full OS installation only on "master" Compute nodes designed as a computational resource Single point administration Single point updates Single process space view Centralized monitoring and job control %page Scyld Beowulf A standard, supported Beowulf cluster operating system Simplifies integration and administration Targeting deployment of complex applications %pause What it is not: Automatic parallelization A new language, or An integrated development environment. %page Scyld Beowulf Features "Install once, execute everywhere" Administration and use is very similar to a single machine Dynamically adding compute nodes is fast and automatic Scalable to over a thousand compute nodes Eliminates software version skew Based on Linux Open Source software infrastructure %page Design Philosophy and Goals Administrators Simplicity Minimal new cluster-specific tools Users Application users should not need to know they are on a cluster Administration should require little new knowledge Developers Need to be sophisticated only in application area Compile-run development cycle, not compile-copy-run Deployment with a single executable %page System Model "Master" front-end Multiple "Slave node" compute machines Booting and configuration controlled from a master Master Full operating system installation All standard tools and utilities available unchanged Supports user login Provides OS, drivers, libraries and applications Slaves Tuned kernel No required file system No user logins or system services No required executables! %page Master-Controlled Cluster System Model %center %image "/home/becker/tutorial/diagram/arch2-2.jpg" %page Why this System Model? Combines the advantages of A standard user environment Ability to run unchanged applications Specialized compute systems %page Scyld Beowulf Single System Image Single Installation Single point upgrade Kernel, drivers, system libraries User applications, user libraries No version skew Zero-installation scaling New nodes take seconds to provision Full performance on compute nodes File system semantics selected At system integration, or By administrator Unified process space %page Operational Details Nodes are added dynamically PXE or Scyld Beoboot booting Provisioning takes as little as one second Next started job may use new node A heartbeat is used to detect missing and failed systems Immediately removed from scheduling Eventually running processes reported as crashed Detection of lost system connection Compute node default is rebooting after 30 seconds Configurable behavior %page Subsystems Overview Booting and Provisioning Clusters Unified Process Space Beowulf Name Services Scheduling %page Booting and Provisioning Functionality not needed with a grid Tightly tied with the architecture and management Some systems mistakenly assume it "just happens" An opportunity and challenge %page Booting Clusters Booting has long been a hot topic Various boot media Disk-based and Disk-less models Safe system software updates problematic Multiple reporting points for boot problems %page BeoBoot: Scyld Booting and Provisioning Boot requirements Reliable network boot Dynamic new node addition All run-time components from a controlling system Configuration from a central point Solution Unchanging boot code Scyld developed BeoBoot Stage 1 Now ubiquitous PXE network boot Flow controlled boot server Kernel and minimal system from a boot server Configuration and provisioning from a master %page Beoboot Stage 2: Initial Provisioning Beoboot has a minimal initial system Identifies network devices Loads network device driver Contacts server for identity information Connects to master for configuration The magic is... %size 4, prefix " ", fore "yellow", font "standard" This Space Intentionally Blank %page Beoboot Final Stage Concept: Configure for specific need For a compute node: Master sets time of day Master mounts file systems Master starts any application or services %page Compute nodes with Scyld Beowulf Base system model is "diskless administrative" Only 10-50MB of required cached data Default environment supports most applications Visible /lib/* libraries /etc/ is mostly empty /etc/passwd and /etc/group are not needed! /etc/mtab exists only so that 'df' works. Name services (hostname, password) are usually bypassed. No /bin or other Recommended but optional local disks Used for databases and additional caching Optionally mounted and checked on startup %page Unified Process Space Problems: Starting jobs on a dynamic cluster Monitoring and controlling running processes Allowing interactive and scheduler-based usage Opportunity: Clusters jobs are issued from designated masters That master has exactly the required environment We already have a POSIX job control model Tight communication %page Solution: A Beowulf Cluster Process Space Create a cluster-wide Unified Process Space Control processes with a local process table entry Forward signals and exit status Precise process creation through migration Remote fork or execute to create processes Implement using checkpoint/restart migration Magic trick: make it fast and efficient Result All jobs appear to exist on the front-end "master". Job control and process monitoring work as expected! Control-Z suspends all jobs, "bg" starts all running The 'ps' and 'top' programs work unchanged %page Performance Characteristics Start-up Under 10 msec. to complete a remote job! 10X faster than rsh, 20-30X ssh No run time performance impact System calls and paging are local Process status update to master is compact and low-rate Only fork(), signals and exit() require a round-trip interaction Compare to transparent process migration of Mosix %page How BProc works BProc is a "Directed Process Migration" Mechanism BProc has architectural elements of Remote Fork Process migration Checkpoint / restart Design details VMA dump and restart -- essentially "checkpoint" to a socket/stream In general, files and sockets are closed stdin, stdout, and stderr may remain connected Process environment info (process ID) appears unchanged Preserves POSIX process family semantics Signals (SIG*) are forwarded both ways. Slave updates state to master. Resource usage on exit %page How can this be Fast? Cached libraries ("VMA regions") Copy on changed pages in known VMA regions Copy unknown VMA regions Improvements Better dynamic caching of objects Caching selection of RAM (default) or local disk Pluggable transport selection e.g. TCP or Myrinet Detach process and re-master node %page Name Service / Directory Service "Name Service" and "Directory Service" mean the same thing. A directory service Maps a name to a value, or Provides a list of names. Specific Examples User names Password and user information Host names IP addresses and Ethernet MAC addresses Network groups A list of similar hosts %page Benefits of Cluster Nameservices Why are cluster nameservices important? Simplicity Eliminates per-node configuration files Automates scaling and updates Performance Avoid the serialization of network name lookups. Avoid communicating with a busy server Avoid failures from server overload Avoid the latency of consulting large databases %page Cluster Name Service Opportunities Why can we do a better job? Clusters have a single set of users User credentials available at job initiation point New nodes will have predictable names Cluster nodes are granted similar access permissions %page Solution: BeoNSS, Beowulf Name Services BeoNSS is a mechanism that Caches, Computes or Avoids name lookup %page Hostnames Cluster hostnames have the form . Syntax does not conflict Compare with DNS and local hostnames Special names for "self" and "master" Current machine is ".-2" or "self". Master is known as ".-1" Aliases of "master" and "master0". Cluster nodes start at ".0" Zero based for flexibility Do not assign ".0" for 1-based naming Extend to maximum node e.g. ".31" Maximum resolvable number defined. %page User Name lookups Names are reported as password table entry 'pwent' Processes are moved with their user information BeoNSS reports only the current user and root Cluster jobs do not need to know other users Much faster than scanning large lists %page Other name services Netgroups to automate file server export security Services and Protocols databases All common, fixed values Frequency of use analysis to select and sort entries %%page %%nodefault %%bimage "/home/becker/images/slidebg.jpg" 1512x1134 %%center, size 11, font "titlepage", fore "black" %% Cluster Scheduling and Job Mapping %page Scheduling on Grids vs. Clusters Similar words and concepts are used Opportunities and thus architecture differ Clusters support interactive and administrative use %page Definitions Scheduling: A combination of concepts about running jobs Queuing: Delaying jobs until resources are available Backfill: Reordering queue for better utilization Mapping: Assigning processes of a job to nodes Environment Configuration: Making files, etc. available Job Initiation: Creating processes on specified nodes Job Control: Stopping, resuming and killing processes Reporting: Tracking resource usage and exit status Environment Clean-up: Undoing configuration %page Scyld Beowulf Scheduling Support Queuing: BBQ, or external scheduler Backfill: Only with external scheduler Mapping: Beomap, NPR, or external scheduler Environment Configuration: Ad hoc, responsibility of job Job Initiation: BProc Job Control: BProc Reporting: BProc Environment Clean-up: Ad hoc %page Scyld Scheduling Scyld supports external schedulers, Differences between Scyld and External Schedulers Scyld programs call library functions for a map Extensible by dynamic loading libraries into the application External Schedulers provide a daemon that schedules jobs Extensible by loading dynamic libraries into the daemon %page Scyld Scheduler Interface Scyld provides centralized scheduler support Use Beostat library node capability: processor count, speed, memory status: load average, free memory Use BProc library for node state Node state is up Permission for user execution Option to force scheduler-only job submissions Set node group ownership to scheduler Set execute permission only for group %page BeoMap, the Scyld Mapping System Beomap is a layered job mapping system Programs call beomap functions Scripts call 'beomap' program Thin wrapper for mapping function NODE=beomap --no-local --np 1 Mapping interacts with node status Node state -- only use 'up' nodes Node information -- need free memory %page BeoMap Implementation Layer The BeoMap system allows "pluggable" schedulers Looks for system- or user-provided dynamic library Library function is passed a key (program name) Default scheduler is good for most uses Looks for least-loaded nodes Prefers grouping processes on SMP nodes Sorts node list by node number Extended schedulers Have access to program name, BeoStat library and BProc state %page BeoMap BeoMap: a better approach Other schedulers are daemon-based Loading dynamic libraries is more efficient and flexible Users and administrators may install customized rules Complex network topologies may be handled Why is this possible in Scyld? Kernel-enforced node ownership mechanism Invalid mappings simply fail. Daemon-based schedulers must be closed systems May not execute arbitrary user code Must use only their internal statistics and job monitoring %page %nodefault %bimage "/home/becker/images/slidebg.jpg" 1512x1134 %center, size 9, font "titlepage", fore "black" Using Scyld Beowulf Cluster %page Application Server Cluster Compute nodes used as server nodes Inverts traditional cluster network Server nodes connect to master and Internet Master is firewalled by server nodes %page Application Server Security Highly Secure Server Nodes No network services to exploit No OS password information No local executables Applications "locked" to not migrate from node %page Example Script Script Run on master at start Uses standard *NIX process concepts %font "typewriter", size 2 while true; do NODE=`beomap --no-local --np 1` bpsh $NODE appserver logger -t appserver Exited with status $? done %page Using Beowulf Process Space (BProc) Calls Cluster applications can be very simple. Basic call is bproc_move(), bproc_rfork() Remote move or fork semantics Takes a numeric destination node ID. Available node ID may be found from the NPR or beomap library Resulting processes controlled with *NIX interface See 'modprobe' for a great example Reads dependency file from the master Reads kernel symbols from the slave Reads driver module from the master Loads module into slave kernel %page Pragmatic Issues Scyld is a complete supported OS distribution Ships as installation/upgrade CDs Provides isolation from unexpected "upgrade" changes Allows delivering a real cluster OS Automated installation part of consistent deploys Avoid system changes with re-install Don't confuse installation time with learning! %page Implications of Single System Image Single System Image and ad hoc installations are fundamentally at odds One kernel over cluster Consider a Filesystem or NIC driver update One set of utilities over the cluster Node specialization not a conflict with this principle Selecting an optimized kernel is not automatic Per-machine library and kernel optimizations problematic Breaks singles point, single file updates Breaks application portability and repeatability %page Node Specialization Specialization allowed by Specific machine (MAC address) Position in cluster (node number) Hardware resources Heterogeneity support Range of support hardware is a common question Instruction set must be the same Cannot run the application otherwise Installed hardware detection is automatic Drivers installed based on e.g. PCI ID %page Operation Issues Dynamically, automatically scalable New nodes assigned a permanent node number Based on MAC address Manual intervention to renumber New nodes take only seconds to provision 750 msec for base system Disk detection and file system mounts extra %page Hardening System Tools All system tools should be "hard" and single layered Don't rely on interpreters: "Perl v5.6.1.33 only" Reserve interpreters for end site use Provide language bindings for system interfaces Single layer implementation Human-oriented text configuration files Trace problems back to the original configuration Generated configuration files are a potential disaster Libraries, shell, command line and GUI interfaces GUIs Provide an efficient monitoring interface Example: Keep vital state mappable shared memory %page Future Even better boot and failure analysis system PXE-based CPU and NIC detection Complete boot state (failure) reporting Environment and kernel fault reporting Multiple master architectures Different structures are possible Automatic detection and configuration needed Integrated process mirroring Extension to existing migration Opportunity with InfiniBand and other RDMA Client pull may increases scalability Reliability trade-off Complete virtual environment creation and mirroring Too inefficient today Extend process migration to environment migration %page Deployment and Support And now the commercial message Training available on-site or scheduled Northrop Grumman hosts training in D.C. area Scyld Beowulf is available on GSA and SEWP Integrated clusters, integration services, professional services Penguin Computing provides standard clusters Most common commercial cluster deployments: AMD Operton w/ gigabit Ethernet or Infiniband Intel Xeon on racks or blades