TN-03_EvaluationApproachesPerformancePortability
================================================

.. meta::
   :description: technical note

   :keywords: T/NA086/20,Code,structure,and,coordination,Report,2047358-TN-03,Evaluation,of,Approaches,to,Performance,Portability,Steven,Wright,,Ben,Dudson,,Peter,Hill,,and,David,Dickinson,University,of,York,Gihan,Mudalige,University,of,Warwick,December,8,,2021,Contents,1,Introduction,1.1,Method,of,Evaluation,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2,Application,Evaluations,2.1,TeaLeaf,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.1.1,Performance,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.1.2,Portability,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.2,miniFE,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.2.1,Performance,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.2.2,Portability,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.3,Laghos,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.3.1,Performance,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.3.2,Portability,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.4,CabanaPIC,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.4.1,Performance,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.5,VPIC,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.5.1,Performance,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.5.2,Portability,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.6,EMPIRE-PIC,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.6.1,Performance,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,2.6.2,Portability,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,3,Conclusions,2,3,6,6,6,7,11,11,12,14,15,15,17,17,18,18,19,21,21,22,26,3.1,Limitations,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,28,References,29,1,1,Introduction,The,focus,of,the,code,structure,and,coordination,work,package,is,to,establish,a,series,of,“best,practices”,on,how,to,develop,simulation,applications,for,Exascale,systems,that,are,able,to,obtain,high,performance,on,each,architecture,(i.e.,are,performance,portable),without,signiﬁcant,manual,porting,eﬀorts.,In,the,past,decade,,a,large,number,of,approaches,to,developing,performance,portable,code,have,been,developed.,In,this,report,we,will,begin,to,report,on,our,evaluation,of,some,of,these,approaches,through,the,execution,of,a,small,number,of,mini-applications,that,implement,methods,similar,to,those,likely,to,be,required,in,NEPTUNE.,These,applications,are,detailed,in,report,2047358-TN-02,,but,are,summarised,below,for,convenience:,TeaLeaf,A,ﬁnite,diﬀerence,mini-app,that,solves,the,linear,heat,conduction,equation,on,a,regular,grid,using,a,5-point,stencil1.,miniFE,A,ﬁnite,element,mini-app,,and,part,of,the,Mantevo,benchmark,suite2.,Laghos,A,high-order,curvilinear,ﬁnite,element,scheme,on,an,unstructured,mesh3.,CabanaPIC,A,structured,PIC,code,built,using,the,CoPA,Cabana,library,for,particle-,based,simulations4.,VPIC/VPIC,2.0,A,general,purpose,PIC,code,for,modelling,kinetic,plasmas,in,one,,two,or,three,dimensions,,developed,at,Los,Alamos,National,Laboratory5.,EMPIRE-PIC,An,unstructured,PIC,code,that,uses,the,ﬁnite-element,method.,1http://uk-mac.github.io/TeaLeaf/,2https://github.com/Mantevo/miniFE,3https://github.com/CEED/Laghos,4https://github.com/ECP-copa/CabanaPIC,5https://github.com/lanl/vpic,2,The,selected,applications,broadly,represent,the,algorithms,of,interest,for,the,NEPTUNE,project,and,fall,in,to,two,categories,–,ﬂuid-methods,and,particle-,methods.,Within,the,ﬂuid-method,tranche,,the,applications,are,available,im-,plemented,in,a,wide,range,of,programming,models,,allowing,us,a,good,op-,portunity,to,evaluate,the,eﬀect,of,programming,model,on,the,performance,,and,importantly,the,performance,portability,of,that,particular,approach,to,ap-,plication,development.,There,are,a,relatively,small,number,of,particle-in-cell,mini-applications,available,,and,thus,the,selected,particle-methods,applications,are,only,available,implemented,using,Kokkos.,However,,this,still,allows,us,an,opportunity,to,evaluate,the,appropriateness,of,Kokkos,as,a,programming,model,for,performance,portable,application,development.,1.1,Method,of,Evaluation,As,stated,previously,,we,will,evaluate,the,performance,portability,of,these,ap-,plications,using,the,metric,introduced,by,Pennycook,et,al.,[1],,and,use,the,visualisation,techniques,outlined,by,Sewall,et,al.,[2].,The,Pennycook,metric,allows,us,to,calculate,the,performance,portability,of,an,application,according,to,Equation,(1).,H,|,|,1,ei(a,,p),(cid:88)i,H,∈,0,if,i,is,supported,H,i,∀,∈,(1),otherwise,PP(a,,p,,H),=,,,,In,the,equation,,the,performance,portability,(PP),of,an,application,a,,solving,problem,p,,on,a,given,set,of,platforms,H,,is,calculated,by,ﬁnding,the,harmonic,mean,of,an,application’s,performance,eﬃciency,(ei(a,,p)).,The,performance,ef-,ﬁciency,for,each,platform,can,be,calculated,by,comparing,achieved,performance,against,the,best,recorded,(possibly,non-portable),performance,on,each,individ-,ual,target,platform,(i.e.,the,application,eﬃciency,,or,by,comparing,the,achieved,performance,against,the,theoretical,maximum,performance,achievable,on,each,the,architectural,eﬃciency).,Should,the,application,fail,to,run,on,one,of,the,target,platforms,,a,performance,portability,score,of,0,individual,platform,(i.e.,is,awarded.,While,Equation,(1),provides,a,formal,deﬁnition,for,performance,portability,,3,this,single,value,metric,may,not,answer,all,questions,a,developer,might,have,about,their,application.,In,recognising,this,,in,this,report,we,use,two,visuali-,sation,techniques,introduced,by,Sewall,et,al.,[2].,These,visualisations,are,best,explained,with,an,example.,Figure,1,presents,a,simple,synthetic,data,set,for,six,implementations,of,an,appli-,cation,running,across,10,platforms.,These,implementations,are:,unportable,with,high,performance,on,a,single,platform,,but,not,portable,to,any,other,platform;,single,target,with,high,performance,on,a,single,platform,,but,low,performance,on,all,others;,multi,target,achieving,high,performance,on,some,platforms,,and,low,performance,on,others;,inconsistent,showing,a,range,of,performance,across,all,platforms;,and,consistent,showing,consistent,low,(30%),or,high,(70%),performance,across,all,platforms.,Figure,1:,Synthetic,data,set,for,six,implementations,running,across,10,platforms,taken,from,Sewall,et,al.,[2],We,could,simply,apply,the,performance,portability,metric,in,Equation,(1),to,this,synthetic,data,but,this,may,mean,that,we,lose,some,information,about,how,the,performance,portability,is,spread,across,platforms,,and,how,the,metric,changes,as,we,add,and,remove,platforms,from,the,evaluation,set.,Figure,2(a),addresses,this,ﬁrst,concern,,showing,not,only,the,median,eﬃciency,of,an,application,,but,also,the,spread,of,eﬃciencies,(and,any,outliers).,The,second,concern,is,addressed,in,the,cascade,plot,in,Figure,2(b),,where,the,applications,performance,portability,and,eﬃciency,are,plotted,as,platforms,are,added,to,the,4,unportablesingletargetmultitargetconsistent(30%)inconsistentconsistent(70%)ABCDEFGHIJ100%0%0%0%0%0%0%0%0%0%100%10%10%10%10%10%10%10%10%10%100%10%100%10%100%10%100%10%100%10%30%30%30%30%30%30%30%30%30%30%10%20%30%40%50%60%70%80%90%100%70%70%70%70%70%70%70%70%70%70%Higherisbetter20406080100Fig.1:Heatmapofsyntheticapplicationefﬁciencies.adistinctefﬁciencydistributionthatcouldrealisticallyariseduringapplicationdevelopment;takentogether,thesedatasetscanbeusedasalitmustestfortheintuitiongainedfromdifferentmetricsandvisualizations.Webelievethatsyntheticdistributionslikethesecouldserveasimilarroletomathemat-icalbasisfunctionsorclassiﬁersinthespaceofperformanceefﬁciencydistributions,butleavethisinvestigationtofuturework.Adetaileddescriptionofeachdatasetisgivenbelow.Unportable:Applicationsinthisclassarewrittentosupportonlyasingletargetplatform,possiblyusingaproprietaryprogrammingmodelthatexcludessomeclassesofplatforms.Thisisrepresentedbyanefﬁciencyofzeroononeormoreplatforms(wheretheapplicationdoesnotrun).SingleTarget:Applicationsmaybewritteninaportablemodelbutwithsigniﬁcantoptimizationeffortsappliedtoalimitedsetofplatforms.Forexample,ifanapplicationisonlyeverexpectedtorunononeplatform,thatplatformwillbethedevelopmentteam’soptimizationfocus.However,theapplicationremainscapableofachievingnon-zeroefﬁciencies(i.e.itwillruncorrectly)ontheremainingplatforms.Multi-Target:Amulti-targetapplicationsharesmanysim-ilaritieswithonetargetingasingleplatform,butexpandsthetargettoalargersetofplatforms.Forexample,theapplicationmightbeoptimizedspeciﬁcallyforoneclassofarchitectures(suchasGPUs).Theresultisoftenabimodaldistributionofefﬁciencies,withtwodistinctgroups:onehighandonelow.Consistent:Someapplicationsachievesimilarperformanceefﬁcienciesonallplatforms.Thismayresultfromthenatureoftheapplication(e.g.amicro-benchmarkintendedtostressaparticulararchitecturalfeature)orprojectgoals(e.g.adesireforacodetoattainsimilarlyhighlevelsofperformanceacrossallplatforms[17]).Weusetwodistributionstorepresentsuchapplications,separatingcaseswheretheefﬁciencyisconsistentlylow(30%)andhigh(70%),todeterminewhethertheapproachestestedcandifferentiatebetweenthem.Inconsistent:AnapplicationthatprioritizesportabilityoverTABLEII:“Averages”ofsyntheticefﬁciencydata.UnportableSingleMultiConsistentInconsistentConsistentTargetTarget(30%)(70%)Minimum0.010.0010.0030.010.0070.0Arith.Mean10.019.0055.0030.055.0070.0Geo.Mean0.012.5931.6230.045.2970.0Har.MeanNaN10.9918.1830.034.1470.0Median0.010.0055.0030.055.0070.0PP0.010.9918.1830.034.1470.0performanceisunlikelytoachieveconsistentlevelsofper-formanceefﬁciency.Suchapplicationsmaybecontinuallyadaptedastheymovebetweenplatformsanddevelopersaddnewfeatures,resultinginwide-rangingresultswithhighvari-ance.Werepresentthisasalinearlyincreasingperformanceefﬁciency,butcouldhavedrawnthesenumbersfromauniformdistribution.III.SINGLENUMBERMETRICSRepresentingtheperformanceportabilityofanapplicationusingasinglenumericvalueishighlydesirable,providingasimplewaytocomparethedistributionsofperformanceefﬁciencyresultsacrossapplications.Whenconfrontedwithaslewofdata,itistemptingtoreachforfamiliarstatisticalmeasures(e.g.averageandstandarddeviation);Question2andQuestion3maybeanswereddi-rectly,withQuestion1answeredbydeﬁningavalue(orrangeofvalues)representingasufﬁcientlyperformanceportableapplication.However,aswewillsee,commonstatisticaltoolsmaynotprovidetheinsightwe’relookingfor.Itisalsohighlyunlikelythatanysinglenumbercouldanswerallthreeofthesequestionssimultaneously;combiningmultiplemetrics–orusingdifferentmetricsfordifferentanalyses–maybenecessary.A.AveragePerformanceEfﬁciencyTableIIcomparesseveral“averages”computedforoursyn-theticdataset.Inadditiontostandardstatisticalaverages(arithmeticmean,geometricmean,harmonicmean,median),weincludetheminimumvalueandthevalueoftheperfor-manceportabilitymetricfrom[1],[2],calculatedas:PP(a,p,H)=8>><>>:|H|Pi2H1ei(a,p)ifiissupported8i2H0otherwisethatis,theharmonicmeanofapplicationa’sperformanceefﬁciencye(a,p)whenexecutingproblemponasetofplatformsH.PPdiffersfromtheharmonicmeanonlyinthatthelatterisnotdeﬁnediftheinputdatacontainsazero(sinceitwouldrequirethecomputationof1/0).Therankingofthesyntheticdatasetsaccordingtotheperformanceportabilitymetricisthemostalignedwiththeauthors’intuitionofwhichhypotheticalapplicationisthemostperformanceportable(fromleasttomost):1)unportable;,evaluation,set,in,descending,order,of,eﬃciency.,A,more,in-depth,analysis,of,these,visualisation,techniques,can,be,found,in,Sewall,et,al.,[2].,(a),Box,plot,(b),Cascade,plot,Figure,2:,Example,plots,for,the,synthetic,data,provided,in,Figure,1,Where,possible,,performance,data,has,been,taken,from,previously,published,works.,Where,no,data,exists,,the,data,has,been,collected,from,the,UK’s,Tier-,2,platforms,,in,particular,Isambard’s,Multi-Architecture,Comparison,System,(MACS),,ThunderX2,system,and,A64FX,system.,As,many,of,the,applications,,libraries,and,programming,models,used,in,this,re-,port,are,under,active,development,,the,data,presented,here,is,subject,to,change.,New,data,is,being,collected,all,the,time,and,analysed,,and,will,be,updated,in,the,future,where,necessary.,This,document,should,therefore,be,considered,a,liv-,ing,document,,reﬂecting,the,current,state,of,performance,portable,application,development,focused,on,applications,of,interest,for,the,simulation,of,plasma,physics.,5,Fig.2:Boxplotsofsyntheticperformancedata.togetherandcompared;andissimpleenoughfordeveloperstointuittheperformanceportabilityofanapplicationataglance.Suchvisualizationswouldenabledeveloperstoquickly(albeitsubjectively)answerQuestion1,Question2andQuestion3.A.BoxPlotsBoxplotsareacommon,well-understoodﬁgureshowingthespreadofdataaroundthemedian,andareanobviouscandidateforsummarizingthedistributionofperformanceefﬁciencydata.Thegraphconsistsofaboxformedbythelowerandupperquartiles,whichisdividedbythemedian.Manysoftwarepackagesproduceboxplotswithwhiskersat1.5timestheinterquartilerangefromtheboxedge,andplotoutliersbeyondthisrangeascircles.Figure2showsboxplotsforthesyntheticdataintroducedinSectionII-C.Thisdataisintendedtostresstesttheapproaches,andwecanclearlyseethattheboxplotsfailtoshowusefulinformationformanyofthesedatasets.Inthemulti-targetcase,thefactthatthedatahastwoclustersisnotrepresentedinanyway.Thelargeboxshowsthatthedataisspreadfarfromthemedian,butdoesn’tprovideinsightintotheamountofdataaroundthesepoints.Likewise,theinconsistentdatashowsafairlylargeboxandasimilarmedianvaluetothemulti-targetdata,aswefoundinTableII.Thedataisevenlyspreadthroughouttheentirerange,butthisisnotrepresented.Theconsistentdatasetsdonotutilizethevisiblespaceonthegraphwell,butthelackofvisibleboxesconveysthatthedataishighlyclusteredaroundthemedian.Additionally,thedifferenceinabsoluteperformancebetweenthetwoconsistentdatasets(30%and70%)isclearlyrepresented.Fortheunportableandsingletargetdatasets,thelackofboxesreﬂectstheclusteringaroundthelowperformanceefﬁciencyvalues.Thesingleplatformswithhighperformanceefﬁciencyarerepresentedasoutliers,reﬂectingthattheseresultsarenotcharacteristicofthisapplication–however,itisimportanttonotethatthedecisiontolabeltheseresultsasoutliersisunderusercontrol,andthereforesubjecttoabuse.Figure3showsboxplotsforthereal-worldapplicationsdescribedinSectionII-C.Eachchartintheﬁgurepertainstoonecode,withdifferentboxplotsforeachapplication(programmingmodel).TheﬁrsttwoboxplotsforBabelStreamshowclearlythatmuchoftheefﬁciencydataisconsistentlyhigh;however,itiseasytomissthatsomeplatformsdidnotrun(representedbytheoutliersatzero),andthenumberofunsupportedplatformsisobscured(bynatureofalloutliersbeingatthesamepoint).Theotherplotsforthiscodedonotyieldmuchinformationastothequalityofperformanceportability;theboxesallcoverthecompleterange[0,100],andweareleftonlywiththemediantomakecomparisons.Manyoftheboxplotsshowndrawthemedianlineatzero:mostoftheefﬁciencyresultsareclassiﬁedasnotportable(i.e.mostapplicationsdidnotrunonmostplatforms).Itisdifﬁculttoseeresultswherethedataisnon-zero.Whenperformanceefﬁcienciesareclusteredaroundthemedian,boxplotsintuitivelyrepresenttheextentofthatclus-tering.However,inmoregeneralcasesitcanbechallengingtounderstandthenumberandeffectofoutliers.Inparticu-lar,bimodaldistributions(likemulti-target)appearseverelydistortedandindistinguishablefromotherdistributions.BoxplotsthereforesufferfrommanyofthesameproblemsasthemetricsdiscussedinSectionIII,anddonotprovideaclearwaytointuitarankingofapplications.B.HistogramsAnotherclassicwaytovisualizethedistributionofdataistoproduceahistogram.Dataaregroupedintocategories(bins)andplottedasabarchartshowingthenumberofitemsineachbin,highlightingwhichbinsarehighlypopulated.Ahistogramalsoshowsallthedatadirectly(albeitsmoothedintocategories),preservingoutliersandintermediatevaluesoccurringbetweenregionsofhighdensity.Inselectingthebins,itisimportanttorememberthemeaningthatwehaveascribedto0%performanceefﬁciency(i.e.thatanapplicationdidnotrunorproducedanincorrectresult).Thisisdistinctfrom(0+✏)%,whichindicatesthatanapplicationrancorrectly,butwithverylowefﬁciency.Assuch,werecommendseparating“didnotrun”resultsintotheirownbin,soastodistinguishthemfromlowefﬁciencies.Thisisaspecialcaseofacommonprobleminconstructinghistograms:usingtoofewbinshidesusefulinformation;butusingtoomanybinsdoesnothingtosummarizethedata.Thesigniﬁcanceofbeinginonebinoranotherisalsoopentointerpretation:onemightfeelthatefﬁcienciesof69%and71%areequivalent,yettheseresultsmayfallintodistinctbins.HistogramsforthesyntheticdataareshowninFigure4a.Weshowallthedatasetsonthesamegraphforbrevityalsotoallowdirectcomparisonbetweenthem.Giventhelimitedrangeofthedata,itisimportanttoplotthedifferentdatasetsasindependentbarsside-by-sideonthechart;inpractice,overlayingthemalmostalwaysobscuresdatapoints.Thesehistogramscapturethecharacteristicsofthedatasetseffectively.Thetwoconsistentdatasetsshowstrongpeaksinthebinscorrespondingto30%and70%efﬁciency,andthetwopeaksofthemulti-targetdatasetaresimilarlyintuitive.ThepresenceofmanylowfrequencybinsfortheinconsistentdataFig.7:Efﬁciencycascadeplotforsyntheticdatasets,alongwithplatformchart.minimumefﬁciencyinE0;werecordthisminE0andthecardinalityofH0,thenremoveanyoneplatformamongthosewiththeminimumefﬁciencytoconstructanewsetofplatformsH1withcorrespondingefﬁcienciesE1.Wecontinueforn=|H0|stepsinthisfashionuntilweobtainHn=;.Wethenplot|Hn 1|,|Hn 2|,...,|H0|againstminEn 1,minEn 2,...,minH0(i.e.increasingnumberofplatformsvs.minimumefﬁcienciesamongeachsubset)toobtainavisualizationofhowprecipitouslyanapplication’ssupportforvariousplatformsdegrades.Thisisnecessarilynon-increasing,andsowedesignatetheseplotsasefﬁciencycascadeplots.ItistrivialtocomputePPusingEiforeach|Hi|andtosuperimposethiswiththeefﬁciencycascade.Multipleapplications/problemsmaybeaggregatedontoasingleefﬁciencycascadeplotbywinnowingtheplatformsetsasdescribedaboveindividually;theHiateachtickonthex-axiswillnotnecessarilybethesameacrossapplications,buttheplottedcardinalitiesareshared.Figure7demonstratessuchaplotforthesyntheticdatasets,withsolidanddashedlinesrepresentingtheminimumefﬁciencyandPPvaluesrespectively.Efﬁciencycascadeplotsmaybeeasilyconstructedbyindividuallysortingeachapplication’sefﬁcienciesacrossallplatformsindecreasingorder,thenplottingthempiecewise-linearagainstthesequencenumberoftheplatforminthisordering.Readingaplatformnumberfromthex-axisandconsultingtheplottedefﬁciency(foraspeciﬁcapplication)givesthenumberofplatformswithatleastthatlevelofefﬁciency.Conversely,readingrightfromanefﬁciencyorPPvalueonthey-axistowhereitinterceptsaplottedvaluegivesthenumberofplatformsthathaveanefﬁciencyorPPgreaterthanthechosenyvalue.Becausetheplatformsetsforeachapplicationaresortedseparately,wemustresistthetemptationtodrawanyconclu-sionsaboutspeciﬁccomparativeplatformperformanceacrossapplicationsinefﬁciencycascadeplots.Theexceptionisfortherightmostpointforeachapplication,whichshowstheminimumefﬁciencyandPPcalculatedacrossallplatforms.PlatformChartsItispossibletoincludeinformationabouttheHichosenbyaddingarowofcolor-codedboxesforeachapplicationforeach|Hi|plottedonanefﬁciencycascade;thiswetermaplatformchart.Whereanapplicationdoesnotrunonaplatform,thespaceisleftblank.Duetotheconstructionofefﬁciencycascades,wecanalwaysexpecttheblankareasforanygivenapplicationtobecontiguousandontheright-handsideoftheplot.Whenapplicationefﬁciencyisusedandthedatasetis“closed”(i.e.thepeakefﬁciencyforeachplatformiscontainedintheshowndata,asinthedatausedhere)weexpecttheefﬁ-ciencycascadetobeginwithaseriesofpeakefﬁciencyvalues.Similarly,wecanexpectthattheleftmostappearance(s)ofaplatforminaplatformchartcontainsthepeakefﬁciencyforthatplatformamongthedata.Bynatureoftheharmonicmean,thePPforanapplicationonagivensetofplatformsisneverlowerthantheminimumefﬁciency.Fortheconsistentdatasets,thePPandefﬁciencyasseeninFigure7areidentical.Thebimodalnatureofthesingletargetandmulti-targetdatasetsarealsoreﬂectedintheefﬁciencycascade,withcleartransitionsbetweentwolevelsofsupportmarkedbysharpdropsinefﬁciency.Figure8showsefﬁciencycascadeplotsforthereal-worlddata.Therearenumerousdistinctpatternsthathelptoquicklyassessapplication(i.e.language/framework)behavior.Thenumberofsupportedplatformsismarkedbyadroptozeroef-ﬁciency.Therearesomeapplicationsthatshowhighefﬁciencyforasubsetofplatformsafterwhichefﬁciencyprecipitouslydrops,reminiscentofthesingletargetdataset.Otherobservationsarenotableforrequiringsubjectiveevaluation.Inalldatasets,OpenMPleadsortiesallotherapplicationsthroughmostoftheplatforms.Insomecases,Kokkossupportsmoreplatforms,orsupportslatterplatformswithhigherefﬁciencythanOpenMP.Foradditionaldiscourseonthissubject,wereferthereadertothestudyofDeakinetal.[17].Itisaninterestingexerciseforthedeveloperorapplicationusertoconsiderwhethertheypreferperformanceorportability;insomecases,itismostimportantthatasmanyplatformsbesupportedaspossible;whileinothercases,thehigherefﬁciencymaybemoredesirable.Bychoosingcolorsintheplatformchartthatconveymean-ingfulgroupings,wecangaininsightintohowindividualapplicationshandledifferenttypesofplatforms.InFigure8,wehavechosendistinctcolorfamiliesforGPUsandCPUs,highlightingthatCUDAsupportsonlyGPUs,whilebothOpenCLandOpenACChaveweaksupportforCPUsinthedatapresented.Itisreasonabletowonderhowthedatapresentedinanefﬁciencycascadeplotcoincides(ordoesnot)withone’sideaofperformanceportability,quantitativeorqualitative.SincePPisfeaturedintheplots,thesequestionsaresimpletoanswer:thehighestpointintherightmostcolumnistheapplicationwiththehighestPPacrossallplatformsinH0.Comparingpointstotheleftcanbemisleading,sincetheHiateachofthesepointsisnotnecessarilythesameacrossapplications.Insightsintothequalitativequestionarereadilyavailablein,2,Application,Evaluations,In,this,section,we,present,performance,data,for,a,number,of,mini-applications,,across,a,range,of,architectural,platforms,,using,a,range,of,diﬀerent,approaches,to,performance,portability.,The,applications,chosen,in,each,case,are,broadly,representative,of,some,of,the,algorithms,of,interest,to,NEPTUNE.,In,particular,,the,ﬂuid-method,based,mini-,apps,implement,algorithms,that,range,from,ﬁnite-diﬀerence,(like,Bout++,[3]),to,high-order,ﬁnite,element,or,spectral,element,(like,Nektar++,[4]).,Similarly,,the,particle-methods,mini-apps,all,implement,the,particle-in-cell,method,(like,EPOCH,[5]).,2.1,TeaLeaf,TeaLeaf,is,a,ﬁnite,diﬀerence,mini-app,that,solves,the,linear,heat,conduction,equation,on,a,regular,grid,using,a,5-point,stencil,,developed,as,part,of,the,UK-MAC,(UK,Mini-App,Consortium),project.,It,has,been,used,extensively,in,studying,performance,portability,already,[6,,7,,8,,9],,and,is,available,implemented,using,CUDA,,OpenACC,,OPS,,RAJA,,and,Kokkos,,among,others6.,The,results,in,this,section,are,extracted,from,two,of,these,studies,,namely,one,by,Kirk,et,al.,[7],and,one,by,Deakin,et,al.,[6].,In,both,studies,,the,largest,test,problem,size,(tea,bm,5.in),is,used,,a,4000,grid.,×,4000,2.1.1,Performance,The,study,by,Kirk,et,al.,shows,the,execution,of,8,diﬀerent,implementations/-,conﬁgurations,of,TeaLeaf,across,3,platforms,,a,dual,Intel,Broadwell,system,,an,Intel,KNL,system,and,an,NVIDIA,P100,system.,The,runtime,for,each,imple-,mentation/conﬁguration,is,presented,in,Figure,37.,Note,that,in,the,study,,some,results,are,missing,due,to,incompatibility,(e.g.,CUDA,on,Broadwell/KNL).,6http://uk-mac.github.io/TeaLeaf/,7Hybrid,represents,the,best,performing,conﬁguration,of,a,MPI/OpenMP,hybrid,execution,6,1,500,1,000,500,),s,(,e,m,i,t,n,u,R,Broadwell,KNL,Platform,MPI,Hybrid,CUDA,OpenMP,OpenACC,OPS,Kokkos,RAJA,P100,Figure,3:,TeaLeaf,runtime,data,from,Kirk,et,al.,[7],The,study,by,Deakin,et,al.,is,more,recent,,using,a,C-based,implementation,of,TeaLeaf,as,its,base.,It,consequently,evaluates,fewer,programming,models,,but,over,a,wider,range,of,hardware,,including,a,dual,Intel,Skylake,system,,both,NVIDIA,P100,and,V100,systems,,AMDs,Naples,CPU,,and,the,Arm-based,ThunderX2,platform.,Runtime,results,are,provided,in,Figure,4.,2.1.2,Portability,Both,studies,evaluate,some,portable,and,non-portable,implementations.,In,most,cases,,there,is,a,non-portable,implementation,that,achieves,the,lowest,runtime,,however,this,places,a,restriction,on,the,hardware,that,it,can,target.,For,study,by,Kirk,et,al.,[7],,Figures,5,and,6,allow,us,to,visualise,the,per-,formance,portability,of,each,approach,to,application,development.,Figure,5,shows,a,clear,divide,between,the,non-portable,approaches,(CUDA,,OpenMP,,MPI,,Hybrid,and,OpenACC),,and,the,portable,approaches,(Kokkos,,OPS,and,RAJA),,whereby,each,of,the,non-portable,approaches,span,the,full,range,from,0.0,eﬃciency,up,to,1.0,eﬃciency,,while,the,three,portable,approaches,each,span,7,),s,(,e,m,i,t,n,u,R,800,600,400,200,Skylake,Naples,Power9,TX2,KNL,P100,V100,Platform,CUDA,OpenMP,OpenACC,Kokkos,Figure,4:,TeaLeaf,runtime,data,from,Deakin,et,al.,[6],a,much,smaller,range,of,eﬃciencies.,The,cascade,plot,in,Figure,6,better,shows,how,the,performance,portability,of,each,implementation,changes,as,new,platforms,are,added,to,the,evaluation,set.,Almost,all,approaches,(except,OpenMP),achieve,more,than,80%,application,eﬃciency,on,at,least,one,platform,,and,in,the,case,of,RAJA,and,OPS,,per-,formance,above,60%,application,eﬃciency,is,maintained,across,the,three,plat-,forms.,Referring,back,to,Figure,3,,we,can,see,that,on,the,Intel,KNL,system,,the,Kokkos,performance,is,double,that,of,other,performance,portable,approaches,,and,thus,skews,its,portability,calculation.,It,is,likely,that,this,is,the,result,an,unidentiﬁed,issue,in,TeaLeaf,or,Kokkos,at,the,time,of,evaluation.,Otherwise,,these,three,programming,models,each,achieve,similar,levels,of,performance,and,,importantly,,portability,across,diﬀerent,architectures.,Figures,7,and,8,show,the,same,visualisations,for,the,data,from,Deakin,et,al.,[6].,Again,,the,non-portable,programming,model,(CUDA),achieves,the,highest,per-,formance,on,its,target,architecture.,For,CPU,architectures,OpenMP,produces,the,highest,result,,and,using,oﬄoad,directives,,portability,is,available,to,GPU,devices.,It,should,be,noted,that,to,support,the,use,of,GPU,devices,,there,are,8,Figure,5:,Box,plot,visualisation,of,performance,portability,from,Kirk,et,al.,[7],Figure,6:,Cascade,visualisation,of,performance,portability,from,Kirk,et,al.,[7],9,CUDAOpenMPMPIHybridOpenACCKokkosOPSRAJA0.00.20.40.60.81.0Efficiency123#,of,platforms0.00.20.40.60.81.0App,PP,(dashed)/efficiency,(solid)CUDA,eff.CUDA,PPOpenMP,eff.OpenMP,PPMPI,eff.MPI,PPHybrid,eff.Hybrid,PPOpenACC,eff.OpenACC,PPKokkos,eff.Kokkos,PPOPS,eff.OPS,PPRAJA,eff.RAJA,PPBroadwellKNLP100,Figure,7:,Box,plot,visualisation,of,performance,portability,from,Deakin,et,al.,[6],Figure,8:,Cascade,visualisation,of,performance,portability,from,Deakin,et,al.,[6],10,CUDAOpenACCKokkosOpenMP0.00.20.40.60.81.0Efficiency1234567#,of,platforms0.00.20.40.60.81.0App,PP,(dashed)/efficiency,(solid)CUDA,eff.CUDA,PPOpenACC,eff.OpenACC,PPKokkos,eff.Kokkos,PPOpenMP,eff.OpenMP,PPSkylakeNaplesPower9TX2KNLP100V100,two,OpenMP,implementations,that,must,be,maintained,(with,and,without,of-,ﬂoad,directives),,though,these,results,are,presented,together,here.,Much,like,in,the,previous,study,,the,performance,portability,of,Kokkos,is,aﬀected,by,an,anomalous,result,on,the,Intel,KNL,platform.,2.2,miniFE,miniFE,is,a,ﬁnite,element,mini-app,,and,part,of,the,Mantevo,benchmark,suite,[10,,11,,12,,13].,It,implements,an,unstructured,implicit,ﬁnite,element,method,and,has,versions,available,in,CUDA,,Kokkos,,OpenMP,(3.0+,and,4.5+),and,SYCL8.,While,there,are,a,number,of,data,sources,for,miniFE,data,,many,of,these,are,limited,in,scope,,and,so,to,ensure,consistency,,all,data,presented,in,this,section,has,been,newly,gathered.,In,all,cases,,a,256,256,256,problem,size,has,been,used,,and,all,runs,have,been,conducted,on,the,platforms,available,on,Isambard.,×,×,2.2.1,Performance,The,raw,runtime,results,for,these,runs,can,be,seen,in,Figure,9.,In,many,of,the,miniFE,ports,available,,only,the,conjugate,solver,has,been,parallelised,eﬀec-,tively,,so,the,results,presented,here,represent,only,the,timing,from,this,kernel.,It,should,be,noted,that,the,SYCL,data,is,gathered,from,a,miniFE,port,that,can,be,found,as,part,of,the,oneAPI-DirectProgramming,github,repository9;,this,port,has,been,generated,using,Intel’s,DPC++,Compatibility,tool,,which,translates,CUDA,to,DPC++,,and,is,compiled,using,hipSYCL,and,GCC.,Results,have,not,yet,been,collected,for,the,ARM-based,system,with,SYCL,,due,to,the,unavail-,ability,of,an,appropriate,compiler.,The,OpenMP,with,oﬄoad,variant,of,miniFE,runs,successfully,on,both,AMD,Rome,and,Cavium,ThunderX2,platforms,,but,the,runtimes,are,several,orders,of,magnitude,greater,than,all,other,platforms,(likely,due,to,an,bug,in,the,compiled,code),,and,so,have,been,removed.,Figure,9,shows,that,SYCL,performance,on,the,KNL,and,Rome,platforms,is,far,8https://github.com/Mantevo/miniFE,9https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/miniFE-sycl,11,80,60,40,20,),s,(,e,m,i,t,n,u,R,CSL,KNL,Rome,TX2,Platform,A64FX,P100,V100,MPI,OpenMP,w/,Oﬄoad,CUDA,SYCL,Kokkos,OpenMP,Figure,9:,miniFE,runtime,data,in,excess,of,any,other,execution,(with,the,exception,of,OpenMP,w/,Oﬄoad,on,Rome,which,is,not,shown),,and,on,the,GPU,platforms,the,SYCL,runtime,is,on,par,with,OpenMP,w/,oﬄoad.,This,is,likely,due,to,the,hipSYCL,compiler,generating,OpenMP,w/,Oﬄoad,syntax,for,the,SYCL,code,,and,so,it,is,unsur-,prising,that,performance,is,similar.,Otherwise,,the,fastest,performance,on,most,CPU-based,platforms,comes,from,the,native,MPI,variant,of,miniFE,,and,the,fastest,performance,on,the,GPU-based,platforms,comes,from,CUDA.,2.2.2,Portability,Figures,10,and,11,present,visualisations,of,the,performance,portability,of,miniFE,,through,various,approaches.,The,highest,median,performance,comes,from,the,non-portable,MPI,approach,,since,it,is,the,best,(or,near,best),performing,implementation,on,all,of,the,CPU,platforms;,however,,it,is,not,portable,to,the,two,GPU,systems.,Conversely,,Figure,10,shows,that,CUDA,has,the,worst,lowest,median,performance,,because,12,Figure,10:,Box,plot,visualisation,of,performance,portability,of,miniFE,Figure,11:,Cascade,visualisation,of,performance,portability,of,miniFE,13,CUDAMPIOpenMPOpenMP,w/,OffloadSYCLKokkos0.00.20.40.60.81.0Efficiency1234567#,of,platforms0.00.20.40.60.81.0App,PP,(dashed)/efficiency,(solid)CUDA,eff.CUDA,PPMPI,eff.MPI,PPOpenMP,eff.OpenMP,PPOpenMP,w/,Offload,eff.OpenMP,w/,Offload,PPSYCL,eff.SYCL,PPKokkos,eff.Kokkos,PPCascadeLakeKNLRomeVoltaPascalThunderX2A64FX,it,only,runs,on,the,two,GPU,systems,,but,is,the,best,performing,on,each.,The,boxplots,for,both,OpenMP,w/,Oﬄoad,and,SYCL,are,very,similar,,and,this,is,likely,an,artefact,of,SYCL,being,translated,to,OpenMP,w/,Oﬄoad,at,compile-time,by,hipSYCL.,The,only,programming,model,to,run,across,all,plat-,forms,currently,is,Kokkos,,but,on,some,platforms,this,may,mean,sacraﬁcing,a,signiﬁcant,proportion,of,performance.,Figure,11,better,shows,how,the,performance,portability,of,miniFE,evolves,as,more,platforms,are,added,for,each,programming,model.,For,CUDA,,MPI,,OpenMP,and,Kokkos,,there,are,at,least,two,platforms,where,they,achieve,over,80%,eﬃciency,,and,in,the,case,of,MPI,and,OpenMP,,this,eﬃ-,ciency,holds,up,until,we,reach,the,GPU,platforms,,while,CUDA,does,the,inverse,of,showing,the,best,eﬃciency,on,the,GPU,platforms.,SYCL,and,OpenMP,w/,oﬄoad,oﬀer,poor,performance,in,our,current,data,,and,hence,achieve,less,than,40%,of,peak,application,performance,across,all,platforms;,this,is,likely,due,to,the,use,of,the,hipSYCL,compiler,and,lack,of,platform,speciﬁc,optimisations.,As,Kokkos,is,the,only,programming,model,we,have,full,data,for,,it,is,the,only,pro-,gramming,model,that,spans,all,platforms;,however,,the,performance,eﬃciency,decreases,as,more,platforms,are,added,to,the,evaluation,set.,While,it,is,clearly,a,portable,approach,,it,is,not,clear,whether,it,is,performance,portable,at,this,time.,2.3,Laghos,Laghos,is,a,mini-app,that,is,part,of,the,ECP,Proxy,Applications,suite,[14,,15,,13].,It,implements,a,high-order,curvilinear,ﬁnite,element,scheme,on,an,unstruc-,tured,mesh.,The,majority,of,the,computation,is,performed,by,the,HYPRE,and,MFEM,libraries,,and,can,thus,use,any,programming,model,that,is,available,for,these,libraries10.,The,results,presented,below,have,all,been,collected,from,the,Isambard,platform.,10https://github.com/CEED/Laghos,14,2.3.1,Performance,Figure,12,shows,the,runtime,for,Laghos,,running,problem,#1,(Sedov,blast,wave),,in,three,dimensions,,up,to,1.0,second,of,simulated,time,,using,partial,assembly,(i.e.,,./laghos,-p,1,-dim,3,-rs,2,-tf,1.0,-pa,-f).,Across,the,six,platforms,evaluated,,RAJA,performance,is,typically,in,line,with,the,fastest,non-portable,approach,(MPI,and,CUDA).,Since,the,parallelisation,in,Laghos,is,in,the,MFEM,and,HYPRE,shared,libraries,,that,were,developed,at,LLNL,alongside,RAJA,,that,these,routines,are,well,optimised,in,RAJA,is,perhaps,not,surprising.,),s,(,e,m,i,t,n,u,R,120,100,80,60,40,20,CSL,KNL,Rome,A64FX,P100,V100,Platform,MPI,CUDA,RAJA,OpenMP,Figure,12:,Laghos,runtime,data,2.3.2,Portability,Portability,visualisations,of,each,implementation,of,Laghos,are,provided,in,Fig-,ures,13,and,14.,Figure,13,demonstrates,the,remarkable,eﬃciency,of,the,RAJA,MFEM,and,HYPRE,implementations,,showing,consistently,above,80%,performance,eﬃ-,15,Figure,13:,Box,plot,visualisation,of,performance,portability,of,Laghos,Figure,14:,Cascade,visualisation,of,performance,portability,of,Laghos,16,CUDAMPIOpenMPRAJA0.00.20.40.60.81.0Efficiency123456#,of,platforms0.00.20.40.60.81.0App,PP,(dashed)/efficiency,(solid)CUDA,eff.CUDA,PPMPI,eff.MPI,PPOpenMP,eff.OpenMP,PPRAJA,eff.RAJA,PPCascadeLakeKNLRomeA64FXPascalVolta,ciency.,In,contrast,to,some,of,our,previous,results,,OpenMP,performs,poorly,across,most,platforms,(except,KNL).,The,diﬀerence,between,OpenMP,and,RAJA,on,the,CPU,platforms,suggests,that,either,the,RAJA,parallelisation,on,these,systems,is,achieved,through,SIMD,and,Thread,Building,Blocks,(TBB),,or,that,there,are,performance,issues,in,the,OpenMP,implementation.,On,the,GPU,platforms,,CUDA,does,marginally,outperform,RAJA,,but,this,is,perhaps,to,be,expected,,given,the,potential,overhead,in,using,a,third,party,performance,library.,2.4,CabanaPIC,CabanaPIC,is,a,structured,PIC,demonstrator,application,built,using,the,Co-,PA/Cabana,library,for,particle-based,simulations,[13].,CoPA/Cabana,provides,algorithms,and,data,structures,for,particle,data,,while,the,remainder,of,the,ap-,plication,is,built,using,Kokkos,as,its,programming,model,for,on-node,parallelism,and,GPU,use,,and,MPI,for,oﬀ-node,parallelism11.,2.4.1,Performance,Since,there,is,only,a,single,implementation,of,CabanaPIC,,it,is,not,possible,for,us,to,evaluate,how,the,programming,model,aﬀects,its,performance,portability,,however,,we,can,show,how,the,performance,changes,between,architectures.,Figure,15,shows,the,achieved,runtime,for,CabanaPIC,across,four,of,Isambard’s,platforms,,running,a,simple,1D,2-stream,problem,with,6.4,million,particles.,Approximately,equivalent,performance,can,be,seen,on,the,CascadeLake,,Rome,and,V100,systems.,Similar,to,our,TeaLeaf,Kokkos,results,on,KNL,,the,run-,time,is,signiﬁcantly,worse,than,expected,,possibly,indicating,a,Kokkos,bug,,or,a,conﬁguration,issue.,Otherwise,performance,is,similar,on,all,platforms,in,terms,of,the,raw,runtime.,Given,the,signiﬁcantly,higher,peak,performance,of,the,NVIDIA,V100,system,,it,is,perhaps,surprising,that,its,performance,is,not,signiﬁcantly,better.,This,may,be,due,to,serialisation,caused,by,atomics,,or,signiﬁcant,data,movement,between,the,host,and,the,accelerator;,further,inves-,tigation,is,necessary,to,identify,this,loss,of,eﬃciency.,11https://github.com/ECP-copa/CabanaPIC,17,),s,(,e,m,i,t,n,u,R,300,200,100,CSL,KNL,Rome,V100,Platform,Kokkos,Figure,15:,CabanaPIC,data,2.5,VPIC,Vector,Particle-in-Cell,(VPIC),is,a,general,purpose,PIC,code,for,modelling,ki-,netic,plasmas,in,one,,two,or,three,dimensions,,developed,at,Los,Alamos,National,Laboratory,[16].,VPIC,is,parallelised,on-core,using,vector,intrinsics,and,on-node,through,a,choice,of,pthreads,or,OpenMP.,It,can,additionally,be,executed,across,a,cluster,using,MPI12.,Recently,,VPIC,2.0,[17],has,been,developed,that,adds,support,for,heterogeneity,by,using,Kokkos,to,optimise,the,data,layout,and,allow,execution,on,accelerator,devices.,2.5.1,Performance,Figure,16,shows,the,runtime,for,the,three,variants,of,the,VPIC,code,running,on,seven,platforms13.,This,data,is,taken,from,the,VPIC,2.0,study,,comparing,the,non-vectorised,,vectorised,and,Kokkos,variants,of,the,VPIC,code.,In,each,case,,12https://github.com/lanl/vpic,13https://globalcomputing.group/assets/pdf/sc19/SC19_flier_VPIC.pptx.pdf,18,the,runtime,is,the,time,taken,for,500,time,steps,,with,66,millions,particles.,300,200,100,),s,(,e,m,i,t,n,u,R,Skylake,KNL,TX2,Naples,Rome,Power9,V100,Platform,Original,Kokkos,SIMD,Figure,16:,VPIC,runtime,data,from,Bird,et,al.,[17],In,Figure,16,we,can,observe,that,the,SIMD,vectorised,implementations,are,al-,ways,the,fastest,for,each,platform,,however,it,should,be,noted,that,each,of,these,are,hand-optimised,for,each,individual,instruction,set,(i.e.,every,implementation,is,platform,speciﬁc).,This,means,that,,alongside,the,additional,coding,eﬀort,of,writing,an,implementation,for,each,platform,,potential,additions,or,ﬁxes,must,also,be,applied,to,all,implementation,individually,,harming,not,only,the,perfor-,mance,portability,,but,also,the,productivity.,While,the,Kokkos,implementation,is,typically,the,slowest,on,each,platform,,performance,is,usually,in-line,with,the,unvectorised,original,VPIC,application,,suggesting,that,the,slowdown,is,caused,by,the,inability,of,the,compiler,to,autovectorise.,2.5.2,Portability,In,terms,of,the,performance,portability,of,VPIC,,we,can,see,that,the,original,and,vectorised,variants,are,only,viable,on,the,CPU,architectures.,Figures,17,and,18,visualise,how,the,performance,portability,varies,as,more,platforms,are,evaluated.,19,Figure,17:,Box,plot,visualisation,of,performance,portability,of,VPIC,Figure,18:,Cascade,visualisation,of,performance,portability,of,VPIC,20,SIMDRefKokkos0.00.20.40.60.81.0Efficiency1234567#,of,platforms0.00.20.40.60.81.0App,PP,(dashed)/efficiency,(solid)SIMD,eff.SIMD,PPRef,eff.Ref,PPKokkos,eff.Kokkos,PPSkylakeKNLTX2NaplesRomePower9V100,The,highest,performance,on,each,of,the,CPU,platforms,comes,from,the,vec-,torised,variant,of,VPIC,,as,it,achieves,the,best,performance,on,all,CPU,plat-,forms,(except,the,ThunderX2,,where,no,data,is,provided).,However,,Figure,17,,when,evaluating,the,entire,set,of,platforms,,its,performance,portability,would,be,0,,due,to,non-execution,on,the,V100,platform.,Figure,18,shows,that,while,Kokkos,performs,worse,than,the,vectorised,imple-,mentation,,its,performance,is,similar,the,non-vectorised,variant,,but,is,also,capable,of,execution,on,the,V100,platform.,It,should,be,noted,that,this,data,is,from,a,study,based,on,the,initial,implemen-,tation,of,VPIC,using,Kokkos.,It,is,likely,that,these,performance,ﬁgures,will,be,improved,in,future,,potentially,closing,the,performance,gap,on,the,vectorised,implementation,,while,maintaining,portability,to,heterogeneous,architectures.,2.6,EMPIRE-PIC,EMPIRE-PIC,is,the,particle-in-cell,solver,central,the,the,ElectroMagnetic,Plasma,In,Realistic,Environments,(EMPIRE),project,[18].,It,solves,Maxwell’s,equa-,tions,on,an,unstructured,grid,using,a,ﬁnite-element,method,,and,implements,the,Boris,push,for,particle,movement.,EMPIRE-PIC,makes,extensive,use,of,the,Trilinos,library,,and,subsequently,uses,Kokkos,as,its,parallel,programming,model,[19,,20].,2.6.1,Performance,The,EMPIRE-PIC,application,is,export,controlled,,and,thus,the,results,in,this,section,come,from,the,study,by,Bettencourt,et,al.,[19],,looking,speciﬁcally,at,the,particle,kernels,within,EMPIRE-PIC.,Figure,19,shows,the,runtime,of,the,Accelerate,,Weight,Fields,,Move,and,Sort,kernels,within,EMPIRE-PIC,for,an,electromagnetic,problem,with,16,million,particles,(8,million,H+,,8,million,e-).,The,geometry,for,this,problem,is,the,tet,mesh,that,can,be,seen,in,Figure,7,in,Bettencourt,et,al.,[19].,21,),s,(,e,m,i,t,n,u,R,30,20,10,0,Accelerate,Weight,Fields,Move,Sort,Kernel,BDW,CSL,KNL,TX2,P100,V100,Figure,19:,EMPIRE-PIC,runtime,data,2.6.2,Portability,While,there,is,only,a,single,programming,model,implementation,of,EMPIRE-,PIC,,we,can,use,the,equations,given,in,Table,2,of,Bettencourt,et,al.,[19],to,calculate,the,FLOP/s,achieved,and,compare,this,to,each,machines,maximum,ﬂoating-point,performance,,thus,calculating,the,architectural,eﬃciency.,The,equations,presented,assume,the,best,case,performance,,whereby,particles,are,evenly,distributed,across,the,domain,,there,is,no,particle,migration,throughout,the,simulation,,and,they,are,sorted,at,the,start,of,the,simulation.,Nevertheless,,they,provide,a,useful,opportunity,to,analyse,the,performance,portability,of,Kokkos,for,particle-based,kernels.,Figures,20,and,21,provide,visualisations,of,EMPIRE-PIC’s,performance,porta-,bility,across,six,platforms14.,It,is,important,to,note,that,although,Figure,20,shows,incredibly,low,eﬃciency,,this,is,compared,to,each,platform’s,peak,performance,,where,a,vectorised,fused-,multiply-add,instruction,must,be,executed,each,clock,cycle.,Achieving,less,than,14Please,note,that,the,y-axis,in,each,of,these,Figures,has,been,scaled,,since,the,architectural,eﬃciency,is,very,low.,22,Figure,20:,Box,plot,visualisation,of,performance,portability,for,four,particle,kernels,in,EMPIRE-PIC,Figure,21:,Cascade,visualisation,of,performance,portability,for,four,particle,kernels,in,EMPIRE-PIC,23,SortWeight,FieldsAccelerateMove0.0000.0050.0100.0150.0200.0250.0300.0350.040Efficiency123456#,of,platforms0.000.010.020.030.040.05App,PP,(dashed)/efficiency,(solid)Sort,eff.Sort,PPWeight,Fields,eff.Weight,Fields,PPAccelerate,eff.Accelerate,PPMove,eff.Move,PPBDWCSLKNLTX2P100V100,10%,of,this,peak,performance,is,not,unusual,for,a,real,application.,In,the,case,of,the,Sort,kernel,,the,eﬃciency,is,lower,still,,as,this,is,not,a,kernel,that,is,bound,by,ﬂoating,point,performance.,What,is,clear,from,Figures,20,and,21,is,that,the,variance,in,achieved,eﬃciency,between,platforms,is,not,large,,indicating,that,Kokkos,is,able,to,achieve,a,similar,portion,of,the,available,performance,for,EMPIRE-PIC’s,particle,kernels.,Achieved,eﬃciency,is,higher,on,the,ThunderX2,and,Broadwell,systems,,due,to,less,reliance,on,well,vectorised,code,,and,a,lower,available,peak,performance.,The,data,suggests,that,EMPIRE-PIC,is,not,able,to,fully,exploit,the,on-core,parallelism,available,through,vectorisation.,Figure,22,shows,rooﬂine,models,for,four,of,these,platforms,,with,the,four,particle,kernels,plotted,according,to,their,arithmetic,intensity,and,achieved,FLOP/s.,In,all,cases,,we,can,see,that,the,application,is,not,successfully,using,vectorisation,(and,this,is,conﬁrmed,by,compiler,reports).,As,stated,in,Bettencourt,et,al.,[19],,the,control,ﬂow,required,to,handle,particles,crossing,element,boundaries,leads,to,warp,divergence,on,GPUs,and,makes,achieving,vectorisation,diﬃcult,on,CPUs.,Nonetheless,,on,the,Cascade,Lake,and,ThunderX2,platforms,,we,are,within,an,order,of,magnitude,of,the,non-vectorised,peak,performance,for,the,three,main,kernels,,and,the,sort,kernel,(with,low,arithmetic,intensity),is,heavily,aﬀected,by,main,memory,bandwidth.,For,the,two,many-core,architectures,(KNL,and,V100),,ﬂoating-point,performance,is,further,from,the,peak,,and,the,performance,of,each,kernel,is,further,hindered,by,the,DRAM/HBM,bandwidth.,Rooﬂine,analyses,,like,Figure,22,,are,eﬀective,at,demonstrating,how,vital,to,performance,it,is,to,balance,eﬃcient,memory,accesses,with,arithmetic,intensity.,This,is,especially,important,in,PIC,codes,,where,some,of,the,kernels,are,rela-,tively,low,in,arithmetic,intensity,when,compared,to,the,amount,of,bytes,that,need,to,be,moved,to,and,from,main,memory,(e.g.,the,Boris,push,algorithm,requires,many,data,accesses,,but,performs,relatively,few,mathematical,opera-,tions).,An,alternative,approach,to,the,FEM-PIC,method,has,been,explored,using,EMPIRE-PIC,by,Brown,et,al.,[20],,whereby,complex,particle,shapes,are,supported,using,virtual,particles,based,on,quadrature,rules.,Using,virtual,parti-,cles,in,this,manner,increases,the,arithmetic,intensity,of,particle,kernels,without,requiring,signiﬁcantly,more,data,to,be,moved,from,and,to,main,memory.,24,c,e,s,/,P,O,L,F,G,c,e,s,/,s,P,O,L,F,G,10,A,R,D,1,0.01,10000,1000,100,G,G,8979.2,3287.6,L,1:,L,2:,B,/s,B,/s,2968.8,GF/s,(Max),371.1,GF/s,(No,Vec),185.6,GF/s,(No,Vec/FMA),230.1,M,:,B,/s,Move,G,Accelerate,Weight,Fields,Sort,10000,1000,100,10,G,3520,1145,L,1:,L,L,C,:,235,A,M,:,R,D,B,/s,B,/s,G,B,/s,Move,G,Sort,953.0,GF/s,(Max),285.0,GF/s,(No,Vec),Accelerate,Weight,Fields,0.1,1,10,100,1,0.01,0.1,1,10,100,FLOPs/Byte,(a),Cascade,Lake,FLOPs/Byte,(b),ThunderX2,10000,1000,100,10,4735.6,1769.9,L,1:,L,2:,M,:,A,R,D,1,0.01,2274.0,GF/s,(Max),B,/s,B,/s,284.3,GF/s,(No,Vec),142.1,GF/s,(No,Vec/FMA),G,G,B,/s,Accelerate,Move,Weight,Fields,G,63.4,Sort,0.1,1,10,100,FLOPs/Byte,(c),Knights,Landing,10000,1000,100,10,G,14336,3350,779,M,:,L,1:,L,2:,B,H,7830.0,GF/s,(Max),3915.0,GF/s,(No,FMA),Accelerate,Move,Weight,Fields,B,/s,G,B,/s,B,/s,G,Sort,1,0.01,0.1,1,10,100,FLOPs/Byte,(d),V100,Figure,22:,Rooﬂine,plots,on,four,platforms,,gathered,using,the,Empirical,Rooﬂine,Toolkit,[21],25,3,Conclusions,This,report,serves,as,a,living,document,of,the,performance,of,applications,that,implement,algorithms,of,interest,to,the,NEPTUNE,project.,For,each,of,the,ap-,plications,in,this,report,,there,are,typically,a,number,of,alternative,implemen-,tations,,solving,the,same,algorithm,but,using,a,diﬀerent,parallel,programming,model.,This,allows,us,an,opportunity,to,assess,these,programming,models,and,their,appropriateness,for,the,NEPTUNE,project,,with,the,goal,of,creating,a,set,of,best,practices,to,developing,plasma,physics,applications,that,are,both,performant,and,portable.,The,results,presented,in,the,previous,section,show,that,in,many,cases,,OpenMP,and/or,MPI,provide,the,best,performance,on,CPU,platforms,,while,CUDA,typically,provides,the,best,performance,on,NVIDIA,GPUs.,However,,these,programming,models,signiﬁcantly,aﬀect,the,portability,of,these,applications,,with,the,former,unable,to,use,accelerators,,and,the,latter,unable,to,use,host,platforms.,Developing,an,application,that,can,exploit,all,available,parallelism,that,is,likely,to,be,present,on,post-Exascale,systems,would,therefore,require,developers,to,maintain,multiple,implementations,of,a,code,–,potentially,one,for,each,class/generation,of,host,or,accelerator,platforms.,For,ﬂuid,codes,,there,are,a,number,of,domain,speciﬁc,languages,(DSLs),that,provide,abstractions,for,grid-based,algorithms.,OPS,is,one,such,DSL,targeted,at,structured,mesh,applications,,and,capable,of,code,generation,targeting,MPI,,OpenMP,,OpenACC,,CUDA,and,HIP.,Our,study,with,TeaLeaf,shows,that,it,is,able,to,provide,performance,that,in,many,cases,is,on,par,with,native,OpenMP,and,MPI,,and,within,2,native,CUDA,performance,on,a,P100.,However,,such,DSLs,often,reduce,the,ﬂexibility,aﬀorded,to,a,developer.,×,Besides,code,generation,from,a,higher-level,abstraction,,GPUs,can,be,targeted,using,pragma-based,language,extensions,such,as,OpenMP,4.5,and,OpenACC.,Both,oﬀer,similar,functionality,,but,only,OpenMP,4.5,allows,portability,between,accelerator,and,non-accelerator,platforms.,However,,our,evaluation,has,shown,that,although,OpenMP,4.5,allows,us,to,target,GPUs,,diﬀerent,pragmas,are,often,required,to,achieve,suﬃcient,performance,on,accelerators,when,compared,to,host,systems,,meaning,that,multiple,implementations,would,likely,need,to,be,maintained.,This,is,well,demonstrated,by,our,miniFE,results,,where,the,26,OpenMP,with,oﬄoad,code,does,successfully,execute,on,the,CPU,architectures,but,oﬀers,signiﬁcantly,worse,performance,than,OpenMP,itself.,The,template,libraries,,Kokkos,and,RAJA,are,both,capable,of,providing,full,portability,across,all,architectures,,and,in,most,cases,oﬀer,good,performance.,The,signiﬁcant,exception,from,our,results,is,for,the,Intel,Knights,Landing,plat-,form,,where,Kokkos,performance,is,typically,poor.,This,performance,gap,is,likely,the,result,of,a,bug,or,memory,conﬁguration,issue,,but,will,not,be,inves-,tigated,further,due,to,the,discontinuation,of,the,KNL,architecture.,Regardless,,where,we,are,able,to,compare,Kokkos,or,RAJA,to,a,native,programming,model,,they,are,typically,able,to,achieve,a,runtime,that,is,no,more,than,20%,greater,than,the,native,programming,model,on,CPUs,and,no,more,than,50%,greater,than,the,native,programming,model,on,GPUs,,but,from,a,single,code,base.,Another,approach,that,is,gaining,traction,is,that,of,SYCL/DPC++.,In,our,cur-,rent,benchmark,set,,only,a,single,application,is,available,implemented,in,SYCL,(miniFE),,and,that,implementation,has,been,generated,using,Intel’s,DPC++,Compatibility,Toolkit.,The,resulting,application,is,portable,across,platforms,but,in,most,cases,has,performance,that,is,only,slightly,better,than,the,available,OpenMP,4.5,implementation.,This,warrants,additional,exploration,to,account,for,this,performance,diﬀerence;,for,such,an,immature,programming,model,,it,is,likely,that,choice,of,compiler,,and,some,very,simple,optimisations,will,bring,performance,more,in,line,with,other,approaches,to,portability.,As,this,project,progresses,,hopefully,more,applications,will,be,available,for,evaluation,,and,compiler,support,will,evolve.,For,the,particle,methods,tranche,of,applications,,they,are,predominantly,avail-,able,using,Kokkos,as,a,parallel,programming,model.,This,does,allow,portable,execution,across,all,available,platforms,,but,makes,it,diﬃcult,to,compare,per-,formance,against,native,implementations.,In,the,case,of,VPIC,,we,can,see,that,Kokkos,provides,performance,that,is,in,line,with,the,original,,unvectorised,implementation,on,all,platforms,,and,allows,us,to,extend,our,platform,set,to,include,GPU,devices.,However,,the,greatest,performance,comes,from,using,non-portable,vector,intrinsics,,which,in,this,case,means,maintaining,an,imple-,mentation,for,each,set,of,vector,instructions,(i.e.,SSE,,AVX,,AVX-2,,Altivec,,etc.).,27,3.1,Limitations,The,work,presented,in,this,report,represents,our,initial,evaluation,of,approaches,to,performance,portability.,We,intend,that,this,document,is,continually,updated,as,new,data,becomes,available,,and,as,applications,and,implementations,are,developed.,Currently,,the,data,in,this,report,contains,a,few,limitations,that,we,aim,to,rectify,in,future.,Firstly,,due,to,its,immaturity,relative,to,other,approaches,,there,are,a,lack,of,rel-,evant,ﬂuid,and,particle-in-cell,applications,available,that,use,the,SYCL/DPC++,programming,model.,This,means,that,with,the,exception,of,miniFE,,it,is,dif-,ﬁcult,to,assess,its,appropriateness,as,an,approach,to,performance,portable,ap-,plication,development.,A,recent,study,by,Reguly,et,al.,has,shown,that,for,a,computational,ﬂuid,dynamic,application,SYCL,may,be,able,to,achieve,compa-,rable,performance,,though,this,may,require,diﬀerent,code,paths,for,diﬀerent,hardware,[22].,Secondly,,the,PIC,codes,assessed,in,this,report,all,use,the,Kokkos,programming,model.,Again,,this,limits,our,ability,to,reason,about,the,appropriateness,of,this,approach,for,PIC,codes,,but,we,can,use,the,VPIC,data,to,show,that,while,we,cannot,match,native,,hand-vectorised,performance,,it,can,provide,perfor-,mance,that,is,similar,to,the,original,implementation,,and,can,be,extended,to,heterogeneous,architectures.,Finally,,we,have,not,currently,evaluated,performance,on,any,AMD,Radeon,Instinct,or,Intel,Xe,hardware,,due,to,availability,of,test,platforms.,We,aim,to,add,these,platforms,in,the,near,future,,when,available,,either,through,the,COSMA8,system,at,Durham,University,,or,through,Amazon,EC2,instances.,28,References,[1],S.J.,Pennycook,,J.D.,Sewall,,and,V.W.,Lee.,Implications,of,a,metric,for,performance,portability.,Future,Generation,Computer,Systems,,92:947,–,958,,2019.,[2],Jason,Sewall,,S.,John,Pennycook,,Douglas,Jacobsen,,Tom,Deakin,,and,Simon,McIntosh-Smith.,Interpreting,and,visualizing,performance,portabil-,ity,metrics.,In,2020,IEEE/ACM,International,Workshop,on,Performance,,Portability,and,Productivity,in,HPC,(P3HPC),,pages,14–24,,2020.,[3],B,D,Dudson,,M,V,Umansky,,X,Q,Xu,,P,B,Snyder,,and,H,R,Wilson.,BOUT++:,A,framework,for,parallel,plasma,ﬂuid,simulations.,Computer,Physics,Communications,,180:1467–1480,,2009.,[4],C.D.,Cantwell,,D.,Moxey,,A.,Comerford,,A.,Bolis,,G.,Rocco,,G.,Mengaldo,,D.,De,Grazia,,S.,Yakovlev,,J.-E.,Lombard,,D.,Ekelschot,,B.,Jordi,,H.,Xu,,Y.,Mohamied,,C.,Eskilsson,,B.,Nelson,,P.,Vos,,C.,Biotto,,R.M.,Kirby,,and,S.J.,Sherwin.,Nektar++:,An,open-source,spectral/hp,element,framework.,Computer,Physics,Communications,,192:205–219,,2015.,[5],T,D,Arber,,K,Bennett,,C,S,Brady,,A,Lawrence-Douglas,,M,G,Ramsay,,N,J,Sircombe,,P,Gillies,,R,G,Evans,,H,Schmitz,,A,R,Bell,,and,C,P,Ridgers.,Contemporary,particle-in-cell,approach,to,laser-plasma,modelling.,Plasma,Physics,and,Controlled,Fusion,,57(11):113001,,sep,2015.,[6],Tom,Deakin,,Simon,McIntosh-Smith,,James,Price,,Andrei,Poenaru,,Patrick,Atkinson,,Codrin,Popa,,and,Justin,Salmon.,Performance,porta-,In,2019,IEEE/ACM,Inter-,bility,across,diverse,computer,architectures.,national,Workshop,on,Performance,,Portability,and,Productivity,in,HPC,(P3HPC),,pages,1–13,,2019.,[7],R.,O.,Kirk,,G.,R.,Mudalige,,I.,Z.,Reguly,,S.,A.,Wright,,M.,J.,Martineau,,and,S.,A.,Jarvis.,Achieving,Performance,Portability,for,a,Heat,Conduction,In,2017,IEEE,Solver,Mini-Application,on,Modern,Multi-core,Systems.,International,Conference,on,Cluster,Computing,(CLUSTER),,pages,834–,841,,Sep.,2017.,[8],Matthew,Martineau,,Simon,McIntosh-Smith,,and,Wayne,Gaudin.,Assess-,ing,the,performance,portability,of,modern,parallel,programming,models,29,using,tealeaf.,Concurrency,and,Computation:,Practice,and,Experience,,29(15):e4117,,2017.,[9],Simon,McIntosh-Smith,,Matthew,Martineau,,Tom,Deakin,,Grzegorz,Pawelczak,,Wayne,Gaudin,,Paul,Garrett,,Wei,Liu,,Richard,Smedley-,Stevenson,,and,David,Beckingsale.,TeaLeaf:,A,Mini-Application,to,Enable,In,2017,Design-Space,Explorations,for,Iterative,Sparse,Linear,Solvers.,IEEE,International,Conference,on,Cluster,Computing,(CLUSTER),,pages,842–849,,2017.,[10],Richard,Frederick,Barrett,,Li,Tang,,and,Sharon,X.,Hu.,Performance,and,Energy,Implications,for,Heterogeneous,Computing,Systems:,A,MiniFE,Case,Study.,12,2014.,[11],Alan,B.,Williams.,Cuda/GPU,version,of,miniFE,mini-application.,2,2012.,[12],Meng,Wu,,Can,Yang,,Taoran,Xiang,,and,Daning,Cheng.,The,research,and,optimization,of,parallel,ﬁnite,element,algorithm,based,on,minife.,CoRR,,abs/1505.08023,,2015.,[13],David,F.,Richards,,Yuri,Alexeev,,Xavier,Andrade,,Ramesh,Balakrishnan,,Hal,Finkel,,Graham,Fletcher,,Cameron,Ibrahim,,Wei,Jiang,,Christoph,Junghans,,Jeremy,Logan,,Amanda,Lund,,Danylo,Lykov,,Robert,Pavel,,Vinay,Ramakrishnaiah,,et,al.,FY20,Proxy,App,Suite,Release.,Technical,Report,LLNL-TR-815174,,Exascale,Computing,Project,,September,2020.,[14],J.,C.,Camier.,Laghos,summary,for,CTS2,benchmark.,Technical,Report,LLNL-TR-770220,,Lawrence,Livermore,National,Laboratory,,March,2019.,[15],Robert,Anderson,,Julian,Andrej,,Andrew,Barker,,Jamie,Bramwell,,Jean-,Sylvain,Camier,,Jakub,Cerveny,,Veselin,Dobrev,,Yohann,Dudouit,,Aaron,Fisher,,Tzanio,Kolev,,Will,Pazner,,Mark,Stowell,,Vladimir,Tomov,,Ido,Akkerman,,Johann,Dahm,,David,Medina,,and,Stefano,Zampini.,Mfem:,A,modular,ﬁnite,element,methods,library.,Computers,&,Mathematics,with,Applications,,81:42–74,,2021.,Development,and,Application,of,Open-source,Software,for,Problems,with,Numerical,PDEs.,[16],K.,J.,Bowers,,B.,J.,Albright,,B.,Bergen,,L.,Yin,,K.,J.,Barker,,and,D.,J.,Kerbyson.,0.374,Pﬂop/s,Trillion-Particle,Kinetic,Modeling,of,Laser,Plasma,Interaction,on,Roadrunner.,In,Proceedings,of,the,2008,ACM/IEEE,Con-,ference,on,Supercomputing,,SC,’08.,IEEE,Press,,2008.,30,[17],Robert,Bird,,Nigel,Tan,,Scott,V,Luedtke,,Stephen,Harrell,,Michela,Taufer,,and,Brian,Albright.,VPIC,2.0:,Next,Generation,Particle-in-Cell,Simula-,tions.,IEEE,Transactions,on,Parallel,and,Distributed,Systems,,pages,1–1,,2021.,[18],Matthew,T.,Bettencourt,and,Sidney,Shields.,EMPIRE,Sandia’s,Next,Gen-,eration,Plasma,Tool.,Technical,Report,SAND2019-3233PE,,Sandia,Na-,tional,Laboratories,,March,2019.,[19],Matthew,T.,Bettencourt,,Dominic,A.,S.,Brown,,Keith,L.,Cartwright,,Eric,C.,Cyr,,Christian,A.,Glusa,,Paul,T.,Lin,,Stan,G.,Moore,,Duncan,A.,O.,McGregor,,Roger,P.,Pawlowski,,Edward,G.,Phillips,,Nathan,V.,Roberts,,Steven,A.,Wright,,Satheesh,Maheswaran,,John,P.,Jones,,and,Stephen,A.,Jarvis.,EMPIRE-PIC:,A,Performance,Portable,Unstructured,Particle-in-,Cell,Code.,Communications,in,Computational,Physics,,x(x):1–37,,March,2021.,[20],Dominic,A.S.,Brown,,Matthew,T.,Bettencourt,,Steven,A.,Wright,,Satheesh,Maheswaran,,John,P.,Jones,,and,Stephen,A.,Jarvis.,Higher-order,particle,representation,for,particle-in-cell,simulations.,Journal,of,Computational,Physics,,435:110255,,2021.,[21],Yu,Jung,Lo,,Samuel,Williams,,Brian,Van,Straalen,,Terry,J.,Ligocki,,Matthew,J.,Cordery,,Nicholas,J.,Wright,,Mary,W.,Hall,,and,Leonid,Oliker.,Rooﬂine,Model,Toolkit:,A,Practical,Tool,for,Architectural,and,Program,In,Stephen,A.,Jarvis,,Steven,A.,Wright,,and,Simon,D.,Ham-,Analysis.,mond,,editors,,High,Performance,Computing,Systems.,Performance,Model-,ing,,Benchmarking,,and,Simulation,,pages,129–148.,Springer,International,Publishing,,2015.,[22],Istvan,Z.,Reguly,,Andrew,M.,B.,Owenson,,Archie,Powell,,Stephen,A.,Jarvis,,and,Gihan,R.,Mudalige.,Under,the,Hood,of,SYCL,–,An,Initial,Performance,Analysis,with,An,Unstructured-Mesh,CFD,Application.,In,Bradford,L.,Chamberlain,,Ana-Lucia,Varbanescu,,Hatem,Ltaief,,and,Piotr,Luszczek,,editors,,High,Performance,Computing,,pages,391–410.,Springer,International,Publishing,,2021.,31

:pdfembed:`src:_static/TN-03_EvaluationApproachesPerformancePortability.pdf, height:1600, width:1100, align:middle`