1. Description

      This benchmark is the simulation of an electronic device
      using a particle-mesh (PM) method, often also called a
      particle-in-cell (PIC) simulation. In each timestep the
      electric and magnetic fields on an (LMAX x MMAX) mesh are
      advanced explicitly in time using Maxwell's equations, and
      the particles (electrons) are advanced in the fields using
      Newton's equations.

      The benchmark is described as local because the time scale
      is such that the fields may be computed explicitly, using
      fields only local to each mesh point. four benchmark cases
      are provided (NBEN3=1,2,3,4), giving four problem sizes
      described by the size factor alpha=1,2,4,8 and mesh numbers
      (75*alpha,33). the number of particles at the end of the
      run of 1 picosecond is given empirically by


      As the number of mesh-points increases for the same physical
      dimension, the time-step must be reduced to satisfy the CFL 
      stability criterion.  This effect has an important influence
      on the meaning of the performance metrics. The performance
      is expressed in several different metrics (and units) for
      comparison purposes.  As well as the traditional Speedup and
      Efficiency, we give the Temporal (tstep/s), Simulation
      (sim-ps/s), and Benchmark (mflop/s(lpm1)) performance, which
      are much more meaningful and useful measures.

      Parallelisation is by one-dimensional domain decomposition,
      in the first coordinate. Each processor is responsible for
      a slab of space, and stores the mesh-ponts and coordinates
      of particles in its region of space. During each timestep,
      particle coordinates are transferred between processors as
      the particles move from region to region.

      Error Check
      Because the simulation uses random numbers, the multi-processor
      calculation cannot be expected to give identical results to the
      uni-processor calculation. however, the percentage difference
      in particle number, NP, and average B-field, BAV, in the last
      timestep, should not exceed a few percent.
      Calculations are accepted if differences < 10%

      Temporal Performance
      Temporal performance is the inverse of the execution time,
      here expressed in units of timestep per second (tstep/s).
      This is the fundamental metric of performance, because it 
      is in absolute units and one can guarantee that the code with
      the highest temporal performance executes in the least time.

      Speedup and Efficiency
      Speedup, Sp, has the traditional definition of the ratio of
      1-proc to n-proc. execution time, and Efficiency, Ep, is
      Speedup per processor. Because Speedup is a relative 
      measure, the program with the highest Speedup may not 
      execute in the least time! Be warned.

      Simulation Performance
      This metric measures the amount of simulated time computed
      in one real wall-clock second. It is the most meaningful
      metric for a simulation, because it is what the user actually
      wishes to maximise. For this benchmark, the units are 
      simulated picosecond per second (sim-ps/s). In this metric
      larger problems with more mesh points run slower (which in
      fact they do), although they generate more Speedup and
      Mflop/s! This metric also includes the fact that problems 
      with a smaller space step often must use a smaller timestep,
      and therefore take more timesteps to cover the same amount
      of simulated time. 

      Benchmark Performance
      This metric is calculated from the nominal number of
      floating-point operations needed to perform the benchmark
      on a single processor.  For the one-nanosecond benchmark
      setup here, the average number of floating-point operations
      per timestep is defined to be:

             F_b(alpha) = 46*75*33*alpha + 58*628*alpha**1.172

      where the size factor alpha=1,2,4,8 for cases NBEN3=1,2,3,4.
      The first term above is the work to update the fields on the
      mesh, and the second term is the work to move the particles.
      Then the benchmark performance is
              R_b(alpha,p) = F_b(alpha)/Tp(alpha,p)

      Performance calculated in this way has the units 
      Mflop/s(LPM1). Different parallel implementations may,
      in fact, perform more or fewer operations than the above, but
      they are only credited with the number given by the formula.
      Because F_b is fixed for all codes, we can quarantee that the
      code with the highest benchmark performance executes in the
      least time.

Operating Instructions
  To compile and link the benchmark type:   `make' for the distributed 
  version or `make slave' for the single-node version.

  To run a recompiled program (e.g. on Intel iPSC), type: 

                  getcube -t4      ! to allocate cube
                  lpm1             ! to run benchmark.

  In some systems the allocate command may not be necessary.

  Then answer one question:

  (1) Number of nodes for mimd run is, at maximum, equal to
      the number of nodes allocated by getcube (4 in above example).
      This is the number of nodes (processors) to be used in the
      calculation. Value maybe:  1, 2, ... , maximum nodes (here 4)

 Note: For every problem size, the 1-processor calculation must be
       performed once, to obtain reference time for Speedup measure. 
       The timing results are stored in the four check result files: 
                 res1p.size1, ... , res1p.size4. 
       We recommend your first run is with 1-processor, otherwise
       speedup will be printed as zero. You do not have to rerun
       when you change the number of processors allocated

  The results for the four problem sizes, cases 1,2,3 and 4, and
  different number of processors are put automatically in 
  different output files, with notation (for example):

  lpm1c3p25 - output for lpm1 benchmark, case 3 for 25 processors

  If you wish to put the files elsewhere there is a prompt to 
  tell you when to do it with a Unix cp command.

      lpm1.u  -  host program, contains PARMACS for host.
      node.u  -  node main program and all communication interface
                 routines, therefore all node PARMACS calls are here.
      benctl.f - benchmark control, may be changed to modify
                 output, but usually left alone. No PARMACS here.
      lpm1bk.f - body of benchmark code. Not to be touched.
      res1p.size1 - correct results on one processor for standard
                    size problem, case1, (75x33) mesh.
      res1p.size2 - results for case2 problem (150x33) mesh.
      res1p.size3 - results for case3 problem (300x33) mesh.
      res1p.size4 - results for case4 problem (600x33) mesh.
      secowa.f - LPM1 program second timer, which calls
      timer.f  - the standard benchmark system timer
      header.f - standard header information
      setdat.f - puts date on results
      setdtl.f - compiler and system details

      lpm1c4p100 - etc, output files generated by program

