[yt-svn] commit/yt-doc: 7 new changesets

Wed Nov 16 12:26:57 PST 2011

7 new commits in yt-doc:


https://bitbucket.org/yt_analysis/yt-doc/changeset/031c8ad940c9/
changeset:   031c8ad940c9
user:        sskory
date:        2011-11-11 22:41:03
summary:     Improving the discussion of parallel computation and such.
Not finished yet.
affected #:  3 files

diff -r ece182e303d4ebb80ea339fb983039c2b8ea9e70 -r 031c8ad940c982298ac1b9b047d48b107b02427f source/advanced/creating_derived_quantities.rst

--- a/source/advanced/creating_derived_quantities.rst
+++ b/source/advanced/creating_derived_quantities.rst
@@ -1,3 +1,5 @@
+.. _creating_derived_quantities:
+
 Creating Derived Quantities
 ---------------------------
 


diff -r ece182e303d4ebb80ea339fb983039c2b8ea9e70 -r 031c8ad940c982298ac1b9b047d48b107b02427f source/advanced/parallel_computation.rst
--- a/source/advanced/parallel_computation.rst
+++ b/source/advanced/parallel_computation.rst
@@ -16,27 +16,33 @@
 Currently, YT is able to
 perform the following actions in parallel:
 
- * Projections
- * Slices
- * Cutting planes (oblique slices)
- * Derived Quantities (total mass, angular momentum, etc)
- * 1-, 2- and 3-D profiles
- * Halo finding
- * Merger tree
- * Two point functions
+ * Projections (:ref:`how-to-make-projections`)
+ * Slices (:ref:`how-to-make-slices`)
+ * Cutting planes (oblique slices) (:ref:`how-to-make-oblique-slices`)
+ * Derived Quantities (total mass, angular momentum, etc) (:ref:`creating_derived_quantities`,
+   :ref:`derived-quantities`)
+ * 1-, 2-, and 3-D profiles (:ref:`generating-profiles-and-histograms`)
+ * Halo finding (:ref:`halo_finding`)
+ * Merger tree (:ref:`merger_tree`)
+ * Two point functions (:ref:`two_point_functions`)
+ * Volume rendering (:ref:`volume_rendering`)
+ * Radial column density
+ * Isocontours & flux calculations
 
 This list covers just about every action YT can take!  Additionally, almost all
 scripts will benefit from parallelization without any modification.  The goal
 of Parallel-YT has been to retain API compatibility and abstract all
-parallelism.  
+parallelism.
 
 Setting Up Parallel YT
 ----------------------
 
-To run scripts in parallel, you must first install `mpi4py <http://code.google.com/p/mpi4py>`_.
-Instructions for doing so are provided on the MPI4Py website.  Once that has
+To run scripts in parallel, you must first install
+`mpi4py <http://code.google.com/p/mpi4py>`_.
+Instructions for doing so are provided on the mpipy website.  Once that has
 been accomplished, you're all done!  You just need to launch your scripts with
-``mpirun`` and signal to YT that you want to run them in parallel.
+``mpirun`` (or equivalent) and signal to YT
+that you want to run them in parallel.
 
 For instance, the following script, which we'll save as ``my_script.py``:
 
@@ -62,7 +68,7 @@
 
 .. warning:: If you manually interact with the filesystem, not through YT, you
    will have to ensure that you only execute your functions on the root
-   processor.  You can do this with the function :func:only_on_root.
+   processor.  You can do this with the function :func:`only_on_root`.
 
 It's important to note that all of the processes listed in `capabilities` work
 -- and no additional work is necessary to parallelize those processes.
@@ -89,9 +95,199 @@
 of parallelism is overall less efficient than grid-based parallelism, but it
 has been shown to obtain good results overall.
 
+The following operations use spatial decomposition:
+
+  * Projections
+  * Slices
+  * Cutting planes
+  * Halo finding
+  * Merger tree
+  * Two point functions
+  * Volume rendering
+  * Radial column density
+
 Grid Decomposition
 ++++++++++++++++++
 
 The alternative to spatial decomposition is a simple round-robin of the grids.
 This process alows YT to pool data access to a given Enzo data file, which
 ultimately results in faster read times and better parallelism.
+
+The following operations use grid decomposition:
+
+  * Derived Quantities
+  * 1-, 2-, and 3-D profiles
+  * Isocontours & flux calculations
+
+Object-Based
+++++++++++++
+
+In a fashion similar to grid decomposition, computation can be parallelized
+over objects. This is especially useful for
+`embarrasingly parallel <http://en.wikipedia.org/wiki/Embarrassingly_parallel>`_
+tasks where the items to be worked on can be split into seperate chunks and
+saved to a list. The list is then split up and each MPI task performs parts of
+it independently.
+
+Parallelizing Your Analysis
+---------------------------
+
+It is easy within YT to parallelize a list of tasks, as long as those tasks
+are independent of one another.
+Using object-based parallelism, the function :func:`parallel_objects` will
+automatically split up a list of tasks over the specified number of processors
+(or cores).
+Please see this heavily-commented example:
+
+.. code-block:: python
+   
+   # This is necessary to prevent a race-condition where each copy of
+   # yt attempts to save information about datasets to the same file on disk,
+   # simultaneously. This will be fixed, eventually...
+   from yt.config import ytcfg; ytcfg["yt","serialize"] = "False"
+   # As always...
+   from yt.mods import *
+   import glob
+   
+   # fns is a list of all Enzo hierarchy files in directories one level down.
+   fns = glob.glob("*/*.hierarchy")
+   fns.sort()
+   # This dict will store information collected in the loop, below.
+   # Inside the loop each task will have a local copy of the dict, but
+   # the dict will be combined once the loop finishes.
+   my_storage = {}
+   # In this example, because the storage option is used in the
+   # parallel_objects function, the loop yields a tuple, which gets used
+   # as (sto, fn) inside the loop.
+   # In the loop, sto is essentially my_storage, but a local copy of it.
+   # If data does not need to be combined after the loop is done, the line
+   # would look like:
+   #       for fn in parallel_objects(fns, 4):
+   # The number 4, below, is the number of processes to parallelize over, which
+   # is generally equal to the number of MPI tasks the job is launched with.
+   for sto, fn in parallel_objects(fns, 4, storage = my_storage):
+       # Open a data file, remembering that fn is different on each task.
+       pf = load(fn)
+       dd = pf.h.all_data()
+       # This copies fn and the min/max of density to the local copy of
+       # my_storage
+       sto.result_id = fn
+       sto.result = dd.quantities["Extrema"]("Density")
+       # Makes and saves a plot of the gas density.
+       pc = PlotCollection(pf, [0.5, 0.5, 0.5])
+       pc.add_projection("Density", 0)
+       pc.save()
+   # At this point, as the loop exits, the local copies of my_storage are
+   # combined such that all tasks now have an identical and full version of
+   # my_storage. Until this point, each task is unaware of what the other
+   # tasks have produced.
+   # Below, the values in my_storage are printed by only one task. The other
+   # tasks do nothing.
+   if ytcfg.getint("yt", "__topcomm_parallel_rank") == 0:
+       for fn, vals in sorted(my_storage.items()):
+           print fn, vals
+
+This example above can be modified to loop over anything that can be saved to
+a Python list: halos, data files, arrays, and more.
+
+Parallel Performance, Resources, and Tuning
+-------------------------------------------
+
+Optimizing parallel jobs in YT is difficult; there are many parameters
+that affect how well and quickly the job runs.
+In many cases, the only way to find out what the minimum (or optimal)
+number of processors is, or amount of memory needed, is through trial and error.
+However, this section will attempt to provide some insight into what are good
+starting values for a given parallel task.
+
+Grid Decomposition
+++++++++++++++++++
+
+In general, these types of parallel calculations scale very well with number of
+processors.
+They are also fairly memory-conservative.
+The two limiting factors is therefore the number of grids in the dataset,
+and the speed of the disk the data is being read off of.
+There is no point in running a parallel job of this kind with more processors
+than grids, because the extra processors will do absolutely nothing, and will
+in fact probably just serve to slow down the whole calculation due to the extra
+overhead.
+The speed of the disk is also a consideration - if it is not a high-end parallel
+file system, adding more tasks will not speed up the calculation if the disk
+is already swamped with activity.
+
+The best advice for these sort of calculations is to run with as many processors
+as practical taking the machine's capability into account, as well as the time
+it takes to start the job in the job scheduler. Start with a few tens of
+processors and go from there.
+
+Object-Based
+++++++++++++
+
+Like grid decomposition, it does not help to run with more processors than the
+number of objects to be iterated over.
+There is also the matter of the kind of work being done on each object, and
+whether it is disk-intensive, cpu-intensive, or memory-intensive.
+It is up to the user to figure out what limits the performance of their script,
+and use the correct amount of resources, accordingly.
+
+Disk-intensive jobs are limited by the speed of the file system, as above,
+and extra processors beyond its capability are likely counter-productive.
+It may require some testing or research (e.g. supercomputer documentation)
+to find out what the file system is capable of.
+
+If it is cpu-intensive, it's best to use as many processors as possible
+and practical.
+
+For a memory-intensive job, each processor needs to be able to allocate enough
+memory, which may mean using fewer than the maximum number of tasks per compute
+node, and increasing the number of nodes.
+The memory used per processor should be calculated, compared to the memory
+on each compute node, which dictates how many tasks per node.
+After that, the number of processors used overall is dictated by the 
+disk system or CPU-intensity of the job.
+
+Domain Decomposition
+++++++++++++++++++++
+
+The various types of analysis that utilize domain decomposition use them in
+different enough ways that they are be discussed separately.
+
+**Halo-Finding**
+
+Halo finding, along with the merger tree that uses halo finding, operates
+on the particles in the volume, and is therefore mostly grid-agnostic.
+Generally, the biggest concern for halo finding is the amount of memory needed.
+There is subtle art in estimating the amount of memory needed for halo finding,
+but a rule of thumb is that Parallel HOP (:func:`parallelHF`) is the most
+memory-intensive, followed by plain HOP (:func:`HaloFinder`),
+with Friends of Friends (:func:`FOFHaloFinder`) being
+the most memory-conservative.
+It has been found that :func:`parallelHF` needs roughly
+1 MB of memory per 5,000
+particles, although recent work has improved this and the memory requirement
+is now smaller than this. But this is a good starting point for beginning to
+calculate the memory required for halo-finding.
+
+**Projections, Slices, and Cutting Planes**
+
+
+**Two point functions**
+
+Please see :ref:`tpf_strategies` for more details.
+
+**Volume Rendering**
+
+The simplest way to think about volume rendering, and the radial column density
+module that uses it, is that it load-balances over the grids in the dataset.
+Each processor is given roughly the same sized volume to operate on.
+In practice, there are just a few things to keep in mind when doing volume
+rendering.
+First, it only uses a power of two number of processors.
+If the job is run with 100 processors, only 64 of them will actually do anything.
+Second, the absolute maximum number of processors is the number of grids.
+But in order to keep work distributed evenly, typically the number of processors
+should be no greater than one-eighth or one-quarter the number of processors
+that were used to produce the dataset.
+
+


diff -r ece182e303d4ebb80ea339fb983039c2b8ea9e70 -r 031c8ad940c982298ac1b9b047d48b107b02427f source/analysis_modules/two_point_functions.rst
--- a/source/analysis_modules/two_point_functions.rst
+++ b/source/analysis_modules/two_point_functions.rst
@@ -570,6 +570,8 @@
      If there were other output fields, they would be named
      ``bin_edges_01_outfield1``, ``bin_edges_02_outfield2`` respectively.
 
+.. _tpf_strategies
+
 Strategies for Computational Efficiency
 ---------------------------------------
 



https://bitbucket.org/yt_analysis/yt-doc/changeset/53a62f492351/
changeset:   53a62f492351
user:        MatthewTurk
date:        2011-11-12 14:14:38
summary:     Adding about projections, slices, cutting planes.
affected #:  1 file

diff -r 031c8ad940c982298ac1b9b047d48b107b02427f -r 53a62f49235158d5c2df2a85c48da20f17274995 source/advanced/parallel_computation.rst
--- a/source/advanced/parallel_computation.rst
+++ b/source/advanced/parallel_computation.rst
@@ -221,6 +221,37 @@
 it takes to start the job in the job scheduler. Start with a few tens of
 processors and go from there.
 
+**Projections, Slices, and Cutting Planes**
+
+Projections, slices and cutting planes are the most common methods of creating
+two-dimensional representations of data.  All three have been parallelized in a
+grid-based fashion.
+
+ * Projections: projections are parallelized utilizing a quad-tree approach.
+   Data is loaded for each processor, typically by a process that consolidates
+   open/close/read operations, and each grid is then iterated over and cells
+   are deposited into a data structure that stores values correpsonding to
+   positions in the two-dimensional plane.  This provides excellent load
+   balancing, and in serial is quite fast.  However, as of yt 2.3, the
+   operation by which quadtrees are joined across processors scales poorly;
+   while memory consumption scales well, the time to completion does not.  As
+   such, projections can often by very fast when operating only on a single
+   processor!  The quadtree algorithm can be used inline (and, indeed, it is
+   for this reason that it is slow.)  It is recommended that you attempt to
+   project in serial before projecting in parallel; even for the very largest
+   datasets (Enzo 1024^3 root grid with 7 levels of refinement) in the absence
+   of IO the quadtree algorithm takes only three minute or so on a decent
+   processor.
+ * Slices: to generate a slice, grids that intersect a given slice are iterated
+   over and their finest-resolution cells are deposited.  The grids are
+   decomposed via standard load balancing.  While this operation is parallel,
+   **it is almost never necessary to slice a dataset in parallel**, as all data is
+   loaded on demand anyway.  The slice operation has been parallelized so as to
+   enable slicing when running *in situ*.
+ * Cutting planes: cutting planes are parallelized exactly as slices are.
+   However, in contrast to slices, because the data-selection operation can be
+   much more time consuming, cutting planes often benefit from parallelism.
+
 Object-Based
 ++++++++++++
 
@@ -247,6 +278,7 @@
 After that, the number of processors used overall is dictated by the 
 disk system or CPU-intensity of the job.
 
+
 Domain Decomposition
 ++++++++++++++++++++
 
@@ -269,9 +301,6 @@
 is now smaller than this. But this is a good starting point for beginning to
 calculate the memory required for halo-finding.
 
-**Projections, Slices, and Cutting Planes**
-
-
 **Two point functions**
 
 Please see :ref:`tpf_strategies` for more details.



https://bitbucket.org/yt_analysis/yt-doc/changeset/e13057dcff4a/
changeset:   e13057dcff4a
user:        sskory
date:        2011-11-14 23:38:06
summary:     Merging.
affected #:  3 files

diff -r 4c7040708b895a28733073966c35558e2a204c4d -r e13057dcff4a8626dacfc3fd34537c60253b2f88 source/advanced/creating_derived_quantities.rst
--- a/source/advanced/creating_derived_quantities.rst
+++ b/source/advanced/creating_derived_quantities.rst
@@ -1,3 +1,5 @@
+.. _creating_derived_quantities:
+
 Creating Derived Quantities
 ---------------------------
 


diff -r 4c7040708b895a28733073966c35558e2a204c4d -r e13057dcff4a8626dacfc3fd34537c60253b2f88 source/advanced/parallel_computation.rst
--- a/source/advanced/parallel_computation.rst
+++ b/source/advanced/parallel_computation.rst
@@ -16,27 +16,33 @@
 Currently, YT is able to
 perform the following actions in parallel:
 
- * Projections
- * Slices
- * Cutting planes (oblique slices)
- * Derived Quantities (total mass, angular momentum, etc)
- * 1-, 2- and 3-D profiles
- * Halo finding
- * Merger tree
- * Two point functions
+ * Projections (:ref:`how-to-make-projections`)
+ * Slices (:ref:`how-to-make-slices`)
+ * Cutting planes (oblique slices) (:ref:`how-to-make-oblique-slices`)
+ * Derived Quantities (total mass, angular momentum, etc) (:ref:`creating_derived_quantities`,
+   :ref:`derived-quantities`)
+ * 1-, 2-, and 3-D profiles (:ref:`generating-profiles-and-histograms`)
+ * Halo finding (:ref:`halo_finding`)
+ * Merger tree (:ref:`merger_tree`)
+ * Two point functions (:ref:`two_point_functions`)
+ * Volume rendering (:ref:`volume_rendering`)
+ * Radial column density
+ * Isocontours & flux calculations
 
 This list covers just about every action YT can take!  Additionally, almost all
 scripts will benefit from parallelization without any modification.  The goal
 of Parallel-YT has been to retain API compatibility and abstract all
-parallelism.  
+parallelism.
 
 Setting Up Parallel YT
 ----------------------
 
-To run scripts in parallel, you must first install `mpi4py <http://code.google.com/p/mpi4py>`_.
-Instructions for doing so are provided on the MPI4Py website.  Once that has
+To run scripts in parallel, you must first install
+`mpi4py <http://code.google.com/p/mpi4py>`_.
+Instructions for doing so are provided on the mpipy website.  Once that has
 been accomplished, you're all done!  You just need to launch your scripts with
-``mpirun`` and signal to YT that you want to run them in parallel.
+``mpirun`` (or equivalent) and signal to YT
+that you want to run them in parallel.
 
 For instance, the following script, which we'll save as ``my_script.py``:
 
@@ -62,7 +68,7 @@
 
 .. warning:: If you manually interact with the filesystem, not through YT, you
    will have to ensure that you only execute your functions on the root
-   processor.  You can do this with the function :func:only_on_root.
+   processor.  You can do this with the function :func:`only_on_root`.
 
 It's important to note that all of the processes listed in `capabilities` work
 -- and no additional work is necessary to parallelize those processes.
@@ -89,9 +95,228 @@
 of parallelism is overall less efficient than grid-based parallelism, but it
 has been shown to obtain good results overall.
 
+The following operations use spatial decomposition:
+
+  * Projections
+  * Slices
+  * Cutting planes
+  * Halo finding
+  * Merger tree
+  * Two point functions
+  * Volume rendering
+  * Radial column density
+
 Grid Decomposition
 ++++++++++++++++++
 
 The alternative to spatial decomposition is a simple round-robin of the grids.
 This process alows YT to pool data access to a given Enzo data file, which
 ultimately results in faster read times and better parallelism.
+
+The following operations use grid decomposition:
+
+  * Derived Quantities
+  * 1-, 2-, and 3-D profiles
+  * Isocontours & flux calculations
+
+Object-Based
+++++++++++++
+
+In a fashion similar to grid decomposition, computation can be parallelized
+over objects. This is especially useful for
+`embarrasingly parallel <http://en.wikipedia.org/wiki/Embarrassingly_parallel>`_
+tasks where the items to be worked on can be split into seperate chunks and
+saved to a list. The list is then split up and each MPI task performs parts of
+it independently.
+
+Parallelizing Your Analysis
+---------------------------
+
+It is easy within YT to parallelize a list of tasks, as long as those tasks
+are independent of one another.
+Using object-based parallelism, the function :func:`parallel_objects` will
+automatically split up a list of tasks over the specified number of processors
+(or cores).
+Please see this heavily-commented example:
+
+.. code-block:: python
+   
+   # This is necessary to prevent a race-condition where each copy of
+   # yt attempts to save information about datasets to the same file on disk,
+   # simultaneously. This will be fixed, eventually...
+   from yt.config import ytcfg; ytcfg["yt","serialize"] = "False"
+   # As always...
+   from yt.mods import *
+   import glob
+   
+   # fns is a list of all Enzo hierarchy files in directories one level down.
+   fns = glob.glob("*/*.hierarchy")
+   fns.sort()
+   # This dict will store information collected in the loop, below.
+   # Inside the loop each task will have a local copy of the dict, but
+   # the dict will be combined once the loop finishes.
+   my_storage = {}
+   # In this example, because the storage option is used in the
+   # parallel_objects function, the loop yields a tuple, which gets used
+   # as (sto, fn) inside the loop.
+   # In the loop, sto is essentially my_storage, but a local copy of it.
+   # If data does not need to be combined after the loop is done, the line
+   # would look like:
+   #       for fn in parallel_objects(fns, 4):
+   # The number 4, below, is the number of processes to parallelize over, which
+   # is generally equal to the number of MPI tasks the job is launched with.
+   for sto, fn in parallel_objects(fns, 4, storage = my_storage):
+       # Open a data file, remembering that fn is different on each task.
+       pf = load(fn)
+       dd = pf.h.all_data()
+       # This copies fn and the min/max of density to the local copy of
+       # my_storage
+       sto.result_id = fn
+       sto.result = dd.quantities["Extrema"]("Density")
+       # Makes and saves a plot of the gas density.
+       pc = PlotCollection(pf, [0.5, 0.5, 0.5])
+       pc.add_projection("Density", 0)
+       pc.save()
+   # At this point, as the loop exits, the local copies of my_storage are
+   # combined such that all tasks now have an identical and full version of
+   # my_storage. Until this point, each task is unaware of what the other
+   # tasks have produced.
+   # Below, the values in my_storage are printed by only one task. The other
+   # tasks do nothing.
+   if ytcfg.getint("yt", "__topcomm_parallel_rank") == 0:
+       for fn, vals in sorted(my_storage.items()):
+           print fn, vals
+
+This example above can be modified to loop over anything that can be saved to
+a Python list: halos, data files, arrays, and more.
+
+Parallel Performance, Resources, and Tuning
+-------------------------------------------
+
+Optimizing parallel jobs in YT is difficult; there are many parameters
+that affect how well and quickly the job runs.
+In many cases, the only way to find out what the minimum (or optimal)
+number of processors is, or amount of memory needed, is through trial and error.
+However, this section will attempt to provide some insight into what are good
+starting values for a given parallel task.
+
+Grid Decomposition
+++++++++++++++++++
+
+In general, these types of parallel calculations scale very well with number of
+processors.
+They are also fairly memory-conservative.
+The two limiting factors is therefore the number of grids in the dataset,
+and the speed of the disk the data is being read off of.
+There is no point in running a parallel job of this kind with more processors
+than grids, because the extra processors will do absolutely nothing, and will
+in fact probably just serve to slow down the whole calculation due to the extra
+overhead.
+The speed of the disk is also a consideration - if it is not a high-end parallel
+file system, adding more tasks will not speed up the calculation if the disk
+is already swamped with activity.
+
+The best advice for these sort of calculations is to run with as many processors
+as practical taking the machine's capability into account, as well as the time
+it takes to start the job in the job scheduler. Start with a few tens of
+processors and go from there.
+
+**Projections, Slices, and Cutting Planes**
+
+Projections, slices and cutting planes are the most common methods of creating
+two-dimensional representations of data.  All three have been parallelized in a
+grid-based fashion.
+
+ * Projections: projections are parallelized utilizing a quad-tree approach.
+   Data is loaded for each processor, typically by a process that consolidates
+   open/close/read operations, and each grid is then iterated over and cells
+   are deposited into a data structure that stores values correpsonding to
+   positions in the two-dimensional plane.  This provides excellent load
+   balancing, and in serial is quite fast.  However, as of yt 2.3, the
+   operation by which quadtrees are joined across processors scales poorly;
+   while memory consumption scales well, the time to completion does not.  As
+   such, projections can often by very fast when operating only on a single
+   processor!  The quadtree algorithm can be used inline (and, indeed, it is
+   for this reason that it is slow.)  It is recommended that you attempt to
+   project in serial before projecting in parallel; even for the very largest
+   datasets (Enzo 1024^3 root grid with 7 levels of refinement) in the absence
+   of IO the quadtree algorithm takes only three minute or so on a decent
+   processor.
+ * Slices: to generate a slice, grids that intersect a given slice are iterated
+   over and their finest-resolution cells are deposited.  The grids are
+   decomposed via standard load balancing.  While this operation is parallel,
+   **it is almost never necessary to slice a dataset in parallel**, as all data is
+   loaded on demand anyway.  The slice operation has been parallelized so as to
+   enable slicing when running *in situ*.
+ * Cutting planes: cutting planes are parallelized exactly as slices are.
+   However, in contrast to slices, because the data-selection operation can be
+   much more time consuming, cutting planes often benefit from parallelism.
+
+Object-Based
+++++++++++++
+
+Like grid decomposition, it does not help to run with more processors than the
+number of objects to be iterated over.
+There is also the matter of the kind of work being done on each object, and
+whether it is disk-intensive, cpu-intensive, or memory-intensive.
+It is up to the user to figure out what limits the performance of their script,
+and use the correct amount of resources, accordingly.
+
+Disk-intensive jobs are limited by the speed of the file system, as above,
+and extra processors beyond its capability are likely counter-productive.
+It may require some testing or research (e.g. supercomputer documentation)
+to find out what the file system is capable of.
+
+If it is cpu-intensive, it's best to use as many processors as possible
+and practical.
+
+For a memory-intensive job, each processor needs to be able to allocate enough
+memory, which may mean using fewer than the maximum number of tasks per compute
+node, and increasing the number of nodes.
+The memory used per processor should be calculated, compared to the memory
+on each compute node, which dictates how many tasks per node.
+After that, the number of processors used overall is dictated by the 
+disk system or CPU-intensity of the job.
+
+
+Domain Decomposition
+++++++++++++++++++++
+
+The various types of analysis that utilize domain decomposition use them in
+different enough ways that they are be discussed separately.
+
+**Halo-Finding**
+
+Halo finding, along with the merger tree that uses halo finding, operates
+on the particles in the volume, and is therefore mostly grid-agnostic.
+Generally, the biggest concern for halo finding is the amount of memory needed.
+There is subtle art in estimating the amount of memory needed for halo finding,
+but a rule of thumb is that Parallel HOP (:func:`parallelHF`) is the most
+memory-intensive, followed by plain HOP (:func:`HaloFinder`),
+with Friends of Friends (:func:`FOFHaloFinder`) being
+the most memory-conservative.
+It has been found that :func:`parallelHF` needs roughly
+1 MB of memory per 5,000
+particles, although recent work has improved this and the memory requirement
+is now smaller than this. But this is a good starting point for beginning to
+calculate the memory required for halo-finding.
+
+**Two point functions**
+
+Please see :ref:`tpf_strategies` for more details.
+
+**Volume Rendering**
+
+The simplest way to think about volume rendering, and the radial column density
+module that uses it, is that it load-balances over the grids in the dataset.
+Each processor is given roughly the same sized volume to operate on.
+In practice, there are just a few things to keep in mind when doing volume
+rendering.
+First, it only uses a power of two number of processors.
+If the job is run with 100 processors, only 64 of them will actually do anything.
+Second, the absolute maximum number of processors is the number of grids.
+But in order to keep work distributed evenly, typically the number of processors
+should be no greater than one-eighth or one-quarter the number of processors
+that were used to produce the dataset.
+
+


diff -r 4c7040708b895a28733073966c35558e2a204c4d -r e13057dcff4a8626dacfc3fd34537c60253b2f88 source/analysis_modules/two_point_functions.rst
--- a/source/analysis_modules/two_point_functions.rst
+++ b/source/analysis_modules/two_point_functions.rst
@@ -570,6 +570,8 @@
      If there were other output fields, they would be named
      ``bin_edges_01_outfield1``, ``bin_edges_02_outfield2`` respectively.
 
+.. _tpf_strategies
+
 Strategies for Computational Efficiency
 ---------------------------------------
 



https://bitbucket.org/yt_analysis/yt-doc/changeset/1d51f3f29a94/
changeset:   1d51f3f29a94
user:        sskory
date:        2011-11-15 01:47:07
summary:     A bit more touchups and content to the parallel_computation docs.
affected #:  1 file

diff -r e13057dcff4a8626dacfc3fd34537c60253b2f88 -r 1d51f3f29a94c307c6b28db3033d347503ed6b7c source/advanced/parallel_computation.rst
--- a/source/advanced/parallel_computation.rst
+++ b/source/advanced/parallel_computation.rst
@@ -207,7 +207,7 @@
 processors.
 They are also fairly memory-conservative.
 The two limiting factors is therefore the number of grids in the dataset,
-and the speed of the disk the data is being read off of.
+and the speed of the disk the data is stored on.
 There is no point in running a parallel job of this kind with more processors
 than grids, because the extra processors will do absolutely nothing, and will
 in fact probably just serve to slow down the whole calculation due to the extra
@@ -218,8 +218,8 @@
 
 The best advice for these sort of calculations is to run with as many processors
 as practical taking the machine's capability into account, as well as the time
-it takes to start the job in the job scheduler. Start with a few tens of
-processors and go from there.
+it takes to start the job in the job scheduler. Start with just a few processors
+and go from there.
 
 **Projections, Slices, and Cutting Planes**
 
@@ -235,12 +235,12 @@
    balancing, and in serial is quite fast.  However, as of yt 2.3, the
    operation by which quadtrees are joined across processors scales poorly;
    while memory consumption scales well, the time to completion does not.  As
-   such, projections can often by very fast when operating only on a single
+   such, projections can often be done very fast when operating only on a single
    processor!  The quadtree algorithm can be used inline (and, indeed, it is
    for this reason that it is slow.)  It is recommended that you attempt to
    project in serial before projecting in parallel; even for the very largest
    datasets (Enzo 1024^3 root grid with 7 levels of refinement) in the absence
-   of IO the quadtree algorithm takes only three minute or so on a decent
+   of IO the quadtree algorithm takes only three minutes or so on a decent
    processor.
  * Slices: to generate a slice, grids that intersect a given slice are iterated
    over and their finest-resolution cells are deposited.  The grids are
@@ -319,4 +319,58 @@
 should be no greater than one-eighth or one-quarter the number of processors
 that were used to produce the dataset.
 
+Additional Tips
+---------------
 
+  * Don't be afraid to change how a parallel job is run. Change the
+    number of processors, or memory allocated, and see if things work better
+    or worse. After all, it's just a computer, it doesn't pass moral judgment!
+
+  * Similarly, human time is more valuable than computer time. Try increasing
+    the number of processors, and see if the runtime drops significantly.
+    There will be a sweet spot between speed of run and the waiting time in
+    the job scheduler queue; it may be worth trying to find it.
+
+  * It is impossible to tune a parallel operation without understanding what's
+    going on. Read the documentation, look at the underlying code, or talk to
+    other yt users. Get informed!
+    
+  * Sometimes it is difficult to know if a job is cpu, memory, or disk
+    intensive, especially if the parallel job utilizes several of the kinds of
+    parallelism discussed above. In this case, it may be worthwhile to put
+    some simple timers in your script (as below) around different parts.
+    
+    .. code-block:: python
+    
+       from yt.mods import *
+       import time
+       
+       pf = load("DD0152")
+       t0 = time.time()
+       bigstuff, hugestuff = StuffFinder(pf)
+       BigHugeStuffParallelFunction(pf, bigstuff, hugestuff)
+       t1 = time.time()
+       for i in range(1000000):
+           tinystuff, ministuff = GetTinyMiniStuffOffDisk("in%06d.txt" % i)
+           array = TinyTeensyParallelFunction(pf, tinystuff, ministuff)
+           SaveTinyMiniStuffToDisk("out%06d.txt" % i, array)
+       t2 = time.time()
+       
+       print "BigStuff took %.5e sec, TinyStuff took %.5e sec" % (t1 - t0, t2 - t1)
+  
+  * Remember that if the script handles disk IO explicitly, and does not use
+    a built-in yt function to write data to disk,
+    care must be taken to
+    avoid `race-conditions <http://en.wikipedia.org/wiki/Race_conditions>`_.
+    Be explicit about which MPI task writes to disk using a construction
+    something like this:
+    
+    .. code-block:: python
+       
+       if ytcfg.getint("yt", "__topcomm_parallel_rank") == 0:
+           file = open("out.txt", "w")
+           file.write(stuff)
+           file.close()
+
+
+  
\ No newline at end of file



https://bitbucket.org/yt_analysis/yt-doc/changeset/ca0e9e0e9309/
changeset:   ca0e9e0e9309
user:        sskory
date:        2011-11-15 01:52:30
summary:     Removing a contradiction.
affected #:  1 file

diff -r 1d51f3f29a94c307c6b28db3033d347503ed6b7c -r ca0e9e0e930904553b386a92d3a42b0ee7ae8f85 source/advanced/parallel_computation.rst
--- a/source/advanced/parallel_computation.rst
+++ b/source/advanced/parallel_computation.rst
@@ -216,10 +216,9 @@
 file system, adding more tasks will not speed up the calculation if the disk
 is already swamped with activity.
 
-The best advice for these sort of calculations is to run with as many processors
-as practical taking the machine's capability into account, as well as the time
-it takes to start the job in the job scheduler. Start with just a few processors
-and go from there.
+The best advice for these sort of calculations is to run 
+with just a few processors and go from there, seeing if it the runtime
+improves noticeably.
 
 **Projections, Slices, and Cutting Planes**
 



https://bitbucket.org/yt_analysis/yt-doc/changeset/6586f434e9cb/
changeset:   6586f434e9cb
user:        sskory
date:        2011-11-16 18:18:32
summary:     Adding another tip.
affected #:  1 file

diff -r ca0e9e0e930904553b386a92d3a42b0ee7ae8f85 -r 6586f434e9cb49f1ec0683b1a753a2787c6bfc90 source/advanced/parallel_computation.rst
--- a/source/advanced/parallel_computation.rst
+++ b/source/advanced/parallel_computation.rst
@@ -371,5 +371,15 @@
            file.write(stuff)
            file.close()
 
-
+  * Many supercomputers allow users to ssh into the nodes that their job is
+    running on.
+    Many job schedulers send the names of the nodes that are
+    used in the notification emails, or a command like ``qstat -f NNNN``, where
+    ``NNNN`` is the job ID, will also show this information.
+    By ssh-ing into nodes, the memory usage of each task can be viewed in
+    real-time as the job runs (using ``top``, for example),
+    and can give valuable feedback about the
+    resources the task requires.
+    
+    
   
\ No newline at end of file



https://bitbucket.org/yt_analysis/yt-doc/changeset/0a8d3453d157/
changeset:   0a8d3453d157
user:        sskory
date:        2011-11-16 21:22:50
summary:     Small additions on advice of Jeff.
affected #:  1 file

diff -r 6586f434e9cb49f1ec0683b1a753a2787c6bfc90 -r 0a8d3453d15772454bef74e04df0fc40f8d2d40c source/advanced/parallel_computation.rst
--- a/source/advanced/parallel_computation.rst
+++ b/source/advanced/parallel_computation.rst
@@ -149,6 +149,10 @@
    from yt.mods import *
    import glob
    
+   # The number 4, below, is the number of processes to parallelize over, which
+   # is generally equal to the number of MPI tasks the job is launched with.
+   num_procs = 4
+   
    # fns is a list of all Enzo hierarchy files in directories one level down.
    fns = glob.glob("*/*.hierarchy")
    fns.sort()
@@ -162,10 +166,8 @@
    # In the loop, sto is essentially my_storage, but a local copy of it.
    # If data does not need to be combined after the loop is done, the line
    # would look like:
-   #       for fn in parallel_objects(fns, 4):
-   # The number 4, below, is the number of processes to parallelize over, which
-   # is generally equal to the number of MPI tasks the job is launched with.
-   for sto, fn in parallel_objects(fns, 4, storage = my_storage):
+   #       for fn in parallel_objects(fns, num_procs):
+   for sto, fn in parallel_objects(fns, num_procs, storage = my_storage):
        # Open a data file, remembering that fn is different on each task.
        pf = load(fn)
        dd = pf.h.all_data()
@@ -355,7 +357,8 @@
            SaveTinyMiniStuffToDisk("out%06d.txt" % i, array)
        t2 = time.time()
        
-       print "BigStuff took %.5e sec, TinyStuff took %.5e sec" % (t1 - t0, t2 - t1)
+       if ytcfg.getint("yt", "__topcomm_parallel_rank") == 0:
+           print "BigStuff took %.5e sec, TinyStuff took %.5e sec" % (t1 - t0, t2 - t1)
   
   * Remember that if the script handles disk IO explicitly, and does not use
     a built-in yt function to write data to disk,

Repository URL: https://bitbucket.org/yt_analysis/yt-doc/

--

This is a commit notification from bitbucket.org. You are receiving
this because you have the service enabled, addressing the recipient of
this email.