<div>Hi Stephen, Pragneshkumar</div><div><br></div><div>Let me know if I should move this discussion off the yt-dev list.</div><div><br></div>The box size is only 80 Mpc or 56 Mpc/h, so load balancing probably help with peak memory more than hinder, but could still try turning it off.<div>

<br></div><div>Would decreasing the amount of cores alleviate some of the symptoms? Or would that just increase the computation time too much?  I haven't done any scaling tests on the parallelHF.<br><br></div><div>Incidentally, I've killed the job after seeing the following error, looks like mpi was having some buffer trouble, which could be caused by the heavy IO?:</div>

<div>.</div><div>.</div><div>.</div><div><br></div><div><div>MPI WARNING: Could not allocate an internal buffer in the last 30 seconds</div><div>on rank 503.  Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST.</div>

<div>MPI WARNING: Could not allocate an internal buffer in the last 30 seconds</div><div>on rank 505.  Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST.</div><div>MPI WARNING: Could not allocate an internal buffer in the last 30 seconds</div>

<div>on rank 509.  Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST.</div><div>MPI WARNING: Could not allocate an internal buffer in the last 30 seconds</div><div>on rank 511.  Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST.</div>

<div>Traceback (most recent call last):</div><div>  File "ParallelHaloProfiler.py", line 17, in <module></div><div>    rearrange=True, safety=1.5, premerge=True)</div><div>  File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/analysis_modules/halo_finding/halo_objects.py", line 1861, in __init__</div>

<div>    root_points = self._subsample_points()</div><div>  File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/analysis_modules/halo_finding/halo_objects.py", line 1991, in _subsample_points</div><div>    root_points = self._mpi_concatenate_array_on_root_double(my_points[0])</div>

<div>  File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py", line 187, in passage</div><div>    return func(self, data)</div><div>  File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py", line 794, in _mpi_concatenate_array_on_root_double</div>

<div>    data = na.concatenate((data, new_data))</div><div>ValueError: negative dimensions are not allowed</div></div><div><br></div><div>Can I get a refund of the SU on that job?</div><div><br></div><div>From</div><div>G.S.</div>

<div><br><div class="gmail_quote">On Fri, Sep 23, 2011 at 10:51 AM, Stephen Skory <span dir="ltr"><<a href="mailto:s@skory.us">s@skory.us</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Geoffrey,<br>

<div class="im"><br>

> (1) How much data are read/written by program ?<br>

> - After all the particles (3200^3 of them) are read in they are linked with<br>

> a fortran KDtree if they satisfy some conditions.<br>

><br>

> (2) How many parallel readers/writers are used by program ?<br>

> - It is reading using 512 cores from my submission script.  The amount of<br>

> write to disk depends on the distribution of the particle haloes across<br>

> processors, if they exist across processors then there will be more files<br>

> written out by write_particle_lists.<br>

> (3) Do you use MPI_IO ? or something else ?<br>

> - Yes, the program uses mpi4py-1.2.2 installed in my home directory<br>

> The details of the code can be found at:<br>

> <a href="http://yt-project.org/doc/analysis_modules/running_halofinder.html#halo-finding" target="_blank">http://yt-project.org/doc/analysis_modules/running_halofinder.html#halo-finding</a><br>

> under the section "Parallel HOP"<br>

<br>

</div>The main thing you got wrong is that we do not use MPI_IO. The IO is<br>

done primarily through a custom HDF5 reader written in C, and each<br>

thread does its own reading.<br>

<br>

The issue that Pragnesh is probably seeing, and what Geoffrey alludes<br>

to, is how load balancing is done. Because of the details of how Enzo<br>

stores its data, it is difficult to know where to send the data for<br>

load balancing it without reading it all in, first. Out of<br>

convenience, once the layout is established, the data is read in again<br>

(instead of distributed via communication), this time by the tasks<br>

that have been assigned the data. Furthermore, the data as assigned<br>

may come from several files, meaning that each task will be<br>

opening/closing multiple files multiple times.<br>

<br>

If all these IO calls are causing a problem, I could see about putting<br>

in some kind of IO wait (configurable by the user) that basically<br>

slows down the reading part of the process.<br>

<br>

p.s. Geoffrey - what is the cosmological size of your box? If it's<br>

above about 300 Mpc/h, load balancing is probably not necessary, which<br>

should roughly half the IO required.<br>

<br>

--<br>

<font color="#888888">Stephen Skory<br>

<a href="mailto:s@skory.us">s@skory.us</a><br>

<a href="http://stephenskory.com/" target="_blank">http://stephenskory.com/</a><br>

<a href="tel:510.621.3687" value="+15106213687">510.621.3687</a> (google voice)<br>

_______________________________________________<br>

Yt-dev mailing list<br>

<a href="mailto:Yt-dev@lists.spacepope.org">Yt-dev@lists.spacepope.org</a><br>

<a href="http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org" target="_blank">http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org</a><br>

</font></blockquote></div><br></div>