Spacepy: Reading SWMF .outs files

Hey! Now you can read *.outs files using Spacepy’s PyBats module! You can load a *.outs file using the same old classes (e.g., Bats2d, MagGridFile, etc.), pick what frame to read, and switch frames on the fly. Further, any self.calc_* functions will update when you switch frames, so you don’t need to re-calculate values upon loading a different data frame. Information about the number of frames, the time and iteration of each, and info on the currently loaded frame are all given in self.attrs.

Note that as of now, this new capability works for binary files only. ASCII *.outs file reading is coming soon.

This is a big capability improvement that changed a lot of guts of PyBats’ IdlFile class. Because this is my first Spacepy-related post on this blog, let’s start with some background to demonstrate why this is such a big improvement. We’ll then cover what the new syntax looks like and how it functions. Finally, we’ll go over a real-world example that contrasts the old and new ways of working with SWMF output.

If you’re an experienced SWMF or Spacepy user, click here to jump to see how to use the new capability. Otherwise, the background info will get you up to speed about why this new feature matters.

The updated code is available on my fork of Spacepy on the outs branch. A PR has been submitted, so this feature should be on merged into main soon.

Background

Most of the output files from the SWMF are dumped as single-epoch files. For example, 2D cuts through the BATS-R-US domain are dumped to disk at a set frequency, say, one file every 60 seconds. Both BATS-R-US and other codes leverage IDL-formatted files, typically with the suffix *.out. Because this approach can make many, many files over the course of a long simulation, the ability to post-process like files into a combined, multi-epoch file (*.outs files) has long been leveraged. These files contain many frames of data, stacked one after the other. They are more convenient in many ways, notably that they prevent your run directories from looking like this:

Reducing the number of output files from tens of thousands to ten or less is especially desirable on distributed file systems like Lustre where having many individual files results in a performance hit (and typically come with hard quotas on disk usage).

PyBats was originally developed to handle the single-frame *.out files (amongst other file formats.) Handling multi-frame *.outs files may seem like a simple step, but there are catches: unless you are on a powerful machine with a lot of RAM, there’s no way you can load an entire typical *.outs file that can be many gigabytes in size. We need a way to only load part of the file at one time and to nimbly move around the file without taking up a lot of time.

Switching to be able to handle *.outs files also required much re-writing of how IDL-formatted files are loaded. Reading the header had to be separated from loading the data and broken into several sub-functions so that we could efficiently jump from frame to frame. There is also a lot of information pulled from the file name itself; all of the regexes needed to be updated to handle the long, compound formatting of *.outs file names. Finally, there are considerations concerning how to handle new values derived from a given frame- should the derived values be wiped when we get the data from a different frame? Do we trust the user to re-calculate everything? Addressing these issues took some thought, but I’m quite happy with the end solution.

Handling *.outs files in Spacepy

Open *.outs files the same way you would any other file:

from spacepy.pybats import IdlFile
from spacepy.pybats.bats import MagGridFile, Bats2d

# Load an arbitrary IDL-formatted file naively:
raw = IdlFile('./y=0_mhd_1_e20150315-200000-000_20150315-235900-013.outs')

# Use subclasses to get output-specific functionality:
mhd = Bats2d('./y=0_mhd_1_e20150315-200000-000_20150315-235900-013.outs')
mag = MagGridFile('./mag_grid_e20150315-200000_20150318-131500.outs', iframe=10)

Note the optional use of iframe when opening the file. The default behavior is to load the data in the first data frame; iframe=x loads the xth frame.

To get information about the file, including about the available data frames, view the object-level datamodel attributes via the attrs dictionary:

mag.attrs
{'file': './mag_grid_e20150315-200000_20150318-131500.outs',
 'format': 'bin',
 'iter_range': [0, 0],
 'runtime_range': [0.0, 234900.070945],
 'time_range': [datetime.datetime(2015, 3, 15, 20, 0), datetime.datetime(2015, 3, 18, 13, 15)],
 'nframe': 784,
 'iters': array([0, 0, 0, 0, ... , 0, 0], dtype=int32),
 'runtimes': array([     0., 300.,  ... 234900.07094579]),
 'times': array([datetime.datetime(2015, 3, 15, 20, 0), ...
        datetime.datetime(2015, 3, 18, 13, 15)], dtype=object),
 'iter': 0,
 'runtime': 0.0,
 'ndim': 2,
 'nparam': 0,
 'nvar': 15,
 'header': 'Magnetometer grid (MAG) [deg] dB (North-East-Down) [nT]',
 'iframe': 10,
 'time': datetime.datetime(2015, 3, 15, 20, 0),
 'strtime': '0000h00m00.000s'}

There’s a lot more info in there than in previous versions! Each IdlFile object and all subclasses now have a new set of attrs to describe the data frames contained within (even if there is a single frame, as is the case for all *.out files). If the relevant information is missing from the file, dummy values are added — this is the case above, where iteration information is not found in the magnetometer grid file. Let’s break down the new additions:

AttributeDescription
iframeThe current frame number, zero-based indexing
nframeThe total data frames available in the file
iter, time, and runtimeThe iteration, time as a datetime, and the run time as seconds from the start of the simulation for the current frame
iters, times, runtimesNumpy arrays of all iterations/times/run times for each data frame within the file.
iter_range, etc.Two-element lists of the range of iteration/time/run time found within the file.

Switching frames is done through the switch_frame object method. Data is read from the disk and loaded into the object, replacing the previous data. switch_frame supports negative indexing, so you can use mhd.switch_frame(-2) to jump to the second-to-last frame. Let’s look at the first and the last frame in our Y=0 slice file:

# Load matplotlib, create a figure:
import matplotlib.pyplot as plt
fig=plt.figure(figsize=[10,5])

# Plot pressure at first frame:
mhd.add_pcolor('x','z','p', target=fig, loc=121)

# Switch to last frame, plot pressure:
mhd.switch_frame(-1)
mhd.add_pcolor('x','z','p', target=fig, loc=122)

# Clean up and show:
fig.tight_layout()
plt.show()

It is critical to remember that switching frames updates only the values that are contained within the file except for calc_* type methods. If you were to do something like this: mhd['new'] = 2*mhd['bx'], the values within mhd['new'] will not update when the data frame is switched. However, a new special decorator called @calc_wrapper can be used to create functions that update when the frame is switched. Any object methods that start with calc_ will update when the frame is switched:

# Return to first frame, calculate number density:
mhd.switch_frame(0)
mhd.calc_ndens()

# Create a figure and plot our new value:
fig=plt.figure(figsize=[10,5])
mhd.add_pcolor('x','z','n', target=fig, loc=121)

# Jump to last frame and plot.
# Because our value is a 'calc_' method, the values update automatically!
mhd.switch_frame(-1)
mhd.add_pcolor('x','z','N', target=fig, loc=122)

# Tidy up and show figure:
fig.tight_layout()
plt.show()

…and that’s pretty much it.

Real World Example: Visualizing the Magnetosphere

Let’s see this in action by performing a common task: visualizing the magnetosphere by making the same plot for all frames. We’ll loop over many Y=0 slices of the MHD output, trace magnetic field lines, and plot contours of number density. Number density is not a standard output; we must calculate it for each file we open.

The old way: looping over all *.out files

The “old way” is to loop over each *.out file, typically using glob.glob to find those files:

# Start with required imports
from glob import glob
import matplotlib.pyplot as plt
from spacepy.pybats import bats

# Loop over all y*.out files
for f in glob('./y*.out'):
    # Load the file, calculate number density.
    mhd = bats.Bats2d(f)
    mhd.calc_ndens()

    # Create plot and add field lines:
    fig, ax, contour, cbar = mhd.add_contour('x','z','N', xlim=[-30,15], ylim=[-25,25])
    mhd.add_b_magsphere(ax)

    # Save figure as PNG and close figure.
    fig.savefig(f'plot_it{mhd.attrs["iter"]:010d}.png')
    plt.close(fig)

The new way: looping over frames

The new way is very similar with a couple of key changes:

  1. We don’t need to loop over many files, we just loop through the frames and use switch_frame. This means we don’t need the glob module.
  2. We only need to calc_ number density once. It will update when we switch frames.

# Imports:
import matplotlib.pyplot as plt
from spacepy.pybats import bats

# Open our file and calculate number density:
mhd = bats.Bats2d('./y=0_mhd_1_e20150315-200000-000_20150315-235900-013.outs')
mhd.calc_ndens()

# Loop over total number of frames:
for i in range(mhd.attrs['nframe']):
    # Switch frame:
    mhd.switch_frame(i)

    # Create figure:
    fig, ax, contour, cbar = mhd.add_contour('x','z','N', xlim=[-30,15], ylim=[-25,25])
    mhd.add_b_magsphere(ax)

    # Save and close:
    fig.savefig(f'plot_it{mhd.attrs["iter"]:010d}.png')
    plt.close(fig)

Both of the above scripts are functionally equivalent: they created a plot like the one below for each file/frame. We could stitch them together into a gif using Imagemagick or FFMPEG or the like to get an animation.

Limitations and Closing Thoughts

The biggest issue is that right now this only works with binary files. Changing this to work with ASCII *.outs files isn’t a big deal, but binary files are used far more often so I wanted to get that working first.

In a perfect world, we could access multiple frames simultaneously to be able to interpolate in time easily. This is something I may play with in the future. Right now, if you want to do this, your best bet is to open the file twice and loop over both but with a one-frame offset so that you have two at once.

Another good thought care of @drSteve1663 is to allow the objects to act as a Python iterator so that you can loop over frames more easily:

for frame in mhd:
   #do something to each frame

This wouldn’t be too hard (just give the class __next__ and __iter__ methods), but would allow for far more streamlined loops. I would want to make sure it wouldn’t interfere with any useful behavior it already has now that uses the object’s dictionary-like keys.

A quick note about time: using *.outs files is not slower than keeping the files separate. In fact, I found a slight speed improvement (~5-10%) in my early tests using ~250 files on an M.2 solid state drive. The improvement is likely more pronounced on slower drives. This was a relief; I had guessed that the overhead of scanning the file to collect frame info would be more than keeping things separate.

That wraps things up for this new feature. Try it out and send feedback; I’m eager to find and fix issues. Folks have been requesting this capability for a while; I am hopeful it lowers barriers for those transitioning from IDL to Python.

Leave a Reply

Your email address will not be published. Required fields are marked *