January 7, 2008

Data Data Everywhere, and not a drop to drink

This is a blog entry I've been meaning to write for a while.

messydesk2.jpg

I am the guilty party who manages this messy desk. I've been working on this pile for the better part of a year. If you stopped by and asked me for something I can easily root through one of the piles and locate it for you. While the model works, it is not scalable, nor is it sustainable over time. At some point the fire inspector is going to show up.

How many of us approach our long term data storage in a similar manner? In In an earlier blog I touched on the subject of long term data management. While NSF currently does not have specific policy regarding data retention, there exists a Council on Governmental Relations policy on access to and retention of research data. NSF will be requiring a data plan and that part of the research grant be set aside for data retention. While researching information for this blog I came across a good example of such a plan, the Community Climate System Model (CCSM) Data Management Plan. I expect this will be typical for large grants.

We can buy oodles and oodles of disk and tape but without a clear and concise strategy, data management and access will look like my desk, frustrating not only ourselves but all those who have a rightful access to the data.

So while I have been procrastinating on writing about this subject Joe at scalability.org wrote an excellent piece on long term data storage. In his blog Joe picks up on something that popped up a few days ago by Robin Harris at StorageMojo on Microsoft eliminating "easy" access to old file formats.

Joe makes some excellent points on the long term viability of stored data. What good is storing it if we don't have the means to access it in the future? As Robin points out, we can't trust closed source solutions. If you have only a couple of dozen files maybe you can remember to resave them in the new format every few years, but this is an administrative nightmare we do not need.

Not to pick on Microsoft, other vendors and our own practices are just as guilty. From a personal perspective, nearly 20 years ago (gasp, has it been that long??) I wrote my Masters Thesis using a then new-fangled text processor called Sprint from Borland. Using Sprint made sense at the time. However today I can no longer access any of the text data. Don't even ask me to open the original data files or even try to open the plot files. I still have my old 8088 machine around that I used to create my text and data but I doubt it will even start up, let alone read in these files. (In case you're wondering I use that old machine to hold up some stacks of paper in my basement. Don't ask why...)

With the coming data explosion we're going to need a different approach than business as usual when it comes to data management. The storage part is going to be easy. All that takes is money. A fundamental question for the storage part of the equation will be is it cheaper to store the data or to rerun the experiment. Disks are cheap and getting cheaper and like cpu power the cost eventually will approach 'zero'.

It is in the long term management where the challenge lies. What file formats do we use? What is the storage and search criteria we use so we can reference our data both now and in the future? My prediction for the retrieval and access part is we'll turn to some type of keyword-based search engine form of management with access rights built on top. We're going to need some help from our friends in Digital Library Technologies for help.

As for file formats, lets be careful out there. Two that pop to mind are Common Data Format and Hierarchal Data Format. These are not the end-all and be-all of storage formats but I firmly believe our thinking should be along these lines.

After we spend a good hunk of our careers collecting all terabytes of fMRI data we should have a way to access it 20 years hence.

Microsoft's bricking some old data formats that are not even old enough to vote should be a stark reminder that we cannot continue business as usual.

The Community Climate System Model (CCSM) Data Management Plansure looks like a good model to follow.

Comments, as always, are welcome!

January 4, 2008

New Year, New Promise

Call me a slacker.

No blogs were added since late September despite a backlog of stuff I have happened to write. So for the new year I will make it a point to keep Tales From the Run Time active and involved. However I still need you... provide comments (real ones, please) to let me know what material you want to see more of. Until then it feels like I am talking in an empty echo chamber!

September 12, 2007

storage and memory

In HPC we traditionally have been faced with how to acquire and operate machines that run faster and faster, with bigger and faster memory and more processors to allow for bigger science.

The fast machines have left us with vast amounts of data. Two important items are involved here: where to store these vast volumes of data, and how to catalog and access it quickly and easily.

I just read an article on the New York Times web site
Redefining the Architecture of Memory that has implications for not only system memory but fast storage.

The implications for HPC are huge.

August 9, 2007

more lionxc, more benchmarks

I discovered a great way to waste a lot of time and get nothing done. Its called making a service call on faulty hardware.

During the HPL runs for the lionxc IBA acceptance testing we ran across a couple of problems -- rack 2 would think it was overheating and shut down the nodes, one node had a bad power switch, one had a drive fail, and one went on permanent vacation.

The rack 'overheat' issue was vexing since the rack was not in an overheat state. Ambient temps were running consistently around 20C. Lionxc, for those who do not know, uses sealed, self-contained water cooled racks. This way we don't need to cool the room to cool the cluster and we eliminate a lot of hotspots within the rack. However one of the rack temperatire sensors thought it was overheating and took 'preventative' measures, which include automatically opening the rack doors, requiring the room air conditioner to help cool the nodes. Now with the loss of rack cooling the nodes really were starting to overheat and would shut themselves down to protect themselves.

While we were looking into the phantom overheat problem I decided to get a few faulty nodes repaired. One had a bad hard dive which was easy enough to obtain a replacement. One had a bad power switch, which forced us to pull and reinsert the power cord every time we needed to power-cycle the node. After some gyrations with IBM service they were convinced we needed a new switch.

The real fun began with the node on permanent vacation. It would boot up, run for a while, and then like a bad employee just go away with no outward indication nobody was home. It would happen at random times. We've enough experience to know this is either a faulty system board, cpu, or vrm. Well the phone guy wasn't convinced and had me play all kinds of games with the SCSI backplane. After an hour and a half of getting nowhere (despite my repeatedly asking for a replacement system) he decided to ship parts. It tuns out it still wasn't the right parts, sending us a new drive, SCSI backplane, and cable. Well our naughty machine was still up to its old tricks and after another hour on the phone we were right where I wanted to be 5 minutes into the first hour long call-- sending me a new main board and processors (the vrm is integrated onto the board). We'll see where this goes tomorrow but I can say with 99% confidence it'll be running.

So what was I supposed to be doing all that time instead of on the phone getting nowhere fast?

Compiling and running the new SpecMPI 2007 benchmarks for both the IBA and gigE networks. Never having prior used the SPEC benchmarks the config file and commands took some getting used to. I tried to block out a few contiguous hours to work uninterrupted but alas Service had different plans for my time.

I've started a few preliminary benchmarks but the numbers so far don't mean much to me. Once a complete sweep is done look here for the published numbers. Let's hope I don't get too sidetracked by other issues to stay on top of SPEC!

August 7, 2007

More lionxc HPL numbers

On Monday we scheduled an all-day system downtime for the xc cluster to finish the software installation of the IBA network and to run HPL across the entire cluster. Since using the entire cluster would create an odd geometry not suitable for a good HPL run, I picked a 22x22 size for 484 cores or 121 of the 125 available nodes.

80% of memory resulted in an N of 288000. I ran out of time before trying the 90% N of 324000. NB was left at 65.

The end result was 3.2 TF, or appx. 55% of the theoretical peak of 5.8 TF. While not as good as the 65% achieved when using only a single rack the numbers are still good. Lack of time prevented me from running additional more tuned samples for better numbers.

Now to get some downtime to run a comparison just using the gigE network.....

August 2, 2007

Lionxc and Infiniband

Over the last week we installed Qlogic Infiniband into our newest cluster, lionxc. Since lionxc uses closed water-cooled racks running the iba cables was an exercise in special. After the first cabinet containing nodes lionxc1-lionxc41 was wired I started running the high performance linpack (HPL) benchmark. HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers.

I turned on as much optimization as I could for Lionxc's Woodcrest processsors ( -O3 -march=core -msse3 -mcpu=core -fno-math-errno -ffast-math) and ran in several configurations. Since Lionxc contains nodes with 8 GB and 16 GB of memory, the largest memory size I could set for HPL was based on the smaller 8 GB memory. Using 40 nodes in the first cabinet wires I sized N for 90% of the total aggregate memory, which resulted in an N size of 185000. Experiments with NB lead to an optimal result at size NB=65. P=10 and Q=16.

Over 160 cores Lionxc achieved a computation rate of 1243 GF, or roughly 64.7% of the peak theoretical performance of the first 40 machines. When using only 2 cores per machine (and roughly one process or core per socket, but not always guaranteed) the result dipped to appx. 680 GF, or appx. 54.7% of the performance of using all the cores per node. This is a pretty good showing a multicore MPI-based application.

Now if only real world applications would run so well!

I'll post cluster-wide HPL and possible other benchmark numbers once we have them become available.

July 26, 2007

How to Install a Black Box

173600185-S.jpg

OK, so this black box is really white.....

I recently blogged about Sun's Project Blackbox, or a datacenter in a can and its first customer installation at the Stanford Linear Accelerator Center (SLAC).

Blackbox, or in this case a black box painted white, is 20-foot container, packed with more than 250 servers and an integrated cooling system. It arrived on a flatbed truck and was hoisted on a crane and lowered into position on a concrete pad behind Building 50 at SLAC. All SLAC needs to do is hook up the power, networking, and chilled water lines.

Here's a link to a time-lapse video of the installation: http://www.slac.stanford.edu/~boeheim/bbcam/0714.mov

While Blackbox was announced last October, with the first customer unit installed at SLAC this July, a Canadian data center provider, eNation, has begun marketing customized Blackboxes that it will configure and install for customers, most of whom are in the casino industry.

It's been a while

It has been almost a month since the last entry appeared here on Tales From the Run Time. It has been a busy month and time just escaped.

I took some time off earlier in the month to work on the greatest all-volunteer 4th of July show in the USA. That was a lot of fun and I look forward to next year's show.

Right after that came the big move. The Pleiades Cluster moved from the Computer Bldg. to its temporary quarters in the DLT machine room. It was a monumental effort by a lot of dedicated folks. The move itself took only 3 days but a lot of planning went into that. the best part was we only lost a network cable and 2 hard drives to failure. Not bad considering that machine has been in production nearly 4 years without ever being powered off, let alone uncrated, packed, driven across campus, unpacked and reracked!

Not even a single screw was lost. In fact, we had a net gain on screws as all the ones we 'lost' into the bowels of the racks when Pleiades was first constructed all shook out of their hiding places during the bumpy ride in the back of the moving truck.

During the move I also installed all the IBA HCA hardware into the lionxc cluster. We're getting the IBA switched wired in as this is written. The first rack of 40 nodes is wired up and the drivers installed. The first HPL numbers are in and the results look good so far. More on this in a later blog.

Wrapping up the busy month was an all-day workshop on the TotalView debugger hosted by TotalView Tech (formerly Etnus). We had great attendance and I look forward to hosting a similar workshop next year.

Coming up next week is a meeting with folks from RapidMind. The RapidMind Development Platform achieves breakthrough performance without the challenges of understanding the processor hardware or sophisticated parallel programming techniques. The RapidMind platform makes programming these processors as easy as single-threaded, single core programming, yet takes full advantage of all available resources. This looks similar to the product we were working with with PeakStream, the company bought by Google and removed from the public. We hope to get a feel from the RM folks what RM can do and hopefully it will be pretty close to where we were with PS, and then some. More later!

That's all I have time to write now as the laptop battery is getting low. More will follow and I promise it will not be another month before the next log entry!

June 29, 2007

Congressional Panel Favors Access to Publicly Funded Research

Abstracted from Gary Price's ResourceShelf:

Congressional Panel Favors Access to Publicly Funded Research
From the news release:

Public access to NIH-funded research took a major step forward this week with Senate Appropriations Committee agreement to direct the National Institutes of Health (NIH) to require that its funded research be made publicly available on the Internet.

This milestone was immediately praised by the Alliance for Taxpayer Access (ATA), a coalition of patient groups, researchers, consumers, and libraries that has long called for such a step.

"The momentum is real and Congress understands the public's interest," said Heather Joseph, Executive Director of SPARC (the Scholarly Publishing and Academic Resources Coalition, an ATA founding member).

Source: ATA

Read the original at URL http://www.resourceshelf.com/2007/06/29/access-to-info-congressional-panel-favors-access-to-publicly-funded-research/

Do you have your data plans ready?

June 28, 2007

More Constellation Fun Facts

Apologies to Mark Hamilton for stealing from his blog:

# The Sun Constellation System includes the world's largest InfiniBand switch, 3456 fully non-blocking IB ports using the latest Mellanox InfiniScale III switch chips.
# The Sun Constellation System requires only 1/6 as many IB cables as current solutions. In a 3456 node cluster, this is estimated to be a weight savings over over 8 tons in the cables alone!
# A single 3456 port switch replaces 300 individual switches in a traditional IB fabric design (288 24 port switches and 12 288 port switches).
# The Sun Constellation System supports AMD, Intel, and SPARC CPUs.
# The Sun Constellation System can be configured with up to an Exabyte of storage! Of course since the Sun Constellation System uses ZFS, there will be no limits in file system scalability as disk drive density increases to enable even greater storage capacity.
# The Sun Constellation System scales down to starter systems, starting with a single rack of 48 blades (192 sockets/768 cores) and up to 13,824 blades with 4 switches.