How and where can one efficiently store ~ 1000s of output files produced by an app


#1

We are developing a Task in the Collab that executes a neuron synthesis code and, depending on the input selected by the user, it can produce up to 1000s of h5 files. The problem is that if these are stored on the document service it is slow, but if the user wants to store them on gpfs they have to specify an output directory in advance of running the job in the collab, which is a bit odd.

Is there a way for a user to set an output directory on gpfs from within the collabs?


#2

I am not sure I got the question.
What exactly do you mean by

In general there is no relation between any collab and folders in GPFS.


#3

As Yury said, a GPFS folder is not related to a collab as of today as it is very specific to our infrastructure. But I think you are looking for a proper way to speed up things by using GPFS to store files.

I can think of asking the user to enter a gpfs path as argument. Your task can then create on the fly the folder in the current HPC project if it does not exists. To ensure it is portable, I would suggest that you keep the document service for the case a user has no access to our GPFS resource.

One downside of doing that is that the files won’t be visible from our storage, unless you manually register those as links in the document service. Doesn’t sound that simple…

Maybe Mike, Jean-Denis or Juan would have a nice idea?


#4

in the morphology release task, that is doing a similar thing. we take a “gpfs output path” as parameter so that the morphologies are stored in that output path at the end of the task.

so the “virtual file” can be stored and visible in your collab storage.
but the physical file are stored in GPFS.


#5

In general it can be not possible to create folder in GPFS from the task as soon as we don’t have impersonalization

So the visible solution is to path gpfs output path as an argument, example
https://bbpcode.epfl.ch/browse/code/platform/tests/SampleTask/tree/gpfs_output_task/gpfs_output_task.py?h=HEAD

But as I understood the issue is that user can’t be sure that path exists in advance and another one is that you would like to distribute these files in several subfolders, right?


#6

That is right. We would need the task to create sub-folders. And we’d like the whole thing to be read-only after the task is completed, otherwise provenance can be broken. Also, this requires the user know about GPFS. I am not sure Abigail can be expected to have that king of knowledge.


#7

Hi,
thanks for the quick answers. This question came up in a discussion with Juan, but he is away until Monday. So maybe I should wait until he gets back to continue. The aim is as you say, how to speed up the writing of large numbers of files.
Talk again soon,
Julian


#8

About read-only there is no mechanisms in place to do it in GPFS at the moment.

Side thoughts. I think it is a bit strange to return 1000 files from the job. Nobody will look at these files one by one. The only reason to produce it is to pass to the next job as I understood. So what is the reason then to create 1000 entities in document service or in GPFS? Why you don’t like the option of zipping it? Then you can put this zip either in document service (in case the size of this zip will be reasonable) or GPFS.

About users knowing about GPFS. In general when somebody is running a job he/she should be aware about HPC, projects and so on. Passing path like /gpfs/bbp.cscs.ch/project/projNN/myoutput doesn’t look like a big miracle. I agree that it can be more user friendly but not like files will appear in right place magically.


#9

“Nobody will look at these files one by one.”

I wasn’t making that assumption. My assumption is that different applications downstream will want to access different subsets of files. For example, the validation job could create further bundles for files that pass or fail different criteria. A brain builder job may need a bundle of files that satisfy yet another set of criteria. But, as I said, that is my assumption. Maybe it is OK to treat the output of a job as a single entity. But maybe Julian can comment on that.


#10

We don’t particularly care about GPFS. We just want safe, fast storage for 1000s of morphology files. With provenance, with some kind of “bundle” or “collection” concept. And I assume the brain building will have similar requirements. In out running scenario, a simulation job creates sub-folders, so the user only need specify the parent output path. The different output paths are to identify different types of morphology because the HDF5 format doesn’t have metadata fields we can, and traditionally the morphology type is not encoded in the file name.


#11

I know it can be done. But there are at least two issues:

  1. From a UX point of view, it may be confusing to have task output location on the doc svc that can be selected from the UI, and a separate output location on GPFS which is an input parameter.

  2. If we pass a writable directory path, we have to start worrying about protecting the output from further modification, checking that a job doesn’t over-write previously existing files etc. This is extra complexity that we’d rather do without.


#12

So if I understand summarize this request it’s a statement that the Document service has the right balance of safety and web-accessibility for your use case, you just need a more performant path for writing large collections of files.

I try to summarize it like this because we’ll do another Platform Developer Survey soon and I would add this as one of the items.

Cheers,
Jeff


#13

Correct, but we also need a performant path for reading these large collections of files. Reading isn’t stated explicitly, but given that we plan to read many times and write only once, I would argue that efficient reading has a higher impact than efficient writing.

There may be other requirements, not needed for M30. Should I start a separate threads for those?


#14

@palacios I have to state it for clarity: honestly, there won’t be much miracle next Month. Yury is working with you in order to have GPFS working as it should work today and we are listening for the future releases. But for M30, you should live with GPFS with its actual strength and weakness.


#15

@olivier Yes, that is clear. We have an ugly work-around for M30. And Yury’s help is very much appreciated.


#16

@olivier Yes, this is an important point. There won’t be any changes before M30, but we’ll see if other people we survey prioritize improving performance for a similar use case. Then whether we should address it after M30.