GRID Superscalar FAQ's

Grid Superscalar tools

I have several log files In my workers' home directory. They are named gram_job_mgr_<number>.log
Usually, when a Globus job fails it leaves information in a log called gram_job_mgr_<number>.log. If you don't need the information inside, you can erase them safely. Depending on your Globus installation they can appear always, when errors rise, or never. You can contact your system administrator to know that.
 

When I use gsstubgen I get this output: Warning: renaming file app-stubs.c to app-stubs.c~ / app-worker.c to app-worker.c~ / app.h to app.h~. What is this for?
In this case gsstubgen has done backups for your old generated files from your IDL definition. This backups end with the '~' character. You can remove them by hand. Next time, if you don't want to generate backups, use −n flag.


GS Master

When I set GS_DEBUG to 10 or 20, the output of my main program seems to appear in really weird places. What is happening?
If you print something to the standard output the system has a buffer to print more information from one call. So it's normal that sometimes appears in weird places.

 
When I redirect all output given from the master to a file, sometimes at the end some information is missing. Why?
Again buffering of the operating system is cheating you. You can also see that the order of some prints also change when printing by screen or when printing to the file. But that's normal. You can repeat the execution and see how it ends printing by screen.

 
I get a message like this when trying to run the master: ERROR activating Globus modules. Check that you have started your user proxy with grid-proxy-info
You forgot to start your Globus proxy or its lifetime has expired. Try the Globus command grid-proxy-info to see if you have started it. If you have not, remember to use grid-proxy-init. If it has expired, you can run grid-proxy-destroy and grid-proxy-init again.

 
The master ends with this message (or similar): ./app: error while loading shared libraries: libGS-master.so.0: cannot open shared object file: No such file or directory
You have to add to your environment variable LD_LIBRARY_PATH your GRID superscalar library location.


I get this message: ERROR: Check environment variables values. But I have all variables defined and GS_SHORTCUTS is set to 0
Your environment variables are wrong or too small. You cannot set GS_SOCKETS to a value different from 0 or 1, for example. We have set some lower limits in order to run your master correctly. See chapter 3.3 "Defining environment variables" of GRID superscalar manual


When working with GS_SOCKETS set to 1 I get a segmentation fault at the master. More precisely, this happens when a previous execution ends (prematurely or not) and I try to launch the master immediately
The problem is that some previous job managers stay running at worker machines, because socket version of the run-time doesn't wait for them to finish (to be faster than file version). Be fore executing again be sure that no globus process remains in the workers, or simply wait 30 seconds (the higher time the running job managers will stay when the worker ends).


I get this message: ****** ERROR AT TASK 0 !!! *******; ******** MACHINE <hostname.domain> *********; the job manager could not stage in a file
The cause can be that your gsiftp service is not reachable or is not started in your master. Be sure to have an opened port for it. You can telnet to that port (default is 2811).



localhost> telnet localhost 2811
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
220 localhost GridFTP Server 1.5 GSSAPI type Globus/GSI wu-2.6.2
(gcc32dbg, 1032298778-28) ready.

 
If you don't get this output (or a similar one), contact your system administrator and tell him that the gsiftp service is not working.

 
I get this message: ERROR: Submitting a job to hostname. Globus error: the connection to the server failed(check host and port)
One of your workers cannot run Globus jobs because the service called "gatekeeper" is not started or its port is closed by a firewall. You can do this to check it:
 

localhost> telnet localhost 2119
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

Where hostname is the worker that we suspect is failing. The connection has to remain till you write 'quit'. If you get a "Connection refused" message, tell your system administrator that Globus is not working properly because the gatekeeper is not started or is unreachable.

 
When the master is going to end I get this message: ERROR: REMOTE DELETION OF FILES IN MACHINE hostname HAS FAILED. Globus error: (error from system). Checkpoint file erased for safety reasons. What happened?
When the master ends it recovers all result files and erases temporary files in all the workers involved in the computation. If this final process fails, the master reaches a non consistent state. In this situation it cannot recover from the checkpoint file. You can get your results by hand, and erase temporary files, or start your execution again from the beginning. The main reason that makes this error appear is when you don't have enough quota in the master to receive the result files, but check the "Globus error" sentence to know this more precisely.

I get an error like this when trying to run the master: License Manager Error: Your license expired on 23/02/2004. Please contact Rosa M. Badia (rosabatac [dot] upc [dot] edu). What is all this stuff about licenses? I haven't acquired any
We use to generate GRID superscalar distributions with expiration date, so you can take benefit from our new versions of GRID superscalar with new features and fixed bugs. It is not good that you remain with the same old (and possibly not bug free) version forever.


GS Workers

The first task executing returns an error of this kind ****** ERROR AT TASK 0 !!! *******. When I see log files at the worker side I find this at the ErrTask0.log: ../appworker: error while loading shared libraries: libGS-worker.so.0:cannot open shared object file: No such file or directory
You, probably with good intentions, deleted at workerGS.sh a line that defines the LD_LIBRARY_PATH environment variable to load the GS-worker library. You cannot remove it if your GRID superscalar library is not installed into a standard location. Just put it back.

 
I get this message when I try to execute a remote task: ******** ERROR AT TASK 0 !!! *********; ********MACHINE hostname *********; the executable file permissions do not allow execution
You must check that the workerGS.sh file in the worker named hostname has execute permission. To change permissions you can run "chmod ugo+x workerGS.sh".

 
The firs task ends with an error, but now when I look into the worker I find in ErrTask0.log: workerGS.sh: ../appworker: No such file or directory
You have not compiled the worker in this machine.

 
Once more my first task fails but my log files are empty. That's crazy!
Be sure that your paths for finding the worker executable are correctly defined in broker.cfg, and that nobody has deleted last line from workerGS.sh. It has to contain this: "../app-worker "$@""

 
I always get errors when trying to run a task into a worker. Is it Globus fault? Is it GRID superscalar fault? Is it my fault?
The first thing you can do when the remote executions fail is to run a single test to check that Globus can run jobs. You can do:
 

globus-job-run worker1 /bin/date


And see if this returns the current date and time. If this fails, you can contact your system administrator and tell him that you cannot use Globus for running your jobs.

 
I receive this message at the master: ERROR: Submitting a job to hostname. Globus error: the cache file could not be opened in order to relocate the user proxy
Check if you have available disk space in that worker machine. This error can leave some .gram_scratch_<random_name> subdirectories in the involved worker.

 
I receive this message at the master: ERROR: Submitting a job to hostname. Globus error: the job manager failed to create the temporary stdout filename
This can be also a problem with quota in hostname


I get this message: ERROR: Submitting a job to hostname. Globus error: data transfer to the server failed
The reason could be that you don't have enough quota on the worker machine to transfer your input files. Check this with the "quota" command.


After having a quota problem in a worker, I see some temporary files remaining. How can I manage to erase them correctly?
You can erase all subdirectories that are named .gram_scratch_<random_name>. Some input files can remain also (their names will be familiar for you). The rest of temporary files are described in section 4.4 of GRID Superscalar manual.