Still here

Yes, I’m alive. Sometimes I seriously doubt about it, but then I have a look to my mailbox and I change my mind. Why is that, is because to be in hell can’t be worse than this, and I’m not definitely in heaven. That is, assuming I’m christian, that I’m not. I also think about other possibilities: if I’m in a simulation, it’s a very bad one. I mean, are they trying to evolve us or something? Are we pokemons of 15-dimensional intelligent beings? And if I’m a dream, I am definitely a nightmare. Where did I leave the intensity and the risk, that sensation of being alive, of doing something nobody else did before? Is it this job I’m currently having?

Or maybe it’s just about time to have a crisis. Anyway, I will survive. I’m a very bad pessimistic person, also.

 

Advertisements

Ideas

I have a lot of ideas. Really. The problem is, I don’t have time to unfold then. I have this one of the eternal city, some low-tech version of Trantor from Asimov’s Foundation serie. I have this one that is a watashi version of Trainspotting, but from a point of view of a successful old-glory engineer. I have that about a mission on Ceres, with a lot of biological real-world hard sci-fi, but no big drama or assassination like in Hollywood movies. Or that crime story on an alien planet that happened to be our Earth, at the end. A little bit like 3000 years after Stargate.  And then, again, I need, I think I really need, to revisit and translate my old, published, short stories. We’ll see how it goes. Maybe a blog is not the right way to give out all of these images. Maybe one day… maybe one day, I can only ask my brain to record it, and edit the result. Maybe. I hope so.

Entertainment

I got this drone that can record HD video. It allows you also to stream VR video… unfortunately it doesn’t come with a handy powerful enough to cope with it. Anyway, I’m not so sure I want to go beyond the normal usage here: as a hardware/software specialist, these kind of things remind me that I need to work. And despite one works for a reason, that reason is not usually fun.

My entertainments are quite common. I’d like to have a new hobby, but I’m so lazy about starting with it from scratch. So I keep doing always the same. Browsing, playing, thinking about how to improve our little world.

Slurm user cheatsheet

It’s time to compile the knowledge acquire after running our SLURM cluster for several months. I will dump here the information we have so far. I know it’s available everywhere, but in principle, I want to have my own version of it 🙂

What it does Example
List cluster load (jobs running, pending, completing) squeue
List all current jobs for a user: squeue -u username
List all running jobs for a user squeue -u username -t RUNNING
List all nodes, and its usage sinfo
List all nodes, and its usage, in detail sinfo -lN
List detailed information for a job (useful for troubleshooting) scontrol show jobid -dd JOBID
Log in interactively (minimum resources, no node specified) srun –pty bash -login
Log in interactively (no node specified, asking for GPU) srun –constraint=gpu –pty bash -login
Log in interactively on NODENAME (minimum resources) srun –nodelist=NODENAME –pty bash -login
Log in interactively on NODENAME, asking for 40 tasks srun -n 40 –nodelist=NODENAME –pty bash -login
Submit job.sh (default values) sbatch job.sh
Submit job.sh requesting 40 tasks sbatch –ntasks=40 job.sh
Submit job.sh to 4 nodes, not specified sbatch –nodes=4 job.sh
Submit job.sh to node NODENAME sbatch –nodelist=NODENAME job.sh
Submit job.sh asking for 30 GB of RAM per NODE sbatch –mem=30720 job.sh
Submit job.sh asking for 8 GB of RAM per CPU sbatch –mem-per-cpu=8192 job.sh
Submit job.sh asking for 30 GB of RAM per CPU sbatch –mem-per-cpu=30720 job.sh
Submit job.sh to a GPU node, requesting 2 Titan X Pascal sbatch –gres=gpu:TXP:2 –constraint=gpu job.sh
Cancel job with JOBID scancel JOBID

The commands to allocate resources MUST be put together. For exaple, for submission

 sbatch --ntask=9 --nodes=3 
--ntasks-per-node=3 --gres=gpu:TXP:2 --constraint=gpu job.sh

and for interactive login:

 srun -n 40 --nodelist=NODENAME 
--gres=gpu:TXP:2 --constraint=gpu --pty bash -login

Failing to do so will result on overbooking of resources, or a CRASH because of the lack of them.

The commands can also be included on your script, by using the preprocessor tag #SBATCH. For example:

#SBATCH --mem-per-cpu=2048

I don’t think at this stage you need a sample submission script. So I will not post it 🙂 but most of the ones you can find by googling will work, assuming you have all the software installed properly. In the future, I hope I can make a post about possible error message you may get, and what they mean… but so far, so good!

Marta and Alicia

Marta and Alicia obviously knew each other before coming here. There’s this intimacy that I can’t break. But they seem to be both confortable with me. After a rather varied questioning about my view of life, they seem to relax, and start to get closer one to the other. They look lovely together. I do try to get involved, and I partially manage, althought I feel like I’m destinied to get only the remnants of the banquet.  I feel excluded, but still with hope to join them. It’s a funny feeling, that mixes resignation and envy of the intimacy they have. They caress each other. They caress me. There are now three levels in our world: one the one they stand, the second the one I am, the third, far below, the rest of the camp.

The sangria runs out, and we go directly for the wine. Now we are very happy, we live in a bubble that is existing only for us. We sit on a blanket, that I don’t know where it comes from. None of us has a tent. The wine bottle is close to us. We speak about life, what we remember from when we were at the school. The snow, the school trip. That we didn’t do together, but ended up going to the same place. The mountains. The fascination for the palace. Our friends at that time. We ignore the people around us, and they ignore us, and at one point, they melt out, they dissapear, and it’s only us and the fire, that from being shy goes to full thrust, to illuminate us, like wanting to be also a part of it. Like feeling the growing sexual tension.

Kubernetes, kubernetes

Now that we have our cluster more or less running with slurm (although the configuration tuning is not clear, and it will be discussed in another post) it’s time to play with it.

Reading about what you can do with a cluster I stumble upon the kubernetes. It’s supposed to be a generic container that will allow me to deploy, if needed, an application that will run on my cluster, doing what I want. If you read its overview, it says that with those monsters you can:

  • Deploy a containerized application on a cluster
  • Scale the deployment
  • Update the containerized application with a new software version
  • Debug the containerized application

I thought I knew programming, but if I continue reading, it says “the tutorials use Katacoda to run a virtual terminal in your web browser that runs Minikube, a small-scale local deployment of Kubernetes that can run anywhere.” That’s amazing, but it sounds pretty strange for me.

So what can the kubernetes do for me? Let’s say I have a software module that runs a program on my cluster. That’s not the end of the road usually, and after that I need to interpret the results with yet another module, that demands different resources. If I have a kubernete that does that for me,  pulling the resources it needs at every step, and delivers the final results of the analysis chain to me, it will be fantastic! Of course, the input to such a magical tool needs to be meaningful, and that’s maybe the problem.

Maybe I’m trying to kill a mosquito with a canyon, but I like to play. Actually, I can imagine that, if I need to serialize my slurm jobs, I can do it, by writing the right script and collecting the right signals from the ongoing analysis. Which signals is to be decided. What is right also. So we will see how far I go.

System X from Lenovo

maxresdefaultWe have a new bunch of these guys at home. They are really gorgeous machines: you can get a GPU server + data together for a handful of thousands, excluding the price of the GPU, the RAM, and the hard disks. The connectivity is great, also they look clean when you open them, not like some SysGen servers I saw (I don’t want to give names) that despite of the final performance, they look cramped with components, heavy and noisy. The finishing is in plastic, except the cover, but even so, it looks beautiful.

The user experience is very good, also. But I don’t want to sell you one. What I want to sell you is the remote management they have. There is an option that, once filled, allows the machine under warranty to “call home” when it has a problem. If you allow that, the customer service of your area will contact the person in the form and tell him what to do to repair the server, even if you didn’t realize the problem was there. What the hell, they even sell you the spare part, if you specify you are brave enough to exchange it (for example a RAM module).  So far so good. The problem is, the issue is not always solved, and since IBM has a big customer support center, it could be that your tickets are crossed in the cyberspace or something, and you start repairing something that is not anymore giving you errors. Yes, complaining is for free. The fact is, one of our servers had the motherboard exchanged by a technician a week ago, and despite I think he did everything right, the server is not yet on a fully operational state. That is, it doesn’t work. And I’m still exchanging tickets with the support.

If this is happening to only one of 300 servers, that is fine, but I don’t handle the statistics. We will see, I will edit this post when the problem is solved.

EDIT: problem solved. A technician came and replaced the rising cards and one CPU. All of that after replacing already 3 RAM modules and the motherboard. But if it works, it works. Interesting was the way of collecting logs. I was asked to run a script, called

lnvgy_utl_dsa_dsala5z-10.2_portable_rhel7_x86-64.bin

The output looks like:

Lenovo Dynamic System Analysis

(C) Copyright Lenovo Corp. 2004-2015.(C) 
Copyright IBM Corp. 2004-2015.  All Rights Reserved.
[...something here...]
Extracting...

Executing...

Logging level set to Status

Copying Schema...

cp: cannot stat ‘/etc/sysconfig/network/*-usb*’: 
Not a directory
cp: cannot stat ‘/etc/sysconfig/network-scripts/*-usb*’: 
No such file or directory
Dynamic System Analysis Version 10.2.A5Z
(C) Copyright Lenovo Corp. 2004-2015. All Rights Reserved. 
(C) Copyright IBM Corp. 2004-2015. All Rights Reserved.
Running DSA IMM plug-ins pass 1.
   IMM: Integrated Management Module Collector 
Running DSA IMM plug-ins pass 2.
   IMM: Integrated Management Module Collector 
Running DSA IPMI plug-ins pass 1.
Running DSA IPMI plug-ins pass 2.
Running DSA collector plug-ins pass 1.
... some stuff here...
...
Adding DSA log entries to XML file.
Writing XML data to file 
/var/log/Lenovo_Support/SOMETHING_LONG.xml.gz

DSA capture completed successfully.
cp: failed to access ‘/etc/sysconfig/network/’: Not a directory
cp: cannot stat ‘*-usb*’: No such file or directory

Please press ANY key to continue ...

I don’t know if the output of this was useful or not. But I’m happy my cluster is complete again 🙂