About bitsanddragons

A traveller, an IT professional and a casual writer

GPFS : force deletion of a node

On the spirit of this being my notebook ❤️ I want to write down my experience forcing the deletion of a GPFS node. The disk of the node ‘deadnode‘ died without a previous warning (or was it that I wasn’t looking?) so I didn’t have a chance to properly stop the GPFS and so on. Let’s go and force the deletion

quorum ~ ## > mmdelnode -N deadnode
Verifying GPFS is stopped on all affected nodes ...
The authenticity of host 'deadnode (1.2.3.4)'
can't be established.
Are you sure you want to continue connecting (yes/no)? yes
deadnode: bash: /bin/ksh: No such file or directory
mmdelnode: Unable to confirm that GPFS is stopped
on all of the affected nodes.
Nodes should not be removed from the cluster
if GPFS is still running.
Make sure GPFS is down on all affected
nodes before continuing.
If not, this may cause a cluster outage.
mmdelnode: If the affected nodes are permanently down,
they can be deleted with the --force option.
mmdelnode: Command failed.
Examine previous error messages to determine cause.
quorum ~ ## > mmdelnode -N --force deadnode
mmdelnode: Incorrect extra argument: deadnode
Usage:
mmdelnode {-a | -N {Node[,Node...] | NodeFile | NodeClass}}
quorum ~ ## > mmdelnode -N deadnode --force
Verifying GPFS is stopped on all affected nodes ...
deadnode: bash: /bin/ksh: No such file or directory
mmdelnode: Unable to confirm that GPFS is stopped
on all of the affected nodes.
Nodes should not be removed from the cluster
if GPFS is still running.
Make sure GPFS is down on all affected nodes before continuing.
If not, this may cause a cluster outage.
Do you want to continue? (yes/no) yes
mmdelnode: Command successfully completed
mmdelnode: Propagating the cluster configuration data
to all affected nodes.
This is an asynchronous process.

If you don’t know what’s a GPFS filesystem, just search on this very blog. There’s plenty of information above. Maybe too much information. Maybe too much. 😁😁😁

kernel:NMI watchdog: BUG: soft lockup on CentOS 7.X

I saw the soft lockup bug before, but I didn’t write the fix. On my AMD servers, sometimes I see printed on the shell a lot of the following messages:

May 25 07:23:59 XXXXXXX kernel: [13445315.881356] BUG: soft lockup – CPU#16 stuck for 23s! [yyyyyyy:81602]

The date, the CPU and the duration it’s stuck varies. Your analysis will still run, the machine is still usable, but the messages can become so annonying (specially if you have a lot of CPUs) that may not let you work. I found this post for SUSE that is really similar to what happens in CentOS, so I gave it a try. First I check that the watchdog_thresh is defined

hostname:~ # more /proc/sys/kernel/watchdog_thresh
hostname:~ # 10

Then I do as suggested:

hostname:~ # echo 20 > /proc/sys/kernel/watchdog_thresh

And so far the message is not coming back. Problem solved! 😊😊

A prehistoric tantrum

My small one is in general lovely, but now he started to fall more and more frequently onto tantrums. He starts to cry loudly and goes away searching for a dark corner to shell himself, a cupboard, behind the sofa, and similar. As a scientist I’d like to understand not only why this is happening, and more specifically what are they and when they are more probable to happen. I found out – I think this is no secret – he is more likely to enter a tantrum when he’s tired, like at the end of the day. Also, as in the adult’s case, when a lot of issues happened to him during the day. So what’s the origin of this defensive mechanism? Since I’m no expert I can only speculate about. Let’s come back to the stone age, and imagine the next scene. There are two young hunters chasing independently the same prey, let’s say a boar. Wild, of course, as everything during those days. For simplicity, let’s say both hunters belong to the same tribe and they speak the same language, if any, but they are not necessarily friends.

“Ugh” says the first, and throws the spear the the boar before hiding again in case the animal charges in his direction. “Ugh ugh” shouts the second a little afterwards, calling the attention of the first. He then leaves his hideout, just to find out that there are two spears piecing the dead boar’s belly. Obviously we don’t know which one came first, at least with this equipment. “Ugh ugh ugh!” mumbles the first, questioning the scene. “Ugh ugh ugh uuuugh!” insists the other. They both stare at the body on the mud, and repeat their messages. “Ugh ugh ugh!” and “Ugh ugh ugh!”, because of course they belong to the same tribe. Then the second, being he the second in this story, starts directly to shout. The first one, annoyed first, scared after, turns his back one moment to check if there’s a mammoth or a tiger coming. Because of the noise, in the same way a present time mother or father will check if there’s someone around susceptible of complaining about. Just to find out, after returning to the falling pray, that the other has gone onto hiding with the capture, in a tantrum. His capture? We’ll never know. And the tantrum evolutionary usage is born. Please leave your comments below 😁😁😁.

The robot land

I’m on a road trip. To whom I don’t remember, but who cares. Our cabriolet is red, and probably rented. The wind blows gently, but it’s hot. The road is now crossing black, volcanic terrain. I don’t drive and actually I don’t know my driver. Meaning I don’t know anyone like him in the real world. An Italian by the accent. We stop over a gently slope, on some kind of lookup, and as if agreed previously, we park and open the trunk lid to get some cold beers and a couple of binoculars.

“There” he says, after sweeping the land for a good five minutes, while I stare at the infinite. “I found a group. Look with me.” I take the googles and look in the direction he’s pointing. After focusing, I see them. Machines. Barely humanoid, similar to the Star Wars attack drones, but less clumsy and with more limbs. “Are they them?” I ask. “Yes they are.” He answers. “Do you see how they dig with the first set of arms? I believe those are actually the mouths. The second set they use to build their offsprings. They grow them on their backs. But it’s a collaboration effort, one makes a piece, then he passes it to another. You get it.” I sip my beer. He leaves the binoculars hanging over his chest.”Isn’t the Government unhappy about them?” He chuckles. “The Government? They don’t know what to do. If it’s not going to produce money, they are not interested.” I sip my beer, and decide to throw away the empty can as far as I can, hoping maybe to hit one of the robots on the head. Of course I don’t manage, but my gesture speaks by itself. “But they are self-replicators, right? Aren’t they dangerous?” My driver starts heading back to the car. I follow him. He takes a soda, and drinks it without looking at me. With a sad voice he says “We hope not. Actually, some of us believe that they are repairing the land, that they are some kind of long-lost pollution cleaning system. If this is true, they will simply deactivate after the job is done.”

I look again to the robots. They don’t look specially dangerous in fact. “I see.” I say. “So you’re letting them clean your mess.” My driver looks at me, angry. “My mess? It’s their mess! The politicians created this wasted land! They didn’t care about us! They only wanted more money, and they didn’t care about the consequences. So we are grateful, and if needed, we will even give them resources, if we manage to communicate with them. We have now a name for this area, la terra dei robot, the land of the robots.”

How to use HP_RECOVERY partition on windows 7

Things get broken, things get tainted… Windows 7 is no more under support but I still have some PCs running it. And stupid me one of them didn’t have a backup copy of the initial working system, since it was installed in 2014. Also I found the original CD but it was… let’s say… unreadable. Actually, the metallic cover of the CD peeled off.

Time to recover it. First trial is to install a new W.7 system on a new disk, and use the activation code on the sticker to activate it. Unfortunately the OA (Online Activated) windows can’t be re-installed, so I ended up with a brand new, unregistered, Windows 7. I tried with different Windows 7 dumped versions and with a Windows 10, with partial success. But I don’t want to write about that. The thing is I was trying to solve the problem the wrong way. With a working OS I can read the “damaged” system disk and I see that it comes with a partition called HP_recovery. Why should I know? I never used the computer! They never called me to check it before! No one cared until now! 😔😔😔

Anyway, the super-expensive hardware is connected to a HP desktop, so the best is to use the HP system recovery for Windows 7. Basically, since I can’t boot, neither recover my install from a recovery point, I’m forced to perform a full HP recovery. This takes time and it’s destructive, so first I clone the original disk, in case I need to extract some license file from it, then I boot the computer pressing F11 repeatedly. At one point, an HP splash screen appears and I see some shell scripts running, then I need to wait (maybe you don’t) until a new window pops up. Sorry guys, no screenshot, but there are not so many options on the new window: repair by wiping up everything or cancel. I choose the first, click on I accept all the risks (nice) and wait for a few hours until I magically have a W.7 back. This time, with license. Now I can tell the user to install his programs to control the hardware, or contact the company that supplies it. And I wonder….for how long this agony called W.7 is going to be with us? 😔😔😔

ERROR: MediaWiki not showing up after a reboot, skin missing

The missing skin message

I like the motto If something works don’t touch it. Unfortunately in the era of CI/CD we need to touch it, and sometimes we even need to reboot it. This is what happened to me. There’s this nice MediaWiki 1.31 install that has been running since years in my team not as a docker or a fancy VM but as a real web server, with its httpd and mariadb services. I run weekly backups but I don’t update it precisely because I was afraid of it. For a reason. After a stupid power cut, when I managed to have the web server back online, I found out that the httpd and mariadb service were running but the MediaWiki was showing a Blank Page. As written on the link before, I go to my /var/www/html/ to edit the LocalSettings.php. Right at the top I copy:

error_reporting( E_ALL );
ini_set( 'display_errors', 1 );

and do systemctl restart httpd. The error instead of the Blank Page reads like this:

Fatal error: Uncaught Exception: /var/www/html/skins/MonoBook/skin.json 
does not exist! in /var/www/html/

and similar. I do check and for some weird reason (update? the other sysadmin?) the skin folder is gone. I then try to disable the skins by modifying the LocalSettings.php and restating the httpd. It works somehow. I get the wiki content unskinned, with an error on the top like this:

The wrong skin message

What to do now? I went for downloading the whole wiki zip again (the same version), copying the skins folder, uncommenting the LocalSettings.php and restating the httpd. And the wiki is back. What to do next? An invisible migration to a more reliable setup, maybe a kubernetes pod. We’ll see if I have time. Or mood. Or actually, both 😉.

Cleaning the hill

If you are an experienced dreamer you know how complicated is to remember the whole movie, even after retaining the picture. My picture today is a green hill stuck between two small mountains, with a granite bridge over it, the bridge with a road or similar. In the middle, coming from a hole on the hill, a river of around 50 meters wide flows peacefully, and we are on one of the shores. The road on the bridge goes around most of the mountainy border of the city, and it is sustained by enormous concrete pillars. Nothing fancy, squared, mid-twenty century, I will say. Nothing supernatural or extravagant. Me, my small son and some other people are gathered on our shore to observe some operation going on on the other shore of the unnamed river. There are two guys that are dressed with climbing gears, one with a yellow helmet, the other with a red one. A small truck, fancy designed, is parked with rather obscure equipment close to them. I hear the other people murmuring around me, unfortunately, I don’t seem to understand the local language.

After clearly examining the situation, the guy with the yellow helmet starts to climb to a tree that grows almost below the bridge. The red helmet walks away to the cartoonish truck, where he seems to take a collection of metallic poles and a tool box. With a pole, the red helmet hands something that looks like a telescopic manipulator to the yellow helmet. He takes it and expand it somehow until reaching the bottom of the road, between the pillars, on a point out of our sight. The people around us start to point out and shout “Troll, troll, troll”. My son seems to be very excited also. I don’t remember why we came, but now I know.

Someone or something crawls away from the shadows under the bridge. A dark shadow with six limbs, covered in mud and what looks like green seaweed. It becomes blurry as we look at it, until it vanishes completely. I hear “ohs” behind our standing point and more talking on their language. We don’t see everything from here, but the yellow helmet has probably pricked some structure that was hanging under the road, and now he’s removing parts of it with another pole. The pieces fall to the river with a splash. The children clap with enthusiasm. Some cheering and whistling is happening also. The guy with the red helmet waves us. No one seems to care about the creature that went out of the nest, like if it was a disturbing wasp or something equally common. Maybe it is, something common. A troll under the bridge, I mean.

Swilkers

The next lines are reconstructed from an old file of mine. I’m not sure I wrote it, so apologies if you are the author. In fact, please leave a comment if you are the author

Without a doubt I am again offline. He would have to start over, long and without excuses. But not like that, you know, dragging short circuits and sloppily, but rather from the beginning, proposing it as a sporting feat. With nothing to lose and everything to gain.

You played it on me in a cruel and bizarre way. We were lying in the sun playing table tennis when a cloud covered it. Then another came and more and more. All black. Mysterious and black. Suddenly a drop. Then a flake, between glass and tennis ball. They came and surrounded us, but you escaped, or so I thought innocently, because I did not know about your AZ connections: they allow you to disintegrate your printed circuits while your codes are sent to a stratospheric satellite. When the switch is closed, the reverse process is carried out, the satellite sends the codes and your robot comes. Well, you escaped and I stayed there between swilkers and black clouds.

Swilkers? I thought they became extinct decades ago, fought over their metal nets, which they could not weave unanimously. Some around here, supporters of related lace, and willing to die for it, while those of the embolillao point, drunk as bats, did not give a shit about the future. And there were also small cut-and-make redoubts, the aces of the only one, the authentic and incomparable, manila shawl. But anyway, the fact is that they appeared there, and I couldn’t get out of my astonishment.

What is offered to you? A small pimpinelo, with crushed ice? They didn’t have the faces of our good friends, and I knew they were bastards when they wanted to be. They started doing the cannabis dance and I heard it, that intense but soft aroma. To all of this the sky increasingly black, with points of purple light causing hallucinogenic sensations. I plunged into a dark world, dominated by hatred, you were the goddess and the swilkers your subjects. A scale of power in which there were only two levels: tyrant or bastard. Like the queen bee fucking her workers.

The truth is that my simplest manifestation of disagreement would have provoked my immediate lynching, therefore I chose to camouflage myself and impersonate a swilker. But of course, my circuits were too complex to hide and work like a fucking swilker. As long as they did not leave the nets, I was not going to get out of that horrifying world, and how well AZ connections would have been for me, but I was not going to manufacture them without caustic resin, as you can for sure understand.

So I began to contract my circuits in order to emulate the working methods of the disgusting beings. I could last very little time like this, and my morale was weakening more and more. You were screwing with me, and I didn’t know you were the number one tyrant in this world. I couldn’t even suspect it. I was no longer in my right mind and that made me as crazy as hell. What type of being would have done something like this to me? I was manufactured to never lose my temper, or at least that’s what I thought when the force of the sun bathed my metallic body. In this other world I have landed there was no light, and my energies were running out.

How did I get back? No, little cat, now I have transcended and I do not share it. I realised your nightmare by mere intuition. I reached an untenable point where I just discovered everything, although I am still not 100% sure. But I have convinced myself of it.

End of the swilkers fragment. I know it doesn’t match my style, but I need to confront the blank page. So thanks to the anonymous donor of this text – from now on everything will be easier. At least for me. 😉❤️.

The perfect storm

Looks like I’m more busy than before. It’s the start of a new academic year, and the unofficial end of the holiday season. I’ve been out on a business trip for a few days in a place with such a sh#$@ internet that I was barely able to delete my emails, just to come back to find out half of our services are down. And now I am little by little recovering everything, while the people, as usual, keep coming asking what’s going on, or simply requesting other service. Good news are, I’m going on holidays soon, for real. Good news for me, maybe not for you, dear reader. We’ll see if I manage to dream a little dream over there. Anyway, thanks for passing by, see you soon ❤️❤️.

GPFS : cannot delete file system / Failed to read a file system descriptor / Wrong medium type

A nice and mysterious title. As usual. 😁. But you know what I’m speaking about. I had one GPFS share that crashed due to a failing NVMe disk. The GPFS share was composed of several disks, and configured without safety, as a scratch disk. Being naive, I thought that simply replacing the failed disk by a disk that will call here Disk it and rebooting everything should bring my GPFS share back. It didn’t work, and I have learned some new things that I’d like to show you.

To recover my shared gpfsshare after the replacing of the new disk, first I’ve tried changing the StanzaFile disk names and recreate the NSD disks by running mmcrnsd -F StanzaFile. I could join the new disk Disk but the share was not yet usable.

Then I tried to change the mounting point of gpfsshare from /mountpoint to /newmountpoint

root@gpfs ~ ## > mmchfs gpfsshare -T /newmountpoint
Verifying file system configuration information ...
Disk Disk: Incompatible file system descriptor version or not formatted.
Failed to read a file system descriptor.
Wrong medium type
mmchfs: Failed to collect required file system attributes.
mmchfs: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.

Next thing I thought on doing is deleting Disk from gpfsshare. That way, I would end up with a GPFS share, smaller than the original, but functional.

root@gpfs ~ ## > mmdeldisk gpfsshare Disk
Verifying file system configuration information …
Too many disks are unavailable.
Some file system data are inaccessible at this time.
Check error log for additional information.
Too many disks are unavailable.
Some file system data are inaccessible at this time.
mmdeldisk: Failed to collect required file system attributes.
mmdeldisk: Unexpected error from reconcileSdrfsWithDaemon.
Return code: 1
mmdeldisk: Attention:
File system gpfsshare may have some disks
that are in a non-ready state.
Issue the command:
mmcommon recoverfs scratch
mmdeldisk: Command failed.
Examine previous error messages to determine cause.

Let’s list our NSD disks to see what we have. We can list with mmlsnsd -X or without the argument (-X). This is my output (edited):

root@gpfs ~ ## > mmlsnsd

File system | Disk name | NSD servers
-----------------------------------------------
gpfsshare Disk node1.domain.org
gpfsshare Disk_old1 node2.domain.org
gpfsshare Disk_old2 node3.domain.org
(free disk) Disk_A node4.domain.org
(free disk) Disk_B node5.domain.org
(free disk) Disk_C node6.domain.org

Since everything is looking awful here, I will delete my gpfsshare filesystem and make it new. Actually I need to force the deletion. Let me show you.

root@gpfs ## > mmdelfs gpfsshare
Disk Disk: Incompatible file system descriptor version or not formatted.
Failed to read a file system descriptor.
Wrong medium type
mmdelfs: tsdelfs failed.
mmdelfs: Command failed. Examine previous error messages to determine cause.
root@gpfs ## > mmdelfs gpfsshare -p
Disk Disk: Incompatible file system descriptor version or not formatted.
Failed to read a file system descriptor.
Wrong medium type
mmdelfs: Attention: Not all disks were marked as available.
mmdelfs: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.

We managed to delete the filesystem! But what happened with our NFS disks? Let’s use mmlsnsd to see where they stand:

root@gpfs ~ ## > mmlsnsd

File system | Disk name | NSD servers
-----------------------------------------------
(free disk) Disk node1.domain.org
(free disk) Disk_old1 node2.domain.org
(free disk) Disk_old2 node3.domain.org
(free disk) Disk_A node4.domain.org
(free disk) Disk_B node5.domain.org
(free disk) Disk_C node6.domain.org

So the disks are there. The filesystem gpfsshare is gone so they are marked as belonging to a free disk filesystem. Let’s then delete the NSD disks. We need to do it one by one. I show the output for the disk Disk.

root@gpfs ## > mmdelnsd Disk
mmdelnsd: Processing disk Disk
mmdelnsd: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.

Once we removed all our disks (check with mmlsnsd) we can add them again using the original StanzaFile (mmcrnsd -F StanzaFile). If you have already a disk processed, don’t worry: the mmcrnsd disk will anyway work. A standard output for our disks will be like this:

root@gpfs ## > mmcrnsd -F StanzaFile
mmcrnsd: Processing disk Disk_A
mmcrnsd: Disk name Disk_A is already registered for use by GPFS.
mmcrnsd: Processing disk Disk_B
mmcrnsd: Disk name Disk_B is already registered for use by GPFS.
mmcrnsd: Processing disk Disk_C
mmcrnsd: Disk name Disk_C is already registered for use by GPFS.
mmcrnsd: Processing disk Disk
mmcrnsd: Processing disk Disk_old1
mmcrnsd: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.

Time to re-create the gpfs filesystem. We can call it the same way as before 😉, if we like. It works now! I have changed nothing with respect to the command I used originally to create my gpfs share gpfsshare. This one:

mmcrfs gpfsshare -F StanzaFile -T /mountpoint -m 2 -M 3 -i 4096 -A yes -Q no -S relatime -E no –version=5.X.Y.Z

If you get an error like this:

Unable to open disk 'Disk_A' on node node4.domain.org
No such device

Check that the gpfs is running on the corresponding node (in this example it’s node4.domain.org) and run again the mmcrfs.

I hope you have learned something already. But for my notes, first delete the filesystem (mmdelfs gpfsshare -p), second delete the nsf disks (mmdelnsd Disk), then add the disks again (mmcrnsd -F StanzaFile), then create a new filesystem (mmcrfs). Take care!