Data analysis of a big storage

The boss bought a PB storage solution. The PB storage has a limited number of inodes and it is optimized for files of a certain size, let’s say 200MB. In principle it is a very nice improvement versus NFS mounted network drives, with the system redundancy (2 servers per task) a disk replacement service, and a backup plan. But I have a couple of problems with this.  First problem is that my boss doesn’t like quotas, so I need to scan the storage on a regularly basis, to see who’s having an indecent amount of space on it. The second one is the limited number of inodes. If you didn’t google it already, an inode “for the illiterate” is just a file, therefore, even if we have 10 PB of storage, we are allowed to store on it as much as inodes we have, let’s say 40 millions.

What I want then is my script to give me how many inodes we have remaining, and how many small files an user has on a specific folder. The first problem has an easy solution for the general case. The command “df -ih” will eve me the inodes (when in use), so I can from it grep by the share name and dump the result to a file on a specific format, like, for example:

echo `date +%m_%d`" "`df -ih | grep pbshare | awk '{print $2}'` \
>> inode_log.txt

The file inode_log.txt will log the current month and the day, followed by the number of inodes currently occupied by pbshare. You can try to run it also for the occupancy (df -h) or for whatever you want. Now here comes the interesting part, to really count the files per user. For that I will use a loop over the users of mygroup, provided they have a folder with the username on it, and find.  My analysis loop looks like this:

GROUPLIST=`getent group mygroup 
| sed 's/,/ /g' | sed 's/sb\:\*\:12345\://g'  `
for member in `echo $GROUPLIST`; do
   if [[ $(ls *$USER* 2>/dev/null ) ]]; then
       ## echo "there are folders"
       VAR2=$(for i in `find -name "*$USER*"`; \ 
              do find $i -type f -size -100c  2>/dev/null  ; \
              done | wc -l)
       VAR3=$(for i in `find -name "*$USER*"`; \ 
              do find $i -type f -size -250M  2>/dev/null  ; \ 
              done | wc -l)
       VAR4=$(for i in `find -name "*$USER*"`; \
              do find $i -type f -size +250M  2>/dev/null  ; \
              done | wc -l)
       VAR5=$(awk "BEGIN {print $VAR2+$VAR3+$VAR4; exit}")  
       echo $VAR1" "$VAR2" "$VAR3" "$VAR4" "$VAR5 

Explanation: the variables VARX are printed after the analysis for each user. The analysis loops over all the folders named after a specific user *$USER*. You could, however, try without the for loop, but it may not work if you have a lot of users. A simple version is:

VAR2=$(find *$USER* -type f -size -100c | wc -l)
VAR3=$(find *$USER* -type f -size -250M | wc -l)
VAR4=$(find *$USER* -type f -size +250M | wc -l)

Try it. Again, I print the values, but the values can go to a file each time the script run. And the file can be plotted. For that, you can use matlab, excel, or whatever you want. I decided to go for a web solution, so I used google charts. The result, although crude, seems to fulfill the purpose of it, that is, to display the data. Now, time to chase people and decide what to clean or what NOT to clean. Wish me luck!



About bitsanddragons

A traveller, an IT professional and a casual writer
This entry was posted in bits, hardware, linux, slurm. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s