Sensors are used in a lot of industrial applications to measure properties of a process. This can be temperature, pressure, humidity, density etc. In manufacturing plants output of sensors or transducers is sampled regularly to check for breaks in functionality. Quality standards mandates that the parameters such as temperature, pressure be maintained around an optimal point failing which can affect the product benchmarks. Traditionally, control bands have been used to capture any anomalous change in value of the metric. This method fails in most application where a time series data is involved as control bands give incorrect results due to serial autocorrelation. Since sensor data is a time series data, there is a need to look at other alternatives. The aim of this is to achieve the following things:

- Look at a typical Sensor Network
- Understand Histogram Based Outlier Scoring algorithm (HBOS)
- Create function in python for Categorical Data
- Create function in python for Numeric Data
- Analyze the output of HBOS
- Other use cases where HBOS can be used :Tell about thermal runaway

**Part 1: Wireless Sensor Network**

This is a group of sensors organized in a way to monitor and record data. Data represents the physical properties of a process which are generally monitored to ensure seamlessness in operations. The sensors can measure sound, temperature, pressure, fluid rate etc. These are generally integrated with a terminal where the recorded data is stored and analyzed. Typical sensor network can be represented by the following diagram

The architecture typically has the following parts:

- Motes: A sensor node is also known as Mote. It can gather and process information and communicate with the other nodes
- Network Managers: People dedicated with the task of managing the network topology and ensuring that it is up and running. They are also responsible for scaling the network up or down based on the requirements of the process. They also manage the exchange of data with the host application
- Customer Software: It interacts with the network manager to get sensor data. The software analyses the feed from a sensor network to study the process and other parameters

**Part 2: Understanding Histogram Based Outlier Scoring Algorithm (HBOS)**

We will explore how to create functions in python to analyse the sensor data. The below demonstrates the usage of various libraries and functions in python.

**#**

**Analyzing a Categorical Variable**

import numpy as np

import pandas as pd

import math

df=pd.DataFrame(np.array(['A','B','A','A','B']),columns=['Feature'])

# Writing the steps to score this column using HBOS method

df1=(df['Feature'].value_counts()/max(df['Feature'].value_counts())).reset_index()

df['Index_val']=range(0,df.shape[0])

df1.columns=['Feature','Proportion']

df2=pd.merge(df,df1,how='inner',on='Feature').sort_values('Index_val')

df3=pd.concat([df2['Feature'],np.log(df2['Proportion'])],axis=1)['Proportion']

**# Function for scoring Categorical Column using HBOS**

def Hbos_cat(col_nm):

df=pd.DataFrame(col_nm,columns=['Feature'])

df1=(df['Feature'].value_counts()/max(df['Feature'].value_counts())).reset_index()

df['Index_val']=range(0,df.shape[0])

df1.columns=['Feature','Proportion']

df2=pd.merge(df,df1,how='inner',on='Feature').sort_values('Index_val')

df3=pd.concat([df2['Feature'],np.log(df2['Proportion'])],axis=1)['Proportion']

return(df3)

# The result of the scores will be stored in Hbos_score

Hbos_score=Hbos_cat(df['Feature'])

**# Analyzing a Numeric Variable**

# Now we need to create a method for handling numeric column

# lets have the values stored in x

x=[12,11,10,9,8,10,11,14,17,20,50,60,70]

# First: Identify the total records in the numeric variable

N=len(x)

# Second: decide the number of bins to divide the data

# k<-round(sqrt(N))

k=math.sqrt(N)

# Next see the value of N/k:records in each group

records_Each_Group=round(N/k)

hb=pd.DataFrame(x)

hb.columns=['x']

hb['ID1']=range(1,len(x)+1)

hb=hb.sort_values('x')

hb['ID2']=range(1,len(x)+1)

hb['Group']=['G'+str(int(x)+1) for x in hb['ID2'] / (records_Each_Group+1) ]

g=hb.groupby('Group')['x']

# Max function for calculating the highest within each group

# https://pandas.pydata.org/pandas-docs/stable/groupby.html

# Link for using group by and apply

def max_f(group):

return(pd.DataFrame({'original':group,'Highest':group.max()})['Highest'])

hb['Highest']=g.apply(max_f)

# Min function for calculating the Lowest within each group

def min_f(group):

return(pd.DataFrame({'original':group,'Lowest':group.min()})['Lowest'])

hb['Lowest']=g.apply(min_f)

# Creating the difference column for Highest and Lowest

# If Highest - Lowest =0, then its value will be equal to 1

# Creating the function usnig lambda

cond_diff=lambda x,y: x-y if x!=y else 1

# Checking it on dummy data

list(map(cond_diff,[1,2],[1,3]))

hb['Diff_Flag']=list(map(cond_diff,hb['Highest'],hb['Lowest']))

# Now calculating the height of each bin by dividing the records in each group by Diff_Flag

hb['Height']=records_Each_Group/hb['Diff_Flag']

# Calculating the calibrated hieght based on zero height values

cond_diff2=lambda x: x if x!=0 else 1

hb['Height2']=list(map(cond_diff2,hb['Height']))

# Normalising the Height2 by dividing it by max(Height2)

hb['Normalised_Hieght']=hb['Height2']/max(hb['Height2'])

# Calculating the log of hb['Normalised_Hieght']

hb['hb_Score']=np.log(hb['Normalised_Hieght'])

# Now sorting the data frame based on ID1

hb_final=hb.sort_values('ID1')['hb_Score']

**# Function for scoring Numeric Column using HBOS**

def Hbos_num(col_nm):

# First: Identify the total records in the numeric variable

N=data_frame.shape[0]

# Second: decide the number of bins to divide the data

# k<-round(sqrt(N))

k=math.sqrt(N)

# Next see the value of N/k:records in each group

records_Each_Group=round(N/k)

hb=pd.DataFrame(data_frame)

hb.columns=['x']

hb['ID1']=range(1,N+1)

hb=hb.sort_values('x')

hb['ID2']=range(1,N+1)

hb['Group']=['G'+str(int(x)+1) for x in hb['ID2'] / (records_Each_Group+1) ]

g=hb.groupby('Group')['x']

# Max function for calculating the highest within each group

# https://pandas.pydata.org/pandas-docs/stable/groupby.html

# Link for using group by and apply

def max_f(group):

return(pd.DataFrame({'original':group,'Highest':group.max()})['Highest'])

hb['Highest']=g.apply(max_f)

# Min function for calculating the Lowest within each group

def min_f(group):

return(pd.DataFrame({'original':group,'Lowest':group.min()})['Lowest'])

hb['Lowest']=g.apply(min_f)

# Creating the difference column for Highest and Lowest

# If Highest - Lowest =0, then its value will be equal to 1

# Creating the function usnig lambda

cond_diff=lambda x,y: x-y if x!=y else 1

hb['Diff_Flag']=list(map(cond_diff,hb['Highest'],hb['Lowest']))

# Now calculating the height of each bin by dividing the records in each group by Diff_Flag

hb['Height']=records_Each_Group/hb['Diff_Flag']

# Calculating the calibrated hieght based on zero height values

cond_diff2=lambda x: x if x!=0 else 1

hb['Height2']=list(map(cond_diff2,hb['Height']))

# Normalising the Height2 by dividing it by max(Height2)

hb['Normalised_Hieght']=hb['Height2']/max(hb['Height2'])

# Calculating the log of hb['Normalised_Hieght']

hb['hb_Score']=np.log(hb['Normalised_Hieght'])

# Now sorting the data frame based on ID1

hb_final=hb.sort_values('ID1')['hb_Score']

return(hb_final)

# Creating a data frame to verify the functionality

df=pd.DataFrame(x,columns=['Feature'])

Hbos_num(df['Feature'])

# Second: decide the number of bins to divide the data

# k<-round(sqrt(N))

k=math.sqrt(N)

# Next see the value of N/k:records in each group

records_Each_Group=round(N/k)

hb=pd.DataFrame(data_frame)

hb.columns=['x']

hb['ID1']=range(1,N+1)

hb=hb.sort_values('x')

hb['ID2']=range(1,N+1)

hb['Group']=['G'+str(int(x)+1) for x in hb['ID2'] / (records_Each_Group+1) ]

g=hb.groupby('Group')['x']

# Max function for calculating the highest within each group

# https://pandas.pydata.org/pandas-docs/stable/groupby.html

# Link for using group by and apply

def max_f(group):

return(pd.DataFrame({'original':group,'Highest':group.max()})['Highest'])

hb['Highest']=g.apply(max_f)

# Min function for calculating the Lowest within each group

def min_f(group):

return(pd.DataFrame({'original':group,'Lowest':group.min()})['Lowest'])

hb['Lowest']=g.apply(min_f)

# Creating the difference column for Highest and Lowest

# If Highest - Lowest =0, then its value will be equal to 1

# Creating the function usnig lambda

cond_diff=lambda x,y: x-y if x!=y else 1

hb['Diff_Flag']=list(map(cond_diff,hb['Highest'],hb['Lowest']))

# Now calculating the height of each bin by dividing the records in each group by Diff_Flag

hb['Height']=records_Each_Group/hb['Diff_Flag']

# Calculating the calibrated hieght based on zero height values

cond_diff2=lambda x: x if x!=0 else 1

hb['Height2']=list(map(cond_diff2,hb['Height']))

# Normalising the Height2 by dividing it by max(Height2)

hb['Normalised_Hieght']=hb['Height2']/max(hb['Height2'])

# Calculating the log of hb['Normalised_Hieght']

hb['hb_Score']=np.log(hb['Normalised_Hieght'])

# Now sorting the data frame based on ID1

hb_final=hb.sort_values('ID1')['hb_Score']

return(hb_final)

# Creating a data frame to verify the functionality

df=pd.DataFrame(x,columns=['Feature'])

Hbos_num(df['Feature'])

# The function is working

# Now importing the sensor data set from the following link

dr="name of the path"

df=pd.read_csv(dr+"\\humidity.0.csv")

df.columns

df.shape[0]

# Using the Hbos_num function

Hbos_Score=Hbos_num(df['Humidity'])

# Merging the Hbos_score with df

df_new=pd.concat([df,Hbos_Score],axis=1)

df_new.head()

# Looking at the histogram of Hbos_score

import matplotlib.pyplot as plt

plt.hist(df_new['hb_Score'], bins='auto') # arguments are passed to np.histogram

plt.title("Histogram with 'auto' bins")

plt.show()

# The histogram is shown below

# The records shown in the red dotted oval are of interest to us as these represents reading having very high negative Hbos score.

# Analysis the records with highest score

df_new.describe()

df_new.sort_values('hb_Score').head(50)

# We can see that the entries with score less than -2.6 comes up as the top exceptions

df_exceptions=df_new[df_new['hb_Score']<-2.6]

df_exceptions.shape # 24 records

df_normal=df_new[df_new['hb_Score'] >=-2.6]

df_normal.shape # 496 records

**# Part 3: Analyzing the Hbos Output**# Now importing the sensor data set from the following link

**https://docs.google.com/spreadsheets/d/1ADPWzs1s-yNqtqbw-VLuPgPKpKXTAWIYHgO3H6grNO0/edit#gid=1682334191**dr="name of the path"

df=pd.read_csv(dr+"\\humidity.0.csv")

df.columns

df.shape[0]

# Using the Hbos_num function

Hbos_Score=Hbos_num(df['Humidity'])

# Merging the Hbos_score with df

df_new=pd.concat([df,Hbos_Score],axis=1)

df_new.head()

# Looking at the histogram of Hbos_score

import matplotlib.pyplot as plt

plt.hist(df_new['hb_Score'], bins='auto') # arguments are passed to np.histogram

plt.title("Histogram with 'auto' bins")

plt.show()

# The histogram is shown below

# The records shown in the red dotted oval are of interest to us as these represents reading having very high negative Hbos score.

# Analysis the records with highest score

df_new.describe()

df_new.sort_values('hb_Score').head(50)

# We can see that the entries with score less than -2.6 comes up as the top exceptions

df_exceptions=df_new[df_new['hb_Score']<-2.6]

df_exceptions.shape # 24 records

df_normal=df_new[df_new['hb_Score'] >=-2.6]

df_normal.shape # 496 records

# To further understand this, we need to do some Exploratory data analyses on the top of it like what is the time at which readings with highest Hbos score were made

**# Part 4: Other Use Cases**- Financial Transaction Outlier detection
- Sentiment Analyses
- Improving Data Quality

Thanks for the feedback.Request you to look at my other blogs as well.I write in R as well as in python

ReplyDeleteThanks for the feedback.Request you to look at my other blogs as well.I write in R as well as in python

ReplyDelete