Sunday, March 31, 2013

IT & BA LAB Session 10: 26/03/2013

IT & BA LAB Session 10: 26/03/2013
 Assignment 1: Create 3 vectors, x, y, z and choose any random values for them, ensuring they are of equal length,
T<- cbind(x,y,z)
Create 3 dimensional plot of the same (of all the 3 types as taught)

Commands :
> Random1<-rnorm(30,mean=0,sd=1)
> Random1
> x<-Random1[1:10]
> x
> y<-Random1[11:20]
> y
> z<-Random1[21:30]
> z
> T<-cbind(x,y,z)
> T
> plot3d(T[,1:3])


> plot3d(T[,1:3],col=rainbow(64))

> plot3d(T[,1:3],col=rainbow(64),type= 's')
Screenshots:



Assignment no 2:
Read the documentation of rnorm and pnorm,
Create 2 random variables
Create 3 plots:
1. X-Y
2. X-Y|Z (introducing a variable z and cbind it to z and y with 5 diff categories) Hint: ?factor
3. Color code and draw the graph
4. Smooth and best fit line for the curve

Commands :
> x<-rnorm(200,mean=5,sd=1)
> y<-rnorm(200,mean=3,sd=1)
> z1<-sample(letters,5)
> z2<-sample(z1,200,replace=TRUE)
> z<-as.factor(z2)
> t<-cbind(x,y,z)
> qplot(x,y)

> qplot(x,z,alpha=I(2/10))

> qplot(x,z)

> qplot(x,y,geom=c("point","smooth"))



> qplot(x,y,colour=z)



> qplot(log(x),log(y),colour=z)



Screenshots:


Friday, March 22, 2013

Business Applications Lab Session #9 on 19th Mar 2013

Data Visualization Using Tableau

There is a great deal of discussion about the value of analytics and big data management in the technology industry today. In my view, data is only useful if it can help provide insights into what customers want (for finding opportunities), what customers do, and what patterns are not obvious from raw data are very important and can be the difference between leading and losing.

In my research for this task I come across a number of tools that may be interesting.   One of them is Tableau.  Tableau is a Data Visualization tool that allows users to easily connect to data sets from very simple flat data files (.csv, .xls, .txt) to very complex SQL data structures (Hadoop, SQL Server, Oracle, etc).  Tableau can analyze data while the data stays in the repository or the data can be imported into Tableau for offline processing.

Some of the useful things you can do with Tableau are showcased here using small case lets :
Real Estate Industry
The real estate industry thrives on data. Your ability to get insight from it can set you apart.
  • Monitor trends for home prices, sales volume and foreclosures.
  • Do detailed site analysis using demographic data and Tableau's built-in mapping capabilities.
  • Provide clients with customized market reports.

 Real estate report

Retail Industry
Lots of data is already available to retailers to make good decisions – from loyalty programs and web analytics to third-party information and point-of-sale details. But there’s a big gap between having the data and putting it to work for you. Tableau’s analytical depth and visualization capabilities can help improve your retail analytics by allowing you to:
  • Create interactive dashboards that support real-time decisions
  • Incorporate geographical-based data for targeted segmentation
  • Blend multiple data sources for more robust analysis
Retail segmentation analysis

Health Care Analytics

Healthcare costs can quickly spin out of control. Misallocation of resources can quickly bring down quality of care. To keep efficiency and profitability moving in the right direction, you need to see all your key healthcare reporting metrics across hospitals, programs, and regions. You need to cut that data many different ways and share it with key employees in order to manage your business more effectively. Use Tableau to:
  • Understand profitability by specialties, HRGs (Healthcare Resource Groups), gender, and age.
  • Identify patterns of cost and profitability by admission method and specialty.
  • Provide interactive, web-based dashboards to staff so they can get exactly the data they need right on the floor and in real time.
Patient cycle time dashboard

Government reporting
Government data is complex and enormous- so are the challenges facing those who work with it. With Tableau Desktop, you can query millions of rows of data in seconds, drag-and-drop to visualize any dataset, and even publish your analysis to Tableau Public to meet transparency reporting requirements. Governments and public-private organizations are using Tableau to.
  • Present enormous countrywide datasets clearly and allow drill-down to local areas.
  • Provide online access to public data without programming.

government transparency dashboard
Banking Analytics
Banks distinguish themselves by the quality of their service. With Tableau you can offer customers a new level of insight and stand out from the competition. Customers from RBC Wealth Management to the Macquarie Group to Fifth Third Bank use Tableau for their banking analytics. Banks use Tableau to:
  • Provide web-based tools for clients and salespeople to track the value of savings and investments
  • Provide what-if analysis to help clients understand the effects of changes in investment decisions
  • Monitor loans and manage risk across geographies with interactive banking dashboards
  • Dynamically produce reports on outstanding accounts that require attention
investment dashboard

Regardless of role/industry Tableau is rapid-fire business intelligence that equips anyone to analyze data quickly. Its intuitive user interface means there’s no need for canned reports, dashboard widgets, or templates to get started. All we need is your data and the questions you want to answer.

In my personal experience as well, we have used Tableau at Mckinsey Knowledge Center to drive reports to various clients through linking it directly to our project management tool. Tableau works wonders by linking reports to live data & helps in setting up reports at different levels for usage throughout the client organisation.

Divij Sharma

Friday, March 15, 2013

Business Applications Lab Session #8 on 12th Mar 2013

Session # 8 :


In this session we learnt about the panel data generation and its various models.

Panel Data refers to the combination of various time series data cascaded together
The basic function used for panel data generation and estimation is plm.

The data set we have used in this session in "Produc".

The description for the same is as under. 

- state : the state
- year : the year
- pcap: private capital stock
- hwy : highway and streets
- pc: public capital
- gsp: gross state products
- emp: labor input measured by the employement in non–agricultural payrolls
- unemp: state unemployment rate

Use the data set "Produc" , a panel data set within plm package for panel estimations.




Assignment :
To calculate the values for all the 3 models and decide which models best fits the data set for panel estimation ?

Solution : 
Step1 : calculating value for pooling model

Step2 : calculating value for fixed model
Step3 : calculating value for random model

To choose the best model that fits the data set "Produc" ,we need to run pairwise hypothesis tests among the 3 models and select the best fit in the end.

Test1 :
Between pooling and fixed model

Command :
pFtest (fixed1 , pooled) 
Test details :
H0: Null: the individual index and time based params are all zero
Alternative Hypothesis : atleast one of the index and time based params are non zero

The hypothesis test suggests that the alternative hypothesis has significant effects.
As the p-value is too low.. Null hypothesis is rejected.

Hence Fixed model is better than the pooling model.


Test2: 
Between pooling and random model

Command :
plmtest (pooled)

Test details :
H0: Null: the individual index and time based params are all zero : Pooling Model
Alternative Hypothesis : atleast one of the index and time based params are non zero : Random Model
The hypothesis test suggests that the alternative hypothesis has significant effects.
As the p-value is too low.. Null hypothesis is rejected.

Hence random model is better than the pooling model.


Test3: 
Between fixed and random model

Command :


We use Hausman test -:
phtest(random1 , fixed1)
Test details :
H0: Null: individual effects are not correlated with any regressor : Random Model
Alternative Hypothesis : Individual effects are correlated : Fixed Model
The hypothesis test suggests that the one of the models is inconsistent.
As the p-value is too low.. Null hypothesis is rejected.

Hence fixed model is better than random model.


Conclusion :-
We can conclude that fixed model best fits the "Produc" data set panel data estimations. i.e there is significant correlation observed with the regressor variables and index impact exists.
Hence, we would choose "Fixed" model to estimate the panel data presented by "Produc" data set.






Wednesday, February 13, 2013

Business Applications Lab Session #6 on 12th Feb 2013


Assignment: 

Create log of returns data  and calculate its historical volatility

Commands : 

1) logSt-logSt-1/logSt-1
OR
2) log(St-St-1/St-1)

Create ACF Plot for log returns and do the ADF test and analyse on it
Data :
NSE Index –Jan 2012 –Jan 2013
NIFTY data –Closing prices

Commands:-

> niftychart<-read.csv(file.choose(),header=T)
> closingval<-niftychart$Close

> closingval.ts<-ts(closingval,frequency=252)
> plot(log( closingval.ts))
> minusone.ts<-lag(closingval.ts,K=-1)
> plot(log( minusone.ts))
> z<-log(closingval.ts)-log(minusone.ts)
> z



> returns<-z/log(minusone.ts)
> plot(returns,main="Plot of Log Returns;CNX NSE Nifty Jan-2012 to Jan-2013" )

 > acf(returns,main=" The Auto Correlation Plot;   Dotted line shows 95% confidence interval ")



The ACF plot shows that all the correlations lie within our expectations of a 95% confidence interval so there is a fairly good chance of considering the Data to be "STATIONARY"

> adf.test(returns)




Now with the ADF test and its P-value we can confirm that the Data is "Stationary"

# Now calculating the Historical volatility of the Data

> T<-252^0.5
> histvolatality<-sd(returns)/T

> histvolatality

Thursday, February 7, 2013

Business Applications Lab Session #5 on 5th Feb 2013


ASSIGNMENT 1 :

Converting the data in the time series format and then calculating the returns from it.
(Data taken - NSE MIDCAP 50 from July 31st to Dec 31st, 2012)

COMMANDS:

> z<-read.csv(file.choose(),header=T)
> Close<-z$Close
> Close

 [1] 1994.30 1993.30 2006.55 1990.00 2002.30 2033.70 2042.00 2046.85 2054.05
[10] 2057.85 2033.65 2063.55 2116.10 2155.80 2134.05 2191.65 2198.40 2203.40
[19] 2210.90 2216.90 2252.45 2269.65 2286.75 2298.00 2275.55 2255.90 2271.65
[28] 2238.95 2287.35 2286.05 2287.05 2254.00 2251.40 2281.30 2258.20 2258.80
[37] 2239.60 2228.80 2199.00 2188.10 2162.00 2174.40 2207.10 2226.45 2208.50
[46] 2214.35 2238.80 2242.30 2219.80 2229.75 2233.80 2233.70 2200.05 2178.80
[55] 2152.10 2168.00 2176.80 2176.10 2195.60 2226.20 2248.25 2288.45 2315.55
[64] 2332.05 2343.85 2369.60 2360.10 2377.95 2350.85 2361.85 2323.15 2347.85
[73] 2363.65 2388.25 2391.65 2379.35 2325.35 2327.45 2345.10 2334.00 2357.25
[82] 2369.50

> Close.ts<-ts(Close)
> Close.ts<-ts(Close,deltat=1/252)
> z1<-ts(data=Close.ts[10:95],frequency=1,deltat=1/252)
> z1.ts<-ts(z1)
> z1.ts

Time Series:
Start = 1
End = 86
Frequency = 1
 [1] 2057.85 2033.65 2063.55 2116.10 2155.80 2134.05 2191.65 2198.40 2203.40
[10] 2210.90 2216.90 2252.45 2269.65 2286.75 2298.00 2275.55 2255.90 2271.65
[19] 2238.95 2287.35 2286.05 2287.05 2254.00 2251.40 2281.30 2258.20 2258.80
[28] 2239.60 2228.80 2199.00 2188.10 2162.00 2174.40 2207.10 2226.45 2208.50
[37] 2214.35 2238.80 2242.30 2219.80 2229.75 2233.80 2233.70 2200.05 2178.80
[46] 2152.10 2168.00 2176.80 2176.10 2195.60 2226.20 2248.25 2288.45 2315.55
[55] 2332.05 2343.85 2369.60 2360.10 2377.95 2350.85 2361.85 2323.15 2347.85
[64] 2363.65 2388.25 2391.65 2379.35 2325.35 2327.45 2345.10 2334.00 2357.25
[73] 2369.50      NA      NA      NA      NA      NA      NA      NA      NA
[82]      NA      NA      NA      NA      NA

> z1.diff<-diff(z1)
> z2<-lag(z1.ts,K=-1)
> Returns<-z1.diff/z2
> plot(Returns,main="10th to 95th day returns")
> z3<-cbind(z1.ts,z1.diff,Returns)
> plot(z3,main="Data from 10th to 95th day, Difference, Returns")



------------------------------------------------------------------------------------------------------------

ASSIGNMENT 2 :

Do logit analysis for 700 data points and then predict for 150 data points.


COMMANDS:

> z<-read.csv(file.choose(),header=T)
> z1<-z[1:700,1:9]
> head(z1)

  age ed employ address income debtinc creddebt othdebt default
1  41  3     17      12    176     9.3    11.36    5.01       1
2  27  1     10       6     31    17.3     1.36    4.00       0
3  40  1     15      14     55     5.5     0.86    2.17       0
4  41  1     15      14    120     2.9     2.66    0.82       0
5  24  2      2       0     28    17.3     1.79    3.06       1
6  41  2      5       5     25    10.2     0.39    2.16       0

> z1$ed<-factor(z1$ed)

> z1.est<-glm(default ~ age + ed + employ + address + income + debtinc + creddebt + othdebt, data=z1, family = "binomial")
> summary(z1.est)

Call:
glm(formula = default ~ age + ed + employ + address + income +
    debtinc + creddebt + othdebt, family = "binomial", data = z1)

Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-2.4322  -0.6463  -0.2899   0.2807   3.0255 

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -1.589302   0.605324  -2.626  0.00865 **
age          0.035514   0.017588   2.019  0.04346 * 
ed2          0.307623   0.251629   1.223  0.22151   
ed3          0.352448   0.339937   1.037  0.29983   
ed4         -0.085359   0.472938  -0.180  0.85677   
ed5          0.874942   1.293734   0.676  0.49886   
employ      -0.260737   0.033410  -7.804 5.99e-15 ***
address     -0.105426   0.023264  -4.532 5.85e-06 ***
income      -0.007855   0.007782  -1.009  0.31282   
debtinc      0.070551   0.030598   2.306  0.02113 * 
creddebt     0.625177   0.112940   5.535 3.10e-08 ***
othdebt      0.053470   0.078464   0.681  0.49558   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 804.36  on 699  degrees of freedom
Residual deviance: 549.56  on 688  degrees of freedom
AIC: 573.56

Number of Fisher Scoring iterations: 6

> forecast<-z[701:850,1:8]
> forecast$ed<-factor(forecast$ed)
> forecast$probability<-predict(z1.est, newdata=forecast, type="response")
> head(forecast)

    age ed employ address income debtinc creddebt othdebt probability
701  36  1     16      13     32    10.9     0.54    2.94  0.00783975
702  50  1      6      27     21    12.9     1.32    1.39  0.07044926
703  40  1      9       9     33    17.0     4.88    0.73  0.63780431
704  31  1      5       7     23     2.0     0.05    0.41  0.07471587
705  29  1      4       0     24     7.8     0.87    1.01  0.34464735
706  25  2      1       3     14     9.9     0.23    1.15  0.45584645



Tuesday, January 22, 2013

Business Applications Lab Session #3 on 22nd Jan 2013


Assignment 1A

Fit ‘lm’ and comment on the applicability of ‘lm’
Plot1: Residual vs Independent curve
Plot2: Standard Residual vs independent curve

 file<-read .csv="" file.choose="" header="T)</p"> file
  mileage groove
1       0 394.33
2       4 329.50
3       8 291.00
4      12 255.17
5      16 229.33
6      20 204.83
7      24 179.00
8      28 163.83
9      32 150.33
 x<-file groove="" p=""> x
[1] 394.33 329.50 291.00 255.17 229.33 204.83 179.00 163.83 150.33
 y<-file mileage="" p=""> y
[1]  0  4  8 12 16 20 24 28 32
 reg1<-lm p="" x="" y=""> res<-resid p="" reg1=""> res

         1          2          3          4          5          6          7          8          9
 3.6502499 -0.8322206 -1.8696280 -2.5576878 -1.9386386 -1.1442614 -0.5239038  1.4912269  3.7248633

plot(x,res)


As the plot is parabolic hence we cant do a regression on this data set.
------------------------------------------------------------------------------------------------------------

Assignment 1B -Alpha-Pluto Data

Fit ‘lm’ and comment on the applicability of ‘lm’
Plot1: Residual vs Independent curve
Plot2: Standard Residual vs independent curve
Also do:
Qq plot
Qqline

 file<-read .csv="" file.choose="" header="T)</p"> file
   alpha pluto
1  0.150    20
2  0.004     0
3  0.069    10
4  0.030     5
5  0.011     0
6  0.004     0
7  0.041     5
8  0.109    20
9  0.068    10
10 0.009     0
11 0.009     0
12 0.048    10
13 0.006     0
14 0.083    20
15 0.037     5
16 0.039     5
17 0.132    20
18 0.004     0
19 0.006     0
20 0.059    10
21 0.051    10
22 0.002     0
23 0.049     5
 x<-file alpha="" p=""> y<-file p="" pluto=""> x
 [1] 0.150 0.004 0.069 0.030 0.011 0.004 0.041 0.109 0.068 0.009 0.009 0.048
[13] 0.006 0.083 0.037 0.039 0.132 0.004 0.006 0.059 0.051 0.002 0.049
y
 [1] 20  0 10  5  0  0  5 20 10  0  0 10  0 20  5  5 20  0  0 10 10  0  5
 reg1<-lm p="" x="" y=""> res<-resid p="" reg1=""> res
         1          2          3          4          5          6          7
-4.2173758 -0.0643108 -0.8173877  0.6344584 -1.2223345 -0.0643108 -1.1852930
         8          9         10         11         12         13         14
 2.5653342 -0.6519557 -0.8914706 -0.8914706  2.6566833 -0.3951747  6.8665650
        15         16         17         18         19         20         21
-0.5235652 -0.8544291 -1.2396007 -0.0643108 -0.3951747  0.8369318  2.1603874
        22         23
 0.2665531 -2.5087486
 plot(x,res)
 qqnorm(res)
 qqline(res)





----------------------------------------------------------------------------------------------------------------------------------
Assignment 2 - Justify Null Hypothesis using ANOVA

 file<-read .csv="" file.choose="" header="T)</p"> file

   Chair Comfort.Level Chair1
1      I             2      a
2      I             3      a
3      I             5      a
4      I             3      a
5      I             2      a
6      I             3      a
7     II             5      b
8     II             4      b
9     II             5      b
10    II             4      b
11    II             1      b
12    II             3      b
13   III             3      c
14   III             4      c
15   III             4      c
16   III             5      c
17   III             1      c
18   III             2      c

 file.anova<-aov file="" hair1="" omfort.level="" p=""> summary(file.anova)

            Df Sum Sq Mean Sq F value Pr(>F)
file$Chair1  2  1.444  0.7222   0.385  0.687




Tuesday, January 15, 2013

Business Applications Lab Session #2 on 15th Jan 2013

Assignment 1: Create the two matrices, select highlighted columns & use cbind to create a new matrix.

Commands included in the snapshot:



Assignment 2: Multiply given matrices 1 & 2
Commands:

z1%*%z2



Assignment 3: Create a regression from the NSE Data – NIFTY 30 days 1dec12, 31dec 12

Commands included in the snapshot:



Assignment 3: Read probability distribution from manual. Generate normal distribution data plot.

Commands:

x=seq(-4,4,length=200)
y=dnorm(x,mean=0,sd=1)
plot(x,y,type="l",lwd=2,col="red")