My reflections

Sunday, March 31, 2013

IT & BA LAB Session 10: 26/03/2013

IT & BA LAB Session 10: 26/03/2013
Assignment 1: Create 3 vectors, x, y, z and choose any random values for them, ensuring they are of equal length,
T<- cbind(x,y,z)
Create 3 dimensional plot of the same (of all the 3 types as taught)

Commands :
> Random1<-rnorm(30,mean=0,sd=1)
> Random1
> x<-Random1[1:10]
> x
> y<-Random1[11:20]
> y
> z<-Random1[21:30]
> z
> T<-cbind(x,y,z)
> T
> plot3d(T[,1:3])

> plot3d(T[,1:3],col=rainbow(64))

> plot3d(T[,1:3],col=rainbow(64),type= 's')

Screenshots:

Assignment no 2:
Read the documentation of rnorm and pnorm,
Create 2 random variables
Create 3 plots:
1. X-Y
2. X-Y|Z (introducing a variable z and cbind it to z and y with 5 diff categories) Hint: ?factor
3. Color code and draw the graph
4. Smooth and best fit line for the curve

Commands :
> x<-rnorm(200,mean=5,sd=1)
> y<-rnorm(200,mean=3,sd=1)
> z1<-sample(letters,5)
> z2<-sample(z1,200,replace=TRUE)
> z<-as.factor(z2)
> t<-cbind(x,y,z)
> qplot(x,y)

> qplot(x,z,alpha=I(2/10))

> qplot(x,z)

> qplot(x,y,geom=c("point","smooth"))

> qplot(x,y,colour=z)

> qplot(log(x),log(y),colour=z)

Screenshots:

Friday, March 22, 2013

Business Applications Lab Session #9 on 19th Mar 2013

Data Visualization Using Tableau

There is a great deal of discussion about the value of analytics and big data management in the technology industry today. In my view, data is only useful if it can help provide insights into what customers want (for finding opportunities), what customers do, and what patterns are not obvious from raw data are very important and can be the difference between leading and losing.

In my research for this task I come across a number of tools that may be interesting. One of them is Tableau. Tableau is a Data Visualization tool that allows users to easily connect to data sets from very simple flat data files (.csv, .xls, .txt) to very complex SQL data structures (Hadoop, SQL Server, Oracle, etc). Tableau can analyze data while the data stays in the repository or the data can be imported into Tableau for offline processing.

Some of the useful things you can do with Tableau are showcased here using small case lets :

Real Estate Industry

The real estate industry thrives on data. Your ability to get insight from it can set you apart.

Monitor trends for home prices, sales volume and foreclosures.
Do detailed site analysis using demographic data and Tableau's built-in mapping capabilities.
Provide clients with customized market reports.

Retail Industry

Lots of data is already available to retailers to make good decisions – from loyalty programs and web analytics to third-party information and point-of-sale details. But there’s a big gap between having the data and putting it to work for you. Tableau’s analytical depth and visualization capabilities can help improve your retail analytics by allowing you to:

Create interactive dashboards that support real-time decisions
Incorporate geographical-based data for targeted segmentation
Blend multiple data sources for more robust analysis

Health Care Analytics

Healthcare costs can quickly spin out of control. Misallocation of resources can quickly bring down quality of care. To keep efficiency and profitability moving in the right direction, you need to see all your key healthcare reporting metrics across hospitals, programs, and regions. You need to cut that data many different ways and share it with key employees in order to manage your business more effectively. Use Tableau to:

Understand profitability by specialties, HRGs (Healthcare Resource Groups), gender, and age.
Identify patterns of cost and profitability by admission method and specialty.
Provide interactive, web-based dashboards to staff so they can get exactly the data they need right on the floor and in real time.

Government reporting

Government data is complex and enormous- so are the challenges facing those who work with it. With Tableau Desktop, you can query millions of rows of data in seconds, drag-and-drop to visualize any dataset, and even publish your analysis to Tableau Public to meet transparency reporting requirements. Governments and public-private organizations are using Tableau to.

Present enormous countrywide datasets clearly and allow drill-down to local areas.
Provide online access to public data without programming.

Banking Analytics

Banks distinguish themselves by the quality of their service. With Tableau you can offer customers a new level of insight and stand out from the competition. Customers from RBC Wealth Management to the Macquarie Group to Fifth Third Bank use Tableau for their banking analytics. Banks use Tableau to:

Provide web-based tools for clients and salespeople to track the value of savings and investments
Provide what-if analysis to help clients understand the effects of changes in investment decisions
Monitor loans and manage risk across geographies with interactive banking dashboards
Dynamically produce reports on outstanding accounts that require attention

Regardless of role/industry Tableau is rapid-fire business intelligence that equips anyone to analyze data quickly. Its intuitive user interface means there’s no need for canned reports, dashboard widgets, or templates to get started. All we need is your data and the questions you want to answer.

In my personal experience as well, we have used Tableau at Mckinsey Knowledge Center to drive reports to various clients through linking it directly to our project management tool. Tableau works wonders by linking reports to live data & helps in setting up reports at different levels for usage throughout the client organisation.

Divij Sharma

Friday, March 15, 2013

Business Applications Lab Session #8 on 12th Mar 2013

Session # 8 :

In this session we learnt about the panel data generation and its various models.

Panel Data refers to the combination of various time series data cascaded together
The basic function used for panel data generation and estimation is plm.

The data set we have used in this session in "Produc".

The description for the same is as under.

- state : the state
- year : the year
- pcap: private capital stock
- hwy : highway and streets
- pc: public capital
- gsp: gross state products
- emp: labor input measured by the employement in non–agricultural payrolls
- unemp: state unemployment rate

Use the data set "Produc" , a panel data set within plm package for panel estimations.

Assignment :
To calculate the values for all the 3 models and decide which models best fits the data set for panel estimation ?

Solution :
Step1 : calculating value for pooling model

Step2 : calculating value for fixed model

Step3 : calculating value for random model

To choose the best model that fits the data set "Produc" ,we need to run pairwise hypothesis tests among the 3 models and select the best fit in the end.

Test1 :
Between pooling and fixed model

Command :
pFtest (fixed1 , pooled)

Test details :

H0: Null: the individual index and time based params are all zero
Alternative Hypothesis : atleast one of the index and time based params are non zero

The hypothesis test suggests that the alternative hypothesis has significant effects.
As the p-value is too low.. Null hypothesis is rejected.

Hence Fixed model is better than the pooling model.

Test2:
Between pooling and random model

Command :
plmtest (pooled)

Test details :

H0: Null: the individual index and time based params are all zero : Pooling Model
Alternative Hypothesis : atleast one of the index and time based params are non zero : Random Model

The hypothesis test suggests that the alternative hypothesis has significant effects.
As the p-value is too low.. Null hypothesis is rejected.

Hence random model is better than the pooling model.

Test3:
Between fixed and random model

Command :

We use Hausman test -:
phtest(random1 , fixed1)

Test details :

H0: Null: individual effects are not correlated with any regressor : Random Model
Alternative Hypothesis : Individual effects are correlated : Fixed Model

The hypothesis test suggests that the one of the models is inconsistent.
As the p-value is too low.. Null hypothesis is rejected.

Hence fixed model is better than random model.

Conclusion :-
We can conclude that fixed model best fits the "Produc" data set panel data estimations. i.e there is significant correlation observed with the regressor variables and index impact exists.
Hence, we would choose "Fixed" model to estimate the panel data presented by "Produc" data set.

Wednesday, February 13, 2013

Business Applications Lab Session #6 on 12th Feb 2013

Assignment:

Create log of returns data and calculate its historical volatility

Commands :

1) logSt-logSt-1/logSt-1
OR
2) log(St-St-1/St-1)

Create ACF Plot for log returns and do the ADF test and analyse on it
Data :
NSE Index –Jan 2012 –Jan 2013
NIFTY data –Closing prices

Commands:-

> niftychart<-read.csv(file.choose(),header=T)
> closingval<-niftychart$Close

> closingval.ts<-ts(closingval,frequency=252)

> plot(log( closingval.ts))

> minusone.ts<-lag(closingval.ts,K=-1)

> plot(log( minusone.ts))

> z<-log(closingval.ts)-log(minusone.ts)

> z

> returns<-z/log(minusone.ts)

> plot(returns,main="Plot of Log Returns;CNX NSE Nifty Jan-2012 to Jan-2013" )

> acf(returns,main=" The Auto Correlation Plot; Dotted line shows 95% confidence interval ")

The ACF plot shows that all the correlations lie within our expectations of a 95% confidence interval so there is a fairly good chance of considering the Data to be "STATIONARY"

> adf.test(returns)

Now with the ADF test and its P-value we can confirm that the Data is "Stationary"

# Now calculating the Historical volatility of the Data

> T<-252^0.5
> histvolatality<-sd(returns)/T

> histvolatality

Thursday, February 7, 2013

Business Applications Lab Session #5 on 5th Feb 2013

ASSIGNMENT 1 :

Converting the data in the time series format and then calculating the returns from it.
(Data taken - NSE MIDCAP 50 from July 31st to Dec 31st, 2012)

COMMANDS:

> z<-read.csv(file.choose(),header=T)
> Close<-z$Close
> Close

[1] 1994.30 1993.30 2006.55 1990.00 2002.30 2033.70 2042.00 2046.85 2054.05
[10] 2057.85 2033.65 2063.55 2116.10 2155.80 2134.05 2191.65 2198.40 2203.40
[19] 2210.90 2216.90 2252.45 2269.65 2286.75 2298.00 2275.55 2255.90 2271.65
[28] 2238.95 2287.35 2286.05 2287.05 2254.00 2251.40 2281.30 2258.20 2258.80
[37] 2239.60 2228.80 2199.00 2188.10 2162.00 2174.40 2207.10 2226.45 2208.50
[46] 2214.35 2238.80 2242.30 2219.80 2229.75 2233.80 2233.70 2200.05 2178.80
[55] 2152.10 2168.00 2176.80 2176.10 2195.60 2226.20 2248.25 2288.45 2315.55
[64] 2332.05 2343.85 2369.60 2360.10 2377.95 2350.85 2361.85 2323.15 2347.85
[73] 2363.65 2388.25 2391.65 2379.35 2325.35 2327.45 2345.10 2334.00 2357.25
[82] 2369.50

> Close.ts<-ts(Close)
> Close.ts<-ts(Close,deltat=1/252)
> z1<-ts(data=Close.ts[10:95],frequency=1,deltat=1/252)
> z1.ts<-ts(z1)
> z1.ts

Time Series:
Start = 1
End = 86
Frequency = 1
[1] 2057.85 2033.65 2063.55 2116.10 2155.80 2134.05 2191.65 2198.40 2203.40
[10] 2210.90 2216.90 2252.45 2269.65 2286.75 2298.00 2275.55 2255.90 2271.65
[19] 2238.95 2287.35 2286.05 2287.05 2254.00 2251.40 2281.30 2258.20 2258.80
[28] 2239.60 2228.80 2199.00 2188.10 2162.00 2174.40 2207.10 2226.45 2208.50
[37] 2214.35 2238.80 2242.30 2219.80 2229.75 2233.80 2233.70 2200.05 2178.80
[46] 2152.10 2168.00 2176.80 2176.10 2195.60 2226.20 2248.25 2288.45 2315.55
[55] 2332.05 2343.85 2369.60 2360.10 2377.95 2350.85 2361.85 2323.15 2347.85
[64] 2363.65 2388.25 2391.65 2379.35 2325.35 2327.45 2345.10 2334.00 2357.25
[73] 2369.50 NA NA NA NA NA NA NA NA
[82] NA NA NA NA NA

> z1.diff<-diff(z1)
> z2<-lag(z1.ts,K=-1)
> Returns<-z1.diff/z2
> plot(Returns,main="10th to 95th day returns")
> z3<-cbind(z1.ts,z1.diff,Returns)
> plot(z3,main="Data from 10th to 95th day, Difference, Returns")

------------------------------------------------------------------------------------------------------------

ASSIGNMENT 2 :

Do logit analysis for 700 data points and then predict for 150 data points.

COMMANDS:

> z<-read.csv(file.choose(),header=T)

> z1<-z[1:700,1:9]

> head(z1)

age ed employ address income debtinc creddebt othdebt default

1 41 3 17 12 176 9.3 11.36 5.01 1

2 27 1 10 6 31 17.3 1.36 4.00 0

3 40 1 15 14 55 5.5 0.86 2.17 0

4 41 1 15 14 120 2.9 2.66 0.82 0

5 24 2 2 0 28 17.3 1.79 3.06 1

6 41 2 5 5 25 10.2 0.39 2.16 0

> z1$ed<-factor(z1$ed)

> z1.est<-glm(default ~ age + ed + employ + address + income + debtinc + creddebt + othdebt, data=z1, family = "binomial")

> summary(z1.est)

Call:

glm(formula = default ~ age + ed + employ + address + income +

debtinc + creddebt + othdebt, family = "binomial", data = z1)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.4322 -0.6463 -0.2899 0.2807 3.0255

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.589302 0.605324 -2.626 0.00865 **

age 0.035514 0.017588 2.019 0.04346 *

ed2 0.307623 0.251629 1.223 0.22151

ed3 0.352448 0.339937 1.037 0.29983

ed4 -0.085359 0.472938 -0.180 0.85677

ed5 0.874942 1.293734 0.676 0.49886

employ -0.260737 0.033410 -7.804 5.99e-15 ***

address -0.105426 0.023264 -4.532 5.85e-06 ***

income -0.007855 0.007782 -1.009 0.31282

debtinc 0.070551 0.030598 2.306 0.02113 *

creddebt 0.625177 0.112940 5.535 3.10e-08 ***

othdebt 0.053470 0.078464 0.681 0.49558

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 804.36 on 699 degrees of freedom

Residual deviance: 549.56 on 688 degrees of freedom

AIC: 573.56

Number of Fisher Scoring iterations: 6

> forecast<-z[701:850,1:8]

> forecast$ed<-factor(forecast$ed)

> forecast$probability<-predict(z1.est, newdata=forecast, type="response")

> head(forecast)

age ed employ address income debtinc creddebt othdebt probability

701 36 1 16 13 32 10.9 0.54 2.94 0.00783975

702 50 1 6 27 21 12.9 1.32 1.39 0.07044926

703 40 1 9 9 33 17.0 4.88 0.73 0.63780431

704 31 1 5 7 23 2.0 0.05 0.41 0.07471587

705 29 1 4 0 24 7.8 0.87 1.01 0.34464735

706 25 2 1 3 14 9.9 0.23 1.15 0.45584645

Tuesday, January 22, 2013

Business Applications Lab Session #3 on 22nd Jan 2013

Assignment 1A

Fit ‘lm’ and comment on the applicability of ‘lm’
Plot1: Residual vs Independent curve
Plot2: Standard Residual vs independent curve

file<-read .csv="" file.choose="" header="T)</p"> file
mileage groove
1 0 394.33
2 4 329.50
3 8 291.00
4 12 255.17
5 16 229.33
6 20 204.83
7 24 179.00
8 28 163.83
9 32 150.33
x<-file groove="" p=""> x
[1] 394.33 329.50 291.00 255.17 229.33 204.83 179.00 163.83 150.33
y<-file mileage="" p=""> y
[1] 0 4 8 12 16 20 24 28 32
reg1<-lm p="" x="" y=""> res<-resid p="" reg1=""> res

1 2 3 4 5 6 7 8 9
3.6502499 -0.8322206 -1.8696280 -2.5576878 -1.9386386 -1.1442614 -0.5239038 1.4912269 3.7248633

plot(x,res)

As the plot is parabolic hence we cant do a regression on this data set.
------------------------------------------------------------------------------------------------------------

Assignment 1B -Alpha-Pluto Data

Fit ‘lm’ and comment on the applicability of ‘lm’
Plot1: Residual vs Independent curve
Plot2: Standard Residual vs independent curve
Also do:
Qq plot
Qqline

file<-read .csv="" file.choose="" header="T)</p"> file
alpha pluto
1 0.150 20
2 0.004 0
3 0.069 10
4 0.030 5
5 0.011 0
6 0.004 0
7 0.041 5
8 0.109 20
9 0.068 10
10 0.009 0
11 0.009 0
12 0.048 10
13 0.006 0
14 0.083 20
15 0.037 5
16 0.039 5
17 0.132 20
18 0.004 0
19 0.006 0
20 0.059 10
21 0.051 10
22 0.002 0
23 0.049 5
x<-file alpha="" p=""> y<-file p="" pluto=""> x
[1] 0.150 0.004 0.069 0.030 0.011 0.004 0.041 0.109 0.068 0.009 0.009 0.048
[13] 0.006 0.083 0.037 0.039 0.132 0.004 0.006 0.059 0.051 0.002 0.049
y
[1] 20 0 10 5 0 0 5 20 10 0 0 10 0 20 5 5 20 0 0 10 10 0 5
reg1<-lm p="" x="" y=""> res<-resid p="" reg1=""> res
1 2 3 4 5 6 7
-4.2173758 -0.0643108 -0.8173877 0.6344584 -1.2223345 -0.0643108 -1.1852930
8 9 10 11 12 13 14
2.5653342 -0.6519557 -0.8914706 -0.8914706 2.6566833 -0.3951747 6.8665650
15 16 17 18 19 20 21
-0.5235652 -0.8544291 -1.2396007 -0.0643108 -0.3951747 0.8369318 2.1603874
22 23
0.2665531 -2.5087486
plot(x,res)
qqnorm(res)
qqline(res)

----------------------------------------------------------------------------------------------------------------------------------
Assignment 2 - Justify Null Hypothesis using ANOVA

file<-read .csv="" file.choose="" header="T)</p"> file

Chair Comfort.Level Chair1
1 I 2 a
2 I 3 a
3 I 5 a
4 I 3 a
5 I 2 a
6 I 3 a
7 II 5 b
8 II 4 b
9 II 5 b
10 II 4 b
11 II 1 b
12 II 3 b
13 III 3 c
14 III 4 c
15 III 4 c
16 III 5 c
17 III 1 c
18 III 2 c

file.anova<-aov file="" hair1="" omfort.level="" p=""> summary(file.anova)

Df Sum Sq Mean Sq F value Pr(>F)
file$Chair1 2 1.444 0.7222 0.385 0.687