Python-II

Better decision lies at fast stats and data analysis and data analysis lies at the intersection of programming,statistics and business analysis . To understand the data in systematic and scientific way various tools of data comes in to being.

One such tool is Python. To understand Python there are some programs and packages that is needed to be installed first.

  1. Download and Install Anaconda from https://www.continuum.io/downloads.
  2. Download and Install the Jupyter Notebook Interfacehttp://jupyter.readthedocs.org/en/latest/install.html
  3. We can use pip or easy install to install packages. We can browse Python packages by topic at https://pypi.python.org/pypi?%3Aaction=browse .
  4. Install package pandas from Anaconda as “conda install pandas”.
  5. Install pandasql , seaborn package and ggplot, SQLAlchemy from anaconda.

 

Install Packages from within Jupyter Notebook.

Steps:

  1. Open Jupyter notebook.Click on new folder and choose kernel Python Root.
  2. Import package: import pandas as pd.
  3. Import Data(csv file) : In case the file is stored locally we can use the os python library by In [2]: import os as os.
  4. Read Data : variable=pd.read_csv(“filename”,header=None,here we have taken the data from URL directly so the given command is 

    shooting=pd.read_csv(“Shooting.csv”,encoding=’cp1252′) 

    In order to get full details of the data set we use the following command in Python

    shooting.info()

    Output:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1512 entries, 0 to 1511
    Data columns (total 14 columns):
    id                         1512 non-null int64
    name                       1512 non-null object
    date                       1512 non-null object
    manner_of_death            1512 non-null object
    armed                      1511 non-null object
    age                        1479 non-null float64
    gender                     1512 non-null object
    race                       1434 non-null object
    city                       1512 non-null object
    state                      1512 non-null object
    signs_of_mental_illness    1512 non-null bool
    threat_level               1512 non-null object
    flee                       1498 non-null object
    body_camera                1512 non-null bool
    dtypes: bool(2), float64(1), int64(1), object(10)
    memory usage: 144.8+ KB

    Dropping a particular variable : shooting2=shooting.drop(‘Unnamed: 0’, 1).

print(shooting2)

Output:

        id                              name       date   manner_of_death  \
0        3                        Tim Elliot   1/2/2015              shot   
1        4                  Lewis Lee Lembke   1/2/2015              shot   
2        5                John Paul Quintero   1/3/2015  shot and Tasered   
3        8                   Matthew Hoffman   1/4/2015              shot   
4        9                 Michael Rodriguez   1/4/2015              shot   
5       11                 Kenneth Joe Brown   1/4/2015              shot   
6       13               Kenneth Arnold Buck   1/5/2015              shot   
7       15                     Brock Nichols   1/6/2015              shot   
8       16                     Autumn Steele   1/6/2015              shot   
9       17                   Leslie Sapp III   1/6/2015              shot   
10      19                    Patrick Wetter   1/6/2015  shot and Tasered   
11      21                         Ron Sneed   1/7/2015              shot   
12      22    Hashim Hanif Ibn Abdul-Rasheed   1/7/2015              shot   
13      25            Nicholas Ryan Brickman   1/7/2015              shot   
14      27  Omarr Julian Maximillian Jackson   1/7/2015              shot   
15      29                     Loren Simpson   1/8/2015              shot   
16      32               James Dudley Barker   1/8/2015              shot   
17      36               Artago Damon Howard   1/8/2015              shot   
18      37                      Thomas Hamby   1/8/2015              shot   
19      38                     Jimmy Foreman   1/9/2015              shot   
20     325                     Andy Martinez   1/9/2015              shot   
21      42                       Tommy Smith  1/11/2015              shot   
22      43                     Brian Barbosa  1/11/2015              shot   
23      45                 Salvador Figueroa  1/11/2015  shot and Tasered   
24      46               John Edward O'Keefe  1/13/2015              shot   
25      48                 Richard McClendon  1/13/2015              shot   
26      49                     Marcus Golden  1/14/2015              shot   
27      50                    Michael Goebel  1/14/2015              shot   
28      51                      Mario Jordan  1/14/2015              shot   
29      52                  Talbot Schroeder  1/14/2015              shot

Ranging

Input : shooting4= shooting[0:12]

print (shooting4)

To get full statistic of the data: shooting.age.describe()

Output :

count    1479.000000
mean       36.379310
std        12.730798
min         6.000000
25%              NaN
50%              NaN
75%              NaN
max        86.000000
Name: age, dtype: float64

To get correlation between variables

  id age signs_of_mental_illness body_camera
id 1.000 -0.035646 -0.062726 0.079815
age -.0356 1 0.107638 -0.005177
signs_of_mental_illness -.0627 .1076 1 .022973
body_camera .0798 -.00517 .0229 1

 

Using SQL- Python does have the pandasql package

from pandasql import sqldf

pysqldf = lambda q: sqldf(q, globals())

pysqldf(“SELECT * FROM shooting2 LIMIT 5 ; “).

Ranging using pysqldf

pysqldf(“SELECT * FROM shooting2 WHERE age>41;”)

For Data Visualization we use the excellent seaborn package fro0m http://stanford.edu/~mwaskom/software/seaborn/index.html. Histograms , Boxplots ScatterPlots and Jointplots are very easily plotted using seaborn.

Making displot

Input : import seaborn as sns

import matplotlib.pyplot as plt

ax = sns.boxplot(x=”gender”, y=”age”, data=shooting)

Making boxplot

Input : ax = sns.boxplot(x=”gender”, y=”age”, data=shooting)

Making Joinplot

Input : sns.jointplot(‘age’,’gender’,data=shooting)

Making factor plot

Input : sns.factorplot(x=”gender”, y=”age”,
col=”cut”, data=shooting, kind=”box”, size=4, aspect=.5);

For Data Visualization, I can also use the ggplot package created by Yhat:

p = ggplot(aes(x=’age’, y=’gender’,color=”caste”), data=shooting)
p + geom_point().

 Regression Models- A widely used data science technique for business, I can also use the statsmodel package.

Input:import statsmodels.formula.api as sm.

boston=pd.read_csv(“http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Boston.csv&#8221;)

boston =boston.drop(‘Unnamed: 0’, 1).

boston.head()

result = sm.ols(formula=”medv ~ crim + zn + nox + ptratio + black + rm “, data=boston).fit()
result.summary()

Dep. Variable: medv R-squared: 0.631
Model: OLS Adj. R-squared: 0.626
Method: Least Squares F-statistic: 142.0
Date: Tue, 19 Jul 2016 Prob (F-statistic): 1.49e-104
Time: 03:04:33 Log-Likelihood: -1588.2
No. Observations: 506 AIC: 3190.
Df Residuals: 499 BIC: 3220.
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -0.3594 4.863 -0.074 0.941 -9.915 9.196
crim -0.0991 0.034 -2.890 0.004 -0.167 -0.032
zn -0.0064 0.014 -0.470 0.638 -0.033 0.020
nox -10.8653 2.865 -3.793 0.000 -16.494 -5.237
ptratio -1.0519 0.135 -7.796 0.000 -1.317 -0.787
black 0.0137 0.003 4.453 0.000 0.008 0.020
rm 6.9796 0.396 17.612 0.000 6.201 7.758
Omnibus: 298.859 Durbin-Watson: 0.808
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3305.426
Skew: 2.385 Prob(JB): 0.00
Kurtosis: 14.577 Cond. No. 7.66e+03

Get the results:

result.params

Intercept    -0.359432
crim         -0.099122
zn           -0.006364
nox         -10.865295
ptratio      -1.051937
black         0.013737
rm            6.979587
dtype: float64

 

Advertisements

Operations in R

2. Operate data in R

Data Types in R

  • Variable
  • Tables

2.1 Variables- Numerics,Character,Factors,Logical

Variable allows you to store a value in R. You can then later use this variable’s name to easily access the value that is stored within respective  variable. 

# Assign the value 50 to x
x <- 50# Print out the value of the variable x
print(x)Output = 50 

 Decimals values like 8.22 are called numerics.

Natural numbers like 86 are called integers. Integers are also numerics.
Boolean values (TRUE or FALSE) are called logical.
Text values are called characters.

#  Variable my_apple to be 42my_ apple <- 42

#Change my_character to be “Wow”

my_character <- ” Wow “

#Change my_logical to be FALSE

my_logical <- TRUE

 

If we are given the variables and we have no idea about the data type then one way to know the same is from class(variable)

Taking the above variables:

# Check class of my_appleclass(my_apple)

“numeric”

# Check class of my_character

class(my_character)

“character”

# Check class of my_logical

class(my_logical)

“logical”

NOTE: No two data types can be solved,one has to convert it i n to the similar one to get the results.

2.2 Vectors

# Sales of Apple Sale_Apple <- c( 2,4,5,6)

# Sale of Orange

 Sale_Orange <- c(4,5,8,7)

# Calculate Total Sales

Total Sales < c(  Sale_Apple+Sale_Orange)

print(Total Sales)<-(6,9,13,13)

# Assign Variables to vectors Sale_Apple, Sale_Orange

Days<- c(“Monday”, “Tuesday”, “Wednesday”, “Thursday”)

# Assign the names of the day to vectors Sale_Apple, Sale_Orange.

names(Sale_Apple) <- Days

names( Sale_Orange) <- Days

 

2.3 Data Frames

A data frame has the variables of a data set as columns and the observations as rows.

Suppose we have loaded the data of “mtcars”.(https://vincentarelbundock.github.io/Rdatasets/datasets.html)

If you want initial or last observations from the data set we use head() and tail() commands.

head() enables you to show the first observations of a data frame. Similarly, the function tail() prints out the last observations in your data set.

Data frames as we now know can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type.

2.4 Lists

A list allows you to gather a variety of objects -matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.

# Vector with numerics from 1 up to 170my_vector <- 1:70

# First 50 elements of the built-in data frame mtcars

my_df <- mtcars[1:50,]

# Adapt list() call to give the components names

my_list <- list(my_vector, my_df)

 # Print out my_list

my_list

3. Simple Linear Regression

Linear regression analysis is the most widely used of all statistical techniques: it is the study of linea-r, additive relationships between variables.   Let Y denote the “dependent” variable whose values you wish to predict, and let X which is independent.

We use the data to understand regression In R.

#import data in R : read.csv(file.choose(),header=TRUE) screen shot 1

 

#attach the data: attach(LungCapData)

#correlation between age and LungCap :cor(Age,LungCap): 0.8196

#Run Regression : mod<- lm(LungCap ~ Age)

#Ask Summary: summary(mod) screen shot 2

#Scatterplot of Age and LungCap: plot(Age,LungCap,main= “scatterplot”)

abline(mod) screen shot 3

#coefficient : coef(mod) Screenshot4

#confidence Interval : confint(mod) Screen Shot 5

#Change Confidence interval : confint(mod,level=.99) Screen Shot 6

#Building ANOVA TABLE: anova(mod) Screen Shot 7

Assumptions of Linear Model

  • Y values can be expressed as a linear function of X variable.
  • The error terms are independent.
  • Variation of observation is constant around the regression line.
  • For given value of X,Y values are normally distributed.

How?

Steps

We have run a regression

plot(mod): we see 4 charts

To get all the 4 plots in single page : par(mfrow=c(1,1))Screen Shot 8

In the first plot : For linearity ,line should fairly flat and for homoskedasticity bubbles should be constant

Second Plot : QQ Plot – Quantile-Quantile Plot –Y axis is ordered,observed,standardized residuals,X Axis is ordered theoretical residuals:If Y value or error terms are normally distributed the point lie on the diagonal line.

SQL-II

Sorting

To sort data in SQL we use ORDER BY clause

If to sort the movies in descresing order of box office rating type:

SELECT* FROM MOVIES 

ORDER BY box office rating DESC;

DESC here implies descending order.

To  query that  returns the three lowest rated movies

SELECT * FROM movies
ORDER BY imdb_rating ASC
LIMIT 3;

Functions

We use functions to quickly sum,average ,count a particular column in the table.

Suppose we take the database of mobile applications that includes columns
:id,name,category,downloads,price.

To view the table including the column of price of application along with its price frequency with the condition of downloads >20000 we type

SELECT price, COUNT(*) FROM mobile_apps
WHERE downloads > 20000
GROUP BY price;

In order to count the application with price 0 we type

SELECT COUNT(*) FROM mobile_apps WHERE price = 0;

To Sum the downloads of each applications we use

SELECT SUM(downloads) FROM mobile_apps ;

To know number that represent maximum download of application

SELECT MAX(downloads) FROM mobile_apps;

Similarly to get the number that represent the minimum download we use

SELECT MIN(downloads) FROM mobile_apps;

To get the average number of downloads we use

SELECT AVG(downloads) FROM mobile_apps;

To round the average number of downloads to two decimal places for each price we use

SELECT price, ROUND(AVG(downloads), 2) FROM mobile_apps GROUP BY price;

Multiple Tables

It is possible that two or more tables are related to each other .Through SQL we can combine the data from the tables that are related to each other.

Suppose we already have a table named albums,we now create a second table named artists by typing CREATE TABLE artists(id INTEGER PRIMARY KEY, name TEXT);

We must know that an artist can create many albums but an album is produced by an artist.

id here is a primary key means that SQL ensures that none of the values in this column are NULL and each value is the unique .

To look at both the tables of artist and albums we must type the following:

SELECT * FROM artists WHERE id = 3;⇐ Table of artists

SELECT * FROM albums WHERE artist_id = 3;⇐Table of albums

 

artist_id: column of id of artist in the albums table. id in the artists table is same as artist_id in the albums table.

SQL Joins

Joins clause is used to combine records from two or more tables in a database.

We have CUSTOMER table:

id name age address Salary
1 Ram 23 Delhi 10000
2 Kavi 25 Chennai 40000
3 Jerry 44 Banglore 80000

Table of ORDER:

Order id date CUSTOMER ID Amount
100 2016-8-12 3 3000
101 2016-6-14 3 1500
102 2016-7-19 2 2000

 

Joining the two tables :

SELECT ID, NAME, AGE, AMOUNT
FROM CUSTOMERS, ORDERS
WHERE CUSTOMERS.ID = ORDERS.CUSTOMER_ID;

Here CUSTOMERS.ID = ORDERS.CUSTOMER_ID depicts the relation between the id column in customers table and customer id in Orders table.

Result

id name age Amount
3 Jerry 44 3000
3 Jerry 44 1500
2 Kavi 25 2000

There are many joins In SQL:

INNER JOIN: selects all rows from both tables as long as there is a match between the columns in both tables.

LEFT JOIN : Select all rows from the table 1, with the matching rows in the table 2. The result is NULL in the right side when there is no match.

RIGHT JOIN: Selects all rows from the table 2 , with the matching rows in the table 1 . The result is NULL in the left side when there is no match.

FULL JOIN : Selects all rows from the table 1  and from the table 2 or combines the result of both LEFT and RIGHT joins.

 

This is an end to the into to SQL . Once gone through the basics we can easily run the complex commands as well.

Git Tutorial

GIT ~ Lets Take Coordination In To Picture

What is Git  ?   What it is used for?

Imagine You and your team mate can work on a same project at a same time without disturbing each other.

Now the question comes :  Is this possible?

Yes it is! Git allows two or more people to work on a same project ,even at a same time without interfering in to each others works and still get updated about any changes in the project.

Git is used by many team leaders to get effective execution of their project.

You can download git from : https://git-scm.com/download/win

Now how does it work.Suppose there is a team of 3 people and a team leader assign Member 1 – A and Member 2-B a single project which involve their different inputs.

Team member will keep the folder “PROJECT” which contain all the project details in a repository  which is kept in Google Cloud.A repository is simply a place where the history of your work is stored.

But how can we judge whether for that folder we can initialize git.

For that first the team leader will give to the access of that repository by adding you and then you can download the folder .

Next Step involves opening that folder ,if you find a “Git” file inside that folder it means you can initialize git for that folder.

images (1)

Now lets understand the whole  path:

STEP 1: Download the “PROJECT” folder from repository at Google Cloud.

Step 2 : Save it on your Desktop and perform your Tasks

Step 3: Next step is to submit your work .This again require few steps:

  • Master (original data set without any changes present in origin). You make a copy of it that will be Master(but local one)and from this locally present Master You make another copy called Branch 1 and do your task in Branch one without making any change in Master(both local and at git hub). Member 2 will work in second copy of Master called Branch 2 (Second copy from local master as  Member 2 will also locally download Master in its desktop)without disturbing Master and Branch 1.

remotes.png

  • Suppose A completes its task and stage  file that is Branch 1,Stage here means the file is ready to commit so that A can push it to the cloud that can be reviewed by the team leader .If  Team Leader approves the task by A then the branch will be live and the Master at cloud will get updated with A’s work.

NOTE: There is Master file in repository at cloud which will be updated by team leader only. Team Members  first get an access of that repository and download Master file and save it in their desktop.

images (2)

  • It is important to note that here order of the delivery of the task matters. A has already done its task and Master at the cloud is updated. How will B will get to know about this update ? For this B type git checkout master to know the updates. Here git will tell B about updated Master at cloud. B task is to now pull new master from cloud and merge B’s work with new master which require few steps:
# To get update about any change in original Master

git checkout master

# To get the updated master locally.

git pull origin master

#To get yourself switched to Branch 2

git checkout branch2

# Now merge master to branch 2

git merge master

# Push you work with updated Master to cloud for review

git push origin branch 2

Now take some adjustments :

Start with making an account at github and get an access of repository. Terminal prompt below is currently in a directory named “PROJECT”.

Secondly initialize a Git repository

# Initialize a Git repository

git init

# To make changes in “PROJECT”

git add PROJECT.txt

# To check status

git status

OUTPUT: On branch master

#

# Initial commit

#

# Changes to be committed:

#   (use “git rm –cached <file>…” to unstage)

#

#             new file:   PROJECT.txt

 

To Create a Branch:Copy of Master

#  Create a Branch for Task 1

git branch Task 1

 

With git branch you’ll see two local branches: a main branch named master and your new branch named clean_up.

You can switch branches using the git checkout Task 1 command.

#  Switch To branches

git checkout Task 1

 

As A do his task now he has to commit its Branch so that it can be further pushed to git hub origin.

# run the commit command with a message describing what we’ve changed

 git commit -m “Task 1 completed”

In order to know the history or any change that has been made we use log command.

# Check changes

git log

OUTPUT: 

commit b652edfd888cd3d5e7fcb857d0dabc5a0fcb5e28
Author: Try Git <try_git@github.com>
Date: Sat Oct 10 08:30:00 2020 -0500 Added Task 1 completed
 

 Push command tells Git to push our local changes to our origin repo (on GitHub).

# push local Branch 1

git push -u origin Branch 1

NOTE: The -u tells Git to remember the parameters, so that next time we can simply run git push and Git will know what to do.

Get yourself updated with new master and pull it from cloud(For Team Member B).

# Get the updated master from Github

git pull origin master

If B has to look at what is different from our last commit .

# Difference from our last commit 

 git diff HEAD

To see the changes B just staged

# To see the changes you lastly staged

git diff –staged

To undo the lastly staged file we first have to reset stage are.

#  unstage files

git reset <lastly staged file>.txt

Files can be changed back to how they were at the last commit.

#  Get back to the file before last commit

git checkout – PROJECT.txt

Further steps for Team Member B

 

#To get yourself switched to Branch 2

git checkout Task 2

# Now merge master to branch 2

git merge master

# Push you work with updated Master to cloud for review

git push origin Task 2

Here Team Leader Gets the work of Both A and B ,without A & B disturbing each other.

 

DATA VISUALIZATION WITH GGPLOT2

1.VISUALIZE YOUR DATA USING PACKAGE GGPLOT2

1.1 Scatter Plot

In my previous tutorials Scatter plot was built to present data points given in the sample without using any packages,here we will discuss about how to perform the same using ggplot2 which make it really simple and easy.

Note : Assume the X variable and Y variable are continuous random Variable.

  • At first install ggplot2 package in R:
 

#install ggplot2 package

install.packages(“ggplot2”)

 

 

  • Now lets get some scatter plots done -“The Basic One!”
  • I have download the data on diamonds from https://vincentarelbundock.github.io/Rdatasets/datasets.html
  • Next step involves load data in R ,the load library and continue with plotting.

 

 

#import data

Data<- read.csv(file.choose(),header=TRUE,stringsAsFactors= “FALSE”)

#load library

library(“ggplot2”)

#Scatter plot

ggplot(diamonds,aes(x=carat,y=price)+geom_point())

 

Scatter Plot

The scatter in the above diagram shows the positive relationship between carat and price.

Hence now include the other variable that accompanies this relationship,lets take clarity of diamonds.

#take color=clarity in aesthetic

ggplot(diamonds,aes(x=carat,y=price,color= clarity))+geom_point()

Rplot.png

In the diagram taking in to account “Clarity” among the two variable relationship is presented by colored dots where each color depicts the clarity of diamonds and its relationship between the two X and Y variables,red color shows low clarity and has weak positive relation between two variables which the blue color shows relatively stronger relation.

Now add one more variable to this relationship which include the size of the scatter points equal to the diamond cuts.

 

#take size of dots-cuts

ggplot(diamonds,aes(x=carat,y=price,color= clarity,size=cut))+geom_point()

 

Rplot01

Here each point with the diamonds with respective cuts and the diagram shows the relationship between price and carat keeping cuts and clarity in to consideration.

Scatter plot is a layer,so in order to include one other layer say curve that shows the general trend between X and Y variable we use geom_smooth .

Rplot03

Show the line of best fit reminde me with linear model:

#Curve to show general trend

ggplot(diamonds,aes(x=carat,y=price))+geom_point()+geom_smooth(se=FALSE,method = lm)

Rplot04.png

The line shows the linear relationship between two variables .

Faceting makes the understanding of the relationship taking in to account third variable more precisely .

 

#Faceting

> ggplot(diamonds,aes(x=carat,y=price))+geom_point()+facet_wrap(~clarity)

Rplot05

1.2 Histogram

Now lets catch histogram here with ggplot2.

Sometimes you need one dimension of the data and observe its distribution,here then we use histogram.

 

#Histogram

ggplot(diamonds,aes(x=price))+ geom_histogram()

 

Rplot06.png

Count shows the frequency in the bin and the histogram shows the distribution of price.

To change the width of the histogram we simply include  bin width layer.

 

#Histogram width

> ggplot(diamonds,aes(x=price))+ geom_histogram(binwidth = 3000)

 

Rplot07.png

Lets take in to account the fill option where histogram shows the clarity of diamonds and its price.

Rplot08.png

#Histogram fill with clarity

ggplot(diamonds,aes(x=price,fill=clarity))+ geom_histogram()

1.3 Boxplot

The basic method in statistics to compare density is  through boxplot.

Boxplot as I have mention before is the graphical representation of data that shows highest,lowest and the median value.

#Boxplot

ggplot(diamonds,aes(x=color,y=price))+geom_boxplot()

Rplot09.png

The middle dark line in the first boxplot shows the median and the box is divided in to 75 percentile and 25 percentile

The dark line in above  boxplot are the outlier that goes beyond the expected values.

In order to get more better picture about the distribution we take log value of price.

#Boxplot taking log values

ggplot(diamonds,aes(x=color,y=price))+geom_boxplot()+scale_y_log10()

Rplot10.png

These are the very basic form of data visualization that helps to maintain the data in great form.

Data Visualization In R- II

Advanced Data Visualization in R

As data become larger ,diverse and complex we require advance version of  data visualization.

1.Data Visualization

  1. Mosaic Plot
  2. 3D Graphs
  3. Hexbin Binning

1.1 Mosaic Plot

What is Mosaic Plot? Why to choose Mosaic Plot?

A mosaic plot is a graphical display that allows you to examine the relationship among two
or more categorical variables.

If you wanted to compare the mortality rates between men and women using a mosaic plot, you would first divide the unit square according to the overall proportion of males and females.

Here in the figure below about 30% are female and 70% are male.

Next we include more adjustments.

mosaic1

Secondly now we include one more variable that depict the survival of sexes.

mosaic5

In the above diagram in females about 60% survived and in male about 25% survived.

Now the next step is to how to plot this mosaic plot

 

#Plot Mosaic Plot

data(Survivals)

mosaicplot(Survivals)

 

1.2 3D Graphs

We use  3D graphs to plot data in three dimensions.3D plots depicts the relationship between all the three variables and also shows the relationships between any of the two variables.

How to plot 3D Graphs

# Install rgl package and load library

install.packages(“rgl”)

library(rgl)

#plot 3D graph
plot3d(var1,var2,var3,xlab= “ ”,ylab= “ ”,zlab= “ ”,type= “ ”,col= “ ”,size= “ ”)

download

1.3 Hexbin Binning

Hexagon binning is a form of bivariate histogram useful for visualizing the structure in datasets with large n.

How to plot hexbins

#load library

library(hexbin)

# plot  hexbin

a=hexbin(X,Y,xbins=40)

#load color brewer

library(RColorBrewer)

#plot

plot(a)

In order to add colors for better understanding of data we use the following commands.

#load library

library(RColorBrewer)

#setting colramp

tp <- colorRampPalette(rev(brewer.pal(40,’Set3′)))

#Plot

hexbinplot(Y~X, data=burgers, colramp=tp)

download (1).png

PURPOSE OF DATA VISUALIZATION

In the world of data we observe how much it is important to visualize data and hence saving time ,but at last what is important is that the data visual should interpret the message for which it has been made.

Just think about it how much time time it will take to explain what data says by just looking at data itself ( I mean lots of figures) Or by looking at the table that has been made from it.

Imagine how much time it saves specially for the decision makers.

Hence we now discuss how to perform data visualization and present data in R.

In this tutorial, we will create the following visualizations:

1.Basic Visualization

  1. Line Graph
  2. Bar Graph
  3. Scatter plot
  4. Histogram
  5. Pie Chart

1.1 Line Graph

What is the use of Line Graph?

It is used commonly used for time-series data.

For Example :

To know the figure of annual rainfall over time.

To know how many people eat Burger in a restaurant etc.

 

Line graph

How to plot Line Graph in R?

To Plot the data set “UsaPopulation.csv” in R ,following are the steps

# import data in R : data1<- read.csv(file.choose(),header=TRUE,stringsAsFactor=FALSE)

# plot 2012 Population : plot(data1$2012[1:51],type=”l”,col=”red”)

# label X axis    : axis(1,lab=”data1$states”,las=”2”)

# label   Y axis    : axis(2,las=”1”)

# Creating title with fonts : title(main=”Population”,col.main=”Red”,font.main=4)

1=Simple

2=Bold

3= Italic

4=Bold Italic

1.2 Bar Graph

What is the use of bar graph in data visualization?

Suppose a milkman  wants to know on which day his sale was maximum,the easy way to do this is through bar charts.

Bar chart is the chart with rectangular bars with length proportional to the value they represent.

Bar Graph

Step

#import data in R : read.csv(file.choose(),header=TRUE) Data1

BodyCap Age Height Sex
6.475 6 62.1 male
10.152 18 74.7 male
9.55 16 69.7 male
11.125 14 71 female
4.8 5 66 female
6.225 11 63.3 female
4.95 8 39.2 female

#Count of male and female: bar1<- table(Book2$Sex)

# Plot bar of Sex with color: barplot(bar1,col=c(“red”,”blue”))

Rplot15.png

1.3Scatter plot

Scatter Plot is used to show the relationship between two quantitative variable.

Positive correlation : Value of Y increase with X.

Negative Correlation : Value of Y decreases with X.

No correlation : No relationship between X and Y.

#import data in R : read.csv(file.choose(),header=TRUE)  Mtcars data

#plot scatter plot of mtcars$mpg: plot(mtcars$mpg)

Rplot11.png

 #correlation through scatter plot between mpg and hp: plot(mtcars$mpg,mtcars$hp).

Rplot12.png

1.4 Histogram

When to use histogram?

If we have numerical data or if we need to see frequency distribution of data we use histogram.

Steps

#import data: read.csv(file.choose(),header=TRUE) Mtcars data

#plot histogram : hist(mtcars$mpg)

Rplot13

#plot histogram with breaks and color : p4<-               hist(mtcars$mpg,breaks=14,col=rainbow(14),labels=T)

Rplot14.png

 1.5Pie Chart

Pie Chart is a circular graph used to show relative contribution that different categories contributes to an overall total and are generally used to show proportional or percentage data.

#import data in R : read.csv(file.choose(),header=TRUE,stringsAsFactors=FALSE)employee

#make Pie Chart : pie(Employee$SAL)

pie chart

#modify pie chart: pie(Employee$SAL,main=”Salary Pie Chart”.col.name= “Darkgreen”,labels=Employee$ENAME,col=rainbow(14))

#Percentage of salaries as labels:

SAL_labels<- round(Employee$SAL/sum(Employee$SAL)*100,1)

SAL_labels

Lbls<- paste(Employee$ENAME,SAL_labels)

Lbls

#add percentages to labels

Lbls<- paste(Lbls,”%”,sep= “  ”)

pie(Employee$SAL,main= “Salary Pie Chart”,labels=lbls,col=rainbow(14)).