1.VISUALIZE YOUR DATA USING PACKAGE GGPLOT2
1.1 Scatter Plot
In my previous tutorials Scatter plot was built to present data points given in the sample without using any packages,here we will discuss about how to perform the same using ggplot2 which make it really simple and easy.
Note : Assume the X variable and Y variable are continuous random Variable.
- At first install ggplot2 package in R:
#install ggplot2 package
- Now lets get some scatter plots done -“The Basic One!”
- I have download the data on diamonds from https://vincentarelbundock.github.io/Rdatasets/datasets.html
- Next step involves load data in R ,the load library and continue with plotting.
Data<- read.csv(file.choose(),header=TRUE,stringsAsFactors= “FALSE”)
The scatter in the above diagram shows the positive relationship between carat and price.
Hence now include the other variable that accompanies this relationship,lets take clarity of diamonds.
|#take color=clarity in aesthetic
In the diagram taking in to account “Clarity” among the two variable relationship is presented by colored dots where each color depicts the clarity of diamonds and its relationship between the two X and Y variables,red color shows low clarity and has weak positive relation between two variables which the blue color shows relatively stronger relation.
Now add one more variable to this relationship which include the size of the scatter points equal to the diamond cuts.
#take size of dots-cuts
Here each point with the diamonds with respective cuts and the diagram shows the relationship between price and carat keeping cuts and clarity in to consideration.
Scatter plot is a layer,so in order to include one other layer say curve that shows the general trend between X and Y variable we use geom_smooth .
Show the line of best fit reminde me with linear model:
|#Curve to show general trend
ggplot(diamonds,aes(x=carat,y=price))+geom_point()+geom_smooth(se=FALSE,method = lm)
The line shows the linear relationship between two variables .
Faceting makes the understanding of the relationship taking in to account third variable more precisely .
Now lets catch histogram here with ggplot2.
Sometimes you need one dimension of the data and observe its distribution,here then we use histogram.
Count shows the frequency in the bin and the histogram shows the distribution of price.
To change the width of the histogram we simply include bin width layer.
> ggplot(diamonds,aes(x=price))+ geom_histogram(binwidth = 3000)
Lets take in to account the fill option where histogram shows the clarity of diamonds and its price.
|#Histogram fill with clarity
The basic method in statistics to compare density is through boxplot.
Boxplot as I have mention before is the graphical representation of data that shows highest,lowest and the median value.
The middle dark line in the first boxplot shows the median and the box is divided in to 75 percentile and 25 percentile
The dark line in above boxplot are the outlier that goes beyond the expected values.
In order to get more better picture about the distribution we take log value of price.
|#Boxplot taking log values
These are the very basic form of data visualization that helps to maintain the data in great form.