Saturday, March 17, 2018

How do you create a BOXPLOT from data?

Let us first of all get some data. Let us get the IndiaCensus.xlsx file created in an earlier post here. Let us save the data in that post to a IndiaCensus.csv file.

Use the following code to get the data into R:

Data from this csv file:



Importing a CSV file:

> dataset <-read .csv="" esktop="" header="TRUE)</font" log2017="" ndiacensus.csv="" sers="" wner=""><-read .csv="" esktop="" header="TRUE)<br" log2017="" ndiacensus.csv="" sers="" wner="">> dataset
   ï..State.UT.Code                    State.UT    Total     Male   Female
1                 1           Jammu and Kashmir  2008670  1080662   927982
2                 2            Himachal Pradesh   763864   400681   363183
3                 3                      Punjab  2941570  1593262  1348308
4                 4                  Chandigarh   117953    63187    54766
5                 5                 Uttarakhand  1328844   704769   624075
6                 6                     Haryana  3297724  1802047  1495677
7                 7                       Delhi  1970510  1055735   914775
8                 8                   Rajasthan 10504916  5580212  4924004
9                 9               Uttar Pradesh 29728235 15653175 14075060
10               10                       Bihar 18582229  9615280  8966949
11               11                      Sikkim    61077    31418    29659
12               12           Arunachal Pradesh   202759   103430    99330
13               13                    Nagaland   285981   147111   138870
14               14                     Manipur   353237   182684   170553
15               15                     Mizoram   165536    83965    81571
16               16                     Tripura   444055   227354   216701
17               17                   Meghalaya   555822   282189   273633
18               18                       Assam  4511307  2305088  2206219
19               19                 West Bengal 10112599  5187264  4925335
20               20                   Jharkhand  5237582  2695921  2541661
21               21                      Odisha  5035650  2603208  2432442
22               22                Chhattisgarh  3584028  1824987  1759041
23               23              Madhya Pradesh 10548295  5516957  5031338
24               24                     Gujarat  7564464  3974286  3519890
25               25               Daman and Diu    25880    13556    12314
26               26      Dadra and Nagar Haveli    49196    25575    23621
27               27                 Maharashtra 12848375  6822262  6026113
28               28              Andhra Pradesh  8642686  4448330  4194356
29               29                   Karnataka  6855801  3527844  3327957
30               30                         Goa   139495    72669    66826
31               31                 Lakshadweep     7088     3715     3373
32               32                      Kerala  3322247  1695889  1626358
33               33                  Tamil Nadu  6894821  3542351  3352470
34               34                  Puducherry   127610    64932    62678
35               35 Andaman and Nicobar Islands    39497    20094    19403

------------
Let us get a  overview of data:
> dim(dataset)
[1] 35  5
--------------
We already looked at the data. But we could have also looked at a sample as shown.


> head(dataset)
  ï..State.UT.Code          State.UT   Total    Male  Female
1                1 Jammu and Kashmir 2008670 1080662  927982
2                2  Himachal Pradesh  763864  400681  363183
3                3            Punjab 2941570 1593262 1348308
4                4        Chandigarh  117953   63187   54766
5                5       Uttarakhand 1328844  704769  624075
6                6           Haryana 3297724 1802047 1495677

>
-----------------------------
Let us get the last three columns of data into x using following and a statistical summary of the three columns using the following:
> x=dataset[,3:5]
> summary(x)

     Total               Male              Female       
 Min.   :    7088   Min.   :    3715   Min.   :    3373 
 1st Qu.:  184148   1st Qu.:   93698   1st Qu.:   90450 
 Median : 2008670   Median : 1080662   Median :  927982 
 Mean   : 4538846   Mean   : 2370060   Mean   : 2166757 
 3rd Qu.: 6875311   3rd Qu.: 3535098   3rd Qu.: 3340214 
 Max.   :29728235   Max.   :15653175   Max.   :14075060

----
This is just a sample of the 'Total' column:
> head(x[1])
    Total
1 2008670
2  763864
3 2941570
4  117953
5 1328844
6 3297724

-----------
Let us just BOXPLOT the "Total" column.
>boxplot(x[1])



-----------
Boxplot can be created for a group of columns as well.
Here is the Boxplot for the group, I just scaled it to log10.
> z=log10(x)
> boxplot(z)


For an explanation of BOXPLOT, read here.

No comments: