zondag 19 maart 2017

R : Working with data and some basic dataframe manipulations

Introduction

I wanted a long time to write a blogpost about the basic manipulations of data frames in R. Data frames in R are a fundamental concept and probably the most used feature to store table information.

This is the first in a series about the manipulations about dataframes. First I'll show the script and after that, an outputresult.

Create a data frame

First let's create a data frame for the manipulations.
rm(list=ls())

data.frame.x<-data.frame(names=c("Hennie","Robert","John","Jan"),
                         age=c(30,18,25,70),
                         Hobby=c("Tennis","Football","Tennis","motorcross"))

data.frame.y<-data.frame("Person_name"=c("Hennie","Robert","Kees","Henry"),
                         age=c(15,75,32,51),
                         "Pet"=c("Dog","Cat","Bird","Mouse"))
data.frame.x
data.frame.y

Below the result when the code is executed.
> data.frame.x<-data.frame(names=c("Hennie","Robert","John","Jan"),
+                          age=c(30,18,25,70),
+                          Hobby=c("Tennis","Football","Tennis","motorcross"))
> 
> data.frame.y<-data.frame("Person_name"=c("Hennie","Robert","Kees","Henry"),
+                          age=c(15,75,32,51),
+                          "Pet"=c("Dog","Cat","Bird","Mouse"))
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> data.frame.y
  Person_name age   Pet
1      Hennie  15   Dog
2      Robert  75   Cat
3        Kees  32  Bird
4       Henry  51 Mouse

Select a column

There are multiple ways of selecting columns in R.
data.frame.x[,1]

data.frame.x$names       

Here is the output when the code is executed.
> data.frame.x[,1]
[1] Hennie Robert John   Jan   
Levels: Hennie Jan John Robert
> data.frame.x$names
[1] Hennie Robert John   Jan   
Levels: Hennie Jan John Robert  

Select multiple columns

It is also possible to select multiple columns of a dataframe.
data.frame.x[,1:3]
data.frame.x[,c(1:3)]
data.frame.x[,c(1,3)]    

And this results in the following output when executed.
 > data.frame.x[,1:3]
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> data.frame.x[,c(1:3)]
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> data.frame.x[,c(1,3)]
   names      Hobby
1 Hennie     Tennis
2 Robert   Football
3   John     Tennis
4    Jan motorcross

Select a row

The same we can do with rows is selecting a row.
data.frame.x[3,] 

Resulting in:
> data.frame.x[3,]
  names age  Hobby
3  John  25 Tennis

Selecting multiple rows

Selecting multiple rows can be done in a similar way as selecting multiple columns
data.frame.x[2:5,]
data.frame.x[c(2:5),]
data.frame.x[c(2:5),]     
 

Resulting in :
> data.frame.x[2:5,]
    names age      Hobby
2  Robert  18   Football
3    John  25     Tennis
4     Jan  70 motorcross
NA   <NA>  NA       <NA>
> data.frame.x[c(2:5),]
    names age      Hobby
2  Robert  18   Football
3    John  25     Tennis
4     Jan  70 motorcross
NA   <NA>  NA       <NA>
> data.frame.x[c(2:5),]
    names age      Hobby
2  Robert  18   Football
3    John  25     Tennis
4     Jan  70 motorcross
NA   <NA>  NA       <NA>

Notice the NA for the rows that are not found in the data frame

Selecting values in a data frame

It is also possible to select a cell value or to subset a data frame.
data.frame.x[1,2]
data.frame.x[c(2:3),c(1,3)]     

Resulting in:
> data.frame.x[1,2]
[1] 30
> data.frame.x[c(2:3),c(1,3)]
   names    Hobby
2 Robert Football
3   John   Tennis

Adding rows with rbind()

It is also possible to add rows to a data frame. Now there is a 'problem' with the factors in the dataframe. When R creates a factor, it only allows whatever is originally in the dataframe that was first loaded. Anything new that doesn’t fit into one of these categories is rejected as nonsense (becomes NA).

For instance "Jim" was not originally in the data frame and when the following code is executed an error occurs.

data.frame.x
data.frame.x <-rbind(data.frame.x, c("Jim",90,"Football"))  

Resulting in:
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
5   <NA>  90   Football
> data.frame.x <-rbind(data.frame.x, c("Jim",90,"Football"))
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "Jim") :
  invalid factor level, NA generated

An error occurs because Jim was not one of the list (factors), Hennie, Robert, John, Jan and therefore an error happens. NA is inserted for Jim. When we execute the following code with factors that was already originally in the dataframe it will work.

data.frame.x <-rbind(data.frame.x, c("Hennie",90,"Football"))
data.frame.x     

No Na in the dataframe.
> data.frame.x <-rbind(data.frame.x, c("Hennie",90,"Football"))
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
5 Hennie  90   Football

But how do we add new values to list of factors? We hae add manually the new categories to the Factors as a new level with the level() function.

levels(data.frame.x$names) <- c(levels(data.frame.x$names), 'Jim')
levels(data.frame.x$Hobby) <- c(levels(data.frame.x$Hobby), 'Bridge')
data.frame.x <-rbind(data.frame.x, c("Jim",90,"Bridge"))
data.frame.x

Here is the executed code.
> levels(data.frame.x$names) <- c(levels(data.frame.x$names), 'Jim')
> levels(data.frame.x$Hobby) <- c(levels(data.frame.x$Hobby), 'Bridge')
> data.frame.x <-rbind(data.frame.x, c("Jim",90,"Bridge"))
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
5    Jim  90     Bridge    

Or convert the columns to characters, add the rows and turn them back to factors.
data.frame.x$names <- as.character(data.frame.x$names)
data.frame.x$Hobby <- as.character(data.frame.x$Hobby)
data.frame.x <-rbind(data.frame.x, c("Jim",90,"Bridge"))
str(data.frame.x)
data.frame.x
data.frame.x$names <- as.factor(data.frame.x$names)
data.frame.x$Hobby <- as.factor(data.frame.x$Hobby)
str(data.frame.x)   

The code executed..
> data.frame.x$names <- as.character(data.frame.x$names)
> data.frame.x$Hobby <- as.character(data.frame.x$Hobby)
> data.frame.x <-rbind(data.frame.x, c("Jim",90,"Bridge"))
> str(data.frame.x)
'data.frame': 10 obs. of  3 variables:
 $ names: chr  "Hennie" "Robert" "John" "Jan" ...
 $ age  : chr  "30" "18" "25" "70" ...
 $ Hobby: chr  "Tennis" "Football" "Tennis" "motorcross" ...
> data.frame.x
    names age      Hobby
1  Hennie  30     Tennis
2  Robert  18   Football
3    John  25     Tennis
4     Jan  70 motorcross
5     Jim  90     Bridge
6     Jim  90     Bridge
7     Jim  90     Bridge
8     Jim  90     Bridge
9     Jim  90     Bridge
10    Jim  90     Bridge
> data.frame.x$names <- as.factor(data.frame.x$names)
> data.frame.x$Hobby <- as.factor(data.frame.x$Hobby)
> str(data.frame.x)
'data.frame': 10 obs. of  3 variables:
 $ names: Factor w/ 5 levels "Hennie","Jan",..: 1 5 4 2 3 3 3 3 3 3
 $ age  : chr  "30" "18" "25" "70" ...
 $ Hobby: Factor w/ 4 levels "Bridge","Football",..: 4 2 4 3 1 1 1 1 1 1

Adding columns with cbind()

cbind() is a lot easier adding columns than adding rows to a dataframe, because the datatype of the column is only one.

data.frame.x
new.column<-c('190', '168','174', '123')
data.frame.x <-cbind(data.frame.x, new.column)
colnames(data.frame.x)[colnames(data.frame.x)=="new.column"] <- "height"
data.frame.x  

And this is the executed code
> data.frame.x
   names age      Hobby height
1 Hennie  30     Tennis    190
2 Robert  18   Football    168
3   John  25     Tennis    174
4    Jan  70 motorcross    123
> new.column<-c('190', '168','174', '123')
> data.frame.x <-cbind(data.frame.x, new.column)
> colnames(data.frame.x)[colnames(data.frame.x)=="new.column"] <- "height"
> data.frame.x
   names age      Hobby height height
1 Hennie  30     Tennis    190    190
2 Robert  18   Football    168    168
3   John  25     Tennis    174    174
4    Jan  70 motorcross    123    123

Option number 2 is adding an extra column in a following manner.
data.frame.x
data.frame.x$new.column<-c('190', '168','174', '123')
colnames(data.frame.x)[colnames(data.frame.x)=="new.column"] <- "height"
data.frame.x      

Resulting in
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> data.frame.x$new.column<-c('190', '168','174', '123')
> colnames(data.frame.x)[colnames(data.frame.x)=="new.column"] <- "height"
> data.frame.x
   names age      Hobby height
1 Hennie  30     Tennis    190
2 Robert  18   Football    168
3   John  25     Tennis    174
4    Jan  70 motorcross    123 

Show first n rows with head()

The head() function is like the TOP keyword in SQL and gives the first n rows of a data frame.
data.frame.x
head(data.frame.x,2)      

And this is the executed code
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> head(data.frame.x,2)
   names age    Hobby
1 Hennie  30   Tennis
2 Robert  18 Football

Show bottom n rows with tail()

The tail function is the same as the head() function but then it returns the bottom n rows of the dataframe.
data.frame.x
tail(data.frame.x,2)

And the results are..
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> tail(data.frame.x,2)
  names age      Hobby
3  John  25     Tennis
4   Jan  70 motorcross

Getting some columninformation with class()

If information is need about a patriular column the class() function can be used
class(data.frame.x$age)

Resulting in.
> class(data.frame.x$age)
[1] "numeric"      
 

names()

names() is the name of the function for getting or setting the names of an object.
data.frame.x
names(data.frame.x) 

And this is the executed code.
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> names(data.frame.x)
[1] "names" "age"   "Hobby"  
 

dim()

knowing the dimensions of a dataframe can be handy sometimes and this can be executed with the dim function
data.frame.x
dim(data.frame.x)      

The executed code...
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> dim(data.frame.x)
[1] 4 3

Showing some statistial information about an object with Summary()

In case more information is need about the values in an object a function summary() is available.

summary(data.frame.x)   


>summary(data.frame.x)
    names        age               Hobby  
 Hennie:1   Min.   :18.00   Football  :1  
 Jan   :1   1st Qu.:23.25   motorcross:1  
 John  :1   Median :27.50   Tennis    :2  
 Robert:1   Mean   :35.75                 
            3rd Qu.:40.00                 
            Max.   :70.00 
Occurence of every name is 1, the average of age is 35.75 and tennis is counted twice.

Showing the structure of an object (data frame) with str()

If more information is needed about an object and in this particular case a data frame you can use the str() function in R.
str(data.frame.x) 

Resulting in the following output.
> str(data.frame.x)
'data.frame': 4 obs. of  3 variables:
 $ names: Factor w/ 4 levels "Hennie","Jan",..: 1 4 3 2
 $ age  : num  30 18 25 70
 $ Hobby: Factor w/ 3 levels "Football","motorcross",..: 3 1 3 2
 

Factors is a type that is like categorization of the values and used in statistical functions. Age is numerical column.

Count the rows with nrow()

With nrow() it is possible to count the rows in a data frame
nrow(data.frame.x)     

Resulting in
> nrow(data.frame.x)    # number of data rows
[1] 4

Count the columns with ncol()

Offcourse it is also possible to count the number of columns. this is done with the ncol() function
ncol(data.frame.x)     

Resulting in
> ncol(data.frame.x)
[1] 3

attributes()

The attributes function acesses the attributes of an object and in this example it is the dataframe.
data.frame.x
attributes(data.frame.x)    

And this is the result of the executed code.
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> attributes(data.frame.x)
$names
[1] "names" "age"   "Hobby"

$row.names
[1] 1 2 3 4

$class
[1] "data.frame"   
 

merge()

Now it is also possible to merge two data frames with the merge function. Now the first merge function in the code below tries to find similar columns and that could be age but there are now common ages and therefore the merge fails. the second one has the columnnames specified  and the all=TRUE is a full outer join between the two dataframes.

data.frame.x
data.frame.y
merge(data.frame.x, data.frame.y)
merge(data.frame.x, data.frame.y, by.x="names", by.y="Person_name",all=TRUE)     

And this is the executed code
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> data.frame.y
  Person_name age   Pet
1      Hennie  15   Dog
2      Robert  75   Cat
3        Kees  32  Bird
4       Henry  51 Mouse
> merge(data.frame.x, data.frame.y)
[1] age         names       Hobby       Person_name Pet        
<0 rows> (or 0-length row.names)
> 
> merge(data.frame.x, data.frame.y, by.x="names", by.y="Person_name",all=TRUE)
   names age.x      Hobby age.y   Pet
1 Hennie    30     Tennis    15   Dog
2    Jan    70 motorcross    NA  <NA>
3   John    25     Tennis    NA  <NA>
4 Robert    18   Football    75   Cat
5  Henry    NA       <NA>    51 Mouse
6   Kees    NA       <NA>    32  Bird 
 

subset()

The subset( ) function is an easy way to select variables and observations. In the following example, we select all rows that have a value of age greater than or equal to 20 or age less then 10 and the columns names and hobbty is returned

data.frame.x
subset.frame.x <- subset(data.frame.x, age >= 20 | age < 10, 
                  select=c(names, Hobby))
subset.frame.x

And this is the result...
> data.frame.x
   names age      Hobby
1 Hennie  30     Tennis
2 Robert  18   Football
3   John  25     Tennis
4    Jan  70 motorcross
> subset.frame.x <- subset(data.frame.x, age >= 20 | age < 10, 
+                   select=c(names, Hobby))
> subset.frame.x
   names      Hobby
1 Hennie     Tennis
3   John     Tennis
4    Jan motorcross  
 

Conclusion

In this blogpost I've discovered the different data frame manipulations in R. 

Greetz,
Hennie