Introduction
I wanted a long time to write a blogpost about the basic manipulations of data frames in R. Data frames in R are a fundamental concept and probably the most used feature to store table information.
This is the first in a series about the manipulations about dataframes. First I'll show the script and after that, an outputresult.
Create a data frame
First let's create a data frame for the manipulations.rm(list=ls())
data.frame.x<-data.frame(names=c("Hennie","Robert","John","Jan"),
age=c(30,18,25,70),
Hobby=c("Tennis","Football","Tennis","motorcross"))
data.frame.y<-data.frame("Person_name"=c("Hennie","Robert","Kees","Henry"),
age=c(15,75,32,51),
"Pet"=c("Dog","Cat","Bird","Mouse"))
data.frame.x
data.frame.y
Below the result when the code is executed.
> data.frame.x<-data.frame(names=c("Hennie","Robert","John","Jan"),
+ age=c(30,18,25,70),
+ Hobby=c("Tennis","Football","Tennis","motorcross"))
>
> data.frame.y<-data.frame("Person_name"=c("Hennie","Robert","Kees","Henry"),
+ age=c(15,75,32,51),
+ "Pet"=c("Dog","Cat","Bird","Mouse"))
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
> data.frame.y
Person_name age Pet
1 Hennie 15 Dog
2 Robert 75 Cat
3 Kees 32 Bird
4 Henry 51 Mouse
Select a column
There are multiple ways of selecting columns in R.data.frame.x[,1]
data.frame.x$names
Here is the output when the code is executed.
> data.frame.x[,1]
[1] Hennie Robert John Jan
Levels: Hennie Jan John Robert
> data.frame.x$names
[1] Hennie Robert John Jan
Levels: Hennie Jan John Robert
Select multiple columns
It is also possible to select multiple columns of a dataframe.data.frame.x[,1:3]
data.frame.x[,c(1:3)]
data.frame.x[,c(1,3)]
And this results in the following output when executed.
> data.frame.x[,1:3] names age Hobby 1 Hennie 30 Tennis 2 Robert 18 Football 3 John 25 Tennis 4 Jan 70 motorcross > data.frame.x[,c(1:3)] names age Hobby 1 Hennie 30 Tennis 2 Robert 18 Football 3 John 25 Tennis 4 Jan 70 motorcross > data.frame.x[,c(1,3)] names Hobby 1 Hennie Tennis 2 Robert Football 3 John Tennis 4 Jan motorcross
Select a row
The same we can do with rows is selecting a row.data.frame.x[3,]
Resulting in:
> data.frame.x[3,] names age Hobby 3 John 25 Tennis
Selecting multiple rows
Selecting multiple rows can be done in a similar way as selecting multiple columnsdata.frame.x[2:5,]
data.frame.x[c(2:5),]
data.frame.x[c(2:5),]
Resulting in :
> data.frame.x[2:5,] names age Hobby 2 Robert 18 Football 3 John 25 Tennis 4 Jan 70 motorcross NA <NA> NA <NA> > data.frame.x[c(2:5),] names age Hobby 2 Robert 18 Football 3 John 25 Tennis 4 Jan 70 motorcross NA <NA> NA <NA> > data.frame.x[c(2:5),] names age Hobby 2 Robert 18 Football 3 John 25 Tennis 4 Jan 70 motorcross NA <NA> NA <NA>
Notice the NA for the rows that are not found in the data frame
Selecting values in a data frame
It is also possible to select a cell value or to subset a data frame.data.frame.x[1,2]
data.frame.x[c(2:3),c(1,3)]
Resulting in:
> data.frame.x[1,2]
[1] 30
> data.frame.x[c(2:3),c(1,3)]
names Hobby
2 Robert Football
3 John Tennis
Adding rows with rbind()
It is also possible to add rows to a data frame. Now there is a 'problem' with the factors in the dataframe. When R creates a factor, it only allows whatever is originally in the dataframe that was first loaded. Anything new that doesn’t fit into one of these categories is rejected as nonsense (becomes NA).For instance "Jim" was not originally in the data frame and when the following code is executed an error occurs.
data.frame.x
data.frame.x <-rbind(data.frame.x, c("Jim",90,"Football"))
Resulting in:
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
5 <NA> 90 Football
> data.frame.x <-rbind(data.frame.x, c("Jim",90,"Football"))
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "Jim") :
invalid factor level, NA generated
An error occurs because Jim was not one of the list (factors), Hennie, Robert, John, Jan and therefore an error happens. NA is inserted for Jim. When we execute the following code with factors that was already originally in the dataframe it will work.
data.frame.x <-rbind(data.frame.x, c("Hennie",90,"Football"))
data.frame.x
No Na in the dataframe.
> data.frame.x <-rbind(data.frame.x, c("Hennie",90,"Football"))
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
5 Hennie 90 Football
But how do we add new values to list of factors? We hae add manually the new categories to the Factors as a new level with the level() function.
levels(data.frame.x$names) <- c(levels(data.frame.x$names), 'Jim')
levels(data.frame.x$Hobby) <- c(levels(data.frame.x$Hobby), 'Bridge')
data.frame.x <-rbind(data.frame.x, c("Jim",90,"Bridge"))
data.frame.x
Here is the executed code.
> levels(data.frame.x$names) <- c(levels(data.frame.x$names), 'Jim')
> levels(data.frame.x$Hobby) <- c(levels(data.frame.x$Hobby), 'Bridge')
> data.frame.x <-rbind(data.frame.x, c("Jim",90,"Bridge"))
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
5 Jim 90 Bridge
Or convert the columns to characters, add the rows and turn them back to factors.
data.frame.x$names <- as.character(data.frame.x$names)
data.frame.x$Hobby <- as.character(data.frame.x$Hobby)
data.frame.x <-rbind(data.frame.x, c("Jim",90,"Bridge"))
str(data.frame.x)
data.frame.x
data.frame.x$names <- as.factor(data.frame.x$names)
data.frame.x$Hobby <- as.factor(data.frame.x$Hobby)
str(data.frame.x)
The code executed..
> data.frame.x$names <- as.character(data.frame.x$names)
> data.frame.x$Hobby <- as.character(data.frame.x$Hobby)
> data.frame.x <-rbind(data.frame.x, c("Jim",90,"Bridge"))
> str(data.frame.x)
'data.frame': 10 obs. of 3 variables:
$ names: chr "Hennie" "Robert" "John" "Jan" ...
$ age : chr "30" "18" "25" "70" ...
$ Hobby: chr "Tennis" "Football" "Tennis" "motorcross" ...
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
5 Jim 90 Bridge
6 Jim 90 Bridge
7 Jim 90 Bridge
8 Jim 90 Bridge
9 Jim 90 Bridge
10 Jim 90 Bridge
> data.frame.x$names <- as.factor(data.frame.x$names)
> data.frame.x$Hobby <- as.factor(data.frame.x$Hobby)
> str(data.frame.x)
'data.frame': 10 obs. of 3 variables:
$ names: Factor w/ 5 levels "Hennie","Jan",..: 1 5 4 2 3 3 3 3 3 3
$ age : chr "30" "18" "25" "70" ...
$ Hobby: Factor w/ 4 levels "Bridge","Football",..: 4 2 4 3 1 1 1 1 1 1
Adding columns with cbind()
cbind() is a lot easier adding columns than adding rows to a dataframe, because the datatype of the column is only one.data.frame.x
new.column<-c('190', '168','174', '123')
data.frame.x <-cbind(data.frame.x, new.column)
colnames(data.frame.x)[colnames(data.frame.x)=="new.column"] <- "height"
data.frame.x
And this is the executed code
> data.frame.x
names age Hobby height
1 Hennie 30 Tennis 190
2 Robert 18 Football 168
3 John 25 Tennis 174
4 Jan 70 motorcross 123
> new.column<-c('190', '168','174', '123')
> data.frame.x <-cbind(data.frame.x, new.column)
> colnames(data.frame.x)[colnames(data.frame.x)=="new.column"] <- "height"
> data.frame.x
names age Hobby height height
1 Hennie 30 Tennis 190 190
2 Robert 18 Football 168 168
3 John 25 Tennis 174 174
4 Jan 70 motorcross 123 123
Option number 2 is adding an extra column in a following manner.
data.frame.x
data.frame.x$new.column<-c('190', '168','174', '123')
colnames(data.frame.x)[colnames(data.frame.x)=="new.column"] <- "height"
data.frame.x
Resulting in
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
> data.frame.x$new.column<-c('190', '168','174', '123')
> colnames(data.frame.x)[colnames(data.frame.x)=="new.column"] <- "height"
> data.frame.x
names age Hobby height
1 Hennie 30 Tennis 190
2 Robert 18 Football 168
3 John 25 Tennis 174
4 Jan 70 motorcross 123
Show first n rows with head()
The head() function is like the TOP keyword in SQL and gives the first n rows of a data frame.data.frame.x
head(data.frame.x,2)
And this is the executed code
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
> head(data.frame.x,2)
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
Show bottom n rows with tail()
The tail function is the same as the head() function but then it returns the bottom n rows of the dataframe.data.frame.x tail(data.frame.x,2)
And the results are..
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
> tail(data.frame.x,2)
names age Hobby
3 John 25 Tennis
4 Jan 70 motorcross
Getting some columninformation with class()
If information is need about a patriular column the class() function can be usedclass(data.frame.x$age)
Resulting in.
> class(data.frame.x$age)
[1] "numeric"
names()
names() is the name of the function for getting or setting the names of an object.data.frame.x names(data.frame.x)
And this is the executed code.
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
> names(data.frame.x)
[1] "names" "age" "Hobby"
dim()
knowing the dimensions of a dataframe can be handy sometimes and this can be executed with the dim functiondata.frame.x
dim(data.frame.x)
The executed code...
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
> dim(data.frame.x)
[1] 4 3
Showing some statistial information about an object with Summary()
In case more information is need about the values in an object a function summary() is available.summary(data.frame.x)
>summary(data.frame.x)
names age Hobby
Hennie:1 Min. :18.00 Football :1
Jan :1 1st Qu.:23.25 motorcross:1
John :1 Median :27.50 Tennis :2
Robert:1 Mean :35.75
3rd Qu.:40.00
Max. :70.00
Occurence of every name is 1, the average of age is 35.75 and tennis is counted twice.Showing the structure of an object (data frame) with str()
If more information is needed about an object and in this particular case a data frame you can use the str() function in R.str(data.frame.x)
Resulting in the following output.
> str(data.frame.x)
'data.frame': 4 obs. of 3 variables:
$ names: Factor w/ 4 levels "Hennie","Jan",..: 1 4 3 2
$ age : num 30 18 25 70
$ Hobby: Factor w/ 3 levels "Football","motorcross",..: 3 1 3 2
Factors is a type that is like categorization of the values and used in statistical functions. Age is numerical column.
Count the rows with nrow()
With nrow() it is possible to count the rows in a data framenrow(data.frame.x)
Resulting in
> nrow(data.frame.x) # number of data rows [1] 4
Count the columns with ncol()
Offcourse it is also possible to count the number of columns. this is done with the ncol() functionncol(data.frame.x)
Resulting in
> ncol(data.frame.x)
[1] 3
attributes()
The attributes function acesses the attributes of an object and in this example it is the dataframe.data.frame.x
attributes(data.frame.x)
And this is the result of the executed code.
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
> attributes(data.frame.x)
$names
[1] "names" "age" "Hobby"
$row.names
[1] 1 2 3 4
$class
[1] "data.frame"
merge()
Now it is also possible to merge two data frames with the merge function. Now the first merge function in the code below tries to find similar columns and that could be age but there are now common ages and therefore the merge fails. the second one has the columnnames specified and the all=TRUE is a full outer join between the two dataframes.data.frame.x
data.frame.y
merge(data.frame.x, data.frame.y)
merge(data.frame.x, data.frame.y, by.x="names", by.y="Person_name",all=TRUE)
And this is the executed code
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
> data.frame.y
Person_name age Pet
1 Hennie 15 Dog
2 Robert 75 Cat
3 Kees 32 Bird
4 Henry 51 Mouse
> merge(data.frame.x, data.frame.y)
[1] age names Hobby Person_name Pet
<0 rows> (or 0-length row.names)
>
> merge(data.frame.x, data.frame.y, by.x="names", by.y="Person_name",all=TRUE)
names age.x Hobby age.y Pet
1 Hennie 30 Tennis 15 Dog
2 Jan 70 motorcross NA <NA>
3 John 25 Tennis NA <NA>
4 Robert 18 Football 75 Cat
5 Henry NA <NA> 51 Mouse
6 Kees NA <NA> 32 Bird
subset()
The subset( ) function is an easy way to select variables and observations. In the following example, we select all rows that have a value of age greater than or equal to 20 or age less then 10 and the columns names and hobbty is returneddata.frame.x subset.frame.x <- subset(data.frame.x, age >= 20 | age < 10, select=c(names, Hobby)) subset.frame.x
And this is the result...
> data.frame.x
names age Hobby
1 Hennie 30 Tennis
2 Robert 18 Football
3 John 25 Tennis
4 Jan 70 motorcross
> subset.frame.x <- subset(data.frame.x, age >= 20 | age < 10,
+ select=c(names, Hobby))
> subset.frame.x
names Hobby
1 Hennie Tennis
3 John Tennis
4 Jan motorcross
Conclusion
In this blogpost I've discovered the different data frame manipulations in R.
Greetz,
Hennie
Geen opmerkingen:
Een reactie posten