20150916 [Coursera] R Programming (4) *整理自 R Programming (Week 1) -- Reading Data [ Week 1 課程內容 ] *Introduction (略過) *Overview and History of R [16:07] (完成) *Getting Help [13:53] (略過) *Console Input and Evaluation [4:46] (完成) *Data Types - R Objects and Attributes [4:43] (完成) *Data Types - Vectors and Lists [6:27] (完成) *Data Types - Matrices [3:24] (完成) *Data Types - Factors [4:31] (完成) *Data Types - Missing Values [2:10] (完成) *Data Types - Data Frames [2:44] (完成) *Data Types - Names Attribute [1:49] (完成) *Data Types - Summary [0:43] (完成) *Reading Tabular Data [5:51] *Reading Large Tables [7:08] *Textual Data Formats [4:58] *Connections: Interfaces to the Outside World [4:35] *Subsetting - Basics *Subsetting - Lists *Subsetting - Matrices *Subsetting - Partial Matching *Subsetting - Removing Missing Values *Vectorized Operations [3:46] *Introduction to swirl [ 本次筆記內容 ] Reading Data *（一）Reading Tabular Data *（二）Reading Large Tables *（三）Textual Data Formats *（四）Connections: Interfaces to the Outside World *（一）Reading Tabular Data [ 參考資料 ] * 課堂簡報 : https://d396qusza40orc.cloudfront.net/rprog/lecture_slides/reading_tables.pdf [ 重點整理 ] 1. 讀取資料用的函數 *表格資料(常見) *read.table() *read.csv() *文件(逐行讀取 , 用在.txt檔) *readLines() *R程式碼(讀取R code , 用在.R檔) *sourse() *dget() *二進制(binary)資料 *load() *unserialize() 2. 寫入資料用的函數 *表格資料(常見) *write.table() *文件(逐行讀取 , 用在.txt檔) *writeLines() *R程式碼(寫入R code , 用在.R檔) *dump() *dput() *二進制(binary)資料 *save() *serialize() 3. read.table() *使用參數 *file 文件或連結的名稱 *header 邏輯標示 , 用來判斷文件裡的第一行是否為表頭 *sep 標示文件裡的"分隔符號" *colClasses 字元向量 , 用來指示表格裡的每一個"直行"的類別 *nrows 用來表示表格裡的列數 *comment.char 設定在文件裡用來表示"註解"的符號 , 註解符號後面的內容都會被忽略 *skip 設定從文件的開頭算起 , 要忽略多少行(用來跳過非資料區域) *stringsAsFactors 預設值為"True" , 用來選擇是否把字元變數編碼成因子 *使用方式 *data <- read.table("foo.txt") 預設情況下除了文件名稱,不需要加上其他參數 *（二）Reading Large Tables [ 參考資料 ] * 課堂簡報 : https://d396qusza40orc.cloudfront.net/rprog/lecture_slides/large_tables.pdf [ 重點整理 ] 1. 如何增加讀取大型數據的效率? >> read.table() *多多查閱R幫助文件 *估算儲存數據所需的記憶體大小 *如果文件裡沒有註解 , 可以將 comment.char = "" 裡面的 "" 留空 *使用 "colClasses" 參數告訴R表格內每一直行的類別 , 可以減少執行時間 *設定 "nrow" 參數來預先讀取列數 , 可以幫助節省記憶體 2. 了解系統配備 *有多少可以使用的記憶體? *有哪些程式正在執行? *有哪些使用者同時登入同一個系統? *哪一種作業系統? *作業系統是32位元還是64位元? 3. 計算需要使用的記憶體 *[例題] 一個資料框裡面有 1,500,000列、120行 , 裡面儲存的資料類型為數值型(numeric) , 請問大概需要多少記憶體來儲存這個資料框? *Ans : *1,500,000 × 120 × 8 bytes/numeric * = 1440000000 bytes * = 1440000000 / bytes/MB * = 1,373.29 MB * = 1.34 GB *（三）Textual Data Formats [ 參考資料 ] *課堂簡報 : https://d396qusza40orc.cloudfront.net/rprog/lecture_slides/textual.pdf [ 重點整理 ] 1. R的文本操作 *常用指令 : *dump 讀取資料 *dput 寫入資料 2. 寫入資料 << dput *> y <- data.frame(a = 1, b = "a") *> dput(y) *structure(list(a = 1, * b = structure(1L, .Label = "a", * class = "factor")), * .Names = c("a", "b"), row.names = c(NA, -1L), * class = "data.frame") *> dput(y, file = "y.R") *> new.y <- dget("y.R") *> new.y * a b *1 1 a 3. 讀取資料 << dump *> x <- "foo" *> y <- data.frame(a = 1, b = "a") *> dump(c("x", "y"), file = "data.R") *> rm(x, y) *> source("data.R") *> y * a b *1 1 a *> x *[1] "foo" *（四）Connections: Interfaces to the Outside World [ 參考資料 ] *課堂簡報 : https://d396qusza40orc.cloudfront.net/rprog/lecture_slides/connections.pdf [ 重點整理 ] 1. R與外部進行聯繫與互動的界面 *file 建立與檔案(無壓縮)之間的聯繫 *gzfile 建立與檔案(經由gzip演算法壓縮)之間的聯繫 (.gz檔) *bzfile 建立與檔案(經由bzip2演算法壓縮)之間的聯繫 (.bz2檔 ) *url 建立與網頁之間的聯繫 2. 檔案的聯繫 *> str(file) *function (description = "", open = "", blocking = TRUE, *encoding = getOption("encoding")) *（1）description 檔案的名稱 *（2）open 檔案的參數 *" r " 讀取 *" w " 寫入 *" a " 附加 *" rb " 讀取(二進制) *" wb " 寫入(二進制) *" ab " 附加(二進制) *[例子] *con <- file("foo.txt", "r") *data <- read.csv(con) *close(con) *其意義相同於 *data <- read.csv("foo.txt") 3. 讀取文件或網站 << readLines()、url() （1）讀取文件 -- readLines() *> con <- gzfile("words.gz") *> x <- readLines(con, 10) # 用readLines()讀取文件內容的前10行 *> x * [1] "1080" "10-point" "10th" "11-point" * [5] "12-point" "16-point" "18-point" "1st" * [9] "2" "20-point" （2）讀取網站 -- url()、readLines() *## This might take time *con <- url("http://www.jhsph.edu", "r") # 用url()建立一個網站的聯繫 *x <- readLines(con) # 用readLines()讀取網頁的元素 *> head(x) *[1] "" *[2] "" *[3] "" *[4] "" *[5] "\t