Load .csv / .csv.gz file from a remote connectionSource:
This is a thin wrapper on data.table::fread
Arguments passed on to
A single character string. The value is inspected and deferred to either
file=(if no \n present),
text=(if at least one \n is present) or
cmd=(if no \n is present, at least one space is present, and it isn't a file name). Exactly one of
cmd=should be used in the same call.
File name in working directory, path to file (passed through
path.expandfor convenience), or a URL starting http://, file://, etc. Compressed files with extension
.bz2are supported if the
R.utilspackage is installed.
The input data itself as a character vector of one or more lines, for example as returned by
A shell command that pre-processes the file; e.g.
fread(cmd=paste("grep",word,"filename")). See Details.
The separator between columns. Defaults to the character in the set
[,\t |;:]that separates the sample of rows into the most number of lines with the same number of fields. Use
""to specify no separator; i.e. each line a single character column like
The separator within columns. A
listcolumn will be returned where each cell is a vector of values. This is much faster using less working memory than
strsplitafterwards or similar techniques. For each column
sep2can be different and is the first character in the same set above [
,\t |;], other than
sep, that exists inside each field outside quoted regions in the sample. NB:
sep2is not yet implemented.
The maximum number of rows to read. Unlike
read.table, you do not need to set this to an estimate of the number of rows in the file for better speed because that is already automatically determined by
freadalmost instantly using the large sample of lines.
nrows=0returns the column names and typed empty columns determined by the large sample; useful for a dry run of a large file or to quickly check format consistency of a set of files before starting to read any of them.
Does the first data line contain column names? Defaults according to whether every non-empty field on the first data line is type character. If so, or TRUE is supplied, any empty column names are given a default name.
A character vector of strings which are to be interpreted as
NAvalues. By default,
",,"for columns of all types, including type
characteris read as
,"",is unambiguous and read as an empty string. To read
na.strings="NA". To read
,,as blank string
na.strings=NULL. When they occur in the file, the strings in
na.stringsshould not appear quoted since that is how the string literal
,"NA",is distinguished from
,NA,, for example, when
Convert all character columns to factors?
Be chatty and report timings?
If 0 (default) start on the first line and from there finds the first row with a consistent number of columns. This automatically avoids irregular header information before the column names row.
skip>0means ignore the first
"string"in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).
A vector of column names or numbers to keep, drop the rest.
selectmay specify types too in the same way as
colClasses; i.e., a vector of
colname=typepairs, or a
type=col(s)pairs. In all forms of
select, the order that the columns are specified determines the order of the columns in the result.
Vector of column names or numbers to drop, keep the rest.
utils::read.csv; i.e., an unnamed vector of types corresponding to the columns in the file, or a named vector specifying types for a subset of the columns by name. The default,
NULLmeans types are inferred from the data in the file. Further,
data.tablesupports a named
listof vectors of column names or numbers where the
listnames are the class names; see examples. The
listform makes it easier to set a batch of columns to be a particular class. When column numbers are used in the
listform, they refer to the column number in the file not the column number after
drophas been applied. If type coercion results in an error, introduces
NAs, or would result in loss of accuracy, the coercion attempt is aborted for that column with warning and the column's type is left unchanged. If you really desire data loss (e.g. reading
integer) you have to truncate such columns afterwards yourself explicitly so that this is clear to future readers of your code.
"integer64" (default) reads columns detected as containing integers larger than 2^31 as type
utils::read.csvdoes; i.e., possibly with loss of precision and if so silently. Or, "character".
The decimal separator as in
utils::read.csv. If not "." (default) then usually ",". See details.
A vector of optional names for the variables (columns). The default is to use the header column if present or detected, or if not "V" followed by the column number. This is applied after
TRUEthen the names of the variables in the
data.tableare checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by
make.names) so that they are, and also to ensure that there are no duplicates.
"unknown". Other possible options are
"Latin-1". Note: it is not used to re-encode the input, rather enables handling of encoded strings in their native encoding.
By default (
"\""), if a field starts with a double quote,
freadhandles embedded quotes robustly as explained under
Details. If it fails, then another attempt is made to read the field as is, i.e., as if quotes are disabled. By setting
quote="", the field is always read as if quotes are disabled. It is not expected to ever need to pass anything other than \"\" to quote; i.e., to turn it off.
TRUE. Strips leading and trailing whitespaces of unquoted fields. If
FALSE, only header trailing spaces are removed.
logical (default is
TRUEthen in case the rows have unequal length, blank fields are implicitly filled.
logical, default is
TRUEblank lines in the input are ignored.
Character vector of one or more column names which is passed to
setkey. It may be a single comma separated string such as
key="x,y,z", or a vector of names such as
key=c("x","y","z"). Only valid when argument
data.table=TRUE. Where applicable, this should refer to column names given in
Character vector or list of character vectors of one or more column names which is passed to
setindexv. As with
key, comma-separated notation like
index="x,y,z"is accepted for convenience. Only valid when argument
data.table=TRUE. Where applicable, this should refer to column names given in
TRUEdisplays progress on the console if the ETA is greater than 3 seconds. It is produced in fread's C code where the very nice (but R level) txtProgressBar and tkProgressBar are not easily available.
TRUE returns a
data.table. FALSE returns a
data.frame. The default for this argument can be changed with
The number of threads to use. Experiment to see what works best for your data on your hardware.
If TRUE a column containing only 0s and 1s will be read as logical, otherwise as integer.
If TRUE a column containing numeric data with leading zeros will be read as character, otherwise leading zeros will be removed and converted to numeric.
freadwill attempt to parse (using
yaml.load) the top of the input as YAML, and further to glean parameters relevant to improving the performance of
freadon the data itself. The entire YAML section is returned as parsed into a
Deprecated and ignored with warning. Please use
Directory to use as the
tmpdirargument for any
tempfilecalls, e.g. when the input is a URL or a shell command. The default is
tempdir()which can be controlled by setting
TMPDIRbefore starting the R session; see
Relevant to datetime values which have no Z or UTC-offset at the end, i.e. unmarked datetime, as written by
utils::write.csv. The default
tz="UTC"reads unmarked datetime as UTC POSIXct efficiently.
tz=""reads unmarked datetime as type character (slowly) so that
as.POSIXctcan interpret (slowly) the character datetimes in local timezone; e.g. by using
colClasses=. Note that
fwrite()by default writes datetime in UTC including the final Z and therefore
fwrite's output will be read by
freadconsistently and quickly without needing to use
colClasses=. If the
TZenvironment variable is set to
""on non-Windows where unset vs `""` is significant) then the R session's timezone is already UTC and
tz=""will result in unmarked datetimes being read as UTC POSIXct. For more information, please see the news items from v1.13.0 and v1.14.0.
a dataframe as created by