Introducing Julia/Working with text files


« Introducing Julia
Working with text files
»
Strings and characters Working with dates and times

从文件中读取 编辑

从文本文件获取信息的标准方法是使用 open(), read()close() 函数.

打开 编辑

要从文件中读取文本,请首先获取文件的 handle:

f = open("sherlock-holmes.txt")

f 现在是 Julia 与磁盘上的文件的连接。完成文件处理后,应使用以下命令关闭连接:

close(f)

通常,在 Julia 中处理文件的推荐方法是将任何文件处理函数包装在 do 代码块中:

open("sherlock-holmes.txt") do file
    # do stuff with the open file
end

当此代码块完成时,打开的文件将自动关闭。更多有关 do block 内容,见 Controlling the flow

由于块中局部变量的范围,您可能希望保留一些已处理的信息:

totaltime, totallines = open("sherlock-holmes.txt") do f
    linecounter = 0
    timetaken = @elapsed for l in eachline(f)
        linecounter += 1
    end
    (timetaken, linecounter)
end
julia> totaltime, totallines
(0.004484679, 76803)

一次性读取整个文件 ! 编辑

可以使用 read()一次性读取打开文件的全部内容:

julia> s = read(f, String)

这会将文件的内容存储在 s 中:

s = open("sherlock-holmes.txt") do file
    read(file, String)
end

你可以使用 readlines() 来以数组的形式读取整个文件,并且每行都有一个元素,请执行以下操作:

julia> f = open("sherlock-holmes.txt");

julia> lines = readlines(f)
76803-element Array{String,1}:
"THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE\r\n"
"\r\n"
"   I. A Scandal in Bohemia\r\n"
"  II. The Red-headed League\r\n"
...
"Holmes, rather to my disappointment, manifested no further\r\n"
"interest in her when once she had ceased to be the centre of one\r\n"
"of his problems, and she is now the head of a private school at\r\n"
"Walsall, where I believe that she has met with considerable success.\r\n"
julia> close(f)

现在,您可以对每行执行单步操作:

counter = 1
for l in lines
   println("$counter $l")
   counter += 1
end
1 THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE
2
3    I. A Scandal in Bohemia
4   II. The Red-headed League
5  III. A Case of Identity
6   IV. The Boscombe Valley Mystery
...
12638 interest in her when once she had ceased to be the centre of one
12639 of his problems, and she is now the head of a private school at
12640 Walsall, where I believe that she has met with considerable success.

有一种更好的方法可以做到这一点-参见下面的 enumerate()

您可能会发现 chomp() 函数很有用-它从字符串中删除后面的换行符。


一行行读取 编辑

The eachline() function turns a source into an iterator. This allows you to process a file a line at a time:

open("sherlock-holmes.txt") do file
    for ln in eachline(file)
        println("$(length(ln)), $(ln)")
    end
end
1, THE ADVENTURES OF SHERLOCK HOLMES by SIR ARTHUR CONAN DOYLE
2,
28,    I. A Scandal in Bohemia
29,   II. The Red-headed League
26,  III. A Case of Identity
35,   IV. The Boscombe Valley Mystery
…
62, the island of Mauritius. As to Miss Violet Hunter, my friend
60, Holmes, rather to my disappointment, manifested no further
66, interest in her when once she had ceased to be the centre of one
65, of his problems, and she is now the head of a private school at
70, Walsall, where I believe that she has met with considerable success.

另一种方法是一直读到文件的末尾。您可能需要跟踪您所在的行:

 open("sherlock-holmes.txt") do f
   line = 1
   while !eof(f)
     x = readline(f)
     println("$line $x")
     line += 1
   end
 end

一种更好的方法是在可迭代对象上使用 enumerate() -您将额外得到一个的行号:

open("sherlock-holmes.txt") do f
    for i in enumerate(eachline(f))
      println(i[1], ": ", i[2])
    end
end

如果您有一个特定的函数要对文件调用,则可以使用以下语法:

function shout(f::IOStream)
    return uppercase(read(f, String))
end
julia> shoutversion = open(shout, "sherlock-holmes.txt");
julia> shoutversion[30237:30400]
"ELEMENTARY PROBLEMS. LET HIM, ON MEETING A\nFELLOW-MORTAL, LEARN AT A GLANCE TO DISTINGUISH THE HISTORY OF THE\nMAN, AND THE TRADE OR  PROFESSION TO WHICH HE BELONGS. "

这将打开该文件,对其运行 shout()函数,然后再次关闭该文件,并将处理后的内容分配给该变量。

You can use the DelimitedFiles.readdlm() function to read lines delimited with certain characters, such as data files, arrays stored as text files, and tables. If you use the DataFrames package, there's also a readtable() specifically designed to read data into a table. See also the package CSV.jl.

对路径和文件名进行操作 编辑

These functions will be useful for working with filenames:

  • cd(path) changes the current directory
  • readdir(path) returns a lists of the contents of a named directory, or the current directory,
  • abspath(path) adds the current directory's path to a filename to make an absolute pathname
  • joinpath(str, str, ...) assembles a pathname from pieces
  • isdir(path) tells you whether the path is a directory
  • splitdir(path) - split a path into a tuple of the directory name and file name.
  • splitdrive(path) - on Windows, split a path into the drive letter part and the path part. On Unix systems, the first component is always the empty string.
  • splitext(path) - if the last component of a path contains a dot, split the path into everything before the dot and everything including and after the dot. Otherwise, return a tuple of the argument unmodified and the empty string.
  • expanduser(path) - replace a tilde character at the start of a path with the current user's home directory.
  • normpath(path) - normalize a path, removing "." and ".." entries.
  • realpath(path) - canonicalize a path by expanding symbolic links and removing "." and ".." entries.
  • homedir() - current user's home directory.
  • dirname(path) - get the directory part of a path.
  • basename(path)- get the file name part of a path.

To work on a restricted selection of files in a directory, use filter() and an anonymous function to filter the file names and just keep the ones you want. (filter() is more of a fishing net or sieve, rather than a coffee filter, in that it catches what you want to keep.)

for f in filter(x -> endswith(x, "jl"), readdir())
    println(f)
end

Astro.jl
calendar.jl
constants.jl
coordinates.jl
...
pseudoscience.jl
riseset.jl
sidereal.jl
sun.jl
utils.jl
vsop87d.jl

If you want to match a group of files using a regular expression, then use occursin(). Let's look for files with ".jpg" or ".png" suffixes (remembering to escape the "."):

for f in filter(x -> occursin(r"(?i)\.jpg|\.png", x), readdir())
    println(f)
end
034571172750.jpg
034571172750.png
51ZN2sCNfVL._SS400_.jpg
51bU7lucOJL._SL500_AA300_.jpg
Voronoy.jpg
kblue.png
korange.png
penrose.jpg
r-home-id-r4.png
wave.jpg

To examine a file hierarchy, use walkdir(), which lets you work through a directory, and examine the files in each directory in turn.

文件信息 编辑

If you want information about a specific file, use stat("pathname"), and then use one of the fields to find out the information. Here's how to get all the information and the field names listed for a file "i":

 for n in fieldnames(typeof(stat("i")))
    println(n, ": ", getfield(stat("i"),n))
end
device: 16777219
inode: 2955324
mode: 16877
nlink: 943
uid: 502
gid: 20
rdev: 0
size: 32062
blksize: 4096
blocks: 0
mtime:1.409769933e9
ctime:1.409769933e9

You can access these fields via a 'stat' structure:

julia> s = stat("Untitled1.ipynb")
StatStruct(mode=100644, size=64424)
julia> s.ctime
1.446649269e9

and you can also use some of them directly:

julia> ctime("Untitled2.ipynb")
1.446649269e9

although not size:

julia> s.size
64424

To work on specific files that meet conditions — all IPython files modified after a certain date, for example — you could use something like this:

using Dates
function output_file(path)
    println(stat(path).size, ": ", path)
end 

for afile in filter!(f -> endswith(f, "ipynb") && (mtime(f) > Dates.datetime2unix(DateTime("2015-11-03T09:00"))),
    readdir())
    output_file(realpath(afile))
end

与文件系统交互 编辑

The cp(), mv(), rm(), and touch() functions have the same names and functions as their Unix shell counterparts.

To convert filenames to pathnames, use abspath(). You can map this over a list of files in a directory:

julia> map(abspath, readdir())
67-element Array{String,1}:
"/Users/me/.CFUserTextEncoding"
"/Users/me/.DS_Store"
"/Users/me/.Trash"
"/Users/me/.Xauthority"
"/Users/me/.ahbbighrc"
"/Users/me/.apdisk"
"/Users/me/.atom"
...

To restrict the list to filenames that contain a particular substring, use an anonymous function inside filter() — something like this:

julia> filter(x -> occursin("re", x), map(abspath, readdir()))
4-element Array{String,1}:
"/Users/me/.DS_Store"
"/Users/me/.gitignore"
"/Users/me/.hgignore_global"
"/Users/me/Pictures"
...

To restrict the list to regular expression matches, try this:

julia> filter(x -> occursin(r"recur.*\.jl", x), map(abspath, readdir()))
2-element Array{String,1}:
 "/Users/me/julia/recursive-directory-scan.jl"
 "/Users/me/julia/recursive-text.jl"

写入文件 编辑

To write to a text file, open it using the "w" flag and make sure that you have permission to create the file in the specified directory:

open("/tmp/t.txt", "w") do f
    write(f, "A, B, C, D\n")
end

Here's how to write 20 lines of 4 random numbers between 1 and 10, separated by commas:

function fourrandom()
    return rand(1:10,4)
end

open("/tmp/t.txt", "w") do f
           for i in 1:20
              n1, n2, n3, n4 = fourrandom()
              write(f, "$n1, $n2, $n3, $n4 \n")
           end
       end

A quicker alternative to this is to use the DelimitedFiles.writedlm() function, described next:

using DelimitedFiles
writedlm("/tmp/test.txt", rand(1:10, 20, 4), ", ")


在文件中写入和读取数组 编辑

In the DelimitedFiles package are two convenient functions, writedlm() and readdlm(). These let you read/write an array or collection from/to a file.

writedlm() writes the contents of an object to a text file, and readdlm() reads the data from a file into an array:

julia> numbers = rand(5,5)
5x5 Array{Float64,2}:
0.913583  0.312291  0.0855798  0.0592331  0.371789
0.13747   0.422435  0.295057   0.736044   0.763928
0.360894  0.434373  0.870768   0.469624   0.268495
0.620462  0.456771  0.258094   0.646355   0.275826
0.497492  0.854383  0.171938   0.870345   0.783558

julia> writedlm("/tmp/test.txt", numbers)

You can see the file using the shell (type a semicolon ";" to switch):

<shell> cat "/tmp/test.txt"
.9135833328830523	.3122905420350348	.08557977218948465	.0592330821115965	.3717889559226475
.13747015238054083	.42243494637594203	.29505701073304524	.7360443978397753	.7639280496847236
.36089432672073607	.43437288984307787	.870767989032692	.4696243851552686	.26849468736154325
.6204624598015906	.4567706404666232	.25809436255988105	.6463554854347682	.27582613759302377
.4974916625466639	.8543829989347014	.17193814498701587	.8703447748713236	.783557793485824

The elements are separated by tabs unless you specify another delimiter. Here, a colon is used to delimit the numbers:

julia> writedlm("/tmp/test.txt", rand(1:6, 10, 10), ":")
shell> cat "/tmp/test.txt"
3:3:3:2:3:2:6:2:3:5
3:1:2:1:5:6:6:1:3:6
5:2:3:1:4:4:4:3:4:1
3:2:1:3:3:1:1:1:5:6
4:2:4:4:4:2:3:5:1:6
6:6:4:1:6:6:3:4:5:4
2:1:3:1:4:1:5:4:6:6
4:4:6:4:6:6:1:4:2:3
1:4:4:1:1:1:5:6:5:6
2:4:4:3:6:6:1:1:5:5

To read in data from a text file, you can use readdlm().

julia> numbers = rand(5,5)
5x5 Array{Float64,2}:
0.862955  0.00827944  0.811526  0.854526  0.747977
0.661742  0.535057    0.186404  0.592903  0.758013
0.800939  0.949748    0.86552   0.113001  0.0849006
0.691113  0.0184901   0.170052  0.421047  0.374274
0.536154  0.48647     0.926233  0.683502  0.116988
julia> writedlm("/tmp/test.txt", numbers)

julia> numbers = readdlm("/tmp/test.txt")
5x5 Array{Float64,2}:
0.862955  0.00827944  0.811526  0.854526  0.747977
0.661742  0.535057    0.186404  0.592903  0.758013
0.800939  0.949748    0.86552   0.113001  0.0849006
0.691113  0.0184901   0.170052  0.421047  0.374274
0.536154  0.48647     0.926233  0.683502  0.116988

There are also a number of Julia packages specifically designed for reading and writing data to files, including DataFrames.jl and CSV.jl. Look through the Julia package directory for these and more.

« Introducing Julia
Working with text files
»
Strings and characters Working with dates and times