stringcan be made immutable by a compiler switch. Although this won't be the default yet, this should be seen as the announcement of a quite disruptive change in the language. Eventually this will be the default in a future version. In this article I explain why I disagree with this particular plan, and which modifications would be better.
Of course, the fact that
string is mutable doesn't fit
well into a functional language. Nevertheless, it has been seen as
acceptable for a long time, probably because the developers of OCaml
did not pay much attention to strings, and felt that the benefits of a
somewhat cleaner concept wouldn't outweigh the practical disadvantages
of immutable strings. Apparently, this attitude changed, and we will
see a new
bytes type in OCaml-4.02. This type is
accompanied by a
Bytes module with library functions
supporting it. The compiler was also extended so
bytes can be used
interchangably by default. If, however, the
switch is set on the command-line, the compiler
bytes as two completely
This is a disruptive change (if enabled): Almost all code bases will need modifications in order to be compatible with the new concept. Although this will often be trivial, there are also harder cases where strings are frequently used as buffers. Before discussing that a bit more in detail, let me point out why such disruptive changes are so problematic. So far there was an implicit guarantee that your code will be compatible to new compiler versions if you stick to the well-established parts of the language and avoid experimental additions. I have in deed code that was developed for OCaml-1.03 (the first version I checked out), and that code still runs. Especially in a commercial context this is a highly appreciated feature, because this protects the investment in the code base. As I'm trying to sell OCaml to companies in my carreer this is a point that bothers me. Giving up this history of excellent backward compatibility is something we shouldn't do easily, and if so, only if we get something highly valuable back. (Of course, if you only look at the open source and academic use of OCaml, you'll put less emphasis on the compatibility point, but it's also not completely unimportant there.)
I'm fully aware that immutable strings fix some problems (the
worst probably: so far even string literals can be mutated, which can be
very surprising). However, creating a completely new type
comes also with some disadvantages:
String.getand there is
Bytes.get. The shorthand
s.[k]is now restricted to strings. This is mostly a stylistic problem.
Bytes.to_string. You have to pay a performance penalty.
Lexingmodule of the standard library in pure OCaml without resorting to unsafe coding (currently it's done in C). This module implements the lexing buffer that backs the lexers generated with ocamllex. We now have to use
bytesfor the core of this buffer. There are three functions in
Lexingfor creating new buffers:
val from_channel : in_channel -> lexbuf val from_string : string -> lexbuf val from_function : (string -> int -> int) -> lexbufThe first observation is that we'll better offer two more constructors to the users of this module:
val from_bytes : bytes -> lexbuf val from_bytes_function : (bytes -> int -> int) -> lexbufSo why do we need the ability to read from
bytes, i.e. copy from one buffer to the other? We could just be a bad host and don't offer these functions to the users of the module. However, it's unavoidable anyway for
from_channel, because I/O buffers are of course
let from_channel ch = from_bytes_function (Pervasives.input ch)So whenever we implement buffers that also include I/O capabilities, it is likely that we need to handle both the
stringcase. This is not only a problem for the interface design. Because
bytesare completely separated, we need two different implementations:
from_bytescannot share much code.
This is the ironical part of the new concept: Although it tries to
make the handling of strings more sound and safe, the immediate
consequence in reality is that code needs to be duplicated because of
missing polymorphisms. Any half-way intelligent programmer will of
course fall back to unsafe functions for casting bytes to strings and
vice versa (
Bytes.unsafe_of_string), and this only means
that the new
-safe-strings option will be a driving force
for using unsafe language features.
Let's look at three modifications of the concept. Is there some easy fix?
stringas a supertype of
We just allow that
bytes can officially be
let s = (b : bytes :> string)
Of course, this weakens the immutability property:
may now be a read-only interface for a
bytes buffer, and
this buffer can be mutated, and this mutation can be observed through
let mutable_string() = let b = Bytes.make 1 'X' in let s = (b :> string) in (s, Bytes.set 0) let (s, set) = mutable_string() (* s is now "X" *) let () = set 'Y' (* s is now "Y" *)
Nevertheless, this concept is not meaningless. In particular, if a function takes a string argument, it is guaranteed that the string isn't modified. Also, string literals are immutable. Only when a function returns a string, we cannot be sure that the string isn't modified by a side effect.
This variation of the concept also solves the polymorphism problem we
explained at the example of the
Lexing module: It is now
sufficient when we implement
bytes can always be coerced to
let from_bytes s = from_string (s :> string)
Some people may feel uncomfortable with the implication of Idea 1 that
the immutability of
string can be easily circumvented.
This can be avoided with a variation: Add a third type
stringlike as the common supertype of both
bytes. So we allow:
let sl1 = (s : string :> stringlike) let sl2 = (b : bytes :> stringlike)Of course,
stringlikedoesn't implement mutators (like
string). It is nevertheless different from
stringis considered as absolutely immutable (there is no way to coerce
stringlikeis seen as the read-only API for either
bytes, and it is allowed to mutate a
stringlikebehind the back of this API
stringlike is especially interesting for interfaces that
need to be compatible to both
Lexing example, we would just define
val from_stringlike : stringlike -> lexbuf val from_stringlike_function : (stringlike -> int -> int) -> lexbufand then reduce the other constructors to just these two, e.g.
let from_string s = from_stringlike (s :> stringlike) let from_bytes b = from_stringlike (b :> bytes)These other constructors are now only defined for the convenience of the user.
This idea doesn't fix any of the mentioned problems. Instead, the
thinking is: If we already accept the incompatibility
bytes, let's at least do
in a way so that we get the maximum out of it. Especially for I/O
buffers, bigarrays are way better suited than strings:
So let's define:
type bytes = (char,Bigarray.int8_unsigned_elt,Bigarray.c_layout) Bigarray.Array1.tSure, there is now no way to unsafely cast strings to bytes and vice versa anymore, but arguably we shouldn't prefer a design over the other only for it's unsafety.
stringlike, it is in deed possible to define it,
but there is some runtime cost. As
have now different representations, any accessor function for
stringlike would have to check at runtime whether it is
backed by a
string or by
bytes. At least, this
check is very cheap.
bytes. The latter is not desirable, of course, but it is surely the task of the language (designer) to make sound and safe string handling an attractive option. I've presented three ideas that would all improve the concept in some respect. In particular, the combination of the ideas 2 and 3 seems to be very attractive: back
bytesby bigarrays, and provide an
stringlikesupertype for easing the programming of application buffers.