How to destroy your application using :erlang.binary_to_term/1
Short story: you should not use erlang:binary_to_term/1 when deserializing terms from an untrusted input, as erlang:binary_to_term/1 might expose your application to a potential DoS.
You should instead use :erlang:binary_to_term(untrusted, [:safe]) or use a serialization library.
The reason behind this simple rule is rooted in how Erlang stores atoms:
Atoms are not garbage-collected. Once an atom is created, it is never removed. The emulator terminates if the limit for the number of atoms (1,048,576 by default) is reached.
The :safe option allows to safely decode without creating new atoms. As a rule of thumb, decoding from untrusted input should always avoid creating new atoms (this is why JSON libraries use binaries as keys instead of atoms).
Still, there’s a good question: is it a good idea to use Erlang extern term format rather than any other serialization format?
My opinionated answer to this question is: no, there are several language-netural good formats around.
Some good serialization alternatives
With fixed structure
Following serialization formats require a well known pre shared schema and received messages need to adhere to it. They can be also easily mapped to an Elixir struct.
Protocol Buffers
Protocol buffers are a language-neutral mechanism for serializing structured data. Elixir has a great library, exprotobuf, which also handles all code generation.
Here is an example .proto file from Astarte:
message AstarteReference {
int32 object_type = 1;
bytes object_uuid = 2;
}
As you can see, each field message has a well defined type that doesn’t leave room for uncertainties such as signedness and size. Moreover, each field has a numeric id which allows keeping backwards-compatibility easily.
More advanced messages can be easily created, protobufs support optional and required fields, enums, repeated fields and union types.
Last but not least, protobuf messages are way smaller than Erlang serialized terms:
iex(1)> alias Astarte.Core.AstarteReference
iex(2)> %AstarteReference{object_type: 10, object_uuid: <<0, 1, 2, 3>>} |> AstarteReference.encode |> byte_size
8
iex(3)> %AstarteReference{object_type: 10, object_uuid: <<0, 1, 2, 3>>} |> :erlang.term_to_binary |> byte_size
97
Cap’n Proto
Cap’n Proto is a fast data interchange format (this is not true on Erlang/Elixir since everything has to be converted to a term). Cap’n Proto is quite similar to Protobuf, and has a peculiarity: the in-memory format and the serialized format are the same. It has also support for RPC. There are no actively developed libraries for Elixir.
FlatBuffers
FlatBuffer is another fast serialization format: data can be read in place in some languages like C++, but not on Elixir. Like Protobuf and Cap’n Proto. a schema is required. There’s also a serialization library for Elixir: eflatbuffers. FlatBuffer’s speed comes at a cost: the previous example takes 36 bytes when serialized with FlatBuffer compared to the 8 bytes required by Protobuf.
Schemaless
Following formats are similar to JSON, and they can be deserialized without any previous schema knowledge, so they will be generally deserialized to an Elixir map.
MessagePack
MessagePack is a compact schemaless format: it is similar to JSON, but binary. It leverages a really compact format that saves unused bytes, and it is widely supported by 50 different languages, including Elixir with the msgpax library. Our example struct fits in 31 bytes when encoded with msgpack.
Serialization with msgpax is quite simple:
%{object_type: 10, object_uuid: <<0, 1, 2, 3>>} |> Msgpax.pack!() |> :erlang.iolist_to_binary()
BSON
BSON is a schemaless format, it is similar to JSON but binary, and it is supported by ~30 different languages, including Elixir with the cyanide library. Our example struct needs 44 bytes when encoded with BSON.
Serialization with cyanide is really straightforward:
%{object_type: 10, object_uuid: <<0, 1, 2, 3>>} |> Bson.encode
CBOR
CBOR is a schemaless format, it is similar to other JSON-like binary formats, it has also a RFC (RFC 7049) and it is also the data serialization layer recommended by CoAP. There are no actively developed libraries for Elixir.
Conclusions
Erlang extern term format is pretty easy to use and might be handy when storing cached terms or in any similar scenario, on the other hand Protocol Buffers outperforms all other alternatives (in terms of compactness) when the message schema is already known. In all other cases, MessagePack or BSON are valid choices. Last but not least: when sending and receiving payloads from Web UIs JSON is good enough: it is the most widely used serialization format for that specific purpose and no special requirements are needed.