This tutorial will provide you with the instructions to configure flush interval of message Log in Apache Kafka.
Abstract
Apache Kafka is a distributed and durable messaging system. While its durability guarantees are more based on message replication, Kafka broker also persists the messages on disk.
Since dealing with disk for every message can very much slow down things, Apache Kafka utilizes OS page cache. In order to do so, It opens a handle to a file on disk and keeps writing messages to it without flushing. Unflushed data is removed from application memory but stays in OS page cache. Once you flush data, it is written to disk.
This deferred flushing of data to disk serves two purposes -
- Better Performance - Syncing in batches gives better performance as no of calls to disk are reduced.
- Reduced GC Problems - Anyone with little experience in JVM based languages would know about GC problems. Since Apache Kafka also runs on JVM, it is also prone to GC problems. By pushing the messages to OS cache, it reduces its memory footprints thereby avoiding GC related issues.
Affected NFRs
Log flush is generally the most expensive operation and affects following aspects of Apache Kafka -
- Durability - Larger flush intervals will keep lot of messages in OS cache. Unflushed messages may be lost if you are not using message replication.
- Latency - Very large flush intervals may also result into latency spikes when flush occurs as there will be a lot of data to flush.
- Throughput - Small flush intervals may lead to excessive seeks and hence will provide low throughput.
Hence, it is very important to choose flush interval wisely based on your use case. A high throughput application will often have larger flush intervals with proper message replication to avoid data loss. E.g. LinkedIn configures to Kafka to flush after every 2 minutes.
While optimal flush intervals will vary from one use case to another, we will talk about different settings provided by Apache Kafka to configure it.
Configuring Log Flush Interval
Apache Kafka provides following two schemes to configure log flush interval -
- Interval based on number of messages - Under this scheme, we configure maximum numbers of messages that Kafka will accept before flushing the data to disk.
Here is config parameter that you can set into properties files of your Kafka brokers. E.g. below configuration will configure Kafka broker to flush the data after every 10000 messages.
# The number of messages to accept before forcing a flush of data to disk
log.flush.interval.messages=10000
It is also possible to apply this configuration at Topic level. For example, below command will set the flush interval to 5000 messages for a Topic named my-topic -
./bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --config flush.messages=5000
Note: Topic level configuration will always override Broker level configurations.
- Interval based on time period - In this setting, we configure maximum amount of time that Kafka will wait before flushing the data to disk.
Here is config parameter that you can set into properties files of your Kafka brokers. E.g. below configuration will configure Kafka broker to flush the data after every 1 second (1000 ms).
# The maximum amount of time a message can sit in a log before we force a flush
log.flush.interval.ms=1000
It is also possible to apply this configuration at Topic level. For example, below command will set the flush interval to 60000 ms (60 seconds) for a Topic named my-topic -
./bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --config flush.ms=60000
Note: Topic level configuration will always override Broker level configurations.
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.