List of Metrics

There are two types of metrics in Alluxio, cluster-wide aggregated metrics, and per-process detailed metrics.

  • Cluster metrics are collected and calculated by the leading master and displayed in the metrics tab of the web UI. These metrics are designed to provide a snapshot of the cluster state and the overall amount of data and metadata served by Alluxio.

  • Process metrics are collected by each Alluxio process and exposed in a machine-readable format through any configured sinks. Process metrics are highly detailed and are intended to be consumed by third-party monitoring tools. Users can then view fine-grained dashboards with time-series graphs of each metric, such as data transferred or the number of RPC invocations.

Metrics in Alluxio have the following format for master node metrics:

Master.[metricName].[tag1].[tag2]...

Metrics in Alluxio have the following format for non-master node metrics:

[processType].[metricName].[tag1].[tag2]...[hostName]

There is generally an Alluxio metric for every RPC invocation, to Alluxio or to the under store.

Tags are additional pieces of metadata for the metric such as user name or under storage location. Tags can be used to further filter or aggregate on various characteristics.

Cluster Metrics

Workers and clients send metrics data to the Alluxio master through heartbeats. The interval is defined by property alluxio.master.worker.heartbeat.interval and alluxio.user.metrics.heartbeat.interval respectively.

Bytes metrics are aggregated value from workers or clients. Bytes throughput metrics are calculated on the leading master. The values of bytes throughput metrics equal to bytes metrics counter value divided by the metrics record time and shown as bytes per minute.

Name
Type
Description

Cluster.ActiveRpcReadCount

COUNTER

The number of active read-RPCs managed by workers

Cluster.ActiveRpcWriteCount

COUNTER

The number of active write-RPCs managed by workers

Cluster.BytesReadDirect

COUNTER

Total number of bytes read from all workers without external RPC involved. Data exists in worker storage or is fetched by workers from UFSes. This records data read by worker internal calls (e.g. clients embedded in workers).

Cluster.BytesReadDirectThroughput

GAUGE

Total number of bytes read from all workers without external RPC involved. Data exists in worker storage or is fetched by workers from UFSes. This records data read by worker internal calls (e.g. clients embedded in workers).

Cluster.BytesReadDomain

COUNTER

Total number of bytes read from all works via domain socket

Cluster.BytesReadDomainThroughput

GAUGE

Bytes read per minute throughput from all workers via domain socket

Cluster.BytesReadLocal

COUNTER

Total number of bytes short-circuit read reported by all clients. Each client reads data from the collocated worker data storage directly.

Cluster.BytesReadLocalThroughput

GAUGE

Bytes per minute throughput short-circuit read reported by all clients

Cluster.BytesReadPerUfs

COUNTER

Total number of bytes read from a specific UFS by all workers

Cluster.BytesReadRemote

COUNTER

Total number of bytes read from all workers via network (RPC). Data exists in worker storage or is fetched by workers from UFSes. This does not include short-circuit local reads and domain socket reads

Cluster.BytesReadRemoteThroughput

GAUGE

Bytes read per minute throughput from all workers via network (RPC calls). Data exists in worker storage or is fetched by workers from UFSes. This does not include short-circuit local reads and domain socket reads

Cluster.BytesReadUfsAll

COUNTER

Total number of bytes read from all Alluxio UFSes by all workers

Cluster.BytesReadUfsThroughput

GAUGE

Bytes read per minute throughput from all Alluxio UFSes by all workers

Cluster.BytesWrittenDomain

COUNTER

Total number of bytes written to all workers via domain socket

Cluster.BytesWrittenDomainThroughput

GAUGE

Throughput of bytes written per minute to all workers via domain socket

Cluster.BytesWrittenLocal

COUNTER

Total number of bytes short-circuit written to local worker data storage by all clients

Cluster.BytesWrittenLocalThroughput

GAUGE

Bytes per minute throughput written to local worker data storage by all clients

Cluster.BytesWrittenPerUfs

COUNTER

Total number of bytes written to a specific Alluxio UFS by all workers

Cluster.BytesWrittenRemote

COUNTER

Total number of bytes written to workers via network (RPC). Data is written to worker storage or is written by workers to underlying UFSes. This does not include short-circuit local writes and domain socket writes.

Cluster.BytesWrittenRemoteThroughput

GAUGE

Bytes write per minute throughput to workers via network (RPC). Data is written to worker storage or is written by workers to underlying UFSes. This does not include short-circuit local writes and domain socket writes.

Cluster.BytesWrittenUfsAll

COUNTER

Total number of bytes written to all Alluxio UFSes by all workers

Cluster.BytesWrittenUfsThroughput

GAUGE

Bytes write per minute throughput to all Alluxio UFSes by all workers

Cluster.CacheHitRate

GAUGE

Cache hit rate: (# bytes read from cache) / (# bytes requested)

Cluster.CapacityFree

GAUGE

Total free bytes on all tiers, on all workers of Alluxio

Cluster.CapacityTotal

GAUGE

Total capacity (in bytes) on all tiers, on all workers of Alluxio

Cluster.CapacityUsed

GAUGE

Total used bytes on all tiers, on all workers of Alluxio

Cluster.LeaderId

GAUGE

Display current leader id

Cluster.LeaderIndex

GAUGE

Index of current leader

Cluster.LostWorkers

GAUGE

Total number of lost workers inside the cluster

Cluster.RootUfsCapacityFree

GAUGE

Free capacity of the Alluxio root UFS in bytes

Cluster.RootUfsCapacityTotal

GAUGE

Total capacity of the Alluxio root UFS in bytes

Cluster.RootUfsCapacityUsed

GAUGE

Used capacity of the Alluxio root UFS in bytes

Cluster.Workers

GAUGE

Total number of active workers inside the cluster

Process Metrics

Metrics shared by the all Alluxio server and client processes.

Name
Type
Description

Process.pool.direct.mem.used

GAUGE

The used direct memory by NIO direct buffer pool

Server Metrics

Metrics shared by the Alluxio server processes.

Name
Type
Description

Server.JvmPauseMonitorInfoTimeExceeded

GAUGE

The total number of times that JVM slept and the sleep period is larger than the info level threshold defined by alluxio.jvm.monitor.info.threshold

Server.JvmPauseMonitorTotalExtraTime

GAUGE

The total time that JVM slept and didn't do GC

Server.JvmPauseMonitorWarnTimeExceeded

GAUGE

The total number of times that JVM slept and the sleep period is larger than the warn level threshold defined by alluxio.jvm.monitor.warn.threshold

Master Metrics

Default master metrics:

Name
Type
Description

Master.AbsentCacheHits

GAUGE

Number of cache hits on the absent cache

Master.AbsentCacheMisses

GAUGE

Number of cache misses on the absent cache

Master.AbsentCacheSize

GAUGE

Size of the absent cache

Master.AbsentPathCacheQueueSize

GAUGE

Alluxio maintains a cache of absent UFS paths. This is the number of UFS paths being processed.

Master.AsyncPersistCancel

COUNTER

The number of cancelled AsyncPersist operations

Master.AsyncPersistFail

COUNTER

The number of failed AsyncPersist operations

Master.AsyncPersistFileCount

COUNTER

The number of files created by AsyncPersist operations

Master.AsyncPersistFileSize

COUNTER

The total size of files created by AsyncPersist operations

Master.AsyncPersistSuccess

COUNTER

The number of successful AsyncPersist operations

Master.AuditLogEntriesSize

GAUGE

The size of the audit log entries blocking queue

Master.BlockHeapSize

GAUGE

An estimate of the blocks heap size

Master.BlockReplicaCount

GAUGE

Total number of block replicas in Alluxio

Master.CachedBlockLocations

GAUGE

Total number of cached block locations

Master.CompleteFileOps

COUNTER

Total number of the CompleteFile operations

Master.CompletedOperationRetryCount

COUNTER

Total number of completed operations that has been retried by client.

Master.CreateDirectoryOps

COUNTER

Total number of the CreateDirectory operations

Master.CreateFileOps

COUNTER

Total number of the CreateFile operations

Master.DeletePathOps

COUNTER

Total number of the Delete operations

Master.DirectoriesCreated

COUNTER

Total number of the succeed CreateDirectory operations

Master.EdgeCacheEvictions

GAUGE

Total number of edges (inode metadata) that was evicted from cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.

Master.EdgeCacheHits

GAUGE

Total number of hits in the edge (inode metadata) cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.

Master.EdgeCacheLoadTimes

GAUGE

Total load times in the edge (inode metadata) cache that resulted from a cache miss. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.

Master.EdgeCacheMisses

GAUGE

Total number of misses in the edge (inode metadata) cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.

Master.EdgeCacheSize

GAUGE

Total number of edges (inode metadata) cached. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.

Master.EdgeLockPoolSize

GAUGE

The size of master edge lock pool

Master.EmbeddedJournalLastSnapshotDownloadDiskSize

GAUGE

Describes the size on disk of the snapshot downloaded from other masters in the cluster the previous time the download occurred. Only valid when using the embedded journal.

Master.EmbeddedJournalLastSnapshotDownloadDurationMs

GAUGE

Describes the amount of time taken to download journal snapshots from other masters in the cluster the previous time the download occurred. Only valid when using the embedded journal.

Master.EmbeddedJournalLastSnapshotDownloadSize

GAUGE

Describes the size of the snapshot downloaded from other masters in the cluster the previous time the download occurred. Only valid when using the embedded journal.

Master.EmbeddedJournalLastSnapshotDurationMs

GAUGE

Describes the amount of time taken to generate the last local journal snapshots on this master. Only valid when using the embedded journal.

Master.EmbeddedJournalLastSnapshotEntriesCount

GAUGE

Describes the number of entries in the last local journal snapshots on this master. Only valid when using the embedded journal.

Master.EmbeddedJournalLastSnapshotReplayDurationMs

GAUGE

Represents the time the last restore from checkpoint operation took in milliseconds.

Master.EmbeddedJournalLastSnapshotReplayEntriesCount

GAUGE

Represents the time the last restore from checkpoint operation took in milliseconds.

Master.EmbeddedJournalLastSnapshotUploadDiskSize

GAUGE

Describes the size on disk of the snapshot uploaded to other masters in the cluster the previous time the download occurred. Only valid when using the embedded journal.

Master.EmbeddedJournalLastSnapshotUploadDurationMs

GAUGE

Describes the amount of time taken to upload journal snapshots to another master in the cluster the previous time the upload occurred. Only valid when using the embedded journal.

Master.EmbeddedJournalLastSnapshotUploadSize

GAUGE

Describes the size of the snapshot uploaded to other masters in the cluster the previous time the download occurred. Only valid when using the embedded journal.

Master.EmbeddedJournalSnapshotDownloadDiskHistogram

HISTOGRAM

Describes the size on disk of the snapshot downloaded from another master in the cluster. Only valid when using the embedded journal. Long running average.

Master.EmbeddedJournalSnapshotDownloadGenerate

TIMER

Describes the amount of time taken to download journal snapshots from other masters in the cluster. Only valid when using the embedded journal. Long running average.

Master.EmbeddedJournalSnapshotDownloadHistogram

HISTOGRAM

Describes the size of the snapshot downloaded from another master in the cluster. Only valid when using the embedded journal. Long running average.

Master.EmbeddedJournalSnapshotGenerateTimer

TIMER

Describes the amount of time taken to generate local journal snapshots on this master. Only valid when using the embedded journal. Use this metric to measure the performance of Alluxio's snapshot generation.

Master.EmbeddedJournalSnapshotInstallTimer

TIMER

Describes the amount of time taken to install a downloaded journal snapshot from another master. Only valid only when using the embedded journal. Use this metric to determine the performance of Alluxio when installing snapshots from the leader. Higher numbers may indicate a slow disk or CPU contention.

Master.EmbeddedJournalSnapshotLastIndex

GAUGE

Represents the latest journal index that was recorded by this master in the most recent local snapshot or from a snapshot downloaded from another master in the cluster. Only valid when using the embedded journal.

Master.EmbeddedJournalSnapshotReplayTimer

TIMER

Describes the amount of time taken to replay a journal snapshot onto the master's state machine. Only valid only when using the embedded journal. Use this metric to determine the performance of Alluxio when replaying journal snapshot file. Higher numbers may indicate a slow disk or CPU contention

Master.EmbeddedJournalSnapshotUploadDiskHistogram

HISTOGRAM

Describes the size on disk of the snapshot uploaded to another master in the cluster. Only valid when using the embedded journal. Long running average.

Master.EmbeddedJournalSnapshotUploadHistogram

HISTOGRAM

Describes the size of the snapshot uploaded to another master in the cluster. Only valid when using the embedded journal. Long running average.

Master.EmbeddedJournalSnapshotUploadTimer

TIMER

Describes the amount of time taken to upload journal snapshots to another master in the cluster. Only valid when using the embedded journal. long running average

Master.FileBlockInfosGot

COUNTER

Total number of succeed GetFileBlockInfo operations

Master.FileInfosGot

COUNTER

Total number of the succeed GetFileInfo operations

Master.FileSize

GAUGE

File size distribution

Master.FilesCompleted

COUNTER

Total number of the succeed CompleteFile operations

Master.FilesCreated

COUNTER

Total number of the succeed CreateFile operations

Master.FilesFreed

COUNTER

Total number of succeed FreeFile operations

Master.FilesPersisted

COUNTER

Total number of successfully persisted files

Master.FilesPinned

GAUGE

Total number of currently pinned files. Note that IDs for these files are stored in memory.

Master.FilesToBePersisted

GAUGE

Total number of currently to be persisted files. Note that the IDs for these files are stored in memory.

Master.FreeFileOps

COUNTER

Total number of FreeFile operations

Master.GetFileBlockInfoOps

COUNTER

Total number of GetFileBlockInfo operations

Master.GetFileInfoOps

COUNTER

Total number of the GetFileInfo operations

Master.GetNewBlockOps

COUNTER

Total number of the GetNewBlock operations

Master.InodeCacheEvictions

GAUGE

Total number of inodes that was evicted from the cache.

Master.InodeCacheHitRatio

GAUGE

Inode Cache hit ratio

Master.InodeCacheHits

GAUGE

Total number of hits in the inodes (inode metadata) cache.

Master.InodeCacheLoadTimes

GAUGE

Total load times in the inodes (inode metadata) cache that resulted from a cache miss.

Master.InodeCacheMisses

GAUGE

Total number of misses in the inodes (inode metadata) cache.

Master.InodeCacheSize

GAUGE

Total number of inodes (inode metadata) cached.

Master.InodeHeapSize

GAUGE

An estimate of the inode heap size

Master.InodeLockPoolSize

GAUGE

The size of master inode lock pool

Master.JobCanceled

COUNTER

The number of canceled status job

Master.JobCompleted

COUNTER

The number of completed status job

Master.JobCount

GAUGE

The number of all status job

Master.JobCreated

COUNTER

The number of created status job

Master.JobDistributedLoadBlockSizes

COUNTER

The total block size loaded by load commands

Master.JobDistributedLoadCancel

COUNTER

The number of cancelled DistributedLoad operations

Master.JobDistributedLoadFail

COUNTER

The number of failed DistributedLoad operations

Master.JobDistributedLoadFileCount

COUNTER

The number of files by DistributedLoad operations

Master.JobDistributedLoadFileSizes

COUNTER

The total file size by DistributedLoad operations

Master.JobDistributedLoadRate

METER

The average DistributedLoad loading rate

Master.JobDistributedLoadSuccess

COUNTER

The number of successful DistributedLoad operations

Master.JobFailed

COUNTER

The number of failed status job

Master.JobLoadBlockCount

COUNTER

The number of blocks loaded by load commands

Master.JobLoadBlockFail

COUNTER

The number of blocks failed to be loaded by load commands

Master.JobLoadFail

COUNTER

The number of failed Load commands

Master.JobLoadRate

METER

The average loading rate of Load commands

Master.JobLoadSuccess

COUNTER

The number of successful Load commands

Master.JobRunning

COUNTER

The number of running status job

Master.JournalCheckpointWarn

GAUGE

If the raft log index exceeds alluxio.master.journal.checkpoint.period.entries, and the last checkpoint exceeds alluxio.master.journal.checkpoint.warning.threshold.time, it returns 1 to indicate that a warning is required, otherwise it returns 0

Master.JournalEntriesSinceCheckPoint

GAUGE

Journal entries since last checkpoint

Master.JournalFlushFailure

COUNTER

Total number of failed journal flush

Master.JournalFlushTimer

TIMER

The timer statistics of journal flush

Master.JournalFreeBytes

GAUGE

Bytes left on the journal disk(s) for an Alluxio master. This metric is only valid on Linux and when embedded journal is used. Use this metric to monitor whether your journal is running out of disk space.

Master.JournalFreePercent

GAUGE

Percentage of free space left on the journal disk(s) for an Alluxio master.This metric is only valid on Linux and when embedded journal is used. Use this metric to monitor whether your journal is running out of disk space.

Master.JournalGainPrimacyTimer

TIMER

The timer statistics of journal gain primacy

Master.JournalLastAppliedCommitIndex

GAUGE

The last raft log index which was applied to the state machine

Master.JournalLastCheckPointTime

GAUGE

Last Journal Checkpoint Time

Master.JournalSequenceNumber

GAUGE

Current journal sequence number

Master.LastBackupEntriesCount

GAUGE

The total number of entries written in the last leading master metadata backup

Master.LastBackupRestoreCount

GAUGE

The total number of entries restored from backup when a leading master initializes its metadata

Master.LastBackupRestoreTimeMs

GAUGE

The process time of the last restore from backup

Master.LastBackupTimeMs

GAUGE

The process time of the last backup

Master.LastGainPrimacyTime

GAUGE

Last time the master gains primacy

Master.LastLosePrimacyTime

GAUGE

Last time the master loses primacy

Master.ListingCacheEvictions

COUNTER

The total number of evictions in master listing cache

Master.ListingCacheHits

COUNTER

The total number of hits in master listing cache

Master.ListingCacheLoadTimes

COUNTER

The total load time (in nanoseconds) in master listing cache that resulted from a cache miss.

Master.ListingCacheMisses

COUNTER

The total number of misses in master listing cache

Master.ListingCacheSize

GAUGE

The size of master listing cache

Master.LostBlockCount

GAUGE

Count of lost unique blocks

Master.LostFileCount

GAUGE

Count of lost files. This number is cached and may not be in sync with Master.LostBlockCount

Master.MetadataSyncActivePaths

COUNTER

The number of in-progress paths from all InodeSyncStream instances

Master.MetadataSyncExecutor

EXECUTOR_SERVICE

Metrics concerning the master metadata sync executor threads. Master.MetadataSyncExecutor.submitted is a meter of the tasks submitted to the executor. Master.MetadataSyncExecutor.completed is a meter of the tasks completed by the executor. Master.MetadataSyncExecutor.activeTaskQueue is exponentially-decaying random reservoir of the number of active tasks (running or submitted) at the executor calculated each time a new task is added to the executor. The max value is the maximum number of active tasks at any time during execution. Master.MetadataSyncExecutor.running is the number of tasks actively being run by the executor. Master.MetadataSyncExecutor.idle is the time spent idling by the submitted tasks (i.e. waiting the the queue before being executed). Master.MetadataSyncExecutor.duration is the time spent running the submitted tasks. If the executor is a thread pool executor then Master.MetadataSyncExecutor.queueSize is the size of the task queue.

Master.MetadataSyncExecutorQueueSize

GAUGE

The number of queuing sync tasks in the metadata sync thread pool controlled by alluxio.master.metadata.sync.executor.pool.size

Master.MetadataSyncFail

COUNTER

The number of InodeSyncStream that failed, either partially or fully

Master.MetadataSyncNoChange

COUNTER

The number of InodeSyncStream that finished with no change to inodes.

Master.MetadataSyncOpsCount

COUNTER

The number of metadata sync operations. Each sync operation corresponds to one InodeSyncStream instance.

Master.MetadataSyncPathsCancel

COUNTER

The number of pending paths from all InodeSyncStream instances that are ignored in the end instead of processed

Master.MetadataSyncPathsFail

COUNTER

The number of paths that failed during metadata sync from all InodeSyncStream instances

Master.MetadataSyncPathsSuccess

COUNTER

The number of paths sync-ed from all InodeSyncStream instances

Master.MetadataSyncPendingPaths

COUNTER

The number of pending paths from all active InodeSyncStream instances,waiting for metadata sync

Master.MetadataSyncPrefetchCancel

COUNTER

Number of cancelled prefetch jobs from metadata sync

Master.MetadataSyncPrefetchExecutor

EXECUTOR_SERVICE

Metrics concerning the master metadata sync prefetchexecutor threads. Master.MetadataSyncPrefetchExecutor.submitted is a meter of the tasks submitted to the executor. Master.MetadataSyncPrefetchExecutor.completed is a meter of the tasks completed by the executor. Master.MetadataSyncPrefetchExecutor.activeTaskQueue is exponentially-decaying random reservoir of the number of active tasks (running or submitted) at the executor calculated each time a new task is added to the executor. The max value is the maximum number of active tasks at any time during execution. Master.MetadataSyncPrefetchExecutor.running is the number of tasks actively being run by the executor. Master.MetadataSyncPrefetchExecutor.idle is the time spent idling by the submitted tasks (i.e. waiting the the queue before being executed). Master.MetadataSyncPrefetchExecutor.duration is the time spent running the submitted tasks. If the executor is a thread pool executor then Master.MetadataSyncPrefetchExecutor.queueSize is the size of the task queue.

Master.MetadataSyncPrefetchExecutorQueueSize

GAUGE

The number of queuing prefetch tasks in the metadata sync thread pool controlled by alluxio.master.metadata.sync.ufs.prefetch.pool.size

Master.MetadataSyncPrefetchFail

COUNTER

Number of failed prefetch jobs from metadata sync

Master.MetadataSyncPrefetchOpsCount

COUNTER

The number of prefetch operations handled by the prefetch thread pool

Master.MetadataSyncPrefetchPaths

COUNTER

Total number of UFS paths fetched by prefetch jobs from metadata sync

Master.MetadataSyncPrefetchRetries

COUNTER

Number of retries to get from prefetch jobs from metadata sync

Master.MetadataSyncPrefetchSuccess

COUNTER

Number of successful prefetch jobs from metadata sync

Master.MetadataSyncSkipped

COUNTER

The number of InodeSyncStream that are skipped because the Alluxio metadata is fresher than alluxio.user.file.metadata.sync.interval

Master.MetadataSyncSuccess

COUNTER

The number of InodeSyncStream that succeeded

Master.MetadataSyncTimeMs

COUNTER

The total time elapsed in all InodeSyncStream instances

Master.MetadataSyncUfsMount.

COUNTER

The number of UFS sync operations for a given mount point

Master.MigrateJobCancel

COUNTER

The number of cancelled MigrateJob operations

Master.MigrateJobFail

COUNTER

The number of failed MigrateJob operations

Master.MigrateJobFileCount

COUNTER

The number of MigrateJob files

Master.MigrateJobFileSize

COUNTER

The total size of MigrateJob files

Master.MigrateJobSuccess

COUNTER

The number of successful MigrateJob operations

Master.MountOps

COUNTER

Total number of Mount operations

Master.NewBlocksGot

COUNTER

Total number of the succeed GetNewBlock operations

Master.PathsDeleted

COUNTER

Total number of the succeed Delete operations

Master.PathsMounted

COUNTER

Total number of succeed Mount operations

Master.PathsRenamed

COUNTER

Total number of succeed Rename operations

Master.PathsUnmounted

COUNTER

Total number of succeed Unmount operations

Master.RenamePathOps

COUNTER

Total number of Rename operations

Master.ReplicaMgmtActiveJobSize

GAUGE

Number of active block replication/eviction jobs. These jobs are created by the master to maintain the block replica factor. The value is an estimate with lag.

Master.ReplicationLimitedFiles

COUNTER

Number of files that have a replication count set to a non-default value. Note that these files have IDs that are stored in memory.

Master.RocksBlockBackgroundErrors

GAUGE

RocksDB block table. Accumulated number of background errors.

Master.RocksBlockBlockCacheCapacity

GAUGE

RocksDB block table. Block cache capacity.

Master.RocksBlockBlockCachePinnedUsage

GAUGE

RocksDB block table. Memory size for the entries being pinned.

Master.RocksBlockBlockCacheUsage

GAUGE

RocksDB block table. Memory size for the entries residing in block cache.

Master.RocksBlockCompactionPending

GAUGE

RocksDB block table. This metric 1 if at least one compaction is pending; otherwise, the metric reports 0.

Master.RocksBlockCurSizeActiveMemTable

GAUGE

RocksDB block table. Approximate size of active memtable in bytes.

Master.RocksBlockCurSizeAllMemTables

GAUGE

RocksDB block table. Approximate size of active, unflushed immutable, and pinned immutable memtables in bytes. Pinned immutable memtables are flushed memtables that are kept in memory to maintain write history in memory.

Master.RocksBlockEstimateNumKeys

GAUGE

RocksDB block table. Estimated number of total keys in the active and unflushed immutable memtables and storage.

Master.RocksBlockEstimatePendingCompactionBytes

GAUGE

RocksDB block table. Estimated total number of bytes a compaction needs to rewrite on disk to get all levels down to under target size. In other words, this metrics relates to the write amplification in level compaction. Thus, this metric is not valid for compactions other than level-based.

Master.RocksBlockEstimateTableReadersMem

GAUGE

RocksDB inode table. Estimated memory in bytes used for reading SST tables, excluding memory used in block cache (e.g., filter and index blocks). This metric records the memory used by iterators as well as filters and indices if the filters and indices are not maintained in the block cache. Basically this metric reports the memory used outside the block cache to read data.

Master.RocksBlockEstimatedMemUsage

GAUGE

RocksDB block table. This metric estimates the memory usage of the RockDB Block table by aggregating the values of Master.RocksBlockBlockCacheUsage, Master.RocksBlockEstimateTableReadersMem, Master.RocksBlockCurSizeAllMemTables, and Master.RocksBlockBlockCachePinnedUsage

Master.RocksBlockLiveSstFilesSize

GAUGE

RocksDB block table. Total size in bytes of all SST files that belong to the latest LSM tree.

Master.RocksBlockMemTableFlushPending

GAUGE

RocksDB block table. This metric returns 1 if a memtable flush is pending; otherwhise it returns 0.

Master.RocksBlockNumDeletesActiveMemTable

GAUGE

RocksDB block table. Total number of delete entries in the active memtable.

Master.RocksBlockNumDeletesImmMemTables

GAUGE

RocksDB block table. Total number of delete entries in the unflushed immutable memtables.

Master.RocksBlockNumEntriesActiveMemTable

GAUGE

RocksDB block table. Total number of entries in the active memtable.

Master.RocksBlockNumEntriesImmMemTables

GAUGE

RocksDB block table. Total number of entries in the unflushed immutable memtables.

Master.RocksBlockNumImmutableMemTable

GAUGE

RocksDB block table. Number of immutable memtables that have not yet been flushed.

Master.RocksBlockNumLiveVersions

GAUGE

RocksDB inode table. Number of live versions. More live versions often mean more SST files are held from being deleted, by iterators or unfinished compactions.

Master.RocksBlockNumRunningCompactions

GAUGE

RocksDB block table. Number of currently running compactions.

Master.RocksBlockNumRunningFlushes

GAUGE

RocksDB block table. Number of currently running flushes.

Master.RocksBlockSizeAllMemTables

GAUGE

RocksDB block table. Size all mem tables.

Master.RocksBlockTotalSstFilesSize

GAUGE

RocksDB block table. Total size in bytes of all SST files.

Master.RocksInodeBackgroundErrors

GAUGE

RocksDB inode table. Accumulated number of background errors.

Master.RocksInodeBlockCacheCapacity

GAUGE

RocksDB inode table. Block cache capacity.

Master.RocksInodeBlockCachePinnedUsage

GAUGE

RocksDB inode table. Memory size for the entries being pinned.

Master.RocksInodeBlockCacheUsage

GAUGE

RocksDB inode table. Memory size for the entries residing in block cache.

Master.RocksInodeCompactionPending

GAUGE

RocksDB inode table. This metric 1 if at least one compaction is pending; otherwise, the metric reports 0.

Master.RocksInodeCurSizeActiveMemTable

GAUGE

RocksDB inode table. Approximate size of active memtable in bytes.

Master.RocksInodeCurSizeAllMemTables

GAUGE

RocksDB inode table. Approximate size of active and unflushed immutable memtable in bytes.

Master.RocksInodeEstimateNumKeys

GAUGE

RocksDB inode table. Estimated number of total keys in the active and unflushed immutable memtables and storage.

Master.RocksInodeEstimatePendingCompactionBytes

GAUGE

RocksDB block table. Estimated total number of bytes a compaction needs to rewrite on disk to get all levels down to under target size. In other words, this metrics relates to the write amplification in level compaction. Thus, this metric is not valid for compactions other than level-based.

Master.RocksInodeEstimateTableReadersMem

GAUGE

RocksDB inode table. Estimated memory in bytes used for reading SST tables, excluding memory used in block cache (e.g., filter and index blocks). This metric records the memory used by iterators as well as filters and indices if the filters and indices are not maintained in the block cache. Basically this metric reports the memory used outside the block cache to read data.

Master.RocksInodeEstimatedMemUsage

GAUGE

RocksDB block table. This metric estimates the memory usage of the RockDB Inode table by aggregating the values of Master.RocksInodeBlockCacheUsage, Master.RocksInodeEstimateTableReadersMem, Master.RocksInodeCurSizeAllMemTables, and Master.RocksInodeBlockCachePinnedUsage

Master.RocksInodeLiveSstFilesSize

GAUGE

RocksDB inode table. Total size in bytes of all SST files that belong to the latest LSM tree.

Master.RocksInodeMemTableFlushPending

GAUGE

RocksDB inode table. This metric returns 1 if a memtable flush is pending; otherwhise it returns 0.

Master.RocksInodeNumDeletesActiveMemTable

GAUGE

RocksDB inode table. Total number of delete entries in the active memtable.

Master.RocksInodeNumDeletesImmMemTables

GAUGE

RocksDB inode table. Total number of delete entries in the unflushed immutable memtables.

Master.RocksInodeNumEntriesActiveMemTable

GAUGE

RocksDB inode table. Total number of entries in the active memtable.

Master.RocksInodeNumEntriesImmMemTables

GAUGE

RocksDB inode table. Total number of entries in the unflushed immutable memtables.

Master.RocksInodeNumImmutableMemTable

GAUGE

RocksDB inode table. Number of immutable memtables that have not yet been flushed.

Master.RocksInodeNumLiveVersions

GAUGE

RocksDB inode table. Number of live versions. More live versions often mean more SST files are held from being deleted, by iterators or unfinished compactions.

Master.RocksInodeNumRunningCompactions

GAUGE

RocksDB inode table. Number of currently running compactions.

Master.RocksInodeNumRunningFlushes

GAUGE

RocksDB inode table. Number of currently running flushes.

Master.RocksInodeSizeAllMemTables

GAUGE

RocksDB inode table. Approximate size of active, unflushed immutable, and pinned immutable memtables in bytes. Pinned immutable memtables are flushed memtables that are kept in memory to maintain write history in memory.

Master.RocksInodeTotalSstFilesSize

GAUGE

RocksDB inode table. Total size in bytes of all SST files.

Master.RocksTotalEstimatedMemUsage

GAUGE

This metric gives an estimate of the total memory used by RocksDB by aggregating the values of Master.RocksBlockEstimatedMemUsage and Master.RocksInodeEstimatedMemUsage

Master.RoleId

GAUGE

Display master role id

Master.RpcQueueLength

GAUGE

Length of the master rpc queue. Use this metric to monitor the RPC pressure on master.

Master.RpcThreadActiveCount

GAUGE

The number of threads that are actively executing tasks in the master RPC executor thread pool. Use this metric to monitor the RPC pressure on master.

Master.RpcThreadCurrentCount

GAUGE

Current count of threads in the master RPC executor thread pool. Use this metric to monitor the RPC pressure on master.

Master.SetAclOps

COUNTER

Total number of SetAcl operations

Master.SetAttributeOps

COUNTER

Total number of SetAttribute operations

Master.StartTime

GAUGE

The start time of the master process

Master.TTLBuckets

GAUGE

The number of TTL buckets at the master. Note that these buckets are stored in memory.

Master.TTLInodes

GAUGE

The total number of inodes contained in TTL buckets at the mater. Note that these inodes are stored in memory.

Master.ToRemoveBlockCount

GAUGE

Count of block replicas to be removed from the workers. If 1 block is to be removed from 2 workers, 2 will be counted here.

Master.TotalPaths

GAUGE

Total number of files and directory in Alluxio namespace

Master.TotalRpcs

TIMER

Throughput of master RPC calls. This metrics indicates how busy the master is serving client and worker requests

Master.UfsJournalCatchupTimer

TIMER

The timer statistics of journal catchupOnly valid when ufs journal is used. This provides a summary of how long a standby master takes to catch up with primary master, and should be monitored if master transition takes too long

Master.UfsJournalFailureRecoverTimer

TIMER

The timer statistics of ufs journal failure recover

Master.UfsJournalInitialReplayTimeMs

GAUGE

The process time of the ufs journal initial replay.Only valid when ufs journal is used. It records the time it took for the very first journal replay. Use this metric to monitor when your master boot-up time is high。

Master.UfsStatusCacheChildrenSize

COUNTER

Total number of UFS file metadata cached. The cache is used during metadata sync.

Master.UfsStatusCacheSize

COUNTER

Total number of Alluxio paths being processed by the metadata sync prefetch thread pool.

Master.UniqueBlocks

GAUGE

Total number of unique blocks in Alluxio

Master.UnmountOps

COUNTER

Total number of Unmount operations

Dynamically generated master metrics:

Metric Name
Description

Master.CapacityTotalTier{TIER_NAME}

Total capacity in tier {TIER_NAME} of the Alluxio file system in bytes

Master.CapacityUsedTier{TIER_NAME}

Used capacity in tier {TIER_NAME} of the Alluxio file system in bytes

Master.CapacityFreeTier{TIER_NAME}

Free capacity in tier {TIER_NAME} of the Alluxio file system in bytes

Master.UfsSessionCount-Ufs:{UFS_ADDRESS}

The total number of currently opened UFS sessions to connect to the given {UFS_ADDRESS}

Master.{UFS_RPC_NAME}.UFS:{UFS_ADDRESS}.UFS_TYPE:{UFS_TYPE}.User:{USER}

The details UFS rpc operation done by the current master

Master.PerUfsOp{UFS_RPC_NAME}.UFS:{UFS_ADDRESS}

The aggregated number of UFS operation {UFS_RPC_NAME} ran on UFS {UFS_ADDRESS} by leading master

Master.{LEADING_MASTER_RPC_NAME}

The duration statistics of RPC calls exposed on leading master

Worker Metrics

Default worker metrics:

Name
Type
Description

Worker.ActiveClients

COUNTER

The number of clients actively reading from or writing to this worker

Worker.ActiveRpcReadCount

COUNTER

The number of active read-RPCs managed by this worker

Worker.ActiveRpcWriteCount

COUNTER

The number of active write-RPCs managed by this worker

Worker.BlockReaderCompleteTaskCount

GAUGE

The approximate total number of block read tasks that have completed execution

Worker.BlockReaderThreadActiveCount

GAUGE

The approximate number of block read threads that are actively executing tasks in reader thread pool

Worker.BlockReaderThreadCurrentCount

GAUGE

The current number of read threads in the reader thread pool

Worker.BlockReaderThreadMaxCount

GAUGE

The maximum allowed number of block read thread in the reader thread pool

Worker.BlockRemoverBlocksRemovedCount

COUNTER

The total number of blocks successfully removed from this worker by asynchronous block remover.

Worker.BlockRemoverRemovingBlocksSize

GAUGE

The size of blocks is being removed from this worker at a moment by asynchronous block remover.

Worker.BlockRemoverTryRemoveBlocksSize

GAUGE

The number of blocks to be removed from this worker at a moment by asynchronous block remover.

Worker.BlockRemoverTryRemoveCount

COUNTER

The total number of blocks this worker attempted to remove with asynchronous block remover.

Worker.BlockSerializedCompleteTaskCount

GAUGE

The approximate total number of block serialized tasks that have completed execution

Worker.BlockSerializedThreadActiveCount

GAUGE

The approximate number of block serialized threads that are actively executing tasks in serialized thread pool

Worker.BlockSerializedThreadCurrentCount

GAUGE

The current number of serialized threads in the serialized thread pool

Worker.BlockSerializedThreadMaxCount

GAUGE

The maximum allowed number of block serialized thread in the serialized thread pool

Worker.BlockWriterCompleteTaskCount

GAUGE

The approximate total number of block write tasks that have completed execution

Worker.BlockWriterThreadActiveCount

GAUGE

The approximate number of block write threads that are actively executing tasks in writer thread pool

Worker.BlockWriterThreadCurrentCount

GAUGE

The current number of write threads in the writer thread pool

Worker.BlockWriterThreadMaxCount

GAUGE

The maximum allowed number of block write thread in the writer thread pool

Worker.BlocksAccessed

COUNTER

Total number of times any one of the blocks in this worker is accessed.

Worker.BlocksCached

GAUGE

Total number of blocks used for caching data in an Alluxio worker

Worker.BlocksCancelled

COUNTER

Total number of aborted temporary blocks in this worker.

Worker.BlocksDeleted

COUNTER

Total number of deleted blocks in this worker by external requests.

Worker.BlocksEvicted

COUNTER

Total number of evicted blocks in this worker.

Worker.BlocksEvictionRate

METER

Block eviction rate in this worker.

Worker.BlocksLost

COUNTER

Total number of lost blocks in this worker.

Worker.BlocksPromoted

COUNTER

Total number of times any one of the blocks in this worker moved to a new tier.

Worker.BlocksReadLocal

COUNTER

Total number of local blocks read by this worker.

Worker.BlocksReadRemote

COUNTER

Total number of a remote blocks read by this worker.

Worker.BlocksReadUfs

COUNTER

Total number of a UFS blocks read by this worker.

Worker.BytesReadDirect

COUNTER

Total number of bytes read from the this worker without external RPC involved. Data exists in worker storage or is fetched by this worker from underlying UFSes. This records data read by worker internal calls (e.g. a client embedded in this worker).

Worker.BytesReadDirectThroughput

METER

Throughput of bytes read from the this worker without external RPC involved. Data exists in worker storage or is fetched by this worker from underlying UFSes. This records data read by worker internal calls (e.g. a client embedded in this worker).

Worker.BytesReadDomain

COUNTER

Total number of bytes read from the this worker via domain socket

Worker.BytesReadDomainThroughput

METER

Bytes read throughput from the this worker via domain socket

Worker.BytesReadPerUfs

COUNTER

Total number of bytes read from a specific Alluxio UFS by this worker

Worker.BytesReadRemote

COUNTER

Total number of bytes read from the this worker via network (RPC). Data exists in worker storage or is fetched by this worker from underlying UFSes. This does not include short-circuit local reads and domain socket reads.

Worker.BytesReadRemoteThroughput

METER

Throughput of bytes read from the this worker via network (RPC). Data exists in worker storage or is fetched by this worker from underlying UFSes. This does not include short-circuit local reads and domain socket reads

Worker.BytesReadUfsThroughput

METER

Bytes read throughput from all Alluxio UFSes by this worker

Worker.BytesWrittenDirect

COUNTER

Total number of bytes written to this worker without external RPC involved. Data is written to worker storage or is written by this worker to underlying UFSes. This records data written by worker internal calls (e.g. a client embedded in this worker).

Worker.BytesWrittenDirectThroughput

METER

Total number of bytes written to this worker without external RPC involved. Data is written to worker storage or is written by this worker to underlying UFSes. This records data written by worker internal calls (e.g. a client embedded in this worker).

Worker.BytesWrittenDomain

COUNTER

Total number of bytes written to this worker via domain socket

Worker.BytesWrittenDomainThroughput

METER

Throughput of bytes written to this worker via domain socket

Worker.BytesWrittenPerUfs

COUNTER

Total number of bytes written to a specific Alluxio UFS by this worker

Worker.BytesWrittenRemote

COUNTER

Total number of bytes written to this worker via network (RPC). Data is written to worker storage or is written by this worker to underlying UFSes. This does not include short-circuit local writes and domain socket writes.

Worker.BytesWrittenRemoteThroughput

METER

Bytes write throughput to this worker via network (RPC). Data is written to worker storage or is written by this worker to underlying UFSes. This does not include short-circuit local writes and domain socket writes.

Worker.BytesWrittenUfsThroughput

METER

Bytes write throughput to all Alluxio UFSes by this worker

Worker.CacheBlocksSize

COUNTER

Total number of bytes that being cached through cache requests

Worker.CacheFailedBlocks

COUNTER

Total number of failed cache blocks in this worker

Worker.CacheManagerCompleteTaskCount

GAUGE

The approximate total number of block cache tasks that have completed execution

Worker.CacheManagerThreadActiveCount

GAUGE

The approximate number of block cache threads that are actively executing tasks in the cache manager thread pool

Worker.CacheManagerThreadCurrentCount

GAUGE

The current number of cache threads in the cache manager thread pool

Worker.CacheManagerThreadMaxCount

GAUGE

The maximum allowed number of block cache thread in the cache manager thread pool

Worker.CacheManagerThreadQueueWaitingTaskCount

GAUGE

The current number of tasks waiting in the work queue in the cache manager thread pool, bounded by alluxio.worker.network.async.cache.manager.queue.max

Worker.CacheRemoteBlocks

COUNTER

Total number of blocks that need to be cached from remote source

Worker.CacheRequests

COUNTER

Total number of cache request received by this worker

Worker.CacheRequestsAsync

COUNTER

Total number of async cache request received by this worker

Worker.CacheRequestsSync

COUNTER

Total number of sync cache request received by this worker

Worker.CacheSucceededBlocks

COUNTER

Total number of cache succeeded blocks in this worker

Worker.CacheUfsBlocks

COUNTER

Total number of blocks that need to be cached from local source

Worker.CapacityFree

GAUGE

Total free bytes on all tiers of a specific Alluxio worker

Worker.CapacityTotal

GAUGE

Total capacity (in bytes) on all tiers of a specific Alluxio worker

Worker.CapacityUsed

GAUGE

Total used bytes on all tiers of a specific Alluxio worker

Worker.MasterRegistrationSuccessCount

COUNTER

Total number of the succeed master registration.

Worker.RpcQueueLength

GAUGE

Length of the worker rpc queue. Use this metric to monitor the RPC pressure on worker.

Worker.RpcThreadActiveCount

GAUGE

The number of threads that are actively executing tasks in the worker RPC executor thread pool. Use this metric to monitor the RPC pressure on worker.

Worker.RpcThreadCurrentCount

GAUGE

Current count of threads in the worker RPC executor thread pool. Use this metric to monitor the RPC pressure on worker.

Dynamically generated worker metrics:

Metric Name
Description

Worker.UfsSessionCount-Ufs:{UFS_ADDRESS}

The total number of currently opened UFS sessions to connect to the given {UFS_ADDRESS}

Worker.{RPC_NAME}

The duration statistics of RPC calls exposed on workers

Client Metrics

Each client metric will be recorded with its local hostname or alluxio.user.app.id is configured. If alluxio.user.app.id is configured, multiple clients can be combined into a logical application.

Name
Type
Description

Client.BlockMasterClientCount

COUNTER

Number of instances in the BlockMasterClientPool.

Client.BlockReadChunkRemote

TIMER

The timer statistics of reading block data in chunks from remote Alluxio workers via RPC framework. This metrics will only be recorded when alluxio.user.block.read.metrics.enabled is set to true

Client.BlockWorkerClientCount

COUNTER

Number of instances in the BlockWorkerClientPool.

Client.BusyExceptionCount

COUNTER

Total number of BusyException observed

Client.BytesReadLocal

COUNTER

Total number of bytes short-circuit read from worker data storage that collocates with the client

Client.BytesReadLocalThroughput

METER

Bytes throughput short-circuit read from worker data storage that collocated with this client

Client.BytesWrittenLocal

COUNTER

Total number of bytes short-circuit written to local storage by this client

Client.BytesWrittenLocalThroughput

METER

Bytes throughput short-circuit written to local storage by this client

Client.BytesWrittenUfs

COUNTER

Total number of bytes write to Alluxio UFS by this client

Client.CacheBytesDiscarded

METER

Total number of bytes discarded when restoring the page store.

Client.CacheBytesEvicted

METER

Total number of bytes evicted from the client cache.

Client.CacheBytesReadCache

METER

Total number of bytes read from the client cache.

Client.CacheBytesReadExternal

METER

Total number of bytes read from external storage due to a cache miss on the client cache.

Client.CacheBytesReadInStreamBuffer

METER

Total number of bytes read from the client cache's in stream buffer.

Client.CacheBytesRequestedExternal

METER

Total number of bytes the user requested to read which resulted in a cache miss. This number may be smaller than Client.CacheBytesReadExternal due to chunk reads.

Client.CacheBytesWrittenCache

METER

Total number of bytes written to the client cache.

Client.CacheCleanErrors

COUNTER

Number of failures when cleaning out the existing cache directory to initialize a new cache.

Client.CacheCleanupGetErrors

COUNTER

Number of failures when cleaning up a failed cache read.

Client.CacheCleanupPutErrors

COUNTER

Number of failures when cleaning up a failed cache write.

Client.CacheCreateErrors

COUNTER

Number of failures when creating a cache in the client cache.

Client.CacheDeleteErrors

COUNTER

Number of failures when deleting cached data in the client cache.

Client.CacheDeleteFromStoreErrors

COUNTER

Number of failures when deleting pages from page stores.

Client.CacheDeleteNonExistingPageErrors

COUNTER

Number of failures when deleting pages due to absence.

Client.CacheDeleteNotReadyErrors

COUNTER

Number of failures when cache is not ready to delete pages.

Client.CacheGetErrors

COUNTER

Number of failures when getting cached data in the client cache.

Client.CacheGetNotReadyErrors

COUNTER

Number of failures when cache is not ready to get pages.

Client.CacheGetStoreReadErrors

COUNTER

Number of failures when getting cached data in the client cache due to failed read from page stores.

Client.CacheHitRate

GAUGE

Cache hit rate: (# bytes read from cache) / (# bytes requested).

Client.CachePageReadCacheTimeNanos

METER

Time in nanoseconds taken to read a page from the client cache when the cache hits.

Client.CachePageReadExternalTimeNanos

METER

Time in nanoseconds taken to read a page from external source when the cache misses.

Client.CachePages

COUNTER

Total number of pages in the client cache.

Client.CachePagesDiscarded

METER

Total number of pages discarded when restoring the page store.

Client.CachePagesEvicted

METER

Total number of pages evicted from the client cache.

Client.CachePutAsyncRejectionErrors

COUNTER

Number of failures when putting cached data in the client cache due to failed injection to async write queue.

Client.CachePutBenignRacingErrors

COUNTER

Number of failures when adding pages due to racing eviction. This error is benign.

Client.CachePutErrors

COUNTER

Number of failures when putting cached data in the client cache.

Client.CachePutEvictionErrors

COUNTER

Number of failures when putting cached data in the client cache due to failed eviction.

Client.CachePutInsufficientSpaceErrors

COUNTER

Number of failures when putting cached data in the client cache due to insufficient space made after eviction.

Client.CachePutNotReadyErrors

COUNTER

Number of failures when cache is not ready to add pages.

Client.CachePutStoreDeleteErrors

COUNTER

Number of failures when putting cached data in the client cache due to failed deletes in page store.

Client.CachePutStoreWriteErrors

COUNTER

Number of failures when putting cached data in the client cache due to failed writes to page store.

Client.CachePutStoreWriteNoSpaceErrors

COUNTER

Number of failures when putting cached data in the client cache but getting disk is full while cache capacity is not achieved. This can happen if the storage overhead ratio to write data is underestimated.

Client.CacheShadowCacheBytes

COUNTER

Amount of bytes in the client shadow cache.

Client.CacheShadowCacheBytesHit

COUNTER

Total number of bytes hit the client shadow cache.

Client.CacheShadowCacheBytesRead

COUNTER

Total number of bytes read from the client shadow cache.

Client.CacheShadowCacheFalsePositiveRatio

COUNTER

Probability that the working set bloom filter makes an error. The value is 0-100. If too high, need to allocate more space

Client.CacheShadowCachePages

COUNTER

Amount of pages in the client shadow cache.

Client.CacheShadowCachePagesHit

COUNTER

Total number of pages hit the client shadow cache.

Client.CacheShadowCachePagesRead

COUNTER

Total number of pages read from the client shadow cache.

Client.CacheSpaceAvailable

GAUGE

Amount of bytes available in the client cache.

Client.CacheSpaceUsed

GAUGE

Amount of bytes used by the client cache.

Client.CacheSpaceUsedCount

COUNTER

Amount of bytes used by the client cache as a counter.

Client.CacheState

COUNTER

State of the cache: 0 (NOT_IN_USE), 1 (READ_ONLY) and 2 (READ_WRITE)

Client.CacheStoreDeleteTimeout

COUNTER

Number of timeouts when deleting pages from page store.

Client.CacheStoreGetTimeout

COUNTER

Number of timeouts when reading pages from page store.

Client.CacheStorePutTimeout

COUNTER

Number of timeouts when writing new pages to page store.

Client.CacheStoreThreadsRejected

COUNTER

Number of rejection of I/O threads on submitting tasks to thread pool, likely due to unresponsive local file system.

Client.CloseAlluxioOutStreamLatency

TIMER

Latency of close Alluxio outstream latency

Client.CloseUFSOutStreamLatency

TIMER

Latency of close UFS outstream latency

Client.DefaultHiveClientCount

COUNTER

Number of instances in the DefaultHiveClientPool.

Client.FileSystemMasterClientCount

COUNTER

Number of instances in the FileSystemMasterClientPool.

Client.MetadataCacheSize

GAUGE

The total number of files and directories whose metadata is cached on the client-side. Only valid if the filesystem is alluxio.client.file.MetadataCachingBaseFileSystem.

Fuse Metrics

Fuse is a long-running Alluxio client. Depending on the launching ways, Fuse metrics show as

  • client metrics when Fuse client is launching in a standalone AlluxioFuse process.

  • worker metrics when Fuse client is embedded in the AlluxioWorker process.

Fuse metrics includes:

Name
Type
Description

Fuse.CachedPathCount

GAUGE

Total number of FUSE-to-Alluxio path mappings being cached. This value will be smaller or equal to alluxio.fuse.cached.paths.max

Fuse.ReadWriteFileCount

GAUGE

Total number of files being opened for reading or writing concurrently.

Fuse.TotalCalls

TIMER

Throughput of JNI FUSE operation calls. This metrics indicates how busy the Alluxio Fuse application is serving requests

Fuse reading/writing file count can be used as the indicators for Fuse application pressure. If a large amount of concurrent read/write occur in a short period of time, each of the read/write operations may take longer time to finish.

When a user or an application runs a filesystem command under Fuse mount point, this command will be processed and translated by operating system which will trigger the related Fuse operations exposed in AlluxioFuse. The count of how many times each operation is called, and the duration of each call will be recorded with metrics name Fuse.<FUSE_OPERATION_NAME> dynamically.

The important Fuse metrics include:

Metric Name
Description

Fuse.readdir

The duration metrics of listing a directory

Fuse.getattr

The duration metrics of getting the metadata of a file

Fuse.open

The duration metrics of opening a file for read or overwrite

Fuse.read

The duration metrics of reading a part of a file

Fuse.create

The duration metrics of creating a file for write

Fuse.write

The duration metrics of writing a file

Fuse.release

The duration metrics of closing a file after read or write. Note that release is async so fuse threads will not wait for release to finish

Fuse.mkdir

The duration metrics of creating a directory

Fuse.unlink

The duration metrics of removing a file or a directory

Fuse.rename

The duration metrics of renaming a file or a directory

Fuse.chmod

The duration metrics of modifying the mode of a file or a directory

Fuse.chown

The duration metrics of modifying the user and/or group ownership of a file or a directory

Fuse related metrics include:

  • Client.TotalRPCClientsshows the total number of RPC clients exist that is using to or can be used to connect to master or worker for operations.

  • Worker metrics with Direct keyword. When Fuse is embedded in worker process, it can go through worker internal API to read from / write to this worker. The related metrics are ended with Direct. For example, Worker.BytesReadDirect shows how many bytes are served by this worker to its embedded Fuse client for read.

  • If alluxio.user.block.read.metrics.enabled=true is configured, Client.BlockReadChunkRemote will be recorded. This metric shows the duration statistics of reading data from remote workers via gRPC.

Client.TotalRPCClients and Fuse.TotalCalls metrics are good indicator of the current load of the Fuse applications. If applications (e.g. Tensorflow) are running on top of Alluxio Fuse but these two metrics show a much lower value than before, the training job may be stuck with Alluxio.

Process Common Metrics

The following metrics are collected on each instance (Master, Worker or Client).

JVM Attributes

Metric Name
Description

name

The name of the JVM

uptime

The uptime of the JVM

vendor

The current JVM vendor

Garbage Collector Statistics

Metric Name
Description

PS-MarkSweep.count

Total number of mark and sweep

PS-MarkSweep.time

The time used to mark and sweep

PS-Scavenge.count

Total number of scavenge

PS-Scavenge.time

The time used to scavenge

Memory Usage

Alluxio provides overall and detailed memory usage information. Detailed memory usage information of code cache, compressed class space, metaspace, PS Eden space, PS old gen, and PS survivor space is collected in each process.

A subset of the memory usage metrics are listed as following:

Metric Name
Description

total.committed

The amount of memory in bytes that is guaranteed to be available for use by the JVM

total.init

The amount of the memory in bytes that is available for use by the JVM

total.max

The maximum amount of memory in bytes that is available for use by the JVM

total.used

The amount of memory currently used in bytes

heap.committed

The amount of memory from heap area guaranteed to be available

heap.init

The amount of memory from heap area available at initialization

heap.max

The maximum amount of memory from heap area that is available

heap.usage

The amount of memory from heap area currently used in GB

heap.used

The amount of memory from heap area that has been used

pools.Code-Cache.used

Used memory of collection usage from the pool from which memory is used for compilation and storage of native code

pools.Compressed-Class-Space.used

Used memory of collection usage from the pool from which memory is use for class metadata

pools.PS-Eden-Space.used

Used memory of collection usage from the pool from which memory is initially allocated for most objects

pools.PS-Survivor-Space.used

Used memory of collection usage from the pool containing objects that have survived the garbage collection of the Eden space

ClassLoading Statistics

Metric Name
Description

loaded

The total number of classes loaded

unloaded

The total number of unloaded classes

Thread Statistics

Metric Name
Description

count

The current number of live threads

daemon.count

The current number of live daemon threads

peak.count

The peak live thread count

total_started.count

The total number of threads started

deadlock.count

The number of deadlocked threads

deadlock

The call stack of each thread related deadlock

new.count

The number of threads with new state

blocked.count

The number of threads with blocked state

runnable.count

The number of threads with runnable state

terminated.count

The number of threads with terminated state

timed_waiting.count

The number of threads with timed_waiting state

Last updated