List of Metrics

There are two types of metrics in Alluxio, cluster-wide aggregated metrics, and per-process detailed metrics.

  • Cluster metrics are collected and calculated by the leading master and displayed in the metrics tab of the web UI. These metrics are designed to provide a snapshot of the cluster state and the overall amount of data and metadata served by Alluxio.

  • Process metrics are collected by each Alluxio process and exposed in a machine-readable format through any configured sinks. Process metrics are highly detailed and are intended to be consumed by third-party monitoring tools. Users can then view fine-grained dashboards with time-series graphs of each metric, such as data transferred or the number of RPC invocations.

Metrics in Alluxio have the following format for master node metrics:

Master.[metricName].[tag1].[tag2]...

Metrics in Alluxio have the following format for non-master node metrics:

[processType].[metricName].[tag1].[tag2]...[hostName]

There is generally an Alluxio metric for every RPC invocation, to Alluxio or to the under store.

Tags are additional pieces of metadata for the metric such as user name or under storage location. Tags can be used to further filter or aggregate on various characteristics.

Cluster Metrics

Workers and clients send metrics data to the Alluxio master through heartbeats. The interval is defined by property alluxio.master.worker.heartbeat.interval and alluxio.user.metrics.heartbeat.interval respectively.

Bytes metrics are aggregated value from workers or clients. Bytes throughput metrics are calculated on the leading master. The values of bytes throughput metrics equal to bytes metrics counter value divided by the metrics record time and shown as bytes per minute.

Name
Type
Description

Cluster.ActiveRpcReadCount

COUNTER

The number of active read-RPCs managed by workers

Cluster.ActiveRpcWriteCount

COUNTER

The number of active write-RPCs managed by workers

Cluster.BytesReadDirect

COUNTER

Total number of bytes read from Alluxio storage managed by workers and underlying UFS if data cannot be found in the Alluxio storage without external RPC involved. This records data read by worker internal calls (e.g. clients embedded in workers).

Cluster.BytesReadDirectThroughput

GAUGE

Total number of bytes read from Alluxio storage managed by workers and underlying UFS if data cannot be found in the Alluxio storage without external RPC involved. This records data read by worker internal calls (e.g. clients embedded in workers).

Cluster.BytesReadDomain

COUNTER

Total number of bytes read from Alluxio storage via domain socket reported by all workers

Cluster.BytesReadDomainThroughput

GAUGE

Bytes read per minute throughput from Alluxio storage via domain socket by all workers

Cluster.BytesReadLocal

COUNTER

Total number of bytes short-circuit read from local storage by all clients

Cluster.BytesReadLocalThroughput

GAUGE

Bytes per minute throughput short-circuit read from local storage by all clients

Cluster.BytesReadPerUfs

COUNTER

Total number of bytes read from a specific UFS by all workers

Cluster.BytesReadRemote

COUNTER

Total number of bytes read from Alluxio storage or underlying UFS if data does not exist in Alluxio storage reported by all workers. This does not include short-circuit local reads and domain socket reads

Cluster.BytesReadRemoteThroughput

GAUGE

Bytes read per minute throughput from Alluxio storage or underlying UFS if data does not exist in Alluxio storage reported by all workers. This does not include short-circuit local reads and domain socket reads

Cluster.BytesReadUfsAll

COUNTER

Total number of bytes read from all Alluxio UFSes by all workers

Cluster.BytesReadUfsThroughput

GAUGE

Bytes read per minute throughput from all Alluxio UFSes by all workers

Cluster.BytesWrittenDomain

COUNTER

Total number of bytes written to Alluxio storage via domain socket by all workers

Cluster.BytesWrittenDomainThroughput

GAUGE

Throughput of bytes written per minute to Alluxio storage via domain socket by all workers

Cluster.BytesWrittenLocal

COUNTER

Total number of bytes short-circuit written to local storage by all clients

Cluster.BytesWrittenLocalThroughput

GAUGE

Bytes per minute throughput written to local storage by all clients

Cluster.BytesWrittenPerUfs

COUNTER

Total number of bytes written to a specific Alluxio UFS by all workers

Cluster.BytesWrittenRemote

COUNTER

Total number of bytes written to Alluxio storage in all workers or the underlying UFS. This does not include short-circuit local writes and domain socket writes.

Cluster.BytesWrittenRemoteThroughput

GAUGE

Bytes write per minute throughput to Alluxio storage in all workers or the underlying UFS. This does not include short-circuit local writes and domain socket writes.

Cluster.BytesWrittenUfsAll

COUNTER

Total number of bytes written to all Alluxio UFSes by all workers

Cluster.BytesWrittenUfsThroughput

GAUGE

Bytes write per minute throughput to all Alluxio UFSes by all workers

Cluster.CacheHitRate

GAUGE

Cache hit rate: (# bytes read from cache) / (# bytes requested)

Cluster.CapacityFree

GAUGE

Total free bytes on all tiers, on all workers of Alluxio

Cluster.CapacityTotal

GAUGE

Total capacity (in bytes) on all tiers, on all workers of Alluxio

Cluster.CapacityUsed

GAUGE

Total used bytes on all tiers, on all workers of Alluxio

Cluster.LostWorkers

GAUGE

Total number of lost workers inside the cluster

Cluster.RootUfsCapacityFree

GAUGE

Free capacity of the Alluxio root UFS in bytes

Cluster.RootUfsCapacityTotal

GAUGE

Total capacity of the Alluxio root UFS in bytes

Cluster.RootUfsCapacityUsed

GAUGE

Used capacity of the Alluxio root UFS in bytes

Cluster.Workers

GAUGE

Total number of active workers inside the cluster

Server Metrics

Metrics shared by the Alluxio server processes.

Name
Type
Description

Server.JvmPauseMonitorInfoTimeExceeded

GAUGE

The total number of times that JVM slept and the sleep period is larger than the info level threshold defined by alluxio.jvm.monitor.info.threshold

Server.JvmPauseMonitorTotalExtraTime

GAUGE

The total time that JVM slept and didn't do GC

Server.JvmPauseMonitorWarnTimeExceeded

GAUGE

The total number of times that JVM slept and the sleep period is larger than the warn level threshold defined by alluxio.jvm.monitor.warn.threshold

Master Metrics

Default master metrics:

Name
Type
Description

Master.AbsentCacheHits

GAUGE

Number of cache hits on the absent cache

Master.AbsentCacheMisses

GAUGE

Number of cache misses on the absent cache

Master.AbsentCacheSize

GAUGE

Size of the absent cache

Master.AuditLogEntriesSize

GAUGE

The size of the audit log entries blocking queue

Master.BlockHeapSize

GAUGE

An estimate of the blocks heap size

Master.BlockReplicaCount

GAUGE

Total number of block replicas in Alluxio

Master.CompleteFileOps

COUNTER

Total number of the CompleteFile operations

Master.CompletedOperationRetryCount

COUNTER

Total number of completed operations that has been retried by client.

Master.CreateDirectoryOps

COUNTER

Total number of the CreateDirectory operations

Master.CreateFileOps

COUNTER

Total number of the CreateFile operations

Master.DeletePathOps

COUNTER

Total number of the Delete operations

Master.DirectoriesCreated

COUNTER

Total number of the succeed CreateDirectory operations

Master.EdgeCacheEvictions

GAUGE

Total number of edges (inode metadata) that was evicted from cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.

Master.EdgeCacheHits

GAUGE

Total number of hits in the edge (inode metadata) cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.

Master.EdgeCacheLoadTimes

GAUGE

Total load times in the edge (inode metadata) cache that resulted from a cache miss. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.

Master.EdgeCacheMisses

GAUGE

Total number of misses in the edge (inode metadata) cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.

Master.EdgeCacheSize

GAUGE

Total number of edges (inode metadata) cached. The edge cache is responsible for managing the mapping from (parentId, childName) to childId.

Master.EdgeLockPoolSize

GAUGE

The size of master edge lock pool

Master.EmbeddedJournalSnapshotDownloadGenerate

TIMER

Describes the amount of time taken to download journal snapshots from other masters in the cluster. Only valid when using the embedded journal. Use this metric to determine if there are potential communication bottlenecks between Alluxio masters.

Master.EmbeddedJournalSnapshotGenerateTimer

TIMER

Describes the amount of time taken to generate local journal snapshots on this master. Only valid when using the embedded journal. Use this metric to measure the performance of Alluxio's snapshot generation.

Master.EmbeddedJournalSnapshotInstallTimer

TIMER

Describes the amount of time taken to install a downloaded journal snapshot from another master. Only valid only when using the embedded journal. Use this metric to determine the performance of Alluxio when installing snapshots from the leader. Higher numbers may indicate a slow disk or CPU contention.

Master.EmbeddedJournalSnapshotLastIndex

GAUGE

Represents the latest journal index that was recorded by this master in the most recent local snapshot or from a snapshot downloaded from another master in the cluster. Only valid when using the embedded journal.

Master.EmbeddedJournalSnapshotReplayTimer

TIMER

Describes the amount of time taken to replay a journal snapshot onto the master's state machine. Only valid only when using the embedded journal. Use this metric to determine the performance of Alluxio when replaying journal snapshot file. Higher numbers may indicate a slow disk or CPU contention

Master.FileBlockInfosGot

COUNTER

Total number of succeed GetFileBlockInfo operations

Master.FileInfosGot

COUNTER

Total number of the succeed GetFileInfo operations

Master.FileSize

GAUGE

File size distribution

Master.FilesCompleted

COUNTER

Total number of the succeed CompleteFile operations

Master.FilesCreated

COUNTER

Total number of the succeed CreateFile operations

Master.FilesFreed

COUNTER

Total number of succeed FreeFile operations

Master.FilesPersisted

COUNTER

Total number of successfully persisted files

Master.FilesPinned

GAUGE

Total number of currently pinned files

Master.FilesToBePersisted

GAUGE

Total number of currently to be persisted files

Master.FreeFileOps

COUNTER

Total number of FreeFile operations

Master.GetFileBlockInfoOps

COUNTER

Total number of GetFileBlockInfo operations

Master.GetFileInfoOps

COUNTER

Total number of the GetFileInfo operations

Master.GetNewBlockOps

COUNTER

Total number of the GetNewBlock operations

Master.InodeCacheEvictions

GAUGE

Total number of inodes that was evicted from the cache.

Master.InodeCacheHitRatio

GAUGE

Inode Cache hit ratio

Master.InodeCacheHits

GAUGE

Total number of hits in the inodes (inode metadata) cache.

Master.InodeCacheLoadTimes

GAUGE

Total load times in the inodes (inode metadata) cache that resulted from a cache miss.

Master.InodeCacheMisses

GAUGE

Total number of misses in the inodes (inode metadata) cache.

Master.InodeCacheSize

GAUGE

Total number of inodes (inode metadata) cached.

Master.InodeHeapSize

GAUGE

An estimate of the inode heap size

Master.InodeLockPoolSize

GAUGE

The size of master inode lock pool

Master.JournalEntriesSinceCheckPoint

GAUGE

Journal entries since last checkpoint

Master.JournalFlushFailure

COUNTER

Total number of failed journal flush

Master.JournalFlushTimer

TIMER

The timer statistics of journal flush

Master.JournalFreeBytes

GAUGE

Bytes left on the journal disk(s) for an Alluxio master. This metric is only valid on Linux and when embedded journal is used. Use this metric to monitor whether your journal is running out of disk space.

Master.JournalFreePercent

GAUGE

Percentage of free space left on the journal disk(s) for an Alluxio master.This metric is only valid on Linux and when embedded journal is used. Use this metric to monitor whether your journal is running out of disk space.

Master.JournalGainPrimacyTimer

TIMER

The timer statistics of journal gain primacy

Master.JournalLastAppliedCommitIndex

GAUGE

The last raft log index which was applied to the state machine

Master.JournalLastCheckPointTime

GAUGE

Last Journal Checkpoint Time

Master.JournalSequenceNumber

GAUGE

Current journal sequence number

Master.LastBackupEntriesCount

GAUGE

The total number of entries written in the last leading master metadata backup

Master.LastBackupRestoreCount

GAUGE

The total number of entries restored from backup when a leading master initializes its metadata

Master.LastBackupRestoreTimeMs

GAUGE

The process time of the last restore from backup

Master.LastBackupTimeMs

GAUGE

The process time of the last backup

Master.ListingCacheEvictions

COUNTER

The total number of evictions in master listing cache

Master.ListingCacheHits

COUNTER

The total number of hits in master listing cache

Master.ListingCacheLoadTimes

COUNTER

The total load time (in nanoseconds) in master listing cache that resulted from a cache miss.

Master.ListingCacheMisses

COUNTER

The total number of misses in master listing cache

Master.ListingCacheSize

GAUGE

The size of master listing cache

Master.LostBlockCount

GAUGE

Count of lost unique blocks

Master.LostFileCount

GAUGE

Count of lost files. This number is cached and may not be in sync with Master.LostBlockCount

Master.MountOps

COUNTER

Total number of Mount operations

Master.NewBlocksGot

COUNTER

Total number of the succeed GetNewBlock operations

Master.PathsDeleted

COUNTER

Total number of the succeed Delete operations

Master.PathsMounted

COUNTER

Total number of succeed Mount operations

Master.PathsRenamed

COUNTER

Total number of succeed Rename operations

Master.PathsUnmounted

COUNTER

Total number of succeed Unmount operations

Master.RenamePathOps

COUNTER

Total number of Rename operations

Master.ReplicaMgmtActiveJobSize

GAUGE

Number of active block replication/eviction jobs. These jobs are created by the master to maintain the block replica factor. The value is an estimate with lag.

Master.RpcQueueLength

GAUGE

Length of the master rpc queue. Use this metric to monitor the RPC pressure on master.

Master.SetAclOps

COUNTER

Total number of SetAcl operations

Master.SetAttributeOps

COUNTER

Total number of SetAttribute operations

Master.TotalPaths

GAUGE

Total number of files and directory in Alluxio namespace

Master.TotalRpcs

TIMER

Throughput of master RPC calls. This metrics indicates how busy the master is serving client and worker requests

Master.UfsJournalCatchupTimer

TIMER

The timer statistics of journal catchupOnly valid when ufs journal is used. This provides a summary of how long a standby master takes to catch up with primary master, and should be monitored if master transition takes too long

Master.UfsJournalFailureRecoverTimer

TIMER

The timer statistics of ufs journal failure recover

Master.UfsJournalInitialReplayTimeMs

GAUGE

The process time of the ufs journal initial replay.Only valid when ufs journal is used. It records the time it took for the very first journal replay. Use this metric to monitor when your master boot-up time is high。

Master.UniqueBlocks

GAUGE

Total number of unique blocks in Alluxio

Master.UnmountOps

COUNTER

Total number of Unmount operations

Dynamically generated master metrics:

Metric Name
Description

Master.CapacityTotalTier<TIER_NAME>

Total capacity in tier <TIER_NAME> of the Alluxio file system in bytes

Master.CapacityUsedTier<TIER_NAME>

Used capacity in tier <TIER_NAME> of the Alluxio file system in bytes

Master.CapacityFreeTier<TIER_NAME>

Free capacity in tier <TIER_NAME> of the Alluxio file system in bytes

Master.UfsSessionCount-Ufs:<UFS_ADDRESS>

The total number of currently opened UFS sessions to connect to the given <UFS_ADDRESS>

Master.<UFS_RPC_NAME>.UFS:<UFS_ADDRESS>.UFS_TYPE:<UFS_TYPE>.User:

The details UFS rpc operation done by the current master

Master.PerUfsOp<UFS_RPC_NAME>.UFS:<UFS_ADDRESS>

The aggregated number of UFS operation <UFS_RPC_NAME> ran on UFS <UFS_ADDRESS> by leading master

Master.<LEADING_MASTER_RPC_NAME>

The duration statistics of RPC calls exposed on leading master

Worker Metrics

Default worker metrics:

Name
Type
Description

Worker.ActiveClients

COUNTER

The number of clients actively reading from or writing to this worker

Worker.ActiveRpcReadCount

COUNTER

The number of active read-RPCs managed by this worker

Worker.ActiveRpcWriteCount

COUNTER

The number of active write-RPCs managed by this worker

Worker.BlockReaderCompleteTaskCount

GAUGE

The approximate total number of block read tasks that have completed execution

Worker.BlockReaderThreadActiveCount

GAUGE

The approximate number of block read threads that are actively executing tasks in reader thread pool

Worker.BlockReaderThreadCurrentCount

GAUGE

The current number of read threads in the reader thread pool

Worker.BlockReaderThreadMaxCount

GAUGE

The maximum allowed number of block read thread in the reader thread pool

Worker.BlockRemoverBlocksRemovedCount

COUNTER

The total number of blocks successfully removed from this worker by asynchronous block remover.

Worker.BlockRemoverRemovingBlocksSize

GAUGE

The size of blocks is being removed from this worker at a moment by asynchronous block remover.

Worker.BlockRemoverTryRemoveBlocksSize

GAUGE

The number of blocks to be removed from this worker at a moment by asynchronous block remover.

Worker.BlockRemoverTryRemoveCount

COUNTER

The total number of blocks this worker attempted to remove with asynchronous block remover.

Worker.BlockWriterCompleteTaskCount

GAUGE

The approximate total number of block write tasks that have completed execution

Worker.BlockWriterThreadActiveCount

GAUGE

The approximate number of block write threads that are actively executing tasks in writer thread pool

Worker.BlockWriterThreadCurrentCount

GAUGE

The current number of write threads in the writer thread pool

Worker.BlockWriterThreadMaxCount

GAUGE

The maximum allowed number of block write thread in the writer thread pool

Worker.BlocksAccessed

COUNTER

Total number of times any one of the blocks in this worker is accessed.

Worker.BlocksCached

GAUGE

Total number of blocks used for caching data in an Alluxio worker

Worker.BlocksCancelled

COUNTER

Total number of aborted temporary blocks in this worker.

Worker.BlocksDeleted

COUNTER

Total number of deleted blocks in this worker by external requests.

Worker.BlocksEvicted

COUNTER

Total number of evicted blocks in this worker.

Worker.BlocksEvictionRate

METER

Block eviction rate in this worker.

Worker.BlocksLost

COUNTER

Total number of lost blocks in this worker.

Worker.BlocksPromoted

COUNTER

Total number of times any one of the blocks in this worker moved to a new tier.

Worker.BlocksReadLocal

COUNTER

Total number of local blocks read by this worker.

Worker.BlocksReadRemote

COUNTER

Total number of a remote blocks read by this worker.

Worker.BlocksReadUfs

COUNTER

Total number of a UFS blocks read by this worker.

Worker.BytesReadDirect

COUNTER

Total number of bytes read from Alluxio storage managed by this worker and underlying UFS if data cannot be found in the Alluxio storage without external RPC involved. This records data read by worker internal calls (e.g. a client embedded in this worker).

Worker.BytesReadDirectThroughput

METER

Total number of bytes read from Alluxio storage managed by this worker and underlying UFS if data cannot be found in the Alluxio storage without external RPC involved. This records data read by worker internal calls (e.g. a client embedded in this worker).

Worker.BytesReadDomain

COUNTER

Total number of bytes read from Alluxio storage via domain socket by this worker

Worker.BytesReadDomainThroughput

METER

Bytes read throughput from Alluxio storage via domain socket by this worker

Worker.BytesReadPerUfs

COUNTER

Total number of bytes read from a specific Alluxio UFS by this worker

Worker.BytesReadRemote

COUNTER

Total number of bytes read from Alluxio storage managed by this worker and underlying UFS if data cannot be found in the Alluxio storage via external RPC channel. This does not include short-circuit local reads and domain socket reads.

Worker.BytesReadRemoteThroughput

METER

Total number of bytes read from Alluxio storage managed by this worker and underlying UFS if data cannot be found in the Alluxio storage via external RPC channel. This does not include short-circuit local reads and domain socket reads.

Worker.BytesReadUfsThroughput

METER

Bytes read throughput from all Alluxio UFSes by this worker

Worker.BytesWrittenDirect

COUNTER

Total number of bytes written to Alluxio storage managed by this worker without external RPC involved. This records data written by worker internal calls (e.g. a client embedded in this worker).

Worker.BytesWrittenDirectThroughput

METER

Total number of bytes written to Alluxio storage managed by this worker without external RPC involved. This records data written by worker internal calls (e.g. a client embedded in this worker).

Worker.BytesWrittenDomain

COUNTER

Total number of bytes written to Alluxio storage via domain socket by this worker

Worker.BytesWrittenDomainThroughput

METER

Throughput of bytes written to Alluxio storage via domain socket by this worker

Worker.BytesWrittenPerUfs

COUNTER

Total number of bytes written to a specific Alluxio UFS by this worker

Worker.BytesWrittenRemote

COUNTER

Total number of bytes written to Alluxio storage or the underlying UFS by this worker. This does not include short-circuit local writes and domain socket writes.

Worker.BytesWrittenRemoteThroughput

METER

Bytes write throughput to Alluxio storage or the underlying UFS by this workerThis does not include short-circuit local writes and domain socket writes.

Worker.BytesWrittenUfsThroughput

METER

Bytes write throughput to all Alluxio UFSes by this worker

Worker.CacheBlocksSize

COUNTER

Total number of bytes that being cached through cache requests

Worker.CacheFailedBlocks

COUNTER

Total number of failed cache blocks in this worker

Worker.CacheRemoteBlocks

COUNTER

Total number of blocks that need to be cached from remote source

Worker.CacheRequests

COUNTER

Total number of cache request received by this worker

Worker.CacheRequestsAsync

COUNTER

Total number of async cache request received by this worker

Worker.CacheRequestsSync

COUNTER

Total number of sync cache request received by this worker

Worker.CacheSucceededBlocks

COUNTER

Total number of cache succeeded blocks in this worker

Worker.CacheUfsBlocks

COUNTER

Total number of blocks that need to be cached from local source

Worker.CapacityFree

GAUGE

Total free bytes on all tiers of a specific Alluxio worker

Worker.CapacityTotal

GAUGE

Total capacity (in bytes) on all tiers of a specific Alluxio worker

Worker.CapacityUsed

GAUGE

Total used bytes on all tiers of a specific Alluxio worker

Dynamically generated worker metrics:

Metric Name
Description

Worker.UfsSessionCount-Ufs:<UFS_ADDRESS>

The total number of currently opened UFS sessions to connect to the given <UFS_ADDRESS>

Worker.<RPC_NAME>

The duration statistics of RPC calls exposed on workers

Client Metrics

Each client metric will be recorded with its local hostname or alluxio.user.app.id is configured. If alluxio.user.app.id is configured, multiple clients can be combined into a logical application.

Name
Type
Description

Client.BlockReadChunkRemote

TIMER

The timer statistics of reading block data in chunks from remote Alluxio workers via RPC framework. This metrics will only be recorded when alluxio.user.block.read.metrics.enabled is set to true

Client.BytesReadLocal

COUNTER

Total number of bytes short-circuit read from local storage by this client

Client.BytesReadLocalThroughput

METER

Bytes throughput short-circuit read from local storage by this client

Client.BytesWrittenLocal

COUNTER

Total number of bytes short-circuit written to local storage by this client

Client.BytesWrittenLocalThroughput

METER

Bytes throughput short-circuit written to local storage by this client

Client.BytesWrittenUfs

COUNTER

Total number of bytes write to Alluxio UFS by this client

Client.CacheBytesEvicted

METER

Total number of bytes evicted from the client cache.

Client.CacheBytesReadCache

METER

Total number of bytes read from the client cache.

Client.CacheBytesReadExternal

METER

Total number of bytes read from external storage due to a cache miss on the client cache.

Client.CacheBytesRequestedExternal

METER

Total number of bytes the user requested to read which resulted in a cache miss. This number may be smaller than Client.CacheBytesReadExternal due to chunk reads.

Client.CacheBytesWrittenCache

METER

Total number of bytes written to the client cache.

Client.CacheCleanErrors

COUNTER

Number of failures when cleaning out the existing cache directory to initialize a new cache.

Client.CacheCleanupGetErrors

COUNTER

Number of failures when cleaning up a failed cache read.

Client.CacheCleanupPutErrors

COUNTER

Number of failures when cleaning up a failed cache write.

Client.CacheCreateErrors

COUNTER

Number of failures when creating a cache in the client cache.

Client.CacheDeleteErrors

COUNTER

Number of failures when deleting cached data in the client cache.

Client.CacheDeleteFromStoreErrors

COUNTER

Number of failures when deleting pages from page stores.

Client.CacheDeleteNonExistingPageErrors

COUNTER

Number of failures when deleting pages due to absence.

Client.CacheDeleteNotReadyErrors

COUNTER

Number of failures when cache is not ready to delete pages.

Client.CacheGetErrors

COUNTER

Number of failures when getting cached data in the client cache.

Client.CacheGetNotReadyErrors

COUNTER

Number of failures when cache is not ready to get pages.

Client.CacheGetStoreReadErrors

COUNTER

Number of failures when getting cached data in the client cache due to failed read from page stores.

Client.CacheHitRate

GAUGE

Cache hit rate: (# bytes read from cache) / (# bytes requested).

Client.CachePageReadCacheTimeNanos

METER

Time in nanoseconds taken to read a page from the client cache when the cache hits.

Client.CachePageReadExternalTimeNanos

METER

Time in nanoseconds taken to read a page from external source when the cache misses.

Client.CachePages

COUNTER

Total number of pages in the client cache.

Client.CachePagesEvicted

METER

Total number of pages evicted from the client cache.

Client.CachePutAsyncRejectionErrors

COUNTER

Number of failures when putting cached data in the client cache due to failed injection to async write queue.

Client.CachePutBenignRacingErrors

COUNTER

Number of failures when adding pages due to racing eviction. This error is benign.

Client.CachePutErrors

COUNTER

Number of failures when putting cached data in the client cache.

Client.CachePutEvictionErrors

COUNTER

Number of failures when putting cached data in the client cache due to failed eviction.

Client.CachePutInsufficientSpaceErrors

COUNTER

Number of failures when putting cached data in the client cache due to insufficient space made after eviction.

Client.CachePutNotReadyErrors

COUNTER

Number of failures when cache is not ready to add pages.

Client.CachePutStoreDeleteErrors

COUNTER

Number of failures when putting cached data in the client cache due to failed deletes in page store.

Client.CachePutStoreWriteErrors

COUNTER

Number of failures when putting cached data in the client cache due to failed writes to page store.

Client.CachePutStoreWriteNoSpaceErrors

COUNTER

Number of failures when putting cached data in the client cache but getting disk is full while cache capacity is not achieved. This can happen if the storage overhead ratio to write data is underestimated.

Client.CacheShadowCacheBytes

COUNTER

Amount of bytes in the client shadow cache.

Client.CacheShadowCacheBytesHit

COUNTER

Total number of bytes hit the client shadow cache.

Client.CacheShadowCacheBytesRead

COUNTER

Total number of bytes read from the client shadow cache.

Client.CacheShadowCacheFalsePositiveRatio

COUNTER

Probability that the working set bloom filter makes an error. The value is 0-100. If too high, need to allocate more space

Client.CacheShadowCachePages

COUNTER

Amount of pages in the client shadow cache.

Client.CacheShadowCachePagesHit

COUNTER

Total number of pages hit the client shadow cache.

Client.CacheShadowCachePagesRead

COUNTER

Total number of pages read from the client shadow cache.

Client.CacheSpaceAvailable

GAUGE

Amount of bytes available in the client cache.

Client.CacheSpaceUsed

GAUGE

Amount of bytes used by the client cache.

Client.CacheSpaceUsedCount

COUNTER

Amount of bytes used by the client cache as a counter.

Client.CacheState

COUNTER

State of the cache: 0 (NOT_IN_USE), 1 (READ_ONLY) and 2 (READ_WRITE)

Client.CacheStoreDeleteTimeout

COUNTER

Number of timeouts when deleting pages from page store.

Client.CacheStoreGetTimeout

COUNTER

Number of timeouts when reading pages from page store.

Client.CacheStorePutTimeout

COUNTER

Number of timeouts when writing new pages to page store.

Client.CacheStoreThreadsRejected

COUNTER

Number of rejection of I/O threads on submitting tasks to thread pool, likely due to unresponsive local file system.

Client.MetadataCacheSize

GAUGE

The total number of files and directories whose metadata is cached on the client-side. Only valid if the filesystem isalluxio.client.file.MetadataCachingBaseFileSystem.

Fuse Metrics

Fuse is a long-running Alluxio client. Depending on the launching ways, Fuse metrics show as

  • client metrics when Fuse client is launching in a standalone AlluxioFuse process.

  • worker metrics when Fuse client is embedded in the AlluxioWorker process.

Fuse metrics includes:

Name
Type
Description

Fuse.BytesRead

COUNTER

Total number of bytes read through Fuse.read() operations.

Fuse.BytesToRead

COUNTER

Total number of bytes requested by Fuse.read() operations.

Fuse.CachedPathCount

GAUGE

Total number of FUSE-to-Alluxio path mappings being cached. This value will be smaller or equal to alluxio.fuse.cached.paths.max

Fuse.ReadingFileCount

GAUGE

Total number of files being read concurrently.

Fuse.WritingFileCount

GAUGE

Total number of files being written concurrently.

Fuse reading/writing file count can be used as the indicator for Fuse application pressure. If a large amount of concurrent read/write occur in a short period of time, each of the read/write operations may take longer time to finish.

When a user or an application runs a filesystem command under Fuse mount point, this command will be processed and translated by operating system which will trigger the related Fuse operations exposed in AlluxioFuse. The count of how many times each operation is called, and the duration of each call will be recorded with metrics name Fuse.<FUSE_OPERATION_NAME> dynamically.

The important Fuse metrics include:

Metric Name
Description

Fuse.readdir

The duration metrics of listing a directory

Fuse.getattr

The duration metrics of getting the metadata of a file

Fuse.open

The duration metrics of opening a file for read

Fuse.read

The duration metrics of reading a part of a file

Fuse.create

The duration metrics of creating a file for write

Fuse.write

The duration metrics of writing a file

Fuse.release

The duration metrics of closing a file after read or write. Note that release is async so fuse threads will not wait for release to finish

Fuse.mkdir

The duration metrics of creating a directory

Fuse.unlink

The duration metrics of removing a file or a directory

Fuse.rename

The duration metrics of renaming a file or a directory

Fuse.chmod

The duration metrics of modifying the mode of a file or a directory

Fuse.chown

The duration metrics of modifying the user and/or group ownership of a file or a directory

Fuse related metrics include:

  • Worker metrics with Direct keyword. When Fuse is embedded in worker process, it can go through worker internal API to read from / write to this worker. The related metrics are ended with Direct. For example, Worker.BytesReadDirect shows how many bytes are served by this worker to its embedded Fuse client for read.

  • If alluxio.user.block.read.metrics.enabled=true is configured, Client.BlockReadChunkRemote will be recorded. This metric shows the duration statistics of reading data from remote workers via gRPC.

Process Common Metrics

The following metrics are collected on each instance (Master, Worker or Client).

JVM Attributes

Metric Name
Description

name

The name of the JVM

uptime

The uptime of the JVM

vendor

The current JVM vendor

Garbage Collector Statistics

Metric Name
Description

PS-MarkSweep.count

Total number of mark and sweep

PS-MarkSweep.time

The time used to mark and sweep

PS-Scavenge.count

Total number of scavenge

PS-Scavenge.time

The time used to scavenge

Memory Usage

Alluxio provides overall and detailed memory usage information. Detailed memory usage information of code cache, compressed class space, metaspace, PS Eden space, PS old gen, and PS survivor space is collected in each process.

A subset of the memory usage metrics are listed as following:

Metric Name
Description

total.committed

The amount of memory in bytes that is guaranteed to be available for use by the JVM

total.init

The amount of the memory in bytes that is available for use by the JVM

total.max

The maximum amount of memory in bytes that is available for use by the JVM

total.used

The amount of memory currently used in bytes

heap.committed

The amount of memory from heap area guaranteed to be available

heap.init

The amount of memory from heap area available at initialization

heap.max

The maximum amount of memory from heap area that is available

heap.usage

The amount of memory from heap area currently used in GB

heap.used

The amount of memory from heap area that has been used

pools.Code-Cache.used

Used memory of collection usage from the pool from which memory is used for compilation and storage of native code

pools.Compressed-Class-Space.used

Used memory of collection usage from the pool from which memory is use for class metadata

pools.PS-Eden-Space.used

Used memory of collection usage from the pool from which memory is initially allocated for most objects

pools.PS-Survivor-Space.used

Used memory of collection usage from the pool containing objects that have survived the garbage collection of the Eden space

ClassLoading Statistics

Metric Name
Description

loaded

The total number of classes loaded

unloaded

The total number of unloaded classes

Thread Statistics

Metric Name
Description

count

The current number of live threads

daemon.count

The current number of live daemon threads

peak.count

The peak live thread count

total_started.count

The total number of threads started

deadlock.count

The number of deadlocked threads

deadlock

The call stack of each thread related deadlock

new.count

The number of threads with new state

blocked.count

The number of threads with blocked state

runnable.count

The number of threads with runnable state

terminated.count

The number of threads with terminated state

timed_waiting.count

The number of threads with timed_waiting state

Last updated