Day 2: Data Ingestion
What You Will Learn Today
- Types of data inputs
- Monitoring files and directories
- Network inputs (TCP/UDP)
- Configuring sourcetypes
- The basics of inputs.conf
Types of Data Inputs
Splunk supports a wide variety of methods for ingesting data.
flowchart TB
subgraph Inputs["Data Input Methods"]
File["Files / Directories<br>monitor"]
Network["Network<br>TCP / UDP"]
Script["Scripted Input<br>scripted input"]
HEC["HTTP Event Collector<br>HEC"]
API["REST API<br>modular input"]
end
style File fill:#3b82f6,color:#fff
style Network fill:#22c55e,color:#fff
style Script fill:#f59e0b,color:#fff
style HEC fill:#8b5cf6,color:#fff
style API fill:#ef4444,color:#fff
| Input Method | Description | Use Case |
|---|---|---|
| File Monitor | Watches files and directories | Log files |
| Network | Receives data on TCP/UDP ports | syslog |
| HEC | Sends data over HTTP | Application logs |
| Scripted Input | Captures script output | Custom data collection |
| Modular Input | Data collection via add-ons | Cloud service integrations |
File Monitoring (Monitor Input)
Configuring via the Web UI
- Go to Settings > Data inputs > Files & directories
- Select New Local File & Directory
- Enter the path (e.g.,
/var/log/syslog) - Choose or create a sourcetype
- Select an index
Configuring via inputs.conf
# $SPLUNK_HOME/etc/system/local/inputs.conf
# Monitor a single file
[monitor:///var/log/syslog]
disabled = false
index = main
sourcetype = syslog
# Monitor all files in a directory
[monitor:///var/log/apache2/]
disabled = false
index = web
sourcetype = access_combined
# Monitor files matching a pattern
[monitor:///opt/app/logs/*.log]
disabled = false
index = application
sourcetype = app_log
Monitor Configuration Options
| Parameter | Description | Example |
|---|---|---|
disabled |
Enable or disable the input | false |
index |
Destination index | main |
sourcetype |
Sourcetype assignment | syslog |
host |
Override the host name | web-01 |
ignoreOlderThan |
Skip files older than this | 7d |
whitelist |
Include files matching pattern | \.log$ |
blacklist |
Exclude files matching pattern | \.gz$ |
Network Inputs
Receiving syslog (UDP)
# inputs.conf
[udp://514]
disabled = false
sourcetype = syslog
index = syslog
TCP Input
[tcp://9997]
disabled = false
sourcetype = forwarder_data
Note: TCP port 9997 is the standard port for receiving data from forwarders.
HTTP Event Collector (HEC)
HEC lets applications send data to Splunk directly over HTTP.
Enabling HEC
- Go to Settings > Data inputs > HTTP Event Collector
- Under Global Settings, enable HEC
- Click New Token to create a token
Sending Data
curl -k https://localhost:8088/services/collector/event \
-H "Authorization: Splunk YOUR_HEC_TOKEN" \
-d '{"event": "Hello from HEC!", "sourcetype": "manual", "index": "main"}'
Sending JSON Data
curl -k https://localhost:8088/services/collector/event \
-H "Authorization: Splunk YOUR_HEC_TOKEN" \
-d '{
"event": {
"action": "login",
"user": "alice",
"status": "success",
"ip": "192.168.1.10"
},
"sourcetype": "app_json",
"index": "main"
}'
flowchart LR
App["Application"]
HEC["HEC<br>:8088"]
Indexer["Indexer"]
App -->|"HTTP POST<br>+ Token"| HEC --> Indexer
style App fill:#3b82f6,color:#fff
style HEC fill:#22c55e,color:#fff
style Indexer fill:#f59e0b,color:#fff
Sourcetypes
A sourcetype tells Splunk how to interpret and parse the format of your data.
Built-in Sourcetypes
| Sourcetype | Description |
|---|---|
syslog |
Syslog format |
access_combined |
Apache Combined Log Format |
csv |
CSV format |
json |
JSON format |
_json |
Auto-detected JSON |
log4j |
Java Log4j format |
Custom Sourcetypes
# props.conf
[my_app_log]
TIME_FORMAT = %Y-%m-%d %H:%M:%S
TIME_PREFIX = ^
MAX_TIMESTAMP_LOOKAHEAD = 19
LINE_BREAKER = ([\r\n]+)
SHOULD_LINEMERGE = false
| Parameter | Description |
|---|---|
TIME_FORMAT |
Timestamp format string |
TIME_PREFIX |
Characters preceding the timestamp |
MAX_TIMESTAMP_LOOKAHEAD |
Max characters to scan for a timestamp |
LINE_BREAKER |
Pattern that separates events |
SHOULD_LINEMERGE |
Whether to merge multiple lines into one event |
Creating Indexes
Creating via the Web UI
- Go to Settings > Indexes
- Click New Index
- Enter a name (e.g.,
web_logs) - Configure retention period and size limits
Configuring via indexes.conf
# indexes.conf
[web_logs]
homePath = $SPLUNK_DB/web_logs/db
coldPath = $SPLUNK_DB/web_logs/colddb
thawedPath = $SPLUNK_DB/web_logs/thaweddb
maxDataSize = auto_high_volume
maxTotalDataSizeMB = 50000
frozenTimePeriodInSecs = 7776000 # 90 days
The Data Lifecycle
flowchart LR
Hot["Hot<br>New data<br>Writable"]
Warm["Warm<br>Recent data<br>Read-only"]
Cold["Cold<br>Older data<br>Slow storage"]
Frozen["Frozen<br>Deleted or archived"]
Hot --> Warm --> Cold --> Frozen
style Hot fill:#ef4444,color:#fff
style Warm fill:#f59e0b,color:#fff
style Cold fill:#3b82f6,color:#fff
style Frozen fill:#8b5cf6,color:#fff
| Bucket | Description | Characteristics |
|---|---|---|
| Hot | Newest data | Writable, fast storage |
| Warm | Recent data | Read-only, fast storage |
| Cold | Older data | Read-only, slower storage |
| Frozen | Expired data | Deleted or archived |
Universal Forwarder
In production environments, the Universal Forwarder (UF) is used to ship data to indexers.
# Install the Universal Forwarder (Linux)
wget -O splunkforwarder.tgz "https://download.splunk.com/products/universalforwarder/releases/latest/linux/splunkforwarder-latest-Linux-x86_64.tgz"
tar xvzf splunkforwarder.tgz -C /opt
# Initial setup
/opt/splunkforwarder/bin/splunk start --accept-license
/opt/splunkforwarder/bin/splunk add forward-server indexer01:9997
/opt/splunkforwarder/bin/splunk add monitor /var/log/syslog
flowchart LR
Server1["Web Server<br>UF"]
Server2["App Server<br>UF"]
Server3["DB Server<br>UF"]
Indexer["Splunk<br>Indexer<br>:9997"]
Server1 --> Indexer
Server2 --> Indexer
Server3 --> Indexer
style Server1 fill:#3b82f6,color:#fff
style Server2 fill:#3b82f6,color:#fff
style Server3 fill:#3b82f6,color:#fff
style Indexer fill:#22c55e,color:#fff
Hands-On: Ingesting Data from Multiple Sources
Create the following three log files and ingest them into Splunk.
access.log (Web Access Log)
192.168.1.10 - alice [30/Jan/2026:10:00:01 +0900] "GET /index.html HTTP/1.1" 200 2048
192.168.1.20 - bob [30/Jan/2026:10:00:05 +0900] "POST /api/login HTTP/1.1" 401 128
192.168.1.10 - alice [30/Jan/2026:10:00:10 +0900] "GET /dashboard HTTP/1.1" 200 4096
app.log (Application Log)
2026-01-30 10:00:01 INFO [main] Application started successfully
2026-01-30 10:00:05 WARN [db-pool] Connection pool running low: 2/10
2026-01-30 10:00:10 ERROR [api] NullPointerException at UserService.java:42
auth.log (Authentication Log)
Jan 30 10:00:01 server01 sshd[1234]: Accepted publickey for alice from 192.168.1.10 port 22
Jan 30 10:00:05 server01 sshd[1235]: Failed password for bob from 192.168.1.20 port 22
Jan 30 10:00:10 server01 sshd[1236]: Failed password for root from 10.0.0.1 port 22
Summary
| Concept | Description |
|---|---|
| Monitor Input | Watch files and directories for new data |
| Network Input | Receive data over TCP/UDP |
| HEC | Send data over HTTP |
| Sourcetype | Defines how Splunk parses data |
| Index | A data repository |
| Universal Forwarder | A lightweight agent for shipping data |
| inputs.conf | The configuration file for data inputs |
Key Takeaways
- Monitor Input is the most common way to ingest log files
- HEC is ideal for sending data directly from applications
- Sourcetypes define the parsing rules for your data
- Use the Universal Forwarder in production environments
Exercises
Exercise 1: Basic
Ingest the three sample log files into Splunk, assigning a different sourcetype to each.
Exercise 2: Applied
Enable HEC and use curl to send JSON data. Verify that the data appears in Splunk search results.
Challenge
Create a custom sourcetype in props.conf with your own timestamp format and field extraction rules for a custom log format.
References
Next up: In Day 3, you will learn the fundamentals of searching -- mastering the basics of SPL (Search Processing Language).