Kernel space programming topics.
In the diagram below you can see how IPFIRE works.
Communication between kernel and user spaces is allowed by means of the functions
in ipfi_netl.c, in conjunction with userspace library libnetl.c.
Two different sockets allow control exchange and logging information exchange.
Packets arriving on the wire enter the firewall, which consults policy database inserted
by the user (or users, since more than one are allowed) and allows or denies traffic.
In addition, NAT can be performed here. The response is sent to userspace thanks to
data socket.
Topics.
- IPFIRE Kernel programming.
- IPFIRE-wall kernel has its own data structures in header files. These are related to the shape a rule has,
to the structures required to keep connection state and to do network translation.
As an example, a structure related to the information stored about a packet is shown below, taken from
includes/ipfi.h:
/** * response: a negative number represents denial due to * the denial rule at position 'response', a positive one * represents permission due to match with permission * rule at position 'respoonse'. 0 means no explicit rule * has been found. */ typedef struct { /* see linux/skbuff.h */ struct iphdr iphead; /* ip header */ union { struct tcphdr tcphead; /* tcp header */ struct udphdr udphead; struct icmphdr icmphead; } transport_header; unsigned short protocol; /* packet protocol from driver */ u8 direction:3, /* in, out or forward */ state:1, /* if true, a match in state tables happened */ nat:1, /* firewall has natted the connection */ snat:1, badsum:1, /* bad checksum */ external:1; /* packet is arriving on external interface (nat) */ int response; deviceparams devpar; #ifdef ENABLE_RULENAME char rulename[RULENAMELEN]; #endif /* id of packets: if counters reach ULONG_MAX, they are * reset to 0 */ unsigned long packet_id; /* id of packet sent to userspace. Every time a packet * is sent to userspace, this counter is incremented */ unsigned long logu_id; struct state_t st; } ipfire_info_t;
As one can see, tcp, udp and icmp headers are present, together with some flags keeping trace of events or particular situations. For instance, nat or snat indicate that the packet has been translated, badsum means that the packet arrives with a bad checksum (ip or tcp/udp), and will be signaled to userspace, external means that the packet belongs to a flow initiated by a host of an external network. As one can see, in packet info itself is stored the response of the firewall, and also the state is memorized in struct state_t. In ipfire_info_t there's place also for a short name for the packet corresponding to an eventual match. This information is the one which is sent to userspace and decoded by IPFIRE-user. - Atomic context.
- Filtering and translation take place in an atomic context. This means no special locks
should be needed when operating with structures. When context was not atomic, it
has been necessary to acquire the adequate locks to prevent unexpected behaviour
in consequence to some event. For instance, below a piece of code regarding the retrieval
of the user id of a process can be observed.
The particular context in which networking functions live did not allow me to develop a userspace filter without great performance loss. This because in network processing functions cannot sleep, waiting for a response from userspace.
When a packet arrives with the correct address to the network card, it is saved in a temporary buffer in the device's memory. Then the network device raises an interrupt. The interrupt handler allocates a new socket buffer (skb) and copies the packet in it. Once determined the data link protocol, the handler invokes the netif_rx() function to notify the linux networking code that a new packet should be processed. The kernel uses a per-CPU queue for the packets that have been received and the execution enters a soft irq context ( NET_RX_SOFTIRQ).
In this environment, the only way to implement a userspace firewall would have been the one that dropped all packets received from the network. The packets would have been saved and sent to userspace for filtering. Then each packet allowed would recompare from the nothingness and go on on its way. So one can understand the poor performance of such a choice. - Getting the owner of a process in userspace.
- In the piece of code below, you can see how the owner of the process with pid pid
is retrieved. This assures that a malicious user cannot manipulate his identity: he can't seem to
have a user id different from the one of the uid of the process he's running (in this case ipfire-
userspace).
struct task_struct* get_uid_from_pid(const pid_t uspace_pid) { struct task_struct* task; /* task list must not change while we are reading! * We are not in atomic */ read_lock(&tasklist_lock); for_each_process(task) { if(task->pid == uspace_pid) { read_unlock(&tasklist_lock); return task; } } read_unlock(&tasklist_lock); return NULL; }
The code above is taken from ipfi_netl.c, where the context is not atomic, and so read_lock is mandatory. - Netlink sockets.
- Netlink interface has been used to make kernel and user space communicate
between themselves. In kernel side, socket is created by means of the function
netlink_kernel_create(), which invokes, on data arrival, the function
pointed, as in the example below, for control socket:
static int create_control_socket(void) { sknl_ipfi_control = netlink_kernel_create(NETLINK_IPFI_CONTROL, nl_receive_control); userspace_control_pid = 0; if (sknl_ipfi_control == NULL) { printk("IPFIRE: create_socket(): failed to create netlink" " control socket\n"); return -1; } else return 0; }
nl_receive_control is called when data is available on the socket. Data has then to be extracted from the socket buffer:static inline void *extract_data(struct sk_buff *skb) { struct nlmsghdr *nlh; nlh = (struct nlmsghdr *) skb->data; return NLMSG_DATA(nlh); /* pointer to data contained in nlh */ }
NLMSG_DATA() is a macro that extracts data. Netlink socket managing is quite complex and so the interested user can inspect patiently the sources. - Doubly linked lists.
- Kernel structures as state connection tables are stored in a linked list. It is quite easy insert and remove entries from such lists. In the example below, the structure of a state table is shown. We underline also the presence of the:
- timer structures.
- Timer structures and lists must be initialized. Timers are made in such a way that
when a timer expires, a function pointed by a variable in the structure is called.
In IPFIRE, timer expiring is used to delete an entry from a list. A timer can also be
modified to give a longer life to an entry.
State table structure
struct state_table { __u32 saddr; __u32 daddr; __u16 sport; __u16 dport; short direction; short id; unsigned short protocol; char in_devname[IFNAMSIZ]; char out_devname[IFNAMSIZ]; #ifdef ENABLE_RULENAME char rulename[RULENAMELEN]; #endif struct timer_list timer_statelist; struct list_head list; struct state_t state; };
In the structure above, you can see the struct timer_list. A timer structure is setup as in the following example, taken from ipfi_machine.cvoid fill_timer_table_fields(struct state_table *state_t) { init_timer(&state_t->timer_statelist); state_t->timer_statelist.expires = jiffies + state_lifetime * HZ; state_t->timer_statelist.data = (unsigned long) state_t; state_t->timer_statelist.function = handle_keep_state_timeout; }
Timer is initialized, then expire moment is set. data is the data passed to the function handle_keep_state_timeout(), when timer expires. - Getsockopt interface.
- getsockopt interface retrieves from translation tables the original destination
address which had been changed by destination address translation.
It registers at startup and then unregisters when module is unloaded.
Have a look at ipfi_translation.c to see how this goal is reached.
The functions involved are:
int get_original_dest(struct sock *sk, int optval, void __user *user, int *len);
int lookup_dnat_table_and_getorigdst(const ipfire_info_t* iit, struct sockaddr_in* sin);
and
int get_orig_from_dnat_entry(const struct dnatted_table* dnt, const ipfire_info_t* iit, struct sockaddr_in* sin);
- Distinguish between external and internal NAT.
- When destination natting a connection, things are different if communication originating
host is external or internal with respect to the network. See
scenario 4 and scenario 5 for an explanation and
an exemplification. The code to distinguish such cases is quite simple and is the application of
RFC 1918 specification:
/* returns 1 if network address is a private one conforming * to rfc 1918, 0 otherwise */ inline int private_address(__u32 addr) { __u32 haddr = ntohl(addr); /* See RFC 1918: "Address allocation for Private Internets" */ /* Class A private network: 10.0.0.0 - 10.255.255.255 */ if( (haddr >= 0xa00000) && (haddr <= 0xaffffff) ) return 1; /* Class B private network: 172.16.0.0 - 172.31.255.255 */ if( (haddr >= 0xac100000) && (haddr <= 0xac1fffff) ) return 1; /* Class C private network: 192.168.0.0 - 192.168.255.255 */ if( (haddr >= 0xc0a80000) && (haddr <= 0xc0a8ffff) ) return 1; return 0; }
- Network structures and address translation.
-
Network address translation is quite complex and the reader might
prefer to follow the examples in the scenarios
available.
All code related to network translation is located in ipfi_translation.c. - Checking and calculating checksums.
-
Network address translation implies manipulation of tcp/udp and ip headers.
For this, after NAT, checksums must be recalculated. When we compute checksum,
we set skb->ip_summed to CHECKSUM_NONE, to tell
network card not to recalculate sum again.
Below an excerpt of recalculate_ip_checksum() is shown:
switch (iph->protocol) { case IPPROTO_TCP: th = skb_header_pointer(skb, skb->nh.iph->ihl * 4, sizeof(tcphead), &tcphead); /* check not null and not malformed */ if(check_tcp_header(th, datalen) < 0) return -1; th->check = 0; th->check = tcp_v4_check(th, len - 4 * iph->ihl, iph->saddr, iph->daddr, csum_partial((char *) th, len - 4 * iph->ihl, 0)); break; /* ... */ } /* see include/skbuff.h: "B.Checksumming on output": * CHECKSUM_NONE: skb is checksummed by protocol or csum is not required. * If checksum is calculated here, no need to recalculate it. */ if( (direction == IPFI_OUTPUT ) || (direction == IPFI_OUTPUT_POST) ) skb->ip_summed = CHECKSUM_NONE; return 0;
Checksums are checked also in prerouting chain when a packet arriving from the network has to be translated. A subsequent translation of a malformed packet would imply a checksum update and make anyway the packet look correct to tcp! - Registering with netfilter hooks.
-
Kernel tree code does not need to be modified when you want to use IPFIRE.
IPFIRE registers itself with netfilter hooks and kernel modules are then ready to
use! For instance:
int register_hooks(void) { /* pre routing */ nfh_pre.pf = PF_INET; nfh_pre.hooknum = NF_IP_PRE_ROUTING; nfh_pre.priority = NF_IP_PRI_FIRST; /* make our function first */ nfh_pre.hook = deliver_process_by_direction; nf_register_hook(&nfh_pre); /* input, output, post, forward follow in ipfi.c */ }
There thou can see protocol family, registration priority, hook involved and function pointer looking at the method to be invoked at hook traversal of a packet. - Reading and writing to proc file system.
-
IPFIRE has got an interface to the /proc file system (see
here). There user can read or write the default policy of the filter.
Let's report the write procedure as example (ipfi.c).
int proc_write_ipfire(struct file *file, const char *page, unsigned long count, void *data) { int len; /* do a range checking, don't overflow buffers in kernel modules */ if(count > PROCENTRY_DATA_LEN) len = PROCENTRY_DATA_LEN; else len = count; /* use the copy_from_user function to copy page data to * to our char. */ if(copy_from_user(procentry_line, page, len)) { return -EFAULT; } /* zero terminate procentry_line */ procentry_line[len] = '\0'; set_policy(procentry_line); return len; }
These are just a little part of many important topics involved in kernel programming. A deep inspection of IPFIRE code would reveal other interesting aspects, for instance how header structures are manipulated, and which macros are available to the programmer to declare a for loop in linked lists, or how an interface address is obtained when masquerading a device... Any information request to me would be greatly appreciated and also corrections will be welcome!!