License Common Public Attribution License 1.0 (CPAL)
Lines 325
Keywords
address (1) email (2) php (3) rfc (1) smtp (1) validation (1) validator (1)
Permissions
Viewable by Everyone
Editable by All Siafoo Users
Hide
Meet people who work on similar things as you – get help if you need it Join Siafoo Now or Learn More

RFC-compliant email address validator Atom Feed 0

In Brief There’s a gazillion regular expressions out there that claim to validate an email address. They don’t. Doug Lovell explains why here: http://www.linuxjournal.com/article/9585. The function that Doug made for his article is good, but it delegates the validation of the domain part of the address to the DNS servers of the world. This is a good approach but there are three issues with it:... more
# 's
  1<?php
2/*
3Copyright 2009 Dominic Sayers
4 dominic_sayers@hotmail.com
5 http://www.dominicsayers.com
6
7Version 1.7
8
9This source file is subject to the Common Public Attribution License Version 1.0 (CPAL) license.
10The license terms are available through the world-wide-web at http://www.opensource.org/licenses/cpal_1.0
11*/
12
13// PHPLint modules
14/*.
15 require_module 'standard';
16 require_module 'pcre';
17.*/
18/*.boolean.*/ function is_email (/*.string.*/ $email, $checkDNS = false) {
19 // Check that $email is a valid address. Read the following RFCs to understand the constraints:
20 // (http://tools.ietf.org/html/rfc5322)
21 // (http://tools.ietf.org/html/rfc3696)
22 // (http://tools.ietf.org/html/rfc5321)
23 // (http://tools.ietf.org/html/rfc4291#section-2.2)
24 // (http://tools.ietf.org/html/rfc1123#section-2.1)
25
26 // the upper limit on address lengths should normally be considered to be 256
27 // (http://www.rfc-editor.org/errata_search.php?rfc=3696)
28 // NB I think John Klensin is misreading RFC 5321 and the the limit should actually be 254
29 // However, I will stick to the published number until it is changed.
30 //
31 // The maximum total length of a reverse-path or forward-path is 256
32 // characters (including the punctuation and element separators)
33 // (http://tools.ietf.org/html/rfc5321#section-4.5.3.1.3)
34 $emailLength = strlen($email);
35 if ($emailLength > 256) return false; // Too long
36
37 // Contemporary email addresses consist of a "local part" separated from
38 // a "domain part" (a fully-qualified domain name) by an at-sign ("@").
39 // (http://tools.ietf.org/html/rfc3696#section-3)
40 $atIndex = strrpos($email,'@');
41
42 if ($atIndex === false) return false; // No at-sign
43 if ($atIndex === 0) return false; // No local part
44 if ($atIndex === $emailLength) return false; // No domain part
45
46 // Sanitize comments
47 // - remove nested comments, quotes and dots in comments
48 // - remove parentheses and dots from quoted strings
49 $braceDepth = 0;
50 $inQuote = false;
51 $escapeThisChar = false;
52
53 for ($i = 0; $i < $emailLength; ++$i) {
54 $char = $email[$i];
55 $replaceChar = false;
56
57 if ($char === '\\') {
58 $escapeThisChar = !$escapeThisChar; // Escape the next character?
59 } else {
60 switch ($char) {
61 case '(':
62 if ($escapeThisChar) {
63 $replaceChar = true;
64 } else {
65 if ($inQuote) {
66 $replaceChar = true;
67 } else {
68 if ($braceDepth++ > 0) $replaceChar = true; // Increment brace depth
69 }
70 }
71
72 break;
73 case ')':
74 if ($escapeThisChar) {
75 $replaceChar = true;
76 } else {
77 if ($inQuote) {
78 $replaceChar = true;
79 } else {
80 if (--$braceDepth > 0) $replaceChar = true; // Decrement brace depth
81 if ($braceDepth < 0) $braceDepth = 0;
82 }
83 }
84
85 break;
86 case '"':
87 if ($escapeThisChar) {
88 $replaceChar = true;
89 } else {
90 if ($braceDepth === 0) {
91 $inQuote = !$inQuote; // Are we inside a quoted string?
92 } else {
93 $replaceChar = true;
94 }
95 }
96
97 break;
98 case '.': // Dots don't help us either
99 if ($escapeThisChar) {
100 $replaceChar = true;
101 } else {
102 if ($braceDepth > 0) $replaceChar = true;
103 }
104
105 break;
106 }
107
108 $escapeThisChar = false;
109 if ($replaceChar) $email[$i] = 'x'; // Replace the offending character with something harmless
110 }
111 }
112
113 $localPart = substr($email, 0, $atIndex);
114 $domain = substr($email, $atIndex + 1);
115 $FWS = "(?:(?:(?:[ \\t]*(?:\\r\\n))?[ \\t]+)|(?:[ \\t]+(?:(?:\\r\\n)[ \\t]+)*))"; // Folding white space
116 // Let's check the local part for RFC compliance...
117 //
118 // local-part = dot-atom / quoted-string / obs-local-part
119 // obs-local-part = word *("." word)
120 // (http://tools.ietf.org/html/rfc5322#section-3.4.1)
121 //
122 // Problem: need to distinguish between "first.last" and "first"."last"
123 // (i.e. one element or two). And I suck at regexes.
124 $dotArray = /*. (array[int]string) .*/ preg_split('/\\.(?=(?:[^\\"]*\\"[^\\"]*\\")*(?![^\\"]*\\"))/m', $localPart);
125 $partLength = 0;
126
127 foreach ($dotArray as $element) {
128 // Remove any leading or trailing FWS
129 $element = preg_replace("/^$FWS|$FWS\$/", '', $element);
130
131 // Then we need to remove all valid comments (i.e. those at the start or end of the element
132 $elementLength = strlen($element);
133
134 if ($element[0] === '(') {
135 $indexBrace = strpos($element, ')');
136 if ($indexBrace !== false) {
137 if (preg_match('/(?<!\\\\)[\\(\\)]/', substr($element, 1, $indexBrace - 1)) > 0) {
138 return false; // Illegal characters in comment
139 }
140 $element = substr($element, $indexBrace + 1, $elementLength - $indexBrace - 1);
141 $elementLength = strlen($element);
142 }
143 }
144
145 if ($element[$elementLength - 1] === ')') {
146 $indexBrace = strrpos($element, '(');
147 if ($indexBrace !== false) {
148 if (preg_match('/(?<!\\\\)(?:[\\(\\)])/', substr($element, $indexBrace + 1, $elementLength - $indexBrace - 2)) > 0) {
149 return false; // Illegal characters in comment
150 }
151 $element = substr($element, 0, $indexBrace);
152 $elementLength = strlen($element);
153 }
154 }
155
156 // Remove any leading or trailing FWS around the element (inside any comments)
157 $element = preg_replace("/^$FWS|$FWS\$/", '', $element);
158
159 // What's left counts towards the maximum length for this part
160 if ($partLength > 0) $partLength++; // for the dot
161 $partLength += strlen($element);
162
163 // Each dot-delimited component can be an atom or a quoted string
164 // (because of the obs-local-part provision)
165 if (preg_match('/^"(?:.)*"$/s', $element) > 0) {
166 // Quoted-string tests:
167 //
168 // Remove any FWS
169 $element = preg_replace("/(?<!\\\\)$FWS/", '', $element);
170 // My regex skillz aren't up to distinguishing between \" \\" \\\" \\\\" etc.
171 // So remove all \\ from the string first...
172 $element = preg_replace('/\\\\\\\\/', ' ', $element);
173 if (preg_match('/(?<!\\\\|^)["\\r\\n\\x00](?!$)|\\\\"$|""/', $element) > 0) return false; // ", CR, LF and NUL must be escaped, "" is too short
174 } else {
175 // Unquoted string tests:
176 //
177 // Period (".") may...appear, but may not be used to start or end the
178 // local part, nor may two or more consecutive periods appear.
179 // (http://tools.ietf.org/html/rfc3696#section-3)
180 //
181 // A zero-length element implies a period at the beginning or end of the
182 // local part, or two periods together. Either way it's not allowed.
183 if ($element === '') return false; // Dots in wrong place
184
185 // Any ASCII graphic (printing) character other than the
186 // at-sign ("@"), backslash, double quote, comma, or square brackets may
187 // appear without quoting. If any of that list of excluded characters
188 // are to appear, they must be quoted
189 // (http://tools.ietf.org/html/rfc3696#section-3)
190 //
191 // Any excluded characters? i.e. 0x00-0x20, (, ), <, >, [, ], :, ;, @, \, comma, period, "
192 if (preg_match('/[\\x00-\\x20\\(\\)<>\\[\\]:;@\\\\,\\."]/', $element) > 0) return false; // These characters must be in a quoted string
193 }
194 }
195
196 if ($partLength > 64) return false; // Local part must be 64 characters or less
197
198 // Now let's check the domain part...
199
200 // The domain name can also be replaced by an IP address in square brackets
201 // (http://tools.ietf.org/html/rfc3696#section-3)
202 // (http://tools.ietf.org/html/rfc5321#section-4.1.3)
203 // (http://tools.ietf.org/html/rfc4291#section-2.2)
204 if (preg_match('/^\\[(.)+]$/', $domain) === 1) {
205 // It's an address-literal
206 $addressLiteral = substr($domain, 1, strlen($domain) - 2);
207 $matchesIP = array();
208
209 // Extract IPv4 part from the end of the address-literal (if there is one)
210 if (preg_match('/\\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/', $addressLiteral, $matchesIP) > 0) {
211 $index = strrpos($addressLiteral, $matchesIP[0]);
212
213 if ($index === 0) {
214 // Nothing there except a valid IPv4 address, so...
215 return true;
216 } else {
217 // Assume it's an attempt at a mixed address (IPv6 + IPv4)
218 if ($addressLiteral[$index - 1] !== ':') return false; // Character preceding IPv4 address must be ':'
219 if (substr($addressLiteral, 0, 5) !== 'IPv6:') return false; // RFC5321 section 4.1.3
220
221 $IPv6 = substr($addressLiteral, 5, ($index ===7) ? 2 : $index - 6);
222 $groupMax = 6;
223 }
224 } else {
225 // It must be an attempt at pure IPv6
226 if (substr($addressLiteral, 0, 5) !== 'IPv6:') return false; // RFC5321 section 4.1.3
227 $IPv6 = substr($addressLiteral, 5);
228 $groupMax = 8;
229 }
230
231 $groupCount = preg_match_all('/^[0-9a-fA-F]{0,4}|\\:[0-9a-fA-F]{0,4}|(.)/', $IPv6, $matchesIP);
232 $index = strpos($IPv6,'::');
233
234 if ($index === false) {
235 // We need exactly the right number of groups
236 if ($groupCount !== $groupMax) return false; // RFC5321 section 4.1.3
237 } else {
238 if ($index !== strrpos($IPv6,'::')) return false; // More than one '::'
239 $groupMax = ($index === 0 || $index === (strlen($IPv6) - 2)) ? $groupMax : $groupMax - 1;
240 if ($groupCount > $groupMax) return false; // Too many IPv6 groups in address
241 }
242
243 // Check for unmatched characters
244 array_multisort($matchesIP[1], SORT_DESC);
245 if ($matchesIP[1][0] !== '') return false; // Illegal characters in address
246
247 // It's a valid IPv6 address, so...
248 return true;
249 } else {
250 // It's a domain name...
251
252 // The syntax of a legal Internet host name was specified in RFC-952
253 // One aspect of host name syntax is hereby changed: the
254 // restriction on the first character is relaxed to allow either a
255 // letter or a digit.
256 // (http://tools.ietf.org/html/rfc1123#section-2.1)
257 //
258 // NB RFC 1123 updates RFC 1035, but this is not currently apparent from reading RFC 1035.
259 //
260 // Most common applications, including email and the Web, will generally not
261 // permit...escaped strings
262 // (http://tools.ietf.org/html/rfc3696#section-2)
263 //
264 // the better strategy has now become to make the "at least one period" test,
265 // to verify LDH conformance (including verification that the apparent TLD name
266 // is not all-numeric)
267 // (http://tools.ietf.org/html/rfc3696#section-2)
268 //
269 // Characters outside the set of alphabetic characters, digits, and hyphen MUST NOT appear in domain name
270 // labels for SMTP clients or servers
271 // (http://tools.ietf.org/html/rfc5321#section-4.1.2)
272 //
273 // RFC5321 precludes the use of a trailing dot in a domain name for SMTP purposes
274 // (http://tools.ietf.org/html/rfc5321#section-4.1.2)
275 $dotArray = /*. (array[int]string) .*/ preg_split('/\\.(?=(?:[^\\"]*\\"[^\\"]*\\")*(?![^\\"]*\\"))/m', $domain);
276 $partLength = 0;
277
278 if (count($dotArray) === 1) return false; // Mail host can't be a TLD
279
280 foreach ($dotArray as $element) {
281 // Remove any leading or trailing FWS
282 $element = preg_replace("/^$FWS|$FWS\$/", '', $element);
283
284 // Then we need to remove all valid comments (i.e. those at the start or end of the element
285 $elementLength = strlen($element);
286
287 if ($element[0] === '(') {
288 $indexBrace = strpos($element, ')');
289 if ($indexBrace !== false) {
290 if (preg_match('/(?<!\\\\)[\\(\\)]/', substr($element, 1, $indexBrace - 1)) > 0) {
291 return false; // Illegal characters in comment
292 }
293 $element = substr($element, $indexBrace + 1, $elementLength - $indexBrace - 1);
294 $elementLength = strlen($element);
295 }
296 }
297
298 if ($element[$elementLength - 1] === ')') {
299 $indexBrace = strrpos($element, '(');
300 if ($indexBrace !== false) {
301 if (preg_match('/(?<!\\\\)(?:[\\(\\)])/', substr($element, $indexBrace + 1, $elementLength - $indexBrace - 2)) > 0) {
302 return false; // Illegal characters in comment
303 }
304 $element = substr($element, 0, $indexBrace);
305 $elementLength = strlen($element);
306 }
307 }
308
309 // Remove any leading or trailing FWS around the element (inside any comments)
310 $element = preg_replace("/^$FWS|$FWS\$/", '', $element);
311
312 // What's left counts towards the maximum length for this part
313 if ($partLength > 0) $partLength++; // for the dot
314 $partLength += strlen($element);
315
316 // The DNS defines domain name syntax very generally -- a
317 // string of labels each containing up to 63 8-bit octets,
318 // separated by dots, and with a maximum total of 255
319 // octets.
320 // (http://tools.ietf.org/html/rfc1123#section-6.1.3.5)
321 if ($elementLength > 63) return false; // Label must be 63 characters or less
322
323 // Each dot-delimited component must be atext
324 // A zero-length element implies a period at the beginning or end of the
325 // local part, or two periods together. Either way it's not allowed.
326 if ($elementLength === 0) return false; // Dots in wrong place
327
328 // Any ASCII graphic (printing) character other than the
329 // at-sign ("@"), backslash, double quote, comma, or square brackets may
330 // appear without quoting. If any of that list of excluded characters
331 // are to appear, they must be quoted
332 // (http://tools.ietf.org/html/rfc3696#section-3)
333 //
334 // If the hyphen is used, it is not permitted to appear at
335 // either the beginning or end of a label.
336 // (http://tools.ietf.org/html/rfc3696#section-2)
337 //
338 // Any excluded characters? i.e. 0x00-0x20, (, ), <, >, [, ], :, ;, @, \, comma, period, "
339 if (preg_match('/[\\x00-\\x20\\(\\)<>\\[\\]:;@\\\\,\\."]|^-|-$/', $element) > 0) {
340 return false;
341 }
342 }
343
344 if ($partLength > 255) return false; // Local part must be 64 characters or less
345
346 if (preg_match('/^[0-9]+$/', $element) > 0) return false; // TLD can't be all-numeric
347
348 // Check DNS?
349 if ($checkDNS && function_exists('checkdnsrr')) {
350 if (!(checkdnsrr($domain, 'A') || checkdnsrr($domain, 'MX'))) {
351 return false; // Domain doesn't actually exist
352 }
353 }
354 }
355
356 // Eliminate all other factors, and the one which remains must be the truth.
357 // (Sherlock Holmes, The Sign of Four)
358 return true;
359}
360?>

There’s a gazillion regular expressions out there that claim to validate an email address. They don’t. Doug Lovell explains why here: http://www.linuxjournal.com/article/9585. The function that Doug made for his article is good, but it delegates the validation of the domain part of the address to the DNS servers of the world. This is a good approach but there are three issues with it:

  1. The RFCs reluctantly allow you to use an IP address rather than a domain name, so you need to check for that.
  2. The DNS may not be available to your function at the time it needs to check the address (maybe it’s an intranet application)
  3. There’s no need to add extra workload to the DNS servers of the world if the address is wrongly formatted in the first place.

So I’ve come up with a PHP function that validates all parts of a given email address, according to RFCs 1123, 2396, 3696, 4291, 4343, 5321 & 5322. I’ve released it under a license that allows you to use it royalty-free in commercial or non-commercial work, subject to a few conditions.

It’s almost certainly the first email address validator that correctly lets you put an IPv6 address in for the domain part…

I've added this function to Google Code, where you can be sure of getting the latest version: http://code.google.com/p/isemail/source/browse/trunk

Comments

over 7 years ago (28 Jan 2009 at 11:18 AM) by David Isaacson
Wow.
over 7 years ago (28 Jan 2009 at 12:53 PM) by Stou S.
I enclosed your code with the php directives in order to get the highlighting working... I don't know PHP so I am not sure if this breaks your code or if the directives are actually required. Let me know and I'll file a bug report with Pygments or fix the lexer myself.
over 7 years ago (29 Jan 2009 at 12:47 AM) by Dominic Sayers
Thanks Stou
over 7 years ago (26 Feb 2009 at 12:25 PM) by David Isaacson
So can we count on a new version every day now? : )