logo


We take our system up time very seriously and know that your business depends on it

Please visit this page to get information on any confirmed outages or maintenance issues

Last updated by Joseph Weisfish on 23:02 PST 04/24/2019

Current Status – All systems operational. No known issues detected.

04/24/2019 – Non Planned
Severity – Limited
Systems affected – All non VPN NAT based connections
Initial time 12:50pm PST
Duration 9h 40m
Status Resolved

Event Details

12:50PM Degraded network service detected, affecting all NATed traffic and non VPN router connections. Limited client connectivity to private data resources.
01:09PM Attempt to update router firmware did not resolve.
01:46PM Intermittent connection recovery discovered possible through router resets limiting downtime to about 5 minutes.
02:00PM Troubleshooting plan to monitor connectivity, and through router resets limit downtime till after hours extended troubleshooting can take place.
07:20PM Awaited router outage prompting router vendor support call
09:30PM Isolate traffic disruption to peer router and correct router rules. Verify all connectivity resumed normally.

Root Cause Analysis

Local network peer router was configured incorrectly grabbing the complete ARP table translation for the entire subnet. Upstream hop incorrectly advertised ARP table to incorrect peer router. We contacted the peer admin and made the necessary correction to their router.

Preventative Measures

We are implementing WAN segregation to avoid any future peer influence on our network.

07/20/2018 – Non Planned
Severity – Limited
Systems affected – UDP sensitive traffic
Initial time 11:46am PST
Duration 1h 00m
Status Resolved

Event Details

11:46AM Degraded network service detected, affecting traffic sensitive services. Mostly UDP VOIP traffic. Clients complaining of lost or choppy RTP audio on calls.
12:09PM Levels of recovery initiating.
12:46PM Degradation recovery completed.

Root Cause Analysis

Upstream peers leaked more routes than usual, causing max prefix limit condition.

Preventative Measures

Route detection and limiting settings implemented.

02/13/2018 – Non Planned
Severity – Limited
Systems affected – Hosted email
Initial time ~3:00pm PST
Duration overnight
Status Resolved

Event Details
Our upstream spam filtering vendor experienced an issue with some domains missing from their system. The domains were restored.
All delayed or missing emails have already, or may still be delivered by sending servers. No emails will be lost.

Not all client domains were affected.

Root Cause Analysis

Vendor link.

Preventative Measures
Vendor link.

02/22/2018 – Non Planned
Severity – Major
Systems affected – Hosted email
Initial time 16:38pm PST
Duration 4h 33m
Status Resolved

Event Details
16:38PM Exchange transport services froze due to a backup anomaly. The exchange LAN adapters would not enable following a normal reboot.
20:46PM And on-site physical reset of the server’s LAN connections at the data center was necessary to force them back on.

Root Cause Analysis

The issue was traced to a defective port on the switch and re-cabled to an alternate port and all services instantly resumed. This was verified in the switch management software.

Although the server has 4 redundant ports a single defective one hindered connectivity to the remaining 3. Switch management software now reveals all 4 ports as functioning normally. This would explain why the server wasn’t able to re-enable its own LAN ports.

Preventative Measures
We’ve added a second member to the switch stack and load balanced the LAN ports evenly across both members.

*Multiple Occurrences – 12/04/2017; 12/05/2017; 12/29/2017 – Non Planned
Severity – Major
Systems affected – All
Initial time 12:19pm PST 12/04/2017
Duration 0h 6m
Initial time 11:40am PST 12/05/2017
Duration 0h 11m
Initial time 12:16pm PST 12/29/2017
Duration – 3 separate occurrences from 2 – 10 minutes each
Status Resolved

Event Details
12/04/2017 12:19PM Primary router communication loss detected.
12/04/2017 12:25PM Router communication and data flow re-established without intervention.
12/05/2017 11:40AM Primary router communication loss detected.
12/05/2017 11:51AM Router communication and data flow re-established without intervention.
12/29/2017 12:16PM Primary router communication loss detected.
12/29/2017 1:03PM Router communication and data flow re-established without intervention (not contiguous).

01/25/2018 3:00PM Acquisition and replacement of new router

Root Cause Analysis
12/05/2017 – As it appears from router log files multiple RTS (DOS) Flood Attacks took affect from both internal LAN and external WAN hosts, which overflowed router communication capacity.
12/29/2017 – DOS Flooding does not appear to be the significant factor, at least one outage was recorded after the affected system was taken offline. We are investigating possible core switch processing capacity issues.

01/25/2018 – The old appliance had corrupt settings which were non applicable to the new router. It was later discovered through log and setting analysis via vendor support that the offending setting was related to a performance DPI option which throttled the router’s total connection limits.

Preventative Measures
12/04/2017 – Patched kernels and updates on CentOS servers as a preventative measure to log identified LAN hosts. Enabled firewall Layer 2 flood protection settings with 3000 packets/sec blacklisting, and Layer 3 suspected WAN Proxy settings with 1500 attempts/sec. Also ICMP 200 packets/sec.
12/05/2017 – After second incident, enable full router logging and alerting. Lower burst rate on low priority LAN and WAN identified traffic sources to 85%. Patch update on cloud storage SAN appliance with full reset.
12/30/2017 – Our sysadmin team does not feel that the DOS traffic in question is a significant factor, this theory has been reinforced during an outage when the affected system in question was taken offline. Instead we are focusing efforts into the possibility of our core data center switch being over utilized. As a precautionary measure we have decided to remove and upgrade our core switch, at the same time increasing our capacity to 10gb throughput. The switch transition is tentatively scheduled for January 6th at 10pm PST. As a precautionary measure we have also upgraded our core router firmware to the most recent maintenance release. Possible links to our issue were directly found in this knowledge base article: https://www.sonicwall.com/en-us/support/knowledge-base/170504539452924. Additionally we have initiated both internal and external 5 second recorded incremental pings to all major IPs in an effort to gain more insight into the issue.
01/01/2018 – As a precautionary measure to possible traffic over-utilization, we’ve throttled our overall bandwidth by 10%

01/25/2018 – our new appliance allows for automatic setting and firmware backups which have been enabled in order to roll back any future setting corruption or anomalies.

12/21/2017 – Planned
Severity – None
Systems affected – None
Initial time 5am PST
Duration 6h
Status Pending re-scheduling

Event Details
USC ITS and FMS will be performing scheduled maintenance on UPS units in the ITS Data Center. During this maintenance, no impact to services housed in the data center is expected as they will be switched to an alternate power feed during testing.

If you or your users encounter issues with accessing colocation or cloud services during or after this maintenance period, please alert us at support@shadik.com.

12/13/2017 – Non Planned
Severity – Major
Systems affected – All
Initial time 10:58am PST
Duration 0h 15m
Status Resolved

Event Details
10:58AM Primary router communication loss detected. Router Gateway unreachable as well.
10:59AM Call in to onsite data center technician. They confirmed they are aware of the issue and investigating. Appears to have been caused by enabling peering with CENIC 100Gig route server and subsequent BGP reconvergence. Connectivity resumed shortly thereafter.

Root Cause Analysis
11/21/2017 Datacenter upstream provider performed maintenance on their network which caused an outage for some partners.  To work around this problem, they’ve re-engineered traffic away from the affected circuit.
12/08/2017 Datacenter received a notice from the provider that they’ve resolved their issues.
12/13/2017 Datacenter technicians attempted to resume sending traffic to the original circuit, but have found that the traffic was being blackholed.

Preventative Measures
Datacenter is sending traffic away from the affected circuit, and are working with the provider to resolve the issue.

11/16/2017 – Non Planned
Severity – Major
Systems affected – All
Initial time 14:33pm PST
Duration 0h 33m
Status Resolved

Event Details
14:33PM Primary router communication loss detected.
15:06PM Onsite technician at the datacenter hard reset the router and service connectivity resumed shortly thereafter.

Root Cause Analysis
Primary datacenter router suffered an internal unhandled exception, causing communication loss and traffic delays. In the interest of time a hard reset was initiated after multiple failed attempts to connect to the router to initiate a graceful shutdown.

Preventative Measures
Log analysis will be forthcoming. No immediate plans to accommodate can be advised as this occurrence seems to be an isolated incident. It’s not determined whether router teaming or failover could have prevented the situation since the primary router was still pingable preventing a high availability secondary unit to take over, and not without a normal communication disconnect in the process.

2/10/2016 – Non Planned
Severity – Major
Systems affected – All
Initial time 07:25am PST
Duration 1h 12m
Status Resolved

Event Details
06:10AM Los Nettos attempted to decommission a router which no longer supports any customer connections.
07:25AM The removal of this router caused an unanticipated routing anomaly which impacted STI traffic and other Associate connections.
08:37AM All services were restored when the router was reloaded.

Root Cause Analysis
Old router hardware removal which still appears to route traffic for STI and other associate connections.

Preventative Measures
Los Nettos will be analyzing their network configuration to determine the root cause before scheduling a new maintenance date for the removal of the old router.

3/14/2014 – Non Planned
Severity – Major
Systems affected – Isolated Virtual Machines, Cloud based data storage
Initial time 09:08am PST
Duration 1h 17m (sporadic)
Status Resolved

Event Details
10:19PM 3/13/2014 Regular security maintenance update was installed on a backend NAS.
09:08am 3/14/2014 Virtual host connecting to NAS dropped access to storage array, manually reconnect iSCSI to array initiator.
09:41am Array connection dropped again. Reset Host server, and delete and recreate iSCSI initiator target.
10:25am Array connection dropped again. Reset Host server provided only temporary access to NAS before connection was broken shortly after. Host server was able to maintain NAS connectivity after default internal NAS firewall was disabled.

Root Cause Analysis
NAS security maintenance software update overrode the custom internal firewall rules, with a default subset. Issue was thought to have been related to the host server and we performed multiple resets on the host server in order to bring back up its NAS iSCSI target. We also deleted and recreated the iSCSI initiator target without benefit. Unfortunately the host resets only extended the outage. When we checked the NAS internal rules and found them to be reset to default we reverted to our custom rules and NAS stability was experienced. Direct WEBDav Cloud based storage access was also affected due to resetting of the default internal NAS firewall rules.

Preventative Measures
NAS internal firewall has been disabled as it provides no further protection beyond the external firewall appliances.

3/15/2014 – Planned
Severity – None
Systems affected – None
Initial time 7am PST
Duration 8 – 12h
Status Resolved

Event Details
The Los Angeles Department of Water and Power (LADWP) has identified a potential issue with the electrical feed coming into the CAL building, which houses the primary data center and colocation facility. To address this issue, LADWP will need to cut utility power to the building for approximately 8 to 12 hours on Saturday, March 15, 2014, beginning at 7:00 a.m. As a result, the ITS data center and colocation facility will run on backup generator power for the duration of the utility outage.

The electrical circuits that feed your racks and equipment are protected by both uninterruptible power supplies and large capacity diesel generators. This infrastructure is designed to maintain constant electrical power when utility power is lost. Given this and the minimal risk factor inherent with any power outage, we expect no downtime for the data center or colocation facility equipment.

ITS and Facilities Management Services (FMS) staff are coordinating with LADWP for this maintenance, which is part of an ongoing effort to keep the data center and colocation facility’s physical infrastructure in the most optimal state possible. ITS and FMS regularly conduct testing of the electrical infrastructure at CAL to make sure the equipment is operating optimally.

2/24/2014 – Non Planned
Severity – Major
Systems affected – All
Initial time 06:25am PST
Duration 5h 38m
Status Resolved

Event Details
09:43am Los Nettos experienced an issue with one of its peering and transit routers located at USC. Engineering is replacing a faulty linecard.
11:25am The Linecard replacement did not resolve the issue and we are calling in additional resources to help troubleshoot.
12:23pm The network outage that began this morning has been resolved. We are investigating the root cause analysis and will send out additional information later.

Root Cause Analysis
Hardware failure of a 10-gigabit interface connecting a core switch to the Los Nettos peering and transit router located at USC. Due to the errors on the interface, the switch port disabled itself.

To remedy the situation, Los Nettos engineers installed a new linecard on the peering and transit router, but the new interface would not come up. After several reboots of the router failed, we replaced the transceiver, then a XENPAK, then the fiber jumper between the two devices, but the interface still remained down. We then configured a new port, shut down the original port and brought it back up, moved the fiber back to it and the port finally came back up.

We have an open Cisco TAC case to find out why the reboot did not reset the state for the interface, as well as determining why the interface failed in the first place.

Preventative Measures
While in the process of resolution we also established a connection to a newer switch to provide additional backup capabilities for the future.

Please visit this page for more information as it becomes available