logo


We take our system up time very seriously and know that your business depends on it

Please visit this page to get information on any confirmed outages or maintenance issues

Last updated by Joseph Weisfish on 09:55am PST 01/01/2018

Current Status – All systems operational. No known issues detected.

WE APPRECIATE OUR CUSTOMER’S PATIENCE AS WE CONTINUE TO MONITOR THIS *ISSUE. WE ARE FULLY AWARE OF THE SEVERITY TO YOUR BUSINESS AND DEPENDENCY ON ACCESS TO YOUR CLOUD RESOURCES.

*Multiple Occurrences – 12/04/2017; 12/05/2017; 12/29/2017 – Non Planned
Severity – Major
Systems affected – All
Initial time 12:19pm PST 12/04/2017
Duration 0h 6m
Initial time 11:40am PST 12/05/2017
Duration 0h 11m
Initial time 12:16pm PST 12/29/2017
Duration – 3 separate occurrences from 2 – 10 minutes each
Status Ongoing Investigation

Event Details
12/04/2017 12:19PM Primary router communication loss detected.
12/04/2017 12:25PM Router communication and data flow re-established without intervention.
12/05/2017 11:40AM Primary router communication loss detected.
12/05/2017 11:51AM Router communication and data flow re-established without intervention.
12/29/2017 12:16PM Primary router communication loss detected.
12/29/2017 1:03PM Router communication and data flow re-established without intervention (not contiguous).

Root Cause Analysis
12/05/2017 – As it appears from router log files multiple RTS (DOS) Flood Attacks took affect from both internal LAN and external WAN hosts, which overflowed router communication capacity.
12/29/2017 – DOS Flooding does not appear to be the significant factor, at least one outage was recorded after the affected system was taken offline. We are investigating possible core switch processing capacity issues.

Preventative Measures
12/04/2017 – Patched kernels and updates on CentOS servers as a preventative measure to log identified LAN hosts. Enabled firewall Layer 2 flood protection settings with 3000 packets/sec blacklisting, and Layer 3 suspected WAN Proxy settings with 1500 attempts/sec. Also ICMP 200 packets/sec.
12/05/2017 – After second incident, enable full router logging and alerting. Lower burst rate on low priority LAN and WAN identified traffic sources to 85%. Patch update on cloud storage SAN appliance with full reset.
12/30/2017 – Our sysadmin team does not feel that the DOS traffic in question is a significant factor, this theory has been reinforced during an outage when the affected system in question was taken offline. Instead we are focusing efforts into the possibility of our core data center switch being over utilized. As a precautionary measure we have decided to remove and upgrade our core switch, at the same time increasing our capacity to 10gb throughput. The switch transition is tentatively scheduled for January 6th at 10pm PST. As a precautionary measure we have also upgraded our core router firmware to the most recent maintenance release. Possible links to our issue were directly found in this knowledge base article: https://www.sonicwall.com/en-us/support/knowledge-base/170504539452924. Additionally we have initiated both internal and external 5 second recorded incremental pings to all major IPs in an effort to gain more insight into the issue.
01/01/2018 – As a precautionary measure to possible traffic over-utilization, we’ve throttled our overall bandwidth by 10%

12/21/2017 – Planned
Severity – None
Systems affected – None
Initial time 5am PST
Duration 6h
Status Pending re-scheduling

Event Details
USC ITS and FMS will be performing scheduled maintenance on UPS units in the ITS Data Center. During this maintenance, no impact to services housed in the data center is expected as they will be switched to an alternate power feed during testing.

If you or your users encounter issues with accessing colocation or cloud services during or after this maintenance period, please alert us at support@shadik.com.

12/13/2017 – Non Planned
Severity – Major
Systems affected – All
Initial time 10:58am PST
Duration 0h 15m
Status Resolved

Event Details
10:58AM Primary router communication loss detected. Router Gateway unreachable as well.
10:59AM Call in to onsite data center technician. They confirmed they are aware of the issue and investigating. Appears to have been caused by enabling peering with CENIC 100Gig route server and subsequent BGP reconvergence. Connectivity resumed shortly thereafter.

Root Cause Analysis
11/21/2017 Datacenter upstream provider performed maintenance on their network which caused an outage for some partners.  To work around this problem, they’ve re-engineered traffic away from the affected circuit.
12/08/2017 Datacenter received a notice from the provider that they’ve resolved their issues.
12/13/2017 Datacenter technicians attempted to resume sending traffic to the original circuit, but have found that the traffic was being blackholed.

Preventative Measures
Datacenter is sending traffic away from the affected circuit, and are working with the provider to resolve the issue.

11/16/2017 – Non Planned
Severity – Major
Systems affected – All
Initial time 14:33pm PST
Duration 0h 33m
Status Resolved

Event Details
14:33PM Primary router communication loss detected.
15:06PM Onsite technician at the datacenter hard reset the router and service connectivity resumed shortly thereafter.

Root Cause Analysis
Primary datacenter router suffered an internal unhandled exception, causing communication loss and traffic delays. In the interest of time a hard reset was initiated after multiple failed attempts to connect to the router to initiate a graceful shutdown.

Preventative Measures
Log analysis will be forthcoming. No immediate plans to accommodate can be advised as this occurrence seems to be an isolated incident. It’s not determined whether router teaming or failover could have prevented the situation since the primary router was still pingable preventing a high availability secondary unit to take over, and not without a normal communication disconnect in the process.

2/10/2016 – Non Planned
Severity – Major
Systems affected – All
Initial time 07:25am PST
Duration 1h 12m
Status Resolved

Event Details
06:10AM Los Nettos attempted to decommission a router which no longer supports any customer connections.
07:25AM The removal of this router caused an unanticipated routing anomaly which impacted STI traffic and other Associate connections.
08:37AM All services were restored when the router was reloaded.

Root Cause Analysis
Old router hardware removal which still appears to route traffic for STI and other associate connections.

Preventative Measures
Los Nettos will be analyzing their network configuration to determine the root cause before scheduling a new maintenance date for the removal of the old router.

3/14/2014 – Non Planned
Severity – Major
Systems affected – Isolated Virtual Machines, Cloud based data storage
Initial time 09:08am PST
Duration 1h 17m (sporadic)
Status Resolved

Event Details
10:19PM 3/13/2014 Regular security maintenance update was installed on a backend NAS.
09:08am 3/14/2014 Virtual host connecting to NAS dropped access to storage array, manually reconnect iSCSI to array initiator.
09:41am Array connection dropped again. Reset Host server, and delete and recreate iSCSI initiator target.
10:25am Array connection dropped again. Reset Host server provided only temporary access to NAS before connection was broken shortly after. Host server was able to maintain NAS connectivity after default internal NAS firewall was disabled.

Root Cause Analysis
NAS security maintenance software update overrode the custom internal firewall rules, with a default subset. Issue was thought to have been related to the host server and we performed multiple resets on the host server in order to bring back up its NAS iSCSI target. We also deleted and recreated the iSCSI initiator target without benefit. Unfortunately the host resets only extended the outage. When we checked the NAS internal rules and found them to be reset to default we reverted to our custom rules and NAS stability was experienced. Direct WEBDav Cloud based storage access was also affected due to resetting of the default internal NAS firewall rules.

Preventative Measures
NAS internal firewall has been disabled as it provides no further protection beyond the external firewall appliances.

3/15/2014 – Planned
Severity – None
Systems affected – None
Initial time 7am PST
Duration 8 – 12h
Status Resolved

Event Details
The Los Angeles Department of Water and Power (LADWP) has identified a potential issue with the electrical feed coming into the CAL building, which houses the primary data center and colocation facility. To address this issue, LADWP will need to cut utility power to the building for approximately 8 to 12 hours on Saturday, March 15, 2014, beginning at 7:00 a.m. As a result, the ITS data center and colocation facility will run on backup generator power for the duration of the utility outage.

The electrical circuits that feed your racks and equipment are protected by both uninterruptible power supplies and large capacity diesel generators. This infrastructure is designed to maintain constant electrical power when utility power is lost. Given this and the minimal risk factor inherent with any power outage, we expect no downtime for the data center or colocation facility equipment.

ITS and Facilities Management Services (FMS) staff are coordinating with LADWP for this maintenance, which is part of an ongoing effort to keep the data center and colocation facility’s physical infrastructure in the most optimal state possible. ITS and FMS regularly conduct testing of the electrical infrastructure at CAL to make sure the equipment is operating optimally.

2/24/2014 – Non Planned
Severity – Major
Systems affected – All
Initial time 06:25am PST
Duration 5h 38m
Status Resolved

Event Details
09:43am Los Nettos experienced an issue with one of its peering and transit routers located at USC. Engineering is replacing a faulty linecard.
11:25am The Linecard replacement did not resolve the issue and we are calling in additional resources to help troubleshoot.
12:23pm The network outage that began this morning has been resolved. We are investigating the root cause analysis and will send out additional information later.

Root Cause Analysis
Hardware failure of a 10-gigabit interface connecting a core switch to the Los Nettos peering and transit router located at USC. Due to the errors on the interface, the switch port disabled itself.

To remedy the situation, Los Nettos engineers installed a new linecard on the peering and transit router, but the new interface would not come up. After several reboots of the router failed, we replaced the transceiver, then a XENPAK, then the fiber jumper between the two devices, but the interface still remained down. We then configured a new port, shut down the original port and brought it back up, moved the fiber back to it and the port finally came back up.

We have an open Cisco TAC case to find out why the reboot did not reset the state for the interface, as well as determining why the interface failed in the first place.

Preventative Measures
While in the process of resolution we also established a connection to a newer switch to provide additional backup capabilities for the future.

Please visit this page for more information as it becomes available