Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Anthropic makes ‘jailbreak’ advance to stop AI models producing harmful results


Stay notified with free updates

Artificial intelligence has demonstrated a new strategy to prevent the harmful content of the start-up anthropic users from its models, to find protected methods arise by the top technology groups, including Microsoft and Meta Race, with the Cutting-Ez technology.

In a study published on Monday, San Francisco-based start-up outlined a new system called “Constitution Classes”. It is a model that acts as a protective layer at the top of the big language models, such as an ethnographic clad chatboat, which can observe both input and output for harmful content.

Development by anthropological, which is in the discussion to increase $ 2 billion in the assessment of $ 60 billion, comes into anxiety industry with “jailbreaking” – AI models try to produce instructions for creating chemical weapons such as making illegal or dangerous information.

Other companies are also racing to deploy to protect them from practice, a step that can help their regulatory investigations avoid investigations and convince the business safely to accept the AI ​​models. Microsoft introduced “Prompt Shields” last March, when Meta launched a Prompt Guard model in July last year, which researchers found a way to quickly bypass, but later fixed.

MRIK Sharma, a member of the technical staff of anthropic, said: “The main inspiration behind the work was for serious chemicals [weapon] Staff [but] The real advantage of the procedure is the ability to respond to and adapt to fast “” “

Anthropic said that it would not immediately use the system in its current clad models but would consider implementation if the risky models were released in the future. Sharma added: “Big acceptance from this job is that we think it’s a tractable problem.”

The proposed solution of the start-up is built on a so-called “Constitution” of the rules that are allowed and limited, which defines and can be adapted for capture of different types of material.

Some jailbreak attempt is well known, such as using abnormal capital in the prompt or telling the model a grandmother’s personality to tell a bed story about a hateful subject.

To verify the effectiveness of the system, anthropological security systems provided up to $ 15,000 for those who were trying to bypass “bug bounty”. This examiner, known as Red TimersSpend more than 3,000 hours by trying to break the defense.

Anthropic Claud 3.5 Sonnet Model has rejected more than 95 percent effort with classmates, compared to 14 percent without protection.

Top -level technology companies are still trying to reduce the abuse of their models by maintaining their assistance. Often, when the models are placed, the models may become alert and the gentle requests may reject, such as with initial versions Google Image generator or Metter Call 2. The anthropologists said that their classmates only cause “only to increase 0.38 percent at the rate of rejection”.

However, the models added to these protections also increase the extra costs for the computing power needed for training and operating. Anthropic said that the category “overhead overhead” of the category will increase as the cost of managing models.

Chart the test bar chart in the latest model of the ethnographic classmate showing the effectiveness of

Security experts have argued that the accessible nature of these national generators has enabled ordinary people without the previous knowledge to try to find dangerous information.

Ram Shankar Shiv Kumar, who led the AI ​​Red team in Microsoft, said, “The threat that we will be in our mind in the 2016, was truly opposed to the strong country.” “Now literally one of my threats is the actors’ ribbon face to face.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *